-
NeuroMoCo: A Neuromorphic Momentum Contrast Learning Method for Spiking Neural Networks
Authors:
Yuqi Ma,
Huamin Wang,
Hangchi Shen,
Xuemei Chen,
Shukai Duan,
Shiping Wen
Abstract:
Recently, brain-inspired spiking neural networks (SNNs) have attracted great research attention owing to their inherent bio-interpretability, event-triggered properties and powerful perception of spatiotemporal information, which is beneficial to handling event-based neuromorphic datasets. In contrast to conventional static image datasets, event-based neuromorphic datasets present heightened compl…
▽ More
Recently, brain-inspired spiking neural networks (SNNs) have attracted great research attention owing to their inherent bio-interpretability, event-triggered properties and powerful perception of spatiotemporal information, which is beneficial to handling event-based neuromorphic datasets. In contrast to conventional static image datasets, event-based neuromorphic datasets present heightened complexity in feature extraction due to their distinctive time series and sparsity characteristics, which influences their classification accuracy. To overcome this challenge, a novel approach termed Neuromorphic Momentum Contrast Learning (NeuroMoCo) for SNNs is introduced in this paper by extending the benefits of self-supervised pre-training to SNNs to effectively stimulate their potential. This is the first time that self-supervised learning (SSL) based on momentum contrastive learning is realized in SNNs. In addition, we devise a novel loss function named MixInfoNCE tailored to their temporal characteristics to further increase the classification accuracy of neuromorphic datasets, which is verified through rigorous ablation experiments. Finally, experiments on DVS-CIFAR10, DVS128Gesture and N-Caltech101 have shown that NeuroMoCo of this paper establishes new state-of-the-art (SOTA) benchmarks: 83.6% (Spikformer-2-256), 98.62% (Spikformer-2-256), and 84.4% (SEW-ResNet-18), respectively.
△ Less
Submitted 10 June, 2024;
originally announced June 2024.
-
Expressive Power of Graph Neural Networks for (Mixed-Integer) Quadratic Programs
Authors:
Ziang Chen,
Xiaohan Chen,
Jialin Liu,
Xinshang Wang,
Wotao Yin
Abstract:
Quadratic programming (QP) is the most widely applied category of problems in nonlinear programming. Many applications require real-time/fast solutions, though not necessarily with high precision. Existing methods either involve matrix decomposition or use the preconditioned conjugate gradient method. For relatively large instances, these methods cannot achieve the real-time requirement unless the…
▽ More
Quadratic programming (QP) is the most widely applied category of problems in nonlinear programming. Many applications require real-time/fast solutions, though not necessarily with high precision. Existing methods either involve matrix decomposition or use the preconditioned conjugate gradient method. For relatively large instances, these methods cannot achieve the real-time requirement unless there is an effective precondition.
Recently, graph neural networks (GNNs) opened new possibilities for QP. Some promising empirical studies of applying GNNs for QP tasks show that GNNs can capture key characteristics of an optimization instance and provide adaptive guidance accordingly to crucial configurations during the solving process, or directly provide an approximate solution. Despite notable empirical observations, theoretical foundations are still lacking. In this work, we investigate the expressive or representative power of GNNs, a crucial aspect of neural network theory, specifically in the context of QP tasks, with both continuous and mixed-integer settings. We prove the existence of message-passing GNNs that can reliably represent key properties of quadratic programs, including feasibility, optimal objective value, and optimal solution. Our theory is validated by numerical results.
△ Less
Submitted 9 June, 2024;
originally announced June 2024.
-
II-Bench: An Image Implication Understanding Benchmark for Multimodal Large Language Models
Authors:
Ziqiang Liu,
Feiteng Fang,
Xi Feng,
Xinrun Du,
Chenhao Zhang,
Zekun Wang,
Yuelin Bai,
Qixuan Zhao,
Liyang Fan,
Chengguang Gan,
Hongquan Lin,
Jiaming Li,
Yuansheng Ni,
Haihong Wu,
Yaswanth Narsupalli,
Zhigang Zheng,
Chengming Li,
Xiping Hu,
Ruifeng Xu,
Xiaojun Chen,
Min Yang,
Jiaheng Liu,
Ruibo Liu,
Wenhao Huang,
Ge Zhang
, et al. (1 additional authors not shown)
Abstract:
The rapid advancements in the development of multimodal large language models (MLLMs) have consistently led to new breakthroughs on various benchmarks. In response, numerous challenging and comprehensive benchmarks have been proposed to more accurately assess the capabilities of MLLMs. However, there is a dearth of exploration of the higher-order perceptual capabilities of MLLMs. To fill this gap,…
▽ More
The rapid advancements in the development of multimodal large language models (MLLMs) have consistently led to new breakthroughs on various benchmarks. In response, numerous challenging and comprehensive benchmarks have been proposed to more accurately assess the capabilities of MLLMs. However, there is a dearth of exploration of the higher-order perceptual capabilities of MLLMs. To fill this gap, we propose the Image Implication understanding Benchmark, II-Bench, which aims to evaluate the model's higher-order perception of images. Through extensive experiments on II-Bench across multiple MLLMs, we have made significant findings. Initially, a substantial gap is observed between the performance of MLLMs and humans on II-Bench. The pinnacle accuracy of MLLMs attains 74.8%, whereas human accuracy averages 90%, peaking at an impressive 98%. Subsequently, MLLMs perform worse on abstract and complex images, suggesting limitations in their ability to understand high-level semantics and capture image details. Finally, it is observed that most models exhibit enhanced accuracy when image sentiment polarity hints are incorporated into the prompts. This observation underscores a notable deficiency in their inherent understanding of image sentiment. We believe that II-Bench will inspire the community to develop the next generation of MLLMs, advancing the journey towards expert artificial general intelligence (AGI). II-Bench is publicly available at https://huggingface.co/datasets/m-a-p/II-Bench.
△ Less
Submitted 9 June, 2024;
originally announced June 2024.
-
MaLa-ASR: Multimedia-Assisted LLM-Based ASR
Authors:
Guanrou Yang,
Ziyang Ma,
Fan Yu,
Zhifu Gao,
Shiliang Zhang,
Xie Chen
Abstract:
As more and more information-rich data like video become available, utilizing multi-modal auxiliary information to enhance audio tasks has sparked widespread research interest. The recent surge in research on LLM-based audio models provides fresh perspectives for tackling audio tasks. Given that LLM can flexibly ingest multiple inputs, we propose MaLa-ASR, an LLM-based ASR model that can integrate…
▽ More
As more and more information-rich data like video become available, utilizing multi-modal auxiliary information to enhance audio tasks has sparked widespread research interest. The recent surge in research on LLM-based audio models provides fresh perspectives for tackling audio tasks. Given that LLM can flexibly ingest multiple inputs, we propose MaLa-ASR, an LLM-based ASR model that can integrate textual keywords extracted from presentation slides to improve recognition of conference content. MaLa-ASR yields average WERs of 9.4% and 11.7% on the L95 and S95 subsets of the SlideSpeech corpus, representing a significant relative WER drop of 27.9% and 44.7% over the baseline model reported in SlideSpeech. MaLa-ASR underscores LLM's strong performance in speech tasks and the capability to integrate auxiliary information conveniently. By adding keywords to the input prompt, the biased word error rate (B-WER) reduces relatively by 46.0% and 44.2%, establishing a new SOTA on this dataset.
△ Less
Submitted 9 June, 2024;
originally announced June 2024.
-
Write Summary Step-by-Step: A Pilot Study of Stepwise Summarization
Authors:
Xiuying Chen,
Shen Gao,
Mingzhe Li,
Qingqing Zhu,
Xin Gao,
Xiangliang Zhang
Abstract:
Nowadays, neural text generation has made tremendous progress in abstractive summarization tasks. However, most of the existing summarization models take in the whole document all at once, which sometimes cannot meet the needs in practice. Practically, social text streams such as news events and tweets keep growing from time to time, and can only be fed to the summarization system step by step. He…
▽ More
Nowadays, neural text generation has made tremendous progress in abstractive summarization tasks. However, most of the existing summarization models take in the whole document all at once, which sometimes cannot meet the needs in practice. Practically, social text streams such as news events and tweets keep growing from time to time, and can only be fed to the summarization system step by step. Hence, in this paper, we propose the task of Stepwise Summarization, which aims to generate a new appended summary each time a new document is proposed. The appended summary should not only summarize the newly added content but also be coherent with the previous summary, to form an up-to-date complete summary. To tackle this challenge, we design an adversarial learning model, named Stepwise Summary Generator (SSG). First, SSG selectively processes the new document under the guidance of the previous summary, obtaining polished document representation. Next, SSG generates the summary considering both the previous summary and the document. Finally, a convolutional-based discriminator is employed to determine whether the newly generated summary is coherent with the previous summary. For the experiment, we extend the traditional two-step update summarization setting to a multi-step stepwise setting, and re-propose a large-scale stepwise summarization dataset based on a public story generation dataset. Extensive experiments on this dataset show that SSG achieves state-of-the-art performance in terms of both automatic metrics and human evaluations. Ablation studies demonstrate the effectiveness of each module in our framework. We also discuss the benefits and limitations of recent large language models on this task.
△ Less
Submitted 8 June, 2024;
originally announced June 2024.
-
Flexible and Adaptable Summarization via Expertise Separation
Authors:
Xiuying Chen,
Mingzhe Li,
Shen Gao,
Xin Cheng,
Qingqing Zhu,
Rui Yan,
Xin Gao,
Xiangliang Zhang
Abstract:
A proficient summarization model should exhibit both flexibility -- the capacity to handle a range of in-domain summarization tasks, and adaptability -- the competence to acquire new knowledge and adjust to unseen out-of-domain tasks. Unlike large language models (LLMs) that achieve this through parameter scaling, we propose a more parameter-efficient approach in this study. Our motivation rests o…
▽ More
A proficient summarization model should exhibit both flexibility -- the capacity to handle a range of in-domain summarization tasks, and adaptability -- the competence to acquire new knowledge and adjust to unseen out-of-domain tasks. Unlike large language models (LLMs) that achieve this through parameter scaling, we propose a more parameter-efficient approach in this study. Our motivation rests on the principle that the general summarization ability to capture salient information can be shared across different tasks, while the domain-specific summarization abilities need to be distinct and tailored. Concretely, we propose MoeSumm, a Mixture-of-Expert Summarization architecture, which utilizes a main expert for gaining the general summarization capability and deputy experts that selectively collaborate to meet specific summarization task requirements. We further propose a max-margin loss to stimulate the separation of these abilities. Our model's distinct separation of general and domain-specific summarization abilities grants it with notable flexibility and adaptability, all while maintaining parameter efficiency. MoeSumm achieves flexibility by managing summarization across multiple domains with a single model, utilizing a shared main expert and selected deputy experts. It exhibits adaptability by tailoring deputy experts to cater to out-of-domain few-shot and zero-shot scenarios. Experimental results on 11 datasets show the superiority of our model compared with recent baselines and LLMs. We also provide statistical and visual evidence of the distinct separation of the two abilities in MoeSumm (https://github.com/iriscxy/MoE_Summ).
△ Less
Submitted 8 June, 2024;
originally announced June 2024.
-
3D-GRAND: Towards Better Grounding and Less Hallucination for 3D-LLMs
Authors:
Jianing Yang,
Xuweiyi Chen,
Nikhil Madaan,
Madhavan Iyengar,
Shengyi Qian,
David F. Fouhey,
Joyce Chai
Abstract:
The integration of language and 3D perception is crucial for developing embodied agents and robots that comprehend and interact with the physical world. While large language models (LLMs) have demonstrated impressive language understanding and generation capabilities, their adaptation to 3D environments (3D-LLMs) remains in its early stages. A primary challenge is the absence of large-scale datase…
▽ More
The integration of language and 3D perception is crucial for developing embodied agents and robots that comprehend and interact with the physical world. While large language models (LLMs) have demonstrated impressive language understanding and generation capabilities, their adaptation to 3D environments (3D-LLMs) remains in its early stages. A primary challenge is the absence of large-scale datasets that provide dense grounding between language and 3D scenes. In this paper, we introduce 3D-GRAND, a pioneering large-scale dataset comprising 40,087 household scenes paired with 6.2 million densely-grounded scene-language instructions. Our results show that instruction tuning with 3D-GRAND significantly enhances grounding capabilities and reduces hallucinations in 3D-LLMs. As part of our contributions, we propose a comprehensive benchmark 3D-POPE to systematically evaluate hallucination in 3D-LLMs, enabling fair comparisons among future models. Our experiments highlight a scaling effect between dataset size and 3D-LLM performance, emphasizing the critical role of large-scale 3D-text datasets in advancing embodied AI research. Notably, our results demonstrate early signals for effective sim-to-real transfer, indicating that models trained on large synthetic data can perform well on real-world 3D scans. Through 3D-GRAND and 3D-POPE, we aim to equip the embodied AI community with essential resources and insights, setting the stage for more reliable and better-grounded 3D-LLMs. Project website: https://3d-grand.github.io
△ Less
Submitted 7 June, 2024;
originally announced June 2024.
-
Targeted Mining Precise-positioning Episode Rules
Authors:
Jian Zhu,
Xiaoye Chen,
Wensheng Gan,
Zefeng Chen,
Philip S. Yu
Abstract:
The era characterized by an exponential increase in data has led to the widespread adoption of data intelligence as a crucial task. Within the field of data mining, frequent episode mining has emerged as an effective tool for extracting valuable and essential information from event sequences. Various algorithms have been developed to discover frequent episodes and subsequently derive episode rules…
▽ More
The era characterized by an exponential increase in data has led to the widespread adoption of data intelligence as a crucial task. Within the field of data mining, frequent episode mining has emerged as an effective tool for extracting valuable and essential information from event sequences. Various algorithms have been developed to discover frequent episodes and subsequently derive episode rules using the frequency function and anti-monotonicity principles. However, currently, there is a lack of algorithms specifically designed for mining episode rules that encompass user-specified query episodes. To address this challenge and enable the mining of target episode rules, we introduce the definition of targeted precise-positioning episode rules and formulate the problem of targeted mining precise-positioning episode rules. Most importantly, we develop an algorithm called Targeted Mining Precision Episode Rules (TaMIPER) to address the problem and optimize it using four proposed strategies, leading to significant reductions in both time and space resource requirements. As a result, TaMIPER offers high accuracy and efficiency in mining episode rules of user interest and holds promising potential for prediction tasks in various domains, such as weather observation, network intrusion, and e-commerce. Experimental results on six real datasets demonstrate the exceptional performance of TaMIPER.
△ Less
Submitted 7 June, 2024;
originally announced June 2024.
-
CRAG -- Comprehensive RAG Benchmark
Authors:
Xiao Yang,
Kai Sun,
Hao Xin,
Yushi Sun,
Nikita Bhalla,
Xiangsen Chen,
Sajal Choudhary,
Rongze Daniel Gui,
Ziran Will Jiang,
Ziyu Jiang,
Lingkun Kong,
Brian Moran,
Jiaqi Wang,
Yifan Ethan Xu,
An Yan,
Chenyu Yang,
Eting Yuan,
Hanwen Zha,
Nan Tang,
Lei Chen,
Nicolas Scheffer,
Yue Liu,
Nirav Shah,
Rakesh Wanga,
Anuj Kumar
, et al. (2 additional authors not shown)
Abstract:
Retrieval-Augmented Generation (RAG) has recently emerged as a promising solution to alleviate Large Language Model (LLM)'s deficiency in lack of knowledge. Existing RAG datasets, however, do not adequately represent the diverse and dynamic nature of real-world Question Answering (QA) tasks. To bridge this gap, we introduce the Comprehensive RAG Benchmark (CRAG), a factual question answering bench…
▽ More
Retrieval-Augmented Generation (RAG) has recently emerged as a promising solution to alleviate Large Language Model (LLM)'s deficiency in lack of knowledge. Existing RAG datasets, however, do not adequately represent the diverse and dynamic nature of real-world Question Answering (QA) tasks. To bridge this gap, we introduce the Comprehensive RAG Benchmark (CRAG), a factual question answering benchmark of 4,409 question-answer pairs and mock APIs to simulate web and Knowledge Graph (KG) search. CRAG is designed to encapsulate a diverse array of questions across five domains and eight question categories, reflecting varied entity popularity from popular to long-tail, and temporal dynamisms ranging from years to seconds. Our evaluation on this benchmark highlights the gap to fully trustworthy QA. Whereas most advanced LLMs achieve <=34% accuracy on CRAG, adding RAG in a straightforward manner improves the accuracy only to 44%. State-of-the-art industry RAG solutions only answer 63% questions without any hallucination. CRAG also reveals much lower accuracy in answering questions regarding facts with higher dynamism, lower popularity, or higher complexity, suggesting future research directions. The CRAG benchmark laid the groundwork for a KDD Cup 2024 challenge, attracting thousands of participants and submissions within the first 50 days of the competition. We commit to maintaining CRAG to serve research communities in advancing RAG solutions and general QA solutions.
△ Less
Submitted 7 June, 2024;
originally announced June 2024.
-
MUSE: Flexible Voiceprint Receptive Fields and Multi-Path Fusion Enhanced Taylor Transformer for U-Net-based Speech Enhancement
Authors:
Zizhen Lin,
Xiaoting Chen,
Junyu Wang
Abstract:
Achieving a balance between lightweight design and high performance remains a challenging task for speech enhancement. In this paper, we introduce Multi-path Enhanced Taylor (MET) Transformer based U-net for Speech Enhancement (MUSE), a lightweight speech enhancement network built upon the Unet architecture. Our approach incorporates a novel Multi-path Enhanced Taylor (MET) Transformer block, whic…
▽ More
Achieving a balance between lightweight design and high performance remains a challenging task for speech enhancement. In this paper, we introduce Multi-path Enhanced Taylor (MET) Transformer based U-net for Speech Enhancement (MUSE), a lightweight speech enhancement network built upon the Unet architecture. Our approach incorporates a novel Multi-path Enhanced Taylor (MET) Transformer block, which integrates Deformable Embedding (DE) to enable flexible receptive fields for voiceprints. The MET Transformer is uniquely designed to fuse Channel and Spatial Attention (CSA) branches, facilitating channel information exchange and addressing spatial attention deficits within the Taylor-Transformer framework. Through extensive experiments conducted on the VoiceBank+DEMAND dataset, we demonstrate that MUSE achieves competitive performance while significantly reducing both training and deployment costs, boasting a mere 0.51M parameters.
△ Less
Submitted 6 June, 2024;
originally announced June 2024.
-
Neural Codec-based Adversarial Sample Detection for Speaker Verification
Authors:
Xuanjun Chen,
Jiawei Du,
Haibin Wu,
Jyh-Shing Roger Jang,
Hung-yi Lee
Abstract:
Automatic Speaker Verification (ASV), increasingly used in security-critical applications, faces vulnerabilities from rising adversarial attacks, with few effective defenses available. In this paper, we propose a neural codec-based adversarial sample detection method for ASV. The approach leverages the codec's ability to discard redundant perturbations and retain essential information. Specificall…
▽ More
Automatic Speaker Verification (ASV), increasingly used in security-critical applications, faces vulnerabilities from rising adversarial attacks, with few effective defenses available. In this paper, we propose a neural codec-based adversarial sample detection method for ASV. The approach leverages the codec's ability to discard redundant perturbations and retain essential information. Specifically, we distinguish between genuine and adversarial samples by comparing ASV score differences between original and re-synthesized audio (by codec models). This comprehensive study explores all open-source neural codecs and their variant models for experiments. The Descript-audio-codec model stands out by delivering the highest detection rate among 15 neural codecs and surpassing seven prior state-of-the-art (SOTA) detection methods. Note that, our single-model method even outperforms a SOTA ensemble method by a large margin.
△ Less
Submitted 6 June, 2024;
originally announced June 2024.
-
NATURAL PLAN: Benchmarking LLMs on Natural Language Planning
Authors:
Huaixiu Steven Zheng,
Swaroop Mishra,
Hugh Zhang,
Xinyun Chen,
Minmin Chen,
Azade Nova,
Le Hou,
Heng-Tze Cheng,
Quoc V. Le,
Ed H. Chi,
Denny Zhou
Abstract:
We introduce NATURAL PLAN, a realistic planning benchmark in natural language containing 3 key tasks: Trip Planning, Meeting Planning, and Calendar Scheduling. We focus our evaluation on the planning capabilities of LLMs with full information on the task, by providing outputs from tools such as Google Flights, Google Maps, and Google Calendar as contexts to the models. This eliminates the need for…
▽ More
We introduce NATURAL PLAN, a realistic planning benchmark in natural language containing 3 key tasks: Trip Planning, Meeting Planning, and Calendar Scheduling. We focus our evaluation on the planning capabilities of LLMs with full information on the task, by providing outputs from tools such as Google Flights, Google Maps, and Google Calendar as contexts to the models. This eliminates the need for a tool-use environment for evaluating LLMs on Planning. We observe that NATURAL PLAN is a challenging benchmark for state of the art models. For example, in Trip Planning, GPT-4 and Gemini 1.5 Pro could only achieve 31.1% and 34.8% solve rate respectively. We find that model performance drops drastically as the complexity of the problem increases: all models perform below 5% when there are 10 cities, highlighting a significant gap in planning in natural language for SoTA LLMs. We also conduct extensive ablation studies on NATURAL PLAN to further shed light on the (in)effectiveness of approaches such as self-correction, few-shot generalization, and in-context planning with long-contexts on improving LLM planning.
△ Less
Submitted 6 June, 2024;
originally announced June 2024.
-
Untrained Neural Nets for Snapshot Compressive Imaging: Theory and Algorithms
Authors:
Mengyu Zhao,
Xi Chen,
Xin Yuan,
Shirin Jalali
Abstract:
Snapshot compressive imaging (SCI) recovers high-dimensional (3D) data cubes from a single 2D measurement, enabling diverse applications like video and hyperspectral imaging to go beyond standard techniques in terms of acquisition speed and efficiency. In this paper, we focus on SCI recovery algorithms that employ untrained neural networks (UNNs), such as deep image prior (DIP), to model source st…
▽ More
Snapshot compressive imaging (SCI) recovers high-dimensional (3D) data cubes from a single 2D measurement, enabling diverse applications like video and hyperspectral imaging to go beyond standard techniques in terms of acquisition speed and efficiency. In this paper, we focus on SCI recovery algorithms that employ untrained neural networks (UNNs), such as deep image prior (DIP), to model source structure. Such UNN-based methods are appealing as they have the potential of avoiding the computationally intensive retraining required for different source models and different measurement scenarios. We first develop a theoretical framework for characterizing the performance of such UNN-based methods. The theoretical framework, on the one hand, enables us to optimize the parameters of data-modulating masks, and on the other hand, provides a fundamental connection between the number of data frames that can be recovered from a single measurement to the parameters of the untrained NN. We also employ the recently proposed bagged-deep-image-prior (bagged-DIP) idea to develop SCI Bagged Deep Video Prior (SCI-BDVP) algorithms that address the common challenges faced by standard UNN solutions. Our experimental results show that in video SCI our proposed solution achieves state-of-the-art among UNN methods, and in the case of noisy measurements, it even outperforms supervised solutions.
△ Less
Submitted 5 June, 2024;
originally announced June 2024.
-
Ouroboros3D: Image-to-3D Generation via 3D-aware Recursive Diffusion
Authors:
Hao Wen,
Zehuan Huang,
Yaohui Wang,
Xinyuan Chen,
Yu Qiao,
Lu Sheng
Abstract:
Existing single image-to-3D creation methods typically involve a two-stage process, first generating multi-view images, and then using these images for 3D reconstruction. However, training these two stages separately leads to significant data bias in the inference phase, thus affecting the quality of reconstructed results. We introduce a unified 3D generation framework, named Ouroboros3D, which in…
▽ More
Existing single image-to-3D creation methods typically involve a two-stage process, first generating multi-view images, and then using these images for 3D reconstruction. However, training these two stages separately leads to significant data bias in the inference phase, thus affecting the quality of reconstructed results. We introduce a unified 3D generation framework, named Ouroboros3D, which integrates diffusion-based multi-view image generation and 3D reconstruction into a recursive diffusion process. In our framework, these two modules are jointly trained through a self-conditioning mechanism, allowing them to adapt to each other's characteristics for robust inference. During the multi-view denoising process, the multi-view diffusion model uses the 3D-aware maps rendered by the reconstruction module at the previous timestep as additional conditions. The recursive diffusion framework with 3D-aware feedback unites the entire process and improves geometric consistency.Experiments show that our framework outperforms separation of these two stages and existing methods that combine them at the inference phase. Project page: https://costwen.github.io/Ouroboros3D/
△ Less
Submitted 5 June, 2024;
originally announced June 2024.
-
Phy-Diff: Physics-guided Hourglass Diffusion Model for Diffusion MRI Synthesis
Authors:
Juanhua Zhang,
Ruodan Yan,
Alessandro Perelli,
Xi Chen,
Chao Li
Abstract:
Diffusion MRI (dMRI) is an important neuroimaging technique with high acquisition costs. Deep learning approaches have been used to enhance dMRI and predict diffusion biomarkers through undersampled dMRI. To generate more comprehensive raw dMRI, generative adversarial network based methods are proposed to include b-values and b-vectors as conditions, but they are limited by unstable training and l…
▽ More
Diffusion MRI (dMRI) is an important neuroimaging technique with high acquisition costs. Deep learning approaches have been used to enhance dMRI and predict diffusion biomarkers through undersampled dMRI. To generate more comprehensive raw dMRI, generative adversarial network based methods are proposed to include b-values and b-vectors as conditions, but they are limited by unstable training and less desirable diversity. The emerging diffusion model (DM) promises to improve generative performance. However, it remains challenging to include essential information in conditioning DM for more relevant generation, i.e., the physical principles of dMRI and white matter tract structures. In this study, we propose a physics-guided diffusion model to generate high-quality dMRI. Our model introduces the physical principles of dMRI in the noise evolution in the diffusion process and introduce a query-based conditional mapping within the difussion model. In addition, to enhance the anatomical fine detials of the generation, we introduce the XTRACT atlas as prior of white matter tracts by adopting an adapter technique. Our experiment results show that our method outperforms other state-of-the-art methods and has the potential to advance dMRI enhancement.
△ Less
Submitted 5 June, 2024;
originally announced June 2024.
-
Readability-guided Idiom-aware Sentence Simplification (RISS) for Chinese
Authors:
Jingshen Zhang,
Xinglu Chen,
Xinying Qiu,
Zhimin Wang,
Wenhe Feng
Abstract:
Chinese sentence simplification faces challenges due to the lack of large-scale labeled parallel corpora and the prevalence of idioms. To address these challenges, we propose Readability-guided Idiom-aware Sentence Simplification (RISS), a novel framework that combines data augmentation techniques with lexcial simplification. RISS introduces two key components: (1) Readability-guided Paraphrase Se…
▽ More
Chinese sentence simplification faces challenges due to the lack of large-scale labeled parallel corpora and the prevalence of idioms. To address these challenges, we propose Readability-guided Idiom-aware Sentence Simplification (RISS), a novel framework that combines data augmentation techniques with lexcial simplification. RISS introduces two key components: (1) Readability-guided Paraphrase Selection (RPS), a method for mining high-quality sentence pairs, and (2) Idiom-aware Simplification (IAS), a model that enhances the comprehension and simplification of idiomatic expressions. By integrating RPS and IAS using multi-stage and multi-task learning strategies, RISS outperforms previous state-of-the-art methods on two Chinese sentence simplification datasets. Furthermore, RISS achieves additional improvements when fine-tuned on a small labeled dataset. Our approach demonstrates the potential for more effective and accessible Chinese text simplification.
△ Less
Submitted 5 June, 2024;
originally announced June 2024.
-
DDGS-CT: Direction-Disentangled Gaussian Splatting for Realistic Volume Rendering
Authors:
Zhongpai Gao,
Benjamin Planche,
Meng Zheng,
Xiao Chen,
Terrence Chen,
Ziyan Wu
Abstract:
Digitally reconstructed radiographs (DRRs) are simulated 2D X-ray images generated from 3D CT volumes, widely used in preoperative settings but limited in intraoperative applications due to computational bottlenecks, especially for accurate but heavy physics-based Monte Carlo methods. While analytical DRR renderers offer greater efficiency, they overlook anisotropic X-ray image formation phenomena…
▽ More
Digitally reconstructed radiographs (DRRs) are simulated 2D X-ray images generated from 3D CT volumes, widely used in preoperative settings but limited in intraoperative applications due to computational bottlenecks, especially for accurate but heavy physics-based Monte Carlo methods. While analytical DRR renderers offer greater efficiency, they overlook anisotropic X-ray image formation phenomena, such as Compton scattering. We present a novel approach that marries realistic physics-inspired X-ray simulation with efficient, differentiable DRR generation using 3D Gaussian splatting (3DGS). Our direction-disentangled 3DGS (DDGS) method separates the radiosity contribution into isotropic and direction-dependent components, approximating complex anisotropic interactions without intricate runtime simulations. Additionally, we adapt the 3DGS initialization to account for tomography data properties, enhancing accuracy and efficiency. Our method outperforms state-of-the-art techniques in image accuracy. Furthermore, our DDGS shows promise for intraoperative applications and inverse problems such as pose registration, delivering superior registration accuracy and runtime performance compared to analytical DRR methods.
△ Less
Submitted 4 June, 2024;
originally announced June 2024.
-
SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models
Authors:
Dongchao Yang,
Dingdong Wang,
Haohan Guo,
Xueyuan Chen,
Xixin Wu,
Helen Meng
Abstract:
In this study, we propose a simple and efficient Non-Autoregressive (NAR) text-to-speech (TTS) system based on diffusion, named SimpleSpeech. Its simpleness shows in three aspects: (1) It can be trained on the speech-only dataset, without any alignment information; (2) It directly takes plain text as input and generates speech through an NAR way; (3) It tries to model speech in a finite and compac…
▽ More
In this study, we propose a simple and efficient Non-Autoregressive (NAR) text-to-speech (TTS) system based on diffusion, named SimpleSpeech. Its simpleness shows in three aspects: (1) It can be trained on the speech-only dataset, without any alignment information; (2) It directly takes plain text as input and generates speech through an NAR way; (3) It tries to model speech in a finite and compact latent space, which alleviates the modeling difficulty of diffusion. More specifically, we propose a novel speech codec model (SQ-Codec) with scalar quantization, SQ-Codec effectively maps the complex speech signal into a finite and compact latent space, named scalar latent space. Benefits from SQ-Codec, we apply a novel transformer diffusion model in the scalar latent space of SQ-Codec. We train SimpleSpeech on 4k hours of a speech-only dataset, it shows natural prosody and voice cloning ability. Compared with previous large-scale TTS models, it presents significant speech quality and generation speed improvement. Demos are released.
△ Less
Submitted 5 June, 2024; v1 submitted 4 June, 2024;
originally announced June 2024.
-
LMB: Augmenting PCIe Devices with CXL-Linked Memory Buffer
Authors:
Jiapin Wang,
Xiangping Zhang,
Chenlei Tang,
Xiang Chen,
Tao Lu
Abstract:
PCIe devices, such as SSDs and GPUs, are pivotal in modern data centers, and their value is set to grow amidst the emergence of AI and large models. However, these devices face onboard DRAM shortage issue due to internal space limitation, preventing accommodation of sufficient DRAM modules alongside flash or GPU processing chips. Current solutions either curb device-internal memory usage or supple…
▽ More
PCIe devices, such as SSDs and GPUs, are pivotal in modern data centers, and their value is set to grow amidst the emergence of AI and large models. However, these devices face onboard DRAM shortage issue due to internal space limitation, preventing accommodation of sufficient DRAM modules alongside flash or GPU processing chips. Current solutions either curb device-internal memory usage or supplement slower non-DRAM mediums, prove inadequate or performance-compromising. This paper introduces the Linked Memory Buffer (LMB), a scalable solution utilizing the CXL memory expander to tackle device onboard memory deficiencies. The low-latency of CXL enables LMB to utilize emerging DRAM memory expander to efficiently supplement device onboard DRAM with minimal impact on performance.
△ Less
Submitted 4 June, 2024;
originally announced June 2024.
-
It Takes Two: A Peer-Prediction Solution for Blockchain Verifier's Dilemma
Authors:
Zishuo Zhao,
Xi Chen,
Yuan Zhou
Abstract:
The security of blockchain systems is fundamentally based on the decentralized consensus in which the majority of parties behave honestly, and the process of content verification is essential to keep the robustness of blockchain systems. However, the phenomenon that a secure blockchain system with few or no cheaters could not provide sufficient incentive for verifiers to honestly perform the costl…
▽ More
The security of blockchain systems is fundamentally based on the decentralized consensus in which the majority of parties behave honestly, and the process of content verification is essential to keep the robustness of blockchain systems. However, the phenomenon that a secure blockchain system with few or no cheaters could not provide sufficient incentive for verifiers to honestly perform the costly verification, referred to as the Verifier's Dilemma, could severely undermine the fundamental security of blockchain systems. While existing works have attempted to insert deliberate errors to disincentivize lazy verification, the decentralized environment makes it impossible to judge the correctness of verification or detect malicious verifiers directly.
In this paper, we initiate the research that leverages the peer prediction approach towards the design of Bayesian truthful mechanisms for the decentralized verification game among multiple verifiers, incentivizing all verifiers to perform honest verification without access to the ground truth even in the presence of noisy observations in the verification process. With theoretically guaranteed truthfulness of our mechanism for the verification game, our work provides a framework of verification mechanisms that enhances the security and robustness of the blockchain and potentially other decentralized systems.
△ Less
Submitted 3 June, 2024;
originally announced June 2024.
-
Towards Flexible Interactive Reflection Removal with Human Guidance
Authors:
Xiao Chen,
Xudong Jiang,
Yunkang Tao,
Zhen Lei,
Qing Li,
Chenyang Lei,
Zhaoxiang Zhang
Abstract:
Single image reflection removal is inherently ambiguous, as both the reflection and transmission components requiring separation may follow natural image statistics. Existing methods attempt to address the issue by using various types of low-level and physics-based cues as sources of reflection signals. However, these cues are not universally applicable, since they are only observable in specific…
▽ More
Single image reflection removal is inherently ambiguous, as both the reflection and transmission components requiring separation may follow natural image statistics. Existing methods attempt to address the issue by using various types of low-level and physics-based cues as sources of reflection signals. However, these cues are not universally applicable, since they are only observable in specific capture scenarios. This leads to a significant performance drop when test images do not align with their assumptions. In this paper, we aim to explore a novel flexible interactive reflection removal approach that leverages various forms of sparse human guidance, such as points and bounding boxes, as auxiliary high-level prior to achieve robust reflection removal. However, incorporating the raw user guidance naively into the existing reflection removal network does not result in performance gains. To this end, we innovatively transform raw user input into a unified form -- reflection masks using an Interactive Segmentation Foundation Model. Such a design absorbs the quintessence of the foundational segmentation model and flexible human guidance, thereby mitigating the challenges of reflection separations. Furthermore, to fully utilize user guidance and reduce user annotation costs, we design a mask-guided reflection removal network, comprising our proposed self-adaptive prompt block. This block adaptively incorporates user guidance as anchors and refines transmission features via cross-attention mechanisms. Extensive results on real-world images validate that our method demonstrates state-of-the-art performance on various datasets with the help of flexible and sparse user guidance. Our code and dataset will be publicly available here https://github.com/ShawnChenn/FlexibleReflectionRemoval.
△ Less
Submitted 3 June, 2024;
originally announced June 2024.
-
GeminiFusion: Efficient Pixel-wise Multimodal Fusion for Vision Transformer
Authors:
Ding Jia,
Jianyuan Guo,
Kai Han,
Han Wu,
Chao Zhang,
Chang Xu,
Xinghao Chen
Abstract:
Cross-modal transformers have demonstrated superiority in various vision tasks by effectively integrating different modalities. This paper first critiques prior token exchange methods which replace less informative tokens with inter-modal features, and demonstrate exchange based methods underperform cross-attention mechanisms, while the computational demand of the latter inevitably restricts its u…
▽ More
Cross-modal transformers have demonstrated superiority in various vision tasks by effectively integrating different modalities. This paper first critiques prior token exchange methods which replace less informative tokens with inter-modal features, and demonstrate exchange based methods underperform cross-attention mechanisms, while the computational demand of the latter inevitably restricts its use with longer sequences. To surmount the computational challenges, we propose GeminiFusion, a pixel-wise fusion approach that capitalizes on aligned cross-modal representations. GeminiFusion elegantly combines intra-modal and inter-modal attentions, dynamically integrating complementary information across modalities. We employ a layer-adaptive noise to adaptively control their interplay on a per-layer basis, thereby achieving a harmonized fusion process. Notably, GeminiFusion maintains linear complexity with respect to the number of input tokens, ensuring this multimodal framework operates with efficiency comparable to unimodal networks. Comprehensive evaluations across multimodal image-to-image translation, 3D object detection and arbitrary-modal semantic segmentation tasks, including RGB, depth, LiDAR, event data, etc. demonstrate the superior performance of our GeminiFusion against leading-edge techniques. The PyTorch code is available at https://github.com/JiaDingCN/GeminiFusion
△ Less
Submitted 3 June, 2024; v1 submitted 3 June, 2024;
originally announced June 2024.
-
Efficient Behavior Tree Planning with Commonsense Pruning and Heuristic
Authors:
Xinglin Chen,
Yishuai Cai,
Yunxin Mao,
Minglong Li,
Zhou Yang,
Wen Shanghua,
Wenjing Yang,
Weixia Xu,
Ji Wang
Abstract:
Behavior Tree (BT) planning is crucial for autonomous robot behavior control, yet its application in complex scenarios is hampered by long planning times. Pruning and heuristics are common techniques to accelerate planning, but it is difficult to design general pruning strategies and heuristic functions for BT planning problems. This paper proposes improving BT planning efficiency for everyday ser…
▽ More
Behavior Tree (BT) planning is crucial for autonomous robot behavior control, yet its application in complex scenarios is hampered by long planning times. Pruning and heuristics are common techniques to accelerate planning, but it is difficult to design general pruning strategies and heuristic functions for BT planning problems. This paper proposes improving BT planning efficiency for everyday service robots leveraging commonsense reasoning provided by Large Language Models (LLMs), leading to model-free pre-planning action space pruning and heuristic generation. This approach takes advantage of the modularity and interpretability of BT nodes, represented by predicate logic, to enable LLMs to predict the task-relevant action predicates and objects, and even the optimal path, without an explicit action model. We propose the Heuristic Optimal Behavior Tree Expansion Algorithm (HOBTEA) with two heuristic variants and provide a formal comparison and discussion of their efficiency and optimality. We introduce a learnable and transferable commonsense library to enhance the LLM's reasoning performance without fine-tuning. The action space expansion based on the commonsense library can further increase the success rate of planning. Experiments show the theoretical bounds of commonsense pruning and heuristic, and demonstrate the actual performance of LLM learning and reasoning with the commonsense library. Results in four datasets showcase the practical effectiveness of our approach in everyday service robot applications.
△ Less
Submitted 3 June, 2024; v1 submitted 2 June, 2024;
originally announced June 2024.
-
Maximum-Entropy Regularized Decision Transformer with Reward Relabelling for Dynamic Recommendation
Authors:
Xiaocong Chen,
Siyu Wang,
Lina Yao
Abstract:
Reinforcement learning-based recommender systems have recently gained popularity. However, due to the typical limitations of simulation environments (e.g., data inefficiency), most of the work cannot be broadly applied in all domains. To counter these challenges, recent advancements have leveraged offline reinforcement learning methods, notable for their data-driven approach utilizing offline data…
▽ More
Reinforcement learning-based recommender systems have recently gained popularity. However, due to the typical limitations of simulation environments (e.g., data inefficiency), most of the work cannot be broadly applied in all domains. To counter these challenges, recent advancements have leveraged offline reinforcement learning methods, notable for their data-driven approach utilizing offline datasets. A prominent example of this is the Decision Transformer. Despite its popularity, the Decision Transformer approach has inherent drawbacks, particularly evident in recommendation methods based on it. This paper identifies two key shortcomings in existing Decision Transformer-based methods: a lack of stitching capability and limited effectiveness in online adoption. In response, we introduce a novel methodology named Max-Entropy enhanced Decision Transformer with Reward Relabeling for Offline RLRS (EDT4Rec). Our approach begins with a max entropy perspective, leading to the development of a max entropy enhanced exploration strategy. This strategy is designed to facilitate more effective exploration in online environments. Additionally, to augment the model's capability to stitch sub-optimal trajectories, we incorporate a unique reward relabeling technique. To validate the effectiveness and superiority of EDT4Rec, we have conducted comprehensive experiments across six real-world offline datasets and in an online simulator.
△ Less
Submitted 2 June, 2024;
originally announced June 2024.
-
SimSAM: Zero-shot Medical Image Segmentation via Simulated Interaction
Authors:
Benjamin Towle,
Xin Chen,
Ke Zhou
Abstract:
The recently released Segment Anything Model (SAM) has shown powerful zero-shot segmentation capabilities through a semi-automatic annotation setup in which the user can provide a prompt in the form of clicks or bounding boxes. There is growing interest around applying this to medical imaging, where the cost of obtaining expert annotations is high, privacy restrictions may limit sharing of patient…
▽ More
The recently released Segment Anything Model (SAM) has shown powerful zero-shot segmentation capabilities through a semi-automatic annotation setup in which the user can provide a prompt in the form of clicks or bounding boxes. There is growing interest around applying this to medical imaging, where the cost of obtaining expert annotations is high, privacy restrictions may limit sharing of patient data, and model generalisation is often poor. However, there are large amounts of inherent uncertainty in medical images, due to unclear object boundaries, low-contrast media, and differences in expert labelling style. Currently, SAM is known to struggle in a zero-shot setting to adequately annotate the contours of the structure of interest in medical images, where the uncertainty is often greatest, thus requiring significant manual correction. To mitigate this, we introduce \textbf{Sim}ulated Interaction for \textbf{S}egment \textbf{A}nything \textbf{M}odel (\textsc{\textbf{SimSAM}}), an approach that leverages simulated user interaction to generate an arbitrary number of candidate masks, and uses a novel aggregation approach to output the most compatible mask. Crucially, our method can be used during inference directly on top of SAM, without any additional training requirement. Quantitatively, we evaluate our method across three publicly available medical imaging datasets, and find that our approach leads to up to a 15.5\% improvement in contour segmentation accuracy compared to zero-shot SAM. Our code is available at \url{https://github.com/BenjaminTowle/SimSAM}.
△ Less
Submitted 2 June, 2024;
originally announced June 2024.
-
Intelligent Text-Conditioned Music Generation
Authors:
Zhouyao Xie,
Nikhil Yadala,
Xinyi Chen,
Jing Xi Liu
Abstract:
CLIP (Contrastive Language-Image Pre-Training) is a multimodal neural network trained on (text, image) pairs to predict the most relevant text caption given an image. It has been used extensively in image generation by connecting its output with a generative model such as VQGAN, with the most notable example being OpenAI's DALLE-2. In this project, we apply a similar approach to bridge the gap bet…
▽ More
CLIP (Contrastive Language-Image Pre-Training) is a multimodal neural network trained on (text, image) pairs to predict the most relevant text caption given an image. It has been used extensively in image generation by connecting its output with a generative model such as VQGAN, with the most notable example being OpenAI's DALLE-2. In this project, we apply a similar approach to bridge the gap between natural language and music. Our model is split into two steps: first, we train a CLIP-like model on pairs of text and music over contrastive loss to align a piece of music with its most probable text caption. Then, we combine the alignment model with a music decoder to generate music. To the best of our knowledge, this is the first attempt at text-conditioned deep music generation. Our experiments show that it is possible to train the text-music alignment model using contrastive loss and train a decoder to generate music from text prompts.
△ Less
Submitted 2 June, 2024;
originally announced June 2024.
-
Making Recommender Systems More Knowledgeable: A Framework to Incorporate Side Information
Authors:
Yukun Jiang,
Leo Guo,
Xinyi Chen,
Jing Xi Liu
Abstract:
Session-based recommender systems typically focus on using only the triplet (user_id, timestamp, item_id) to make predictions of users' next actions. In this paper, we aim to utilize side information to help recommender systems catch patterns and signals otherwise undetectable. Specifically, we propose a general framework for incorporating item-specific side information into the recommender system…
▽ More
Session-based recommender systems typically focus on using only the triplet (user_id, timestamp, item_id) to make predictions of users' next actions. In this paper, we aim to utilize side information to help recommender systems catch patterns and signals otherwise undetectable. Specifically, we propose a general framework for incorporating item-specific side information into the recommender system to enhance its performance without much modification on the original model architecture. Experimental results on several models and datasets prove that with side information, our recommender system outperforms state-of-the-art models by a considerable margin and converges much faster. Additionally, we propose a new type of loss to regularize the attention mechanism used by recommender systems and evaluate its influence on model performance. Furthermore, through analysis, we put forward a few insights on potential further improvements.
△ Less
Submitted 2 June, 2024;
originally announced June 2024.
-
Multi-variable Adversarial Time-Series Forecast Model
Authors:
Xiaoqiao Chen
Abstract:
Short-term industrial enterprises power system forecasting is an important issue for both load control and machine protection. Scientists focus on load forecasting but ignore other valuable electric-meters which should provide guidance of power system protection. We propose a new framework, multi-variable adversarial time-series forecasting model, which regularizes Long Short-term Memory (LSTM) mo…
▽ More
Short-term industrial enterprises power system forecasting is an important issue for both load control and machine protection. Scientists focus on load forecasting but ignore other valuable electric-meters which should provide guidance of power system protection. We propose a new framework, multi-variable adversarial time-series forecasting model, which regularizes Long Short-term Memory (LSTM) models via an adversarial process. The novel model forecasts all variables (may in different type, such as continue variables, category variables, etc.) in power system at the same time and helps trade-off process between forecasting accuracy of single variable and variable-variable relations. Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the generated samples. The predict results of electricity consumption of industrial enterprises by multi-variable adversarial time-series forecasting model show that the proposed approach is able to achieve better prediction accuracy. We also applied this model to real industrial enterprises power system data we gathered from several large industrial enterprises via advanced power monitors, and got impressed forecasting results.
△ Less
Submitted 1 June, 2024;
originally announced June 2024.
-
Memory-guided Network with Uncertainty-based Feature Augmentation for Few-shot Semantic Segmentation
Authors:
Xinyue Chen,
Miaojing Shi
Abstract:
The performance of supervised semantic segmentation methods highly relies on the availability of large-scale training data. To alleviate this dependence, few-shot semantic segmentation (FSS) is introduced to leverage the model trained on base classes with sufficient data into the segmentation of novel classes with few data. FSS methods face the challenge of model generalization on novel classes du…
▽ More
The performance of supervised semantic segmentation methods highly relies on the availability of large-scale training data. To alleviate this dependence, few-shot semantic segmentation (FSS) is introduced to leverage the model trained on base classes with sufficient data into the segmentation of novel classes with few data. FSS methods face the challenge of model generalization on novel classes due to the distribution shift between base and novel classes. To overcome this issue, we propose a class-shared memory (CSM) module consisting of a set of learnable memory vectors. These memory vectors learn elemental object patterns from base classes during training whilst re-encoding query features during both training and inference, thereby improving the distribution alignment between base and novel classes. Furthermore, to cope with the performance degradation resulting from the intra-class variance across images, we introduce an uncertainty-based feature augmentation (UFA) module to produce diverse query features during training for improving the model's robustness. We integrate CSM and UFA into representative FSS works, with experimental results on the widely-used PASCAL-5$^i$ and COCO-20$^i$ datasets demonstrating the superior performance of ours over state of the art.
△ Less
Submitted 9 June, 2024; v1 submitted 1 June, 2024;
originally announced June 2024.
-
Non-destructive Degradation Pattern Decoupling for Ultra-early Battery Prototype Verification Using Physics-informed Machine Learning
Authors:
Shengyu Tao,
Mengtian Zhang,
Zixi Zhao,
Haoyang Li,
Ruifei Ma,
Yunhong Che,
Xin Sun,
Lin Su,
Xiangyu Chen,
Zihao Zhou,
Heng Chang,
Tingwei Cao,
Xiao Xiao,
Yaojun Liu,
Wenjun Yu,
Zhongling Xu,
Yang Li,
Han Hao,
Xuan Zhang,
Xiaosong Hu,
Guangmin ZHou
Abstract:
Manufacturing complexities and uncertainties have impeded the transition from material prototypes to commercial batteries, making prototype verification critical to quality assessment. A fundamental challenge involves deciphering intertwined chemical processes to characterize degradation patterns and their quantitative relationship with battery performance. Here we show that a physics-informed mac…
▽ More
Manufacturing complexities and uncertainties have impeded the transition from material prototypes to commercial batteries, making prototype verification critical to quality assessment. A fundamental challenge involves deciphering intertwined chemical processes to characterize degradation patterns and their quantitative relationship with battery performance. Here we show that a physics-informed machine learning approach can quantify and visualize temporally resolved losses concerning thermodynamics and kinetics only using electric signals. Our method enables non-destructive degradation pattern characterization, expediting temperature-adaptable predictions of entire lifetime trajectories, rather than end-of-life points. The verification speed is 25 times faster yet maintaining 95.1% accuracy across temperatures. Such advances facilitate more sustainable management of defective prototypes before massive production, establishing a 19.76 billion USD scrap material recycling market by 2060 in China. By incorporating stepwise charge acceptance as a measure of the initial manufacturing variability of normally identical batteries, we can immediately identify long-term degradation variations. We attribute the predictive power to interpreting machine learning insights using material-agnostic featurization taxonomy for degradation pattern decoupling. Our findings offer new possibilities for dynamic system analysis, such as battery prototype degradation, demonstrating that complex pattern evolutions can be accurately predicted in a non-destructive and data-driven fashion by integrating physics-informed machine learning.
△ Less
Submitted 31 May, 2024;
originally announced June 2024.
-
Towards LLM-Powered Verilog RTL Assistant: Self-Verification and Self-Correction
Authors:
Hanxian Huang,
Zhenghan Lin,
Zixuan Wang,
Xin Chen,
Ke Ding,
Jishen Zhao
Abstract:
We explore the use of Large Language Models (LLMs) to generate high-quality Register-Transfer Level (RTL) code with minimal human interference. The traditional RTL design workflow requires human experts to manually write high-quality RTL code, which is time-consuming and error-prone. With the help of emerging LLMs, developers can describe their requirements to LLMs which then generate correspondin…
▽ More
We explore the use of Large Language Models (LLMs) to generate high-quality Register-Transfer Level (RTL) code with minimal human interference. The traditional RTL design workflow requires human experts to manually write high-quality RTL code, which is time-consuming and error-prone. With the help of emerging LLMs, developers can describe their requirements to LLMs which then generate corresponding code in Python, C, Java, and more. Adopting LLMs to generate RTL design in hardware description languages is not trivial, given the complex nature of hardware design and the generated design has to meet the timing and physical constraints.
We propose VeriAssist, an LLM-powered programming assistant for Verilog RTL design workflow. VeriAssist takes RTL design descriptions as input and generates high-quality RTL code with corresponding test benches. VeriAssist enables the LLM to self-correct and self-verify the generated code by adopting an automatic prompting system and integrating RTL simulator in the code generation loop. To generate an RTL design, VeriAssist first generates the initial RTL code and corresponding test benches, followed by a self-verification step that walks through the code with test cases to reason the code behavior at different time steps, and finally it self-corrects the code by reading the compilation and simulation results and generating final RTL code that fixes errors in compilation and simulation. This design fully leverages the LLMs' capabilities on multi-turn interaction and chain-of-thought reasoning to improve the quality of the generated code. We evaluate VeriAssist with various benchmark suites and find it significantly improves both syntax and functionality correctness over existing LLM implementations, thus minimizing human intervention and making RTL design more accessible to novice designers.
△ Less
Submitted 31 May, 2024;
originally announced June 2024.
-
BadRAG: Identifying Vulnerabilities in Retrieval Augmented Generation of Large Language Models
Authors:
Jiaqi Xue,
Mengxin Zheng,
Yebowen Hu,
Fei Liu,
Xun Chen,
Qian Lou
Abstract:
Large Language Models (LLMs) are constrained by outdated information and a tendency to generate incorrect data, commonly referred to as "hallucinations." Retrieval-Augmented Generation (RAG) addresses these limitations by combining the strengths of retrieval-based methods and generative models. This approach involves retrieving relevant information from a large, up-to-date dataset and using it to…
▽ More
Large Language Models (LLMs) are constrained by outdated information and a tendency to generate incorrect data, commonly referred to as "hallucinations." Retrieval-Augmented Generation (RAG) addresses these limitations by combining the strengths of retrieval-based methods and generative models. This approach involves retrieving relevant information from a large, up-to-date dataset and using it to enhance the generation process, leading to more accurate and contextually appropriate responses. Despite its benefits, RAG introduces a new attack surface for LLMs, particularly because RAG databases are often sourced from public data, such as the web. In this paper, we propose \TrojRAG{} to identify the vulnerabilities and attacks on retrieval parts (RAG database) and their indirect attacks on generative parts (LLMs). Specifically, we identify that poisoning several customized content passages could achieve a retrieval backdoor, where the retrieval works well for clean queries but always returns customized poisoned adversarial queries. Triggers and poisoned passages can be highly customized to implement various attacks. For example, a trigger could be a semantic group like "The Republican Party, Donald Trump, etc." Adversarial passages can be tailored to different contents, not only linked to the triggers but also used to indirectly attack generative LLMs without modifying them. These attacks can include denial-of-service attacks on RAG and semantic steering attacks on LLM generations conditioned by the triggers. Our experiments demonstrate that by just poisoning 10 adversarial passages can induce 98.2\% success rate to retrieve the adversarial passages. Then, these passages can increase the reject ratio of RAG-based GPT-4 from 0.01\% to 74.6\% or increase the rate of negative responses from 0.22\% to 72\% for targeted queries.
△ Less
Submitted 6 June, 2024; v1 submitted 2 June, 2024;
originally announced June 2024.
-
LocMoE+: Enhanced Router with Token Feature Awareness for Efficient LLM Pre-Training
Authors:
Jing Li,
Zhijie Sun,
Dachao Lin,
Xuan He,
Yi Lin,
Binfan Zheng,
Li Zeng,
Rongqian Zhao,
Xin Chen
Abstract:
Mixture-of-Experts (MoE) architectures have recently gained increasing popularity within the domain of large language models (LLMs) due to their ability to significantly reduce training and inference overhead. However, MoE architectures face challenges, such as significant disparities in the number of tokens assigned to each expert and a tendency toward homogenization among experts, which adversel…
▽ More
Mixture-of-Experts (MoE) architectures have recently gained increasing popularity within the domain of large language models (LLMs) due to their ability to significantly reduce training and inference overhead. However, MoE architectures face challenges, such as significant disparities in the number of tokens assigned to each expert and a tendency toward homogenization among experts, which adversely affects the model's semantic generation capabilities. In this paper, we introduce LocMoE+, a refined version of the low-overhead LocMoE, incorporating the following enhancements: (1) Quantification and definition of the affinity between experts and tokens. (2) Implementation of a global-level adaptive routing strategy to rearrange tokens based on their affinity scores. (3) Reestimation of the lower bound for expert capacity, which has been shown to progressively decrease as the token feature distribution evolves. Experimental results demonstrate that, without compromising model convergence or efficacy, the number of tokens each expert processes can be reduced by over 60%. Combined with communication optimizations, this leads to an average improvement in training efficiency ranging from 5.4% to 46.6%. After fine-tuning, LocMoE+ exhibits a performance improvement of 9.7% to 14.1% across the GDAD, C-Eval, and TeleQnA datasets.
△ Less
Submitted 23 May, 2024;
originally announced June 2024.
-
Enhancing Noise Robustness of Retrieval-Augmented Language Models with Adaptive Adversarial Training
Authors:
Feiteng Fang,
Yuelin Bai,
Shiwen Ni,
Min Yang,
Xiaojun Chen,
Ruifeng Xu
Abstract:
Large Language Models (LLMs) exhibit substantial capabilities yet encounter challenges, including hallucination, outdated knowledge, and untraceable reasoning processes. Retrieval-augmented generation (RAG) has emerged as a promising solution, integrating knowledge from external databases to mitigate these challenges. However, inappropriate retrieved passages can potentially hinder the LLMs' capac…
▽ More
Large Language Models (LLMs) exhibit substantial capabilities yet encounter challenges, including hallucination, outdated knowledge, and untraceable reasoning processes. Retrieval-augmented generation (RAG) has emerged as a promising solution, integrating knowledge from external databases to mitigate these challenges. However, inappropriate retrieved passages can potentially hinder the LLMs' capacity to generate comprehensive and high-quality responses. Prior RAG studies on the robustness of retrieval noises often confine themselves to a limited set of noise types, deviating from real-world retrieval environments and limiting practical applicability. In this study, we initially investigate retrieval noises and categorize them into three distinct types, reflecting real-world environments. We analyze the impact of these various retrieval noises on the robustness of LLMs. Subsequently, we propose a novel RAG approach known as Retrieval-augmented Adaptive Adversarial Training (RAAT). RAAT leverages adaptive adversarial training to dynamically adjust the model's training process in response to retrieval noises. Concurrently, it employs multi-task learning to ensure the model's capacity to internally recognize noisy contexts. Extensive experiments demonstrate that the LLaMA-2 7B model trained using RAAT exhibits significant improvements in F1 and EM scores under diverse noise conditions. For reproducibility, we release our code and data at: https://github.com/calubkk/RAAT.
△ Less
Submitted 31 May, 2024;
originally announced May 2024.
-
MeshXL: Neural Coordinate Field for Generative 3D Foundation Models
Authors:
Sijin Chen,
Xin Chen,
Anqi Pang,
Xianfang Zeng,
Wei Cheng,
Yijun Fu,
Fukun Yin,
Yanru Wang,
Zhibin Wang,
Chi Zhang,
Jingyi Yu,
Gang Yu,
Bin Fu,
Tao Chen
Abstract:
The polygon mesh representation of 3D data exhibits great flexibility, fast rendering speed, and storage efficiency, which is widely preferred in various applications. However, given its unstructured graph representation, the direct generation of high-fidelity 3D meshes is challenging. Fortunately, with a pre-defined ordering strategy, 3D meshes can be represented as sequences, and the generation…
▽ More
The polygon mesh representation of 3D data exhibits great flexibility, fast rendering speed, and storage efficiency, which is widely preferred in various applications. However, given its unstructured graph representation, the direct generation of high-fidelity 3D meshes is challenging. Fortunately, with a pre-defined ordering strategy, 3D meshes can be represented as sequences, and the generation process can be seamlessly treated as an auto-regressive problem. In this paper, we validate the Neural Coordinate Field (NeurCF), an explicit coordinate representation with implicit neural embeddings, is a simple-yet-effective representation for large-scale sequential mesh modeling. After that, we present MeshXL, a family of generative pre-trained auto-regressive models, which addresses the process of 3D mesh generation with modern large language model approaches. Extensive experiments show that MeshXL is able to generate high-quality 3D meshes, and can also serve as foundation models for various down-stream applications.
△ Less
Submitted 31 May, 2024;
originally announced May 2024.
-
4Diffusion: Multi-view Video Diffusion Model for 4D Generation
Authors:
Haiyu Zhang,
Xinyuan Chen,
Yaohui Wang,
Xihui Liu,
Yunhong Wang,
Yu Qiao
Abstract:
Current 4D generation methods have achieved noteworthy efficacy with the aid of advanced diffusion generative models. However, these methods lack multi-view spatial-temporal modeling and encounter challenges in integrating diverse prior knowledge from multiple diffusion models, resulting in inconsistent temporal appearance and flickers. In this paper, we propose a novel 4D generation pipeline, nam…
▽ More
Current 4D generation methods have achieved noteworthy efficacy with the aid of advanced diffusion generative models. However, these methods lack multi-view spatial-temporal modeling and encounter challenges in integrating diverse prior knowledge from multiple diffusion models, resulting in inconsistent temporal appearance and flickers. In this paper, we propose a novel 4D generation pipeline, namely 4Diffusion aimed at generating spatial-temporally consistent 4D content from a monocular video. We first design a unified diffusion model tailored for multi-view video generation by incorporating a learnable motion module into a frozen 3D-aware diffusion model to capture multi-view spatial-temporal correlations. After training on a curated dataset, our diffusion model acquires reasonable temporal consistency and inherently preserves the generalizability and spatial consistency of the 3D-aware diffusion model. Subsequently, we propose 4D-aware Score Distillation Sampling loss, which is based on our multi-view video diffusion model, to optimize 4D representation parameterized by dynamic NeRF. This aims to eliminate discrepancies arising from multiple diffusion models, allowing for generating spatial-temporally consistent 4D content. Moreover, we devise an anchor loss to enhance the appearance details and facilitate the learning of dynamic NeRF. Extensive qualitative and quantitative experiments demonstrate that our method achieves superior performance compared to previous methods.
△ Less
Submitted 31 May, 2024;
originally announced May 2024.
-
Query Provenance Analysis for Robust and Efficient Query-based Black-box Attack Defense
Authors:
Shaofei Li,
Ziqi Zhang,
Haomin Jia,
Ding Li,
Yao Guo,
Xiangqun Chen
Abstract:
Query-based black-box attacks have emerged as a significant threat to machine learning systems, where adversaries can manipulate the input queries to generate adversarial examples that can cause misclassification of the model. To counter these attacks, researchers have proposed Stateful Defense Models (SDMs) for detecting adversarial query sequences and rejecting queries that are "similar" to the…
▽ More
Query-based black-box attacks have emerged as a significant threat to machine learning systems, where adversaries can manipulate the input queries to generate adversarial examples that can cause misclassification of the model. To counter these attacks, researchers have proposed Stateful Defense Models (SDMs) for detecting adversarial query sequences and rejecting queries that are "similar" to the history queries. Existing state-of-the-art (SOTA) SDMs (e.g., BlackLight and PIHA) have shown great effectiveness in defending against these attacks. However, recent studies have shown that they are vulnerable to Oracle-guided Adaptive Rejection Sampling (OARS) attacks, which is a stronger adaptive attack strategy. It can be easily integrated with existing attack algorithms to evade the SDMs by generating queries with fine-tuned direction and step size of perturbations utilizing the leaked decision information from the SDMs.
In this paper, we propose a novel approach, Query Provenance Analysis (QPA), for more robust and efficient SDMs. QPA encapsulates the historical relationships among queries as the sequence feature to capture the fundamental difference between benign and adversarial query sequences. To utilize the query provenance, we propose an efficient query provenance analysis algorithm with dynamic management. We evaluate QPA compared with two baselines, BlackLight and PIHA, on four widely used datasets with six query-based black-box attack algorithms. The results show that QPA outperforms the baselines in terms of defense effectiveness and efficiency on both non-adaptive and adaptive attacks. Specifically, QPA reduces the Attack Success Rate (ASR) of OARS to 4.08%, comparing to 77.63% and 87.72% for BlackLight and PIHA, respectively. Moreover, QPA also achieves 7.67x and 2.25x higher throughput than BlackLight and PIHA.
△ Less
Submitted 31 May, 2024;
originally announced May 2024.
-
EPIDetect: Video-based convulsive seizure detection in chronic epilepsy mouse model for anti-epilepsy drug screening
Authors:
Junming Ren,
Zhoujian Xiao,
Yujia Zhang,
Yujie Yang,
Ling He,
Ezra Yoon,
Stephen Temitayo Bello,
Xi Chen,
Dapeng Wu,
Micky Tortorella,
Jufang He
Abstract:
In the preclinical translational studies, drug candidates with remarkable anti-epileptic efficacy demonstrate long-term suppression of spontaneous recurrent seizures (SRSs), particularly convulsive seizures (CSs), in mouse models of chronic epilepsy. However, the current methods for monitoring CSs have limitations in terms of invasiveness, specific laboratory settings, high cost, and complex opera…
▽ More
In the preclinical translational studies, drug candidates with remarkable anti-epileptic efficacy demonstrate long-term suppression of spontaneous recurrent seizures (SRSs), particularly convulsive seizures (CSs), in mouse models of chronic epilepsy. However, the current methods for monitoring CSs have limitations in terms of invasiveness, specific laboratory settings, high cost, and complex operation, which hinder drug screening efforts. In this study, a camera-based system for automated detection of CSs in chronically epileptic mice is first established to screen potential anti-epilepsy drugs.
△ Less
Submitted 31 May, 2024;
originally announced May 2024.
-
Multi-label Class Incremental Emotion Decoding with Augmented Emotional Semantics Learning
Authors:
Kaicheng Fu,
Changde Du,
Xiaoyu Chen,
Jie Peng,
Huiguang He
Abstract:
Emotion decoding plays an important role in affective human-computer interaction. However, previous studies ignored the dynamic real-world scenario, where human experience a blend of multiple emotions which are incrementally integrated into the model, leading to the multi-label class incremental learning (MLCIL) problem. Existing methods have difficulty in solving MLCIL issue due to notorious cata…
▽ More
Emotion decoding plays an important role in affective human-computer interaction. However, previous studies ignored the dynamic real-world scenario, where human experience a blend of multiple emotions which are incrementally integrated into the model, leading to the multi-label class incremental learning (MLCIL) problem. Existing methods have difficulty in solving MLCIL issue due to notorious catastrophic forgetting caused by partial label problem and inadequate label semantics mining. In this paper, we propose an augmented emotional semantics learning framework for multi-label class incremental emotion decoding. Specifically, we design an augmented emotional relation graph module with label disambiguation to handle the past-missing partial label problem. Then, we leverage domain knowledge from affective dimension space to alleviate future-missing partial label problem by knowledge distillation. Besides, an emotional semantics learning module is constructed with a graph autoencoder to obtain emotion embeddings in order to guide the semantic-specific feature decoupling for better multi-label learning. Extensive experiments on three datasets show the superiority of our method for improving emotion decoding performance and mitigating forgetting on MLCIL problem.
△ Less
Submitted 30 May, 2024;
originally announced May 2024.
-
Generalized Semi-Supervised Learning via Self-Supervised Feature Adaptation
Authors:
Jiachen Liang,
Ruibing Hou,
Hong Chang,
Bingpeng Ma,
Shiguang Shan,
Xilin Chen
Abstract:
Traditional semi-supervised learning (SSL) assumes that the feature distributions of labeled and unlabeled data are consistent which rarely holds in realistic scenarios. In this paper, we propose a novel SSL setting, where unlabeled samples are drawn from a mixed distribution that deviates from the feature distribution of labeled samples. Under this setting, previous SSL methods tend to predict wr…
▽ More
Traditional semi-supervised learning (SSL) assumes that the feature distributions of labeled and unlabeled data are consistent which rarely holds in realistic scenarios. In this paper, we propose a novel SSL setting, where unlabeled samples are drawn from a mixed distribution that deviates from the feature distribution of labeled samples. Under this setting, previous SSL methods tend to predict wrong pseudo-labels with the model fitted on labeled data, resulting in noise accumulation. To tackle this issue, we propose Self-Supervised Feature Adaptation (SSFA), a generic framework for improving SSL performance when labeled and unlabeled data come from different distributions. SSFA decouples the prediction of pseudo-labels from the current model to improve the quality of pseudo-labels. Particularly, SSFA incorporates a self-supervised task into the SSL framework and uses it to adapt the feature extractor of the model to the unlabeled data. In this way, the extracted features better fit the distribution of unlabeled data, thereby generating high-quality pseudo-labels. Extensive experiments show that our proposed SSFA is applicable to various pseudo-label-based SSL learners and significantly improves performance in labeled, unlabeled, and even unseen distributions.
△ Less
Submitted 30 May, 2024;
originally announced May 2024.
-
1st Place Solution to Odyssey Emotion Recognition Challenge Task1: Tackling Class Imbalance Problem
Authors:
Mingjie Chen,
Hezhao Zhang,
Yuanchao Li,
Jiachen Luo,
Wen Wu,
Ziyang Ma,
Peter Bell,
Catherine Lai,
Joshua Reiss,
Lin Wang,
Philip C. Woodland,
Xie Chen,
Huy Phan,
Thomas Hain
Abstract:
Speech emotion recognition is a challenging classification task with natural emotional speech, especially when the distribution of emotion types is imbalanced in the training and test data. In this case, it is more difficult for a model to learn to separate minority classes, resulting in those sometimes being ignored or frequently misclassified. Previous work has utilised class weighted loss for t…
▽ More
Speech emotion recognition is a challenging classification task with natural emotional speech, especially when the distribution of emotion types is imbalanced in the training and test data. In this case, it is more difficult for a model to learn to separate minority classes, resulting in those sometimes being ignored or frequently misclassified. Previous work has utilised class weighted loss for training, but problems remain as it sometimes causes over-fitting for minor classes or under-fitting for major classes. This paper presents the system developed by a multi-site team for the participation in the Odyssey 2024 Emotion Recognition Challenge Track-1. The challenge data has the aforementioned properties and therefore the presented systems aimed to tackle these issues, by introducing focal loss in optimisation when applying class weighted loss. Specifically, the focal loss is further weighted by prior-based class weights. Experimental results show that combining these two approaches brings better overall performance, by sacrificing performance on major classes. The system further employs a majority voting strategy to combine the outputs of an ensemble of 7 models. The models are trained independently, using different acoustic features and loss functions - with the aim to have different properties for different data. Hence these models show different performance preferences on major classes and minor classes. The ensemble system output obtained the best performance in the challenge, ranking top-1 among 68 submissions. It also outperformed all single models in our set. On the Odyssey 2024 Emotion Recognition Challenge Task-1 data the system obtained a Macro-F1 score of 35.69% and an accuracy of 37.32%.
△ Less
Submitted 30 May, 2024;
originally announced May 2024.
-
Unbending strategies shepherd cooperation and suppress extortion in spatial populations
Authors:
Zijie Chen,
Yuxin Geng,
Xingru Chen,
Feng Fu
Abstract:
Evolutionary game dynamics on networks typically consider the competition among simple strategies such as cooperation and defection in the Prisoner's Dilemma and summarize the effect of population structure as network reciprocity. However, it remains largely unknown regarding the evolutionary dynamics involving multiple powerful strategies typically considered in repeated games, such as the zero-d…
▽ More
Evolutionary game dynamics on networks typically consider the competition among simple strategies such as cooperation and defection in the Prisoner's Dilemma and summarize the effect of population structure as network reciprocity. However, it remains largely unknown regarding the evolutionary dynamics involving multiple powerful strategies typically considered in repeated games, such as the zero-determinant (ZD) strategies that are able to enforce a linear payoff relationship between them and their co-players. Here, we consider the evolutionary dynamics of always cooperate (AllC), extortionate ZD (extortioners), and unbending players in lattice populations based on the commonly used death-birth updating. Out of the class of unbending strategies, we consider a particular candidate, PSO Gambler, a machine-learning-optimized memory-one strategy, which can foster reciprocal cooperation and fairness among extortionate players. We derive analytical results under weak selection and rare mutations, including pairwise fixation probabilities and long-term frequencies of strategies. In the absence of the third unbending type, extortioners can achieve a half-half split in equilibrium with unconditional cooperators for sufficiently large extortion factors. However, the presence of unbending players fundamentally changes the dynamics and tilts the system to favor unbending cooperation. Most surprisingly, extortioners cannot dominate at all regardless of how large their extortion factor is, and the long-term frequency of unbending players is maintained almost as a constant. Our analytical method is applicable to studying the evolutionary dynamics of multiple strategies in structured populations. Our work provides insights into the interplay between network reciprocity and direct reciprocity, revealing the role of unbending strategies in enforcing fairness and suppressing extortion.
△ Less
Submitted 29 May, 2024;
originally announced May 2024.
-
Preference Learning Algorithms Do Not Learn Preference Rankings
Authors:
Angelica Chen,
Sadhika Malladi,
Lily H. Zhang,
Xinyi Chen,
Qiuyi Zhang,
Rajesh Ranganath,
Kyunghyun Cho
Abstract:
Preference learning algorithms (e.g., RLHF and DPO) are frequently used to steer LLMs to produce generations that are more preferred by humans, but our understanding of their inner workings is still limited. In this work, we study the conventional wisdom that preference learning trains models to assign higher likelihoods to more preferred outputs than less preferred outputs, measured via…
▽ More
Preference learning algorithms (e.g., RLHF and DPO) are frequently used to steer LLMs to produce generations that are more preferred by humans, but our understanding of their inner workings is still limited. In this work, we study the conventional wisdom that preference learning trains models to assign higher likelihoods to more preferred outputs than less preferred outputs, measured via $\textit{ranking accuracy}$. Surprisingly, we find that most state-of-the-art preference-tuned models achieve a ranking accuracy of less than 60% on common preference datasets. We furthermore derive the $\textit{idealized ranking accuracy}$ that a preference-tuned LLM would achieve if it optimized the DPO or RLHF objective perfectly. We demonstrate that existing models exhibit a significant $\textit{alignment gap}$ -- $\textit{i.e.}$, a gap between the observed and idealized ranking accuracies. We attribute this discrepancy to the DPO objective, which is empirically and theoretically ill-suited to fix even mild ranking errors in the reference model, and derive a simple and efficient formula for quantifying the difficulty of learning a given preference datapoint. Finally, we demonstrate that ranking accuracy strongly correlates with the empirically popular win rate metric when the model is close to the reference model used in the objective, shedding further light on the differences between on-policy (e.g., RLHF) and off-policy (e.g., DPO) preference learning algorithms.
△ Less
Submitted 29 May, 2024;
originally announced May 2024.
-
Stochastic Optimization Algorithms for Instrumental Variable Regression with Streaming Data
Authors:
Xuxing Chen,
Abhishek Roy,
Yifan Hu,
Krishnakumar Balasubramanian
Abstract:
We develop and analyze algorithms for instrumental variable regression by viewing the problem as a conditional stochastic optimization problem. In the context of least-squares instrumental variable regression, our algorithms neither require matrix inversions nor mini-batches and provides a fully online approach for performing instrumental variable regression with streaming data. When the true mode…
▽ More
We develop and analyze algorithms for instrumental variable regression by viewing the problem as a conditional stochastic optimization problem. In the context of least-squares instrumental variable regression, our algorithms neither require matrix inversions nor mini-batches and provides a fully online approach for performing instrumental variable regression with streaming data. When the true model is linear, we derive rates of convergence in expectation, that are of order $\mathcal{O}(\log T/T)$ and $\mathcal{O}(1/T^{1-ι})$ for any $ι>0$, respectively under the availability of two-sample and one-sample oracles, respectively, where $T$ is the number of iterations. Importantly, under the availability of the two-sample oracle, our procedure avoids explicitly modeling and estimating the relationship between confounder and the instrumental variables, demonstrating the benefit of the proposed approach over recent works based on reformulating the problem as minimax optimization problems. Numerical experiments are provided to corroborate the theoretical results.
△ Less
Submitted 29 May, 2024;
originally announced May 2024.
-
Multi-modal Mood Reader: Pre-trained Model Empowers Cross-Subject Emotion Recognition
Authors:
Yihang Dong,
Xuhang Chen,
Yanyan Shen,
Michael Kwok-Po Ng,
Tao Qian,
Shuqiang Wang
Abstract:
Emotion recognition based on Electroencephalography (EEG) has gained significant attention and diversified development in fields such as neural signal processing and affective computing. However, the unique brain anatomy of individuals leads to non-negligible natural differences in EEG signals across subjects, posing challenges for cross-subject emotion recognition. While recent studies have attem…
▽ More
Emotion recognition based on Electroencephalography (EEG) has gained significant attention and diversified development in fields such as neural signal processing and affective computing. However, the unique brain anatomy of individuals leads to non-negligible natural differences in EEG signals across subjects, posing challenges for cross-subject emotion recognition. While recent studies have attempted to address these issues, they still face limitations in practical effectiveness and model framework unity. Current methods often struggle to capture the complex spatial-temporal dynamics of EEG signals and fail to effectively integrate multimodal information, resulting in suboptimal performance and limited generalizability across subjects. To overcome these limitations, we develop a Pre-trained model based Multimodal Mood Reader for cross-subject emotion recognition that utilizes masked brain signal modeling and interlinked spatial-temporal attention mechanism. The model learns universal latent representations of EEG signals through pre-training on large scale dataset, and employs Interlinked spatial-temporal attention mechanism to process Differential Entropy(DE) features extracted from EEG data. Subsequently, a multi-level fusion layer is proposed to integrate the discriminative features, maximizing the advantages of features across different dimensions and modalities. Extensive experiments on public datasets demonstrate Mood Reader's superior performance in cross-subject emotion recognition tasks, outperforming state-of-the-art methods. Additionally, the model is dissected from attention perspective, providing qualitative analysis of emotion-related brain areas, offering valuable insights for affective research in neural signal processing.
△ Less
Submitted 28 May, 2024;
originally announced May 2024.
-
Nearest Neighbor Speculative Decoding for LLM Generation and Attribution
Authors:
Minghan Li,
Xilun Chen,
Ari Holtzman,
Beidi Chen,
Jimmy Lin,
Wen-tau Yih,
Xi Victoria Lin
Abstract:
Large language models (LLMs) often hallucinate and lack the ability to provide attribution for their generations. Semi-parametric LMs, such as kNN-LM, approach these limitations by refining the output of an LM for a given prompt using its nearest neighbor matches in a non-parametric data store. However, these models often exhibit slow inference speeds and produce non-fluent texts. In this paper, w…
▽ More
Large language models (LLMs) often hallucinate and lack the ability to provide attribution for their generations. Semi-parametric LMs, such as kNN-LM, approach these limitations by refining the output of an LM for a given prompt using its nearest neighbor matches in a non-parametric data store. However, these models often exhibit slow inference speeds and produce non-fluent texts. In this paper, we introduce Nearest Neighbor Speculative Decoding (NEST), a novel semi-parametric language modeling approach that is capable of incorporating real-world text spans of arbitrary length into the LM generations and providing attribution to their sources. NEST performs token-level retrieval at each inference step to compute a semi-parametric mixture distribution and identify promising span continuations in a corpus. It then uses an approximate speculative decoding procedure that accepts a prefix of the retrieved span or generates a new token. NEST significantly enhances the generation quality and attribution rate of the base LM across a variety of knowledge-intensive tasks, surpassing the conventional kNN-LM method and performing competitively with in-context retrieval augmentation. In addition, NEST substantially improves the generation speed, achieving a 1.8x speedup in inference time when applied to Llama-2-Chat 70B.
△ Less
Submitted 30 May, 2024; v1 submitted 29 May, 2024;
originally announced May 2024.
-
Comparative Study of Neighbor-based Methods for Local Outlier Detection
Authors:
Zhuang Qi,
Junlin Zhang,
Xiaming Chen,
Xin Qi
Abstract:
The neighbor-based method has become a powerful tool to handle the outlier detection problem, which aims to infer the abnormal degree of the sample based on the compactness of the sample and its neighbors. However, the existing methods commonly focus on designing different processes to locate outliers in the dataset, while the contributions of different types neighbors to outlier detection has not…
▽ More
The neighbor-based method has become a powerful tool to handle the outlier detection problem, which aims to infer the abnormal degree of the sample based on the compactness of the sample and its neighbors. However, the existing methods commonly focus on designing different processes to locate outliers in the dataset, while the contributions of different types neighbors to outlier detection has not been well discussed. To this end, this paper studies the neighbor in the existing outlier detection algorithms and a taxonomy is introduced, which uses the three-level components of information, neighbor and methodology to define hybrid methods. This taxonomy can serve as a paradigm where a novel neighbor-based outlier detection method can be proposed by combining different components in this taxonomy. A large number of comparative experiments were conducted on synthetic and real-world datasets in terms of performance comparison and case study, and the results show that reverse K-nearest neighbor based methods achieve promising performance and dynamic selection method is suitable for working in high-dimensional space. Notably, it is verified that rationally selecting components from this taxonomy may create an algorithms superior to existing methods.
△ Less
Submitted 29 May, 2024;
originally announced May 2024.
-
Reverse the auditory processing pathway: Coarse-to-fine audio reconstruction from fMRI
Authors:
Che Liu,
Changde Du,
Xiaoyu Chen,
Huiguang He
Abstract:
Drawing inspiration from the hierarchical processing of the human auditory system, which transforms sound from low-level acoustic features to high-level semantic understanding, we introduce a novel coarse-to-fine audio reconstruction method. Leveraging non-invasive functional Magnetic Resonance Imaging (fMRI) data, our approach mimics the inverse pathway of auditory processing. Initially, we utili…
▽ More
Drawing inspiration from the hierarchical processing of the human auditory system, which transforms sound from low-level acoustic features to high-level semantic understanding, we introduce a novel coarse-to-fine audio reconstruction method. Leveraging non-invasive functional Magnetic Resonance Imaging (fMRI) data, our approach mimics the inverse pathway of auditory processing. Initially, we utilize CLAP to decode fMRI data coarsely into a low-dimensional semantic space, followed by a fine-grained decoding into the high-dimensional AudioMAE latent space guided by semantic features. These fine-grained neural features serve as conditions for audio reconstruction through a Latent Diffusion Model (LDM). Validation on three public fMRI datasets-Brain2Sound, Brain2Music, and Brain2Speech-underscores the superiority of our coarse-to-fine decoding method over stand-alone fine-grained approaches, showcasing state-of-the-art performance in metrics like FD, FAD, and KL. Moreover, by employing semantic prompts during decoding, we enhance the quality of reconstructed audio when semantic features are suboptimal. The demonstrated versatility of our model across diverse stimuli highlights its potential as a universal brain-to-audio framework. This research contributes to the comprehension of the human auditory system, pushing boundaries in neural decoding and audio reconstruction methodologies.
△ Less
Submitted 28 May, 2024;
originally announced May 2024.
-
Universal and Extensible Language-Vision Models for Organ Segmentation and Tumor Detection from Abdominal Computed Tomography
Authors:
Jie Liu,
Yixiao Zhang,
Kang Wang,
Mehmet Can Yavuz,
Xiaoxi Chen,
Yixuan Yuan,
Haoliang Li,
Yang Yang,
Alan Yuille,
Yucheng Tang,
Zongwei Zhou
Abstract:
The advancement of artificial intelligence (AI) for organ segmentation and tumor detection is propelled by the growing availability of computed tomography (CT) datasets with detailed, per-voxel annotations. However, these AI models often struggle with flexibility for partially annotated datasets and extensibility for new classes due to limitations in the one-hot encoding, architectural design, and…
▽ More
The advancement of artificial intelligence (AI) for organ segmentation and tumor detection is propelled by the growing availability of computed tomography (CT) datasets with detailed, per-voxel annotations. However, these AI models often struggle with flexibility for partially annotated datasets and extensibility for new classes due to limitations in the one-hot encoding, architectural design, and learning scheme. To overcome these limitations, we propose a universal, extensible framework enabling a single model, termed Universal Model, to deal with multiple public datasets and adapt to new classes (e.g., organs/tumors). Firstly, we introduce a novel language-driven parameter generator that leverages language embeddings from large language models, enriching semantic encoding compared with one-hot encoding. Secondly, the conventional output layers are replaced with lightweight, class-specific heads, allowing Universal Model to simultaneously segment 25 organs and six types of tumors and ease the addition of new classes. We train our Universal Model on 3,410 CT volumes assembled from 14 publicly available datasets and then test it on 6,173 CT volumes from four external datasets. Universal Model achieves first place on six CT tasks in the Medical Segmentation Decathlon (MSD) public leaderboard and leading performance on the Beyond The Cranial Vault (BTCV) dataset. In summary, Universal Model exhibits remarkable computational efficiency (6x faster than other dataset-specific models), demonstrates strong generalization across different hospitals, transfers well to numerous downstream tasks, and more importantly, facilitates the extensibility to new classes while alleviating the catastrophic forgetting of previously learned classes. Codes, models, and datasets are available at https://github.com/ljwztc/CLIP-Driven-Universal-Model
△ Less
Submitted 28 May, 2024;
originally announced May 2024.
-
AnyFit: Controllable Virtual Try-on for Any Combination of Attire Across Any Scenario
Authors:
Yuhan Li,
Hao Zhou,
Wenxiang Shang,
Ran Lin,
Xuanhong Chen,
Bingbing Ni
Abstract:
While image-based virtual try-on has made significant strides, emerging approaches still fall short of delivering high-fidelity and robust fitting images across various scenarios, as their models suffer from issues of ill-fitted garment styles and quality degrading during the training process, not to mention the lack of support for various combinations of attire. Therefore, we first propose a ligh…
▽ More
While image-based virtual try-on has made significant strides, emerging approaches still fall short of delivering high-fidelity and robust fitting images across various scenarios, as their models suffer from issues of ill-fitted garment styles and quality degrading during the training process, not to mention the lack of support for various combinations of attire. Therefore, we first propose a lightweight, scalable, operator known as Hydra Block for attire combinations. This is achieved through a parallel attention mechanism that facilitates the feature injection of multiple garments from conditionally encoded branches into the main network. Secondly, to significantly enhance the model's robustness and expressiveness in real-world scenarios, we evolve its potential across diverse settings by synthesizing the residuals of multiple models, as well as implementing a mask region boost strategy to overcome the instability caused by information leakage in existing models. Equipped with the above design, AnyFit surpasses all baselines on high-resolution benchmarks and real-world data by a large gap, excelling in producing well-fitting garments replete with photorealistic and rich details. Furthermore, AnyFit's impressive performance on high-fidelity virtual try-ons in any scenario from any image, paves a new path for future research within the fashion community.
△ Less
Submitted 28 May, 2024;
originally announced May 2024.