-
4DHands: Reconstructing Interactive Hands in 4D with Transformers
Authors:
Dixuan Lin,
Yuxiang Zhang,
Mengcheng Li,
Yebin Liu,
Wei Jing,
Qi Yan,
Qianying Wang,
Hongwen Zhang
Abstract:
In this paper, we introduce 4DHands, a robust approach to recovering interactive hand meshes and their relative movement from monocular inputs. Our approach addresses two major limitations of previous methods: lacking a unified solution for handling various hand image inputs and neglecting the positional relationship of two hands within images. To overcome these challenges, we develop a transforme…
▽ More
In this paper, we introduce 4DHands, a robust approach to recovering interactive hand meshes and their relative movement from monocular inputs. Our approach addresses two major limitations of previous methods: lacking a unified solution for handling various hand image inputs and neglecting the positional relationship of two hands within images. To overcome these challenges, we develop a transformer-based architecture with novel tokenization and feature fusion strategies. Specifically, we propose a Relation-aware Two-Hand Tokenization (RAT) method to embed positional relation information into the hand tokens. In this way, our network can handle both single-hand and two-hand inputs and explicitly leverage relative hand positions, facilitating the reconstruction of intricate hand interactions in real-world scenarios. As such tokenization indicates the relative relationship of two hands, it also supports more effective feature fusion. To this end, we further develop a Spatio-temporal Interaction Reasoning (SIR) module to fuse hand tokens in 4D with attention and decode them into 3D hand meshes and relative temporal movements. The efficacy of our approach is validated on several benchmark datasets. The results on in-the-wild videos and real-world scenarios demonstrate the superior performances of our approach for interactive hand reconstruction. More video results can be found on the project page: https://4dhands.github.io.
△ Less
Submitted 30 May, 2024;
originally announced May 2024.
-
GECO: Generative Image-to-3D within a SECOnd
Authors:
Chen Wang,
Jiatao Gu,
Xiaoxiao Long,
Yuan Liu,
Lingjie Liu
Abstract:
3D generation has seen remarkable progress in recent years. Existing techniques, such as score distillation methods, produce notable results but require extensive per-scene optimization, impacting time efficiency. Alternatively, reconstruction-based approaches prioritize efficiency but compromise quality due to their limited handling of uncertainty. We introduce GECO, a novel method for high-quali…
▽ More
3D generation has seen remarkable progress in recent years. Existing techniques, such as score distillation methods, produce notable results but require extensive per-scene optimization, impacting time efficiency. Alternatively, reconstruction-based approaches prioritize efficiency but compromise quality due to their limited handling of uncertainty. We introduce GECO, a novel method for high-quality 3D generative modeling that operates within a second. Our approach addresses the prevalent issues of uncertainty and inefficiency in current methods through a two-stage approach. In the initial stage, we train a single-step multi-view generative model with score distillation. Then, a second-stage distillation is applied to address the challenge of view inconsistency from the multi-view prediction. This two-stage process ensures a balanced approach to 3D generation, optimizing both quality and efficiency. Our comprehensive experiments demonstrate that GECO achieves high-quality image-to-3D generation with an unprecedented level of efficiency.
△ Less
Submitted 30 May, 2024;
originally announced May 2024.
-
Towards Hierarchical Multi-Agent Workflows for Zero-Shot Prompt Optimization
Authors:
Yuchi Liu,
Jaskirat Singh,
Gaowen Liu,
Ali Payani,
Liang Zheng
Abstract:
Large language models (LLMs) have shown great progress in responding to user questions, allowing for a multitude of diverse applications. Yet, the quality of LLM outputs heavily depends on the prompt design, where a good prompt might enable the LLM to answer a very challenging question correctly. Therefore, recent works have developed many strategies for improving the prompt, including both manual…
▽ More
Large language models (LLMs) have shown great progress in responding to user questions, allowing for a multitude of diverse applications. Yet, the quality of LLM outputs heavily depends on the prompt design, where a good prompt might enable the LLM to answer a very challenging question correctly. Therefore, recent works have developed many strategies for improving the prompt, including both manual crafting and in-domain optimization. However, their efficacy in unrestricted scenarios remains questionable, as the former depends on human design for specific questions and the latter usually generalizes poorly to unseen scenarios. To address these problems, we give LLMs the freedom to design the best prompts according to themselves. Specifically, we include a hierarchy of LLMs, first constructing a prompt with precise instructions and accurate wording in a hierarchical manner, and then using this prompt to generate the final answer to the user query. We term this pipeline Hierarchical Multi-Agent Workflow, or HMAW. In contrast with prior works, HMAW imposes no human restriction and requires no training, and is completely task-agnostic while capable of adjusting to the nuances of the underlying task. Through both quantitative and qualitative experiments across multiple benchmarks, we verify that despite its simplicity, the proposed approach can create detailed and suitable prompts, further boosting the performance of current LLMs.
△ Less
Submitted 30 May, 2024;
originally announced May 2024.
-
Structure Gaussian SLAM with Manhattan World Hypothesis
Authors:
Shuhong Liu,
Heng Zhou,
Liuzhuozheng Li,
Yun Liu,
Tianchen Deng,
Yiming Zhou,
Mingrui Li
Abstract:
Gaussian SLAM systems have made significant advancements in improving the efficiency and fidelity of real-time reconstructions. However, these systems often encounter incomplete reconstructions in complex indoor environments, characterized by substantial holes due to unobserved geometry caused by obstacles or limited view angles. To address this challenge, we present Manhattan Gaussian SLAM (MG-SL…
▽ More
Gaussian SLAM systems have made significant advancements in improving the efficiency and fidelity of real-time reconstructions. However, these systems often encounter incomplete reconstructions in complex indoor environments, characterized by substantial holes due to unobserved geometry caused by obstacles or limited view angles. To address this challenge, we present Manhattan Gaussian SLAM (MG-SLAM), an RGB-D system that leverages the Manhattan World hypothesis to enhance geometric accuracy and completeness. By seamlessly integrating fused line segments derived from structured scenes, MG-SLAM ensures robust tracking in textureless indoor areas. Moreover, The extracted lines and planar surface assumption allow strategic interpolation of new Gaussians in regions of missing geometry, enabling efficient scene completion. Extensive experiments conducted on both synthetic and real-world scenes demonstrate that these advancements enable our method to achieve state-of-the-art performance, marking a substantial improvement in the capabilities of Gaussian SLAM systems.
△ Less
Submitted 30 May, 2024;
originally announced May 2024.
-
Multi-Aspect Controllable Text Generation with Disentangled Counterfactual Augmentation
Authors:
Yi Liu,
Xiangyu Liu,
Xiangrong Zhu,
Wei Hu
Abstract:
Multi-aspect controllable text generation aims to control the generated texts in attributes from multiple aspects (e.g., "positive" from sentiment and "sport" from topic). For ease of obtaining training samples, existing works neglect attribute correlations formed by the intertwining of different attributes. Particularly, the stereotype formed by imbalanced attribute correlations significantly aff…
▽ More
Multi-aspect controllable text generation aims to control the generated texts in attributes from multiple aspects (e.g., "positive" from sentiment and "sport" from topic). For ease of obtaining training samples, existing works neglect attribute correlations formed by the intertwining of different attributes. Particularly, the stereotype formed by imbalanced attribute correlations significantly affects multi-aspect control. In this paper, we propose MAGIC, a new multi-aspect controllable text generation method with disentangled counterfactual augmentation. We alleviate the issue of imbalanced attribute correlations during training using counterfactual feature vectors in the attribute latent space by disentanglement. During inference, we enhance attribute correlations by target-guided counterfactual augmentation to further improve multi-aspect control. Experiments show that MAGIC outperforms state-of-the-art baselines in both imbalanced and balanced attribute correlation scenarios. Our source code and data are available at https://github.com/nju-websoft/MAGIC.
△ Less
Submitted 30 May, 2024;
originally announced May 2024.
-
Instruction-Guided Visual Masking
Authors:
Jinliang Zheng,
Jianxiong Li,
Sijie Cheng,
Yinan Zheng,
Jiaming Li,
Jihao Liu,
Yu Liu,
Jingjing Liu,
Xianyuan Zhan
Abstract:
Instruction following is crucial in contemporary LLM. However, when extended to multimodal setting, it often suffers from misalignment between specific textual instruction and targeted local region of an image. To achieve more accurate and nuanced multimodal instruction following, we introduce Instruction-guided Visual Masking (IVM), a new versatile visual grounding model that is compatible with d…
▽ More
Instruction following is crucial in contemporary LLM. However, when extended to multimodal setting, it often suffers from misalignment between specific textual instruction and targeted local region of an image. To achieve more accurate and nuanced multimodal instruction following, we introduce Instruction-guided Visual Masking (IVM), a new versatile visual grounding model that is compatible with diverse multimodal models, such as LMM and robot model. By constructing visual masks for instruction-irrelevant regions, IVM-enhanced multimodal models can effectively focus on task-relevant image regions to better align with complex instructions. Specifically, we design a visual masking data generation pipeline and create an IVM-Mix-1M dataset with 1 million image-instruction pairs. We further introduce a new learning technique, Discriminator Weighted Supervised Learning (DWSL) for preferential IVM training that prioritizes high-quality data samples. Experimental results on generic multimodal tasks such as VQA and embodied robotic control demonstrate the versatility of IVM, which as a plug-and-play tool, significantly boosts the performance of diverse multimodal models, yielding new state-of-the-art results across challenging multimodal benchmarks. Code is available at https://github.com/2toinf/IVM.
△ Less
Submitted 30 May, 2024;
originally announced May 2024.
-
Learning Task-relevant Sequence Representations via Intrinsic Dynamics Characteristics in Reinforcement Learning
Authors:
Dayang Liang,
Jinyang Lai,
Yunlong Liu
Abstract:
Learning task-relevant state representations is crucial to solving the problem of scene generalization in visual deep reinforcement learning. Prior work typically establishes a self-supervised auxiliary learner, introducing elements (e.g., rewards and actions) to extract task-relevant state information from observations through behavioral similarity metrics. However, the methods often ignore the i…
▽ More
Learning task-relevant state representations is crucial to solving the problem of scene generalization in visual deep reinforcement learning. Prior work typically establishes a self-supervised auxiliary learner, introducing elements (e.g., rewards and actions) to extract task-relevant state information from observations through behavioral similarity metrics. However, the methods often ignore the inherent relationships between the elements (e.g., dynamics relationships) that are essential for learning accurate representations, and they are also limited to single-step metrics, which impedes the discrimination of short-term similar task/behavior information in long-term dynamics transitions. To solve the issues, we propose an intrinsic dynamic characteristics-driven sequence representation learning method (DSR) over a common DRL frame. Concretely, inspired by the fact of state transition in the underlying system, it constrains the optimization of the encoder via modeling the dynamics equations related to the state transition, which prompts the latent encoding information to satisfy the state transition process and thereby distinguishes state space and noise space. Further, to refine the ability of encoding similar tasks based on dynamics constraints, DSR also sequentially models inherent dynamics equation relationships from the perspective of sequence elements' frequency domain and multi-step prediction. Finally, experimental results show that DSR has achieved a significant performance boost in the Distracting DMControl Benchmark, with an average of 78.9% over the backbone baseline. Further results indicate that it also achieves the best performance in real-world autonomous driving tasks in the CARLA simulator. Moreover, the qualitative analysis results of t-SNE visualization validate that our method possesses superior representation ability on visual tasks.
△ Less
Submitted 30 May, 2024;
originally announced May 2024.
-
Research on Foundation Model for Spatial Data Intelligence: China's 2024 White Paper on Strategic Development of Spatial Data Intelligence
Authors:
Shaohua Wang,
Xing Xie,
Yong Li,
Danhuai Guo,
Zhi Cai,
Yu Liu,
Yang Yue,
Xiao Pan,
Feng Lu,
Huayi Wu,
Zhipeng Gui,
Zhiming Ding,
Bolong Zheng,
Fuzheng Zhang,
Tao Qin,
Jingyuan Wang,
Chuang Tao,
Zhengchao Chen,
Hao Lu,
Jiayi Li,
Hongyang Chen,
Peng Yue,
Wenhao Yu,
Yao Yao,
Leilei Sun
, et al. (9 additional authors not shown)
Abstract:
This report focuses on spatial data intelligent large models, delving into the principles, methods, and cutting-edge applications of these models. It provides an in-depth discussion on the definition, development history, current status, and trends of spatial data intelligent large models, as well as the challenges they face. The report systematically elucidates the key technologies of spatial dat…
▽ More
This report focuses on spatial data intelligent large models, delving into the principles, methods, and cutting-edge applications of these models. It provides an in-depth discussion on the definition, development history, current status, and trends of spatial data intelligent large models, as well as the challenges they face. The report systematically elucidates the key technologies of spatial data intelligent large models and their applications in urban environments, aerospace remote sensing, geography, transportation, and other scenarios. Additionally, it summarizes the latest application cases of spatial data intelligent large models in themes such as urban development, multimodal systems, remote sensing, smart transportation, and resource environments. Finally, the report concludes with an overview and outlook on the development prospects of spatial data intelligent large models.
△ Less
Submitted 30 May, 2024;
originally announced May 2024.
-
CSANet: Channel Spatial Attention Network for Robust 3D Face Alignment and Reconstruction
Authors:
Yilin Liu,
Xuezhou Guo,
Xinqi Wang,
Fangzhou Du
Abstract:
Our project proposes an end-to-end 3D face alignment and reconstruction network. The backbone of our model is built by Bottle-Neck structure via Depth-wise Separable Convolution. We integrate Coordinate Attention mechanism and Spatial Group-wise Enhancement to extract more representative features. For more stable training process and better convergence, we jointly use Wing loss and the Weighted Pa…
▽ More
Our project proposes an end-to-end 3D face alignment and reconstruction network. The backbone of our model is built by Bottle-Neck structure via Depth-wise Separable Convolution. We integrate Coordinate Attention mechanism and Spatial Group-wise Enhancement to extract more representative features. For more stable training process and better convergence, we jointly use Wing loss and the Weighted Parameter Distance Cost to learn parameters for 3D Morphable model and 3D vertices. Our proposed model outperforms all baseline models both quantitatively and qualitatively.
△ Less
Submitted 29 May, 2024;
originally announced May 2024.
-
Few for Many: Tchebycheff Set Scalarization for Many-Objective Optimization
Authors:
Xi Lin,
Yilu Liu,
Xiaoyuan Zhang,
Fei Liu,
Zhenkun Wang,
Qingfu Zhang
Abstract:
Multi-objective optimization can be found in many real-world applications where some conflicting objectives can not be optimized by a single solution. Existing optimization methods often focus on finding a set of Pareto solutions with different optimal trade-offs among the objectives. However, the required number of solutions to well approximate the whole Pareto optimal set could be exponentially…
▽ More
Multi-objective optimization can be found in many real-world applications where some conflicting objectives can not be optimized by a single solution. Existing optimization methods often focus on finding a set of Pareto solutions with different optimal trade-offs among the objectives. However, the required number of solutions to well approximate the whole Pareto optimal set could be exponentially large with respect to the number of objectives, which makes these methods unsuitable for handling many optimization objectives. In this work, instead of finding a dense set of Pareto solutions, we propose a novel Tchebycheff set scalarization method to find a few representative solutions (e.g., 5) to cover a large number of objectives (e.g., $>100$) in a collaborative and complementary manner. In this way, each objective can be well addressed by at least one solution in the small solution set. In addition, we further develop a smooth Tchebycheff set scalarization approach for efficient optimization with good theoretical guarantees. Experimental studies on different problems with many optimization objectives demonstrate the effectiveness of our proposed method.
△ Less
Submitted 29 May, 2024;
originally announced May 2024.
-
Enabling Visual Recognition at Radio Frequency
Authors:
Haowen Lai,
Gaoxiang Luo,
Yifei Liu,
Mingmin Zhao
Abstract:
This paper introduces PanoRadar, a novel RF imaging system that brings RF resolution close to that of LiDAR, while providing resilience against conditions challenging for optical signals. Our LiDAR-comparable 3D imaging results enable, for the first time, a variety of visual recognition tasks at radio frequency, including surface normal estimation, semantic segmentation, and object detection. Pano…
▽ More
This paper introduces PanoRadar, a novel RF imaging system that brings RF resolution close to that of LiDAR, while providing resilience against conditions challenging for optical signals. Our LiDAR-comparable 3D imaging results enable, for the first time, a variety of visual recognition tasks at radio frequency, including surface normal estimation, semantic segmentation, and object detection. PanoRadar utilizes a rotating single-chip mmWave radar, along with a combination of novel signal processing and machine learning algorithms, to create high-resolution 3D images of the surroundings. Our system accurately estimates robot motion, allowing for coherent imaging through a dense grid of synthetic antennas. It also exploits the high azimuth resolution to enhance elevation resolution using learning-based methods. Furthermore, PanoRadar tackles 3D learning via 2D convolutions and addresses challenges due to the unique characteristics of RF signals. Our results demonstrate PanoRadar's robust performance across 12 buildings.
△ Less
Submitted 29 May, 2024;
originally announced May 2024.
-
FourierMamba: Fourier Learning Integration with State Space Models for Image Deraining
Authors:
Dong Li,
Yidi Liu,
Xueyang Fu,
Senyan Xu,
Zheng-Jun Zha
Abstract:
Image deraining aims to remove rain streaks from rainy images and restore clear backgrounds. Currently, some research that employs the Fourier transform has proved to be effective for image deraining, due to it acting as an effective frequency prior for capturing rain streaks. However, despite there exists dependency of low frequency and high frequency in images, these Fourier-based methods rarely…
▽ More
Image deraining aims to remove rain streaks from rainy images and restore clear backgrounds. Currently, some research that employs the Fourier transform has proved to be effective for image deraining, due to it acting as an effective frequency prior for capturing rain streaks. However, despite there exists dependency of low frequency and high frequency in images, these Fourier-based methods rarely exploit the correlation of different frequencies for conjuncting their learning procedures, limiting the full utilization of frequency information for image deraining. Alternatively, the recently emerged Mamba technique depicts its effectiveness and efficiency for modeling correlation in various domains (e.g., spatial, temporal), and we argue that introducing Mamba into its unexplored Fourier spaces to correlate different frequencies would help improve image deraining. This motivates us to propose a new framework termed FourierMamba, which performs image deraining with Mamba in the Fourier space. Owning to the unique arrangement of frequency orders in Fourier space, the core of FourierMamba lies in the scanning encoding of different frequencies, where the low-high frequency order formats exhibit differently in the spatial dimension (unarranged in axis) and channel dimension (arranged in axis). Therefore, we design FourierMamba that correlates Fourier space information in the spatial and channel dimensions with distinct designs. Specifically, in the spatial dimension Fourier space, we introduce the zigzag coding to scan the frequencies to rearrange the orders from low to high frequencies, thereby orderly correlating the connections between frequencies; in the channel dimension Fourier space with arranged orders of frequencies in axis, we can directly use Mamba to perform frequency correlation and improve the channel information representation.
△ Less
Submitted 29 May, 2024;
originally announced May 2024.
-
$E^{3}$Gen: Efficient, Expressive and Editable Avatars Generation
Authors:
Weitian Zhang,
Yichao Yan,
Yunhui Liu,
Xingdong Sheng,
Xiaokang Yang
Abstract:
This paper aims to introduce 3D Gaussian for efficient, expressive, and editable digital avatar generation. This task faces two major challenges: (1) The unstructured nature of 3D Gaussian makes it incompatible with current generation pipelines; (2) the expressive animation of 3D Gaussian in a generative setting that involves training with multiple subjects remains unexplored. In this paper, we pr…
▽ More
This paper aims to introduce 3D Gaussian for efficient, expressive, and editable digital avatar generation. This task faces two major challenges: (1) The unstructured nature of 3D Gaussian makes it incompatible with current generation pipelines; (2) the expressive animation of 3D Gaussian in a generative setting that involves training with multiple subjects remains unexplored. In this paper, we propose a novel avatar generation method named $E^3$Gen, to effectively address these challenges. First, we propose a novel generative UV features plane representation that encodes unstructured 3D Gaussian onto a structured 2D UV space defined by the SMPL-X parametric model. This novel representation not only preserves the representation ability of the original 3D Gaussian but also introduces a shared structure among subjects to enable generative learning of the diffusion model. To tackle the second challenge, we propose a part-aware deformation module to achieve robust and accurate full-body expressive pose control. Extensive experiments demonstrate that our method achieves superior performance in avatar generation and enables expressive full-body pose control and editing. Our project page is https://olivia23333.github.io/E3Gen.
△ Less
Submitted 30 May, 2024; v1 submitted 29 May, 2024;
originally announced May 2024.
-
LOGO: Video Text Spotting with Language Collaboration and Glyph Perception Model
Authors:
Hongen Liu,
Yi Liu,
Di Sun,
Jiahao Wang,
Gang Pan
Abstract:
Video text spotting aims to simultaneously localize, recognize and track text instances in videos. To address the limited recognition capability of end-to-end methods, tracking the zero-shot results of state-of-the-art image text spotters directly can achieve impressive performance. However, owing to the domain gap between different datasets, these methods usually obtain limited tracking trajector…
▽ More
Video text spotting aims to simultaneously localize, recognize and track text instances in videos. To address the limited recognition capability of end-to-end methods, tracking the zero-shot results of state-of-the-art image text spotters directly can achieve impressive performance. However, owing to the domain gap between different datasets, these methods usually obtain limited tracking trajectories on extreme dataset. Fine-tuning transformer-based text spotters on specific datasets could yield performance enhancements, albeit at the expense of considerable training resources. In this paper, we propose a Language Collaboration and Glyph Perception Model, termed LOGO to enhance the performance of conventional text spotters through the integration of a synergy module. To achieve this goal, a language synergy classifier (LSC) is designed to explicitly discern text instances from background noise in the recognition stage. Specially, the language synergy classifier can output text content or background code based on the legibility of text regions, thus computing language scores. Subsequently, fusion scores are computed by taking the average of detection scores and language scores, and are utilized to re-score the detection results before tracking. By the re-scoring mechanism, the proposed LSC facilitates the detection of low-resolution text instances while filtering out text-like regions. Besides, the glyph supervision and visual position mixture module are proposed to enhance the recognition accuracy of noisy text regions, and acquire more discriminative tracking features, respectively. Extensive experiments on public benchmarks validate the effectiveness of the proposed method.
△ Less
Submitted 29 May, 2024;
originally announced May 2024.
-
Resurrecting Old Classes with New Data for Exemplar-Free Continual Learning
Authors:
Dipam Goswami,
Albin Soutif--Cormerais,
Yuyang Liu,
Sandesh Kamath,
Bartłomiej Twardowski,
Joost van de Weijer
Abstract:
Continual learning methods are known to suffer from catastrophic forgetting, a phenomenon that is particularly hard to counter for methods that do not store exemplars of previous tasks. Therefore, to reduce potential drift in the feature extractor, existing exemplar-free methods are typically evaluated in settings where the first task is significantly larger than subsequent tasks. Their performanc…
▽ More
Continual learning methods are known to suffer from catastrophic forgetting, a phenomenon that is particularly hard to counter for methods that do not store exemplars of previous tasks. Therefore, to reduce potential drift in the feature extractor, existing exemplar-free methods are typically evaluated in settings where the first task is significantly larger than subsequent tasks. Their performance drops drastically in more challenging settings starting with a smaller first task. To address this problem of feature drift estimation for exemplar-free methods, we propose to adversarially perturb the current samples such that their embeddings are close to the old class prototypes in the old model embedding space. We then estimate the drift in the embedding space from the old to the new model using the perturbed images and compensate the prototypes accordingly. We exploit the fact that adversarial samples are transferable from the old to the new feature space in a continual learning setting. The generation of these images is simple and computationally cheap. We demonstrate in our experiments that the proposed approach better tracks the movement of prototypes in embedding space and outperforms existing methods on several standard continual learning benchmarks as well as on fine-grained datasets. Code is available at https://github.com/dipamgoswami/ADC.
△ Less
Submitted 29 May, 2024;
originally announced May 2024.
-
Multiscale Spatio-Temporal Enhanced Short-term Load Forecasting of Electric Vehicle Charging Stations
Authors:
Zongbao Zhang,
Jiao Hao,
Wenmeng Zhao,
Yan Liu,
Yaohui Huang,
Xinhang Luo
Abstract:
The rapid expansion of electric vehicles (EVs) has rendered the load forecasting of electric vehicle charging stations (EVCS) increasingly critical. The primary challenge in achieving precise load forecasting for EVCS lies in accounting for the nonlinear of charging behaviors, the spatial interactions among different stations, and the intricate temporal variations in usage patterns. To address the…
▽ More
The rapid expansion of electric vehicles (EVs) has rendered the load forecasting of electric vehicle charging stations (EVCS) increasingly critical. The primary challenge in achieving precise load forecasting for EVCS lies in accounting for the nonlinear of charging behaviors, the spatial interactions among different stations, and the intricate temporal variations in usage patterns. To address these challenges, we propose a Multiscale Spatio-Temporal Enhanced Model (MSTEM) for effective load forecasting at EVCS. MSTEM incorporates a multiscale graph neural network to discern hierarchical nonlinear temporal dependencies across various time scales. Besides, it also integrates a recurrent learning component and a residual fusion mechanism, enhancing its capability to accurately capture spatial and temporal variations in charging patterns. The effectiveness of the proposed MSTEM has been validated through comparative analysis with six baseline models using three evaluation metrics. The case studies utilize real-world datasets for both fast and slow charging loads at EVCS in Perth, UK. The experimental results demonstrate the superiority of MSTEM in short-term continuous load forecasting for EVCS.
△ Less
Submitted 29 May, 2024;
originally announced May 2024.
-
DiveR-CT: Diversity-enhanced Red Teaming with Relaxing Constraints
Authors:
Andrew Zhao,
Quentin Xu,
Matthieu Lin,
Shenzhi Wang,
Yong-jin Liu,
Zilong Zheng,
Gao Huang
Abstract:
Recent advances in large language models (LLMs) have made them indispensable, raising significant concerns over managing their safety. Automated red teaming offers a promising alternative to the labor-intensive and error-prone manual probing for vulnerabilities, providing more consistent and scalable safety evaluations. However, existing approaches often compromise diversity by focusing on maximiz…
▽ More
Recent advances in large language models (LLMs) have made them indispensable, raising significant concerns over managing their safety. Automated red teaming offers a promising alternative to the labor-intensive and error-prone manual probing for vulnerabilities, providing more consistent and scalable safety evaluations. However, existing approaches often compromise diversity by focusing on maximizing attack success rate. Additionally, methods that decrease the cosine similarity from historical embeddings with semantic diversity rewards lead to novelty stagnation as history grows. To address these issues, we introduce DiveR-CT, which relaxes conventional constraints on the objective and semantic reward, granting greater freedom for the policy to enhance diversity. Our experiments demonstrate DiveR-CT's marked superiority over baselines by 1) generating data that perform better in various diversity metrics across different attack success rate levels, 2) better-enhancing resiliency in blue team models through safety tuning based on collected data, 3) allowing dynamic control of objective weights for reliable and controllable attack success rates, and 4) reducing susceptibility to reward overoptimization. Project details and code can be found at https://andrewzh112.github.io/#diverct.
△ Less
Submitted 29 May, 2024;
originally announced May 2024.
-
Evaluating the External and Parametric Knowledge Fusion of Large Language Models
Authors:
Hao Zhang,
Yuyang Zhang,
Xiaoguang Li,
Wenxuan Shi,
Haonan Xu,
Huanshuo Liu,
Yasheng Wang,
Lifeng Shang,
Qun Liu,
Yong Liu,
Ruiming Tang
Abstract:
Integrating external knowledge into large language models (LLMs) presents a promising solution to overcome the limitations imposed by their antiquated and static parametric memory. Prior studies, however, have tended to over-reliance on external knowledge, underestimating the valuable contributions of an LLMs' intrinsic parametric knowledge. The efficacy of LLMs in blending external and parametric…
▽ More
Integrating external knowledge into large language models (LLMs) presents a promising solution to overcome the limitations imposed by their antiquated and static parametric memory. Prior studies, however, have tended to over-reliance on external knowledge, underestimating the valuable contributions of an LLMs' intrinsic parametric knowledge. The efficacy of LLMs in blending external and parametric knowledge remains largely unexplored, especially in cases where external knowledge is incomplete and necessitates supplementation by their parametric knowledge. We propose to deconstruct knowledge fusion into four distinct scenarios, offering the first thorough investigation of LLM behavior across each. We develop a systematic pipeline for data construction and knowledge infusion to simulate these fusion scenarios, facilitating a series of controlled experiments. Our investigation reveals that enhancing parametric knowledge within LLMs can significantly bolster their capability for knowledge integration. Nonetheless, we identify persistent challenges in memorizing and eliciting parametric knowledge, and determining parametric knowledge boundaries. Our findings aim to steer future explorations on harmonizing external and parametric knowledge within LLMs.
△ Less
Submitted 29 May, 2024;
originally announced May 2024.
-
Enhancing Vision-Language Model with Unmasked Token Alignment
Authors:
Jihao Liu,
Jinliang Zheng,
Boxiao Liu,
Yu Liu,
Hongsheng Li
Abstract:
Contrastive pre-training on image-text pairs, exemplified by CLIP, becomes a standard technique for learning multi-modal visual-language representations. Although CLIP has demonstrated remarkable performance, training it from scratch on noisy web-scale datasets is computationally demanding. On the other hand, mask-then-predict pre-training approaches, like Masked Image Modeling (MIM), offer effici…
▽ More
Contrastive pre-training on image-text pairs, exemplified by CLIP, becomes a standard technique for learning multi-modal visual-language representations. Although CLIP has demonstrated remarkable performance, training it from scratch on noisy web-scale datasets is computationally demanding. On the other hand, mask-then-predict pre-training approaches, like Masked Image Modeling (MIM), offer efficient self-supervised learning for single-modal representations. This paper introduces Unmasked Token Alignment (UTA), a method that leverages existing CLIP models to further enhance its vision-language representations. UTA trains a Vision Transformer (ViT) by aligning unmasked visual tokens to the corresponding image tokens from a frozen CLIP vision encoder, which automatically aligns the ViT model with the CLIP text encoder. The pre-trained ViT can be directly applied for zero-shot evaluation even without training on image-text pairs. Compared to MIM approaches, UTA does not suffer from training-finetuning inconsistency and is much more training-efficient by avoiding using the extra [MASK] tokens. Extensive experimental results demonstrate that UTA can enhance CLIP models and outperform existing MIM methods on various uni- and multi-modal benchmarks. Code and models are available at https://github.com/jihaonew/UTA.
△ Less
Submitted 29 May, 2024;
originally announced May 2024.
-
Language Generation with Strictly Proper Scoring Rules
Authors:
Chenze Shao,
Fandong Meng,
Yijin Liu,
Jie Zhou
Abstract:
Language generation based on maximum likelihood estimation (MLE) has become the fundamental approach for text generation. Maximum likelihood estimation is typically performed by minimizing the log-likelihood loss, also known as the logarithmic score in statistical decision theory. The logarithmic score is strictly proper in the sense that it encourages honest forecasts, where the expected score is…
▽ More
Language generation based on maximum likelihood estimation (MLE) has become the fundamental approach for text generation. Maximum likelihood estimation is typically performed by minimizing the log-likelihood loss, also known as the logarithmic score in statistical decision theory. The logarithmic score is strictly proper in the sense that it encourages honest forecasts, where the expected score is maximized only when the model reports true probabilities. Although many strictly proper scoring rules exist, the logarithmic score is the only local scoring rule among them that depends exclusively on the probability of the observed sample, making it capable of handling the exponentially large sample space of natural text. In this work, we propose a straightforward strategy for adapting scoring rules to language generation, allowing for language modeling with any non-local scoring rules. Leveraging this strategy, we train language generation models using two classic strictly proper scoring rules, the Brier score and the Spherical score, as alternatives to the logarithmic score. Experimental results indicate that simply substituting the loss function, without adjusting other hyperparameters, can yield substantial improvements in model's generation capabilities. Moreover, these improvements can scale up to large language models (LLMs) such as LLaMA-7B and LLaMA-13B. Source code: \url{https://github.com/shaochenze/ScoringRulesLM}.
△ Less
Submitted 29 May, 2024;
originally announced May 2024.
-
Opinion-Unaware Blind Image Quality Assessment using Multi-Scale Deep Feature Statistics
Authors:
Zhangkai Ni,
Yue Liu,
Keyan Ding,
Wenhan Yang,
Hanli Wang,
Shiqi Wang
Abstract:
Deep learning-based methods have significantly influenced the blind image quality assessment (BIQA) field, however, these methods often require training using large amounts of human rating data. In contrast, traditional knowledge-based methods are cost-effective for training but face challenges in effectively extracting features aligned with human visual perception. To bridge these gaps, we propos…
▽ More
Deep learning-based methods have significantly influenced the blind image quality assessment (BIQA) field, however, these methods often require training using large amounts of human rating data. In contrast, traditional knowledge-based methods are cost-effective for training but face challenges in effectively extracting features aligned with human visual perception. To bridge these gaps, we propose integrating deep features from pre-trained visual models with a statistical analysis model into a Multi-scale Deep Feature Statistics (MDFS) model for achieving opinion-unaware BIQA (OU-BIQA), thereby eliminating the reliance on human rating data and significantly improving training efficiency. Specifically, we extract patch-wise multi-scale features from pre-trained vision models, which are subsequently fitted into a multivariate Gaussian (MVG) model. The final quality score is determined by quantifying the distance between the MVG model derived from the test image and the benchmark MVG model derived from the high-quality image set. A comprehensive series of experiments conducted on various datasets show that our proposed model exhibits superior consistency with human visual perception compared to state-of-the-art BIQA models. Furthermore, it shows improved generalizability across diverse target-specific BIQA tasks. Our code is available at: https://github.com/eezkni/MDFS
△ Less
Submitted 29 May, 2024;
originally announced May 2024.
-
CtrlA: Adaptive Retrieval-Augmented Generation via Probe-Guided Control
Authors:
Huanshuo Liu,
Hao Zhang,
Zhijiang Guo,
Kuicai Dong,
Xiangyang Li,
Yi Quan Lee,
Cong Zhang,
Yong Liu
Abstract:
Retrieval-augmented generation (RAG) has emerged as a promising solution for mitigating hallucinations of large language models (LLMs) with retrieved external knowledge. Adaptive RAG enhances this approach by dynamically assessing the retrieval necessity, aiming to balance external and internal knowledge usage. However, existing adaptive RAG methods primarily realize retrieval on demand by relying…
▽ More
Retrieval-augmented generation (RAG) has emerged as a promising solution for mitigating hallucinations of large language models (LLMs) with retrieved external knowledge. Adaptive RAG enhances this approach by dynamically assessing the retrieval necessity, aiming to balance external and internal knowledge usage. However, existing adaptive RAG methods primarily realize retrieval on demand by relying on superficially verbalize-based or probability-based feedback of LLMs, or directly fine-tuning LLMs via carefully crafted datasets, resulting in unreliable retrieval necessity decisions, heavy extra costs, and sub-optimal response generation. We present the first attempts to delve into the internal states of LLMs to mitigate such issues by introducing an effective probe-guided adaptive RAG framework, termed CtrlA. Specifically, CtrlA employs an honesty probe to regulate the LLM's behavior by manipulating its representations for increased honesty, and a confidence probe to monitor the internal states of LLM and assess confidence levels, determining the retrieval necessity during generation. Experiments show that CtrlA is superior to existing adaptive RAG methods on a diverse set of tasks, the honesty control can effectively make LLMs more honest and confidence monitoring is proven to be a promising indicator of retrieval trigger. Our codes are available at https://github.com/HSLiu-Initial/CtrlA.git.
△ Less
Submitted 28 May, 2024;
originally announced May 2024.
-
Phased Consistency Model
Authors:
Fu-Yun Wang,
Zhaoyang Huang,
Alexander William Bergman,
Dazhong Shen,
Peng Gao,
Michael Lingelbach,
Keqiang Sun,
Weikang Bian,
Guanglu Song,
Yu Liu,
Hongsheng Li,
Xiaogang Wang
Abstract:
The consistency model (CM) has recently made significant progress in accelerating the generation of diffusion models. However, its application to high-resolution, text-conditioned image generation in the latent space (a.k.a., LCM) remains unsatisfactory. In this paper, we identify three key flaws in the current design of LCM. We investigate the reasons behind these limitations and propose the Phas…
▽ More
The consistency model (CM) has recently made significant progress in accelerating the generation of diffusion models. However, its application to high-resolution, text-conditioned image generation in the latent space (a.k.a., LCM) remains unsatisfactory. In this paper, we identify three key flaws in the current design of LCM. We investigate the reasons behind these limitations and propose the Phased Consistency Model (PCM), which generalizes the design space and addresses all identified limitations. Our evaluations demonstrate that PCM significantly outperforms LCM across 1--16 step generation settings. While PCM is specifically designed for multi-step refinement, it achieves even superior or comparable 1-step generation results to previously state-of-the-art specifically designed 1-step methods. Furthermore, we show that PCM's methodology is versatile and applicable to video generation, enabling us to train the state-of-the-art few-step text-to-video generator. More details are available at https://g-u-n.github.io/projects/pcm/.
△ Less
Submitted 28 May, 2024;
originally announced May 2024.
-
Is a 3D-Tokenized LLM the Key to Reliable Autonomous Driving?
Authors:
Yifan Bai,
Dongming Wu,
Yingfei Liu,
Fan Jia,
Weixin Mao,
Ziheng Zhang,
Yucheng Zhao,
Jianbing Shen,
Xing Wei,
Tiancai Wang,
Xiangyu Zhang
Abstract:
Rapid advancements in Autonomous Driving (AD) tasks turned a significant shift toward end-to-end fashion, particularly in the utilization of vision-language models (VLMs) that integrate robust logical reasoning and cognitive abilities to enable comprehensive end-to-end planning. However, these VLM-based approaches tend to integrate 2D vision tokenizers and a large language model (LLM) for ego-car…
▽ More
Rapid advancements in Autonomous Driving (AD) tasks turned a significant shift toward end-to-end fashion, particularly in the utilization of vision-language models (VLMs) that integrate robust logical reasoning and cognitive abilities to enable comprehensive end-to-end planning. However, these VLM-based approaches tend to integrate 2D vision tokenizers and a large language model (LLM) for ego-car planning, which lack 3D geometric priors as a cornerstone of reliable planning. Naturally, this observation raises a critical concern: Can a 2D-tokenized LLM accurately perceive the 3D environment? Our evaluation of current VLM-based methods across 3D object detection, vectorized map construction, and environmental caption suggests that the answer is, unfortunately, NO. In other words, 2D-tokenized LLM fails to provide reliable autonomous driving. In response, we introduce DETR-style 3D perceptrons as 3D tokenizers, which connect LLM with a one-layer linear projector. This simple yet elegant strategy, termed Atlas, harnesses the inherent priors of the 3D physical world, enabling it to simultaneously process high-resolution multi-view images and employ spatiotemporal modeling. Despite its simplicity, Atlas demonstrates superior performance in both 3D detection and ego planning tasks on nuScenes dataset, proving that 3D-tokenized LLM is the key to reliable autonomous driving. The code and datasets will be released.
△ Less
Submitted 28 May, 2024;
originally announced May 2024.
-
Text Modality Oriented Image Feature Extraction for Detecting Diffusion-based DeepFake
Authors:
Di Yang,
Yihao Huang,
Qing Guo,
Felix Juefei-Xu,
Xiaojun Jia,
Run Wang,
Geguang Pu,
Yang Liu
Abstract:
The widespread use of diffusion methods enables the creation of highly realistic images on demand, thereby posing significant risks to the integrity and safety of online information and highlighting the necessity of DeepFake detection. Our analysis of features extracted by traditional image encoders reveals that both low-level and high-level features offer distinct advantages in identifying DeepFa…
▽ More
The widespread use of diffusion methods enables the creation of highly realistic images on demand, thereby posing significant risks to the integrity and safety of online information and highlighting the necessity of DeepFake detection. Our analysis of features extracted by traditional image encoders reveals that both low-level and high-level features offer distinct advantages in identifying DeepFake images produced by various diffusion methods. Inspired by this finding, we aim to develop an effective representation that captures both low-level and high-level features to detect diffusion-based DeepFakes. To address the problem, we propose a text modality-oriented feature extraction method, termed TOFE. Specifically, for a given target image, the representation we discovered is a corresponding text embedding that can guide the generation of the target image with a specific text-to-image model. Experiments conducted across ten diffusion types demonstrate the efficacy of our proposed method.
△ Less
Submitted 28 May, 2024;
originally announced May 2024.
-
An empirical study of bloated dependencies in CommonJS packages
Authors:
Yuxin Liu,
Deepika Tiwari,
Cristian Bogdan,
Benoit Baudry
Abstract:
JavaScript packages are notoriously prone to bloat, a factor that significantly impacts the performance and maintainability of web applications. While web bundlers and tree-shaking can mitigate this issue in client-side applications at the function level, they cannot effectively detect and remove bloat in server-side applications. In this paper, we conduct an empirical study to investigate the blo…
▽ More
JavaScript packages are notoriously prone to bloat, a factor that significantly impacts the performance and maintainability of web applications. While web bundlers and tree-shaking can mitigate this issue in client-side applications at the function level, they cannot effectively detect and remove bloat in server-side applications. In this paper, we conduct an empirical study to investigate the bloated dependencies that are entirely unused within server-side applications. Our study focuses on applications built with the widely used and highly dynamic CommonJS module system. We propose a trace-based dynamic analysis that monitors file access, to determine which dependencies are not accessed during runtime. To conduct our study, we curate an original dataset of 92 CommonJS packages with a median test coverage of 96.9% and a total of 50,661 dependencies. Our dynamic analysis identifies and successfully removes 50.7% of these dependencies while maintaining the correct build of all packages. Furthermore, we find that 14.9% of directly used dependencies and 51.3% of indirect dependencies are bloated. A key insight is that focusing on removing only the direct bloated dependencies by cleaning the package.json file, also removes a significant share of unnecessary bloated indirect dependencies. Compared to the state-of-the-art dynamic debloating technique, our analysis based on file accesses has fewer false positives, and demonstrates higher accuracy in detecting bloated dependencies. Our findings suggest that native support for dependency debloating in package managers could significantly alleviate the burden of maintaining dependencies.
△ Less
Submitted 28 May, 2024;
originally announced May 2024.
-
Decentralized Directed Collaboration for Personalized Federated Learning
Authors:
Yingqi Liu,
Yifan Shi,
Qinglun Li,
Baoyuan Wu,
Xueqian Wang,
Li Shen
Abstract:
Personalized Federated Learning (PFL) is proposed to find the greatest personalized models for each client. To avoid the central failure and communication bottleneck in the server-based FL, we concentrate on the Decentralized Personalized Federated Learning (DPFL) that performs distributed model training in a Peer-to-Peer (P2P) manner. Most personalized works in DPFL are based on undirected and sy…
▽ More
Personalized Federated Learning (PFL) is proposed to find the greatest personalized models for each client. To avoid the central failure and communication bottleneck in the server-based FL, we concentrate on the Decentralized Personalized Federated Learning (DPFL) that performs distributed model training in a Peer-to-Peer (P2P) manner. Most personalized works in DPFL are based on undirected and symmetric topologies, however, the data, computation and communication resources heterogeneity result in large variances in the personalized models, which lead the undirected aggregation to suboptimal personalized performance and unguaranteed convergence. To address these issues, we propose a directed collaboration DPFL framework by incorporating stochastic gradient push and partial model personalized, called \textbf{D}ecentralized \textbf{Fed}erated \textbf{P}artial \textbf{G}radient \textbf{P}ush (\textbf{DFedPGP}). It personalizes the linear classifier in the modern deep model to customize the local solution and learns a consensus representation in a fully decentralized manner. Clients only share gradients with a subset of neighbors based on the directed and asymmetric topologies, which guarantees flexible choices for resource efficiency and better convergence. Theoretically, we show that the proposed DFedPGP achieves a superior convergence rate of $\mathcal{O}(\frac{1}{\sqrt{T}})$ in the general non-convex setting, and prove the tighter connectivity among clients will speed up the convergence. The proposed method achieves state-of-the-art (SOTA) accuracy in both data and computation heterogeneity scenarios, demonstrating the efficiency of the directed collaboration and partial gradient push.
△ Less
Submitted 28 May, 2024;
originally announced May 2024.
-
DanceGen: Supporting Choreography Ideation and Prototyping with Generative AI
Authors:
Yimeng Liu,
Misha Sra
Abstract:
Choreography creation requires high proficiency in artistic and technical skills. Choreographers typically go through four stages to create a dance piece: preparation, studio, performance, and reflection. This process is often individualized, complicated, and challenging due to multiple constraints at each stage. To assist choreographers, most prior work has focused on designing digital tools to s…
▽ More
Choreography creation requires high proficiency in artistic and technical skills. Choreographers typically go through four stages to create a dance piece: preparation, studio, performance, and reflection. This process is often individualized, complicated, and challenging due to multiple constraints at each stage. To assist choreographers, most prior work has focused on designing digital tools to support the last three stages of the choreography process, with the preparation stage being the least explored. To address this research gap, we introduce an AI-based approach to assist the preparation stage by supporting ideation, creating choreographic prototypes, and documenting creative attempts and outcomes. We address the limitations of existing AI-based motion generation methods for ideation by allowing generated sequences to be edited and modified in an interactive web interface. This capability is motivated by insights from a formative study we conducted with seven choreographers. We evaluated our system's functionality, benefits, and limitations with six expert choreographers. Results highlight the usability of our system, with users reporting increased efficiency, expanded creative possibilities, and an enhanced iterative process. We also identified areas for improvement, such as the relationship between user intent and AI outcome, intuitive and flexible user interaction design, and integration with existing physical choreography prototyping workflows. By reflecting on the evaluation results, we present three insights that aim to inform the development of future AI systems that can empower choreographers.
△ Less
Submitted 28 May, 2024;
originally announced May 2024.
-
Visual Anchors Are Strong Information Aggregators For Multimodal Large Language Model
Authors:
Haogeng Liu,
Quanzeng You,
Xiaotian Han,
Yongfei Liu,
Huaibo Huang,
Ran He,
Hongxia Yang
Abstract:
In the realm of Multimodal Large Language Models (MLLMs), vision-language connector plays a crucial role to link the pre-trained vision encoders with Large Language Models (LLMs). Despite its importance, the vision-language connector has been relatively less explored. In this study, we aim to propose a strong vision-language connector that enables MLLMs to achieve high accuracy while maintain low…
▽ More
In the realm of Multimodal Large Language Models (MLLMs), vision-language connector plays a crucial role to link the pre-trained vision encoders with Large Language Models (LLMs). Despite its importance, the vision-language connector has been relatively less explored. In this study, we aim to propose a strong vision-language connector that enables MLLMs to achieve high accuracy while maintain low computation cost. We first reveal the existence of the visual anchors in Vision Transformer and propose a cost-effective search algorithm to extract them. Building on these findings, we introduce the Anchor Former (AcFormer), a novel vision-language connector designed to leverage the rich prior knowledge obtained from these visual anchors during pretraining, guiding the aggregation of information. Through extensive experimentation, we demonstrate that the proposed method significantly reduces computational costs by nearly two-thirds compared with baseline, while simultaneously outperforming baseline methods. This highlights the effectiveness and efficiency of AcFormer.
△ Less
Submitted 28 May, 2024;
originally announced May 2024.
-
Dataset Regeneration for Sequential Recommendation
Authors:
Mingjia Yin,
Hao Wang,
Wei Guo,
Yong Liu,
Suojuan Zhang,
Sirui Zhao,
Defu Lian,
Enhong Chen
Abstract:
The sequential recommender (SR) system is a crucial component of modern recommender systems, as it aims to capture the evolving preferences of users. Significant efforts have been made to enhance the capabilities of SR systems. These methods typically follow the \textbf{model-centric} paradigm, which involves developing effective models based on fixed datasets. However, this approach often overloo…
▽ More
The sequential recommender (SR) system is a crucial component of modern recommender systems, as it aims to capture the evolving preferences of users. Significant efforts have been made to enhance the capabilities of SR systems. These methods typically follow the \textbf{model-centric} paradigm, which involves developing effective models based on fixed datasets. However, this approach often overlooks potential quality issues and flaws inherent in the data. Driven by the potential of \textbf{data-centric} AI, we propose a novel data-centric paradigm for developing an ideal training dataset using a model-agnostic dataset regeneration framework called DR4SR. This framework enables the regeneration of a dataset with exceptional cross-architecture generalizability. Additionally, we introduce the DR4SR+ framework, which incorporates a model-aware dataset personalizer to tailor the regenerated dataset specifically for a target model. To demonstrate the effectiveness of the data-centric paradigm, we integrate our framework with various model-centric methods and observe significant performance improvements across four widely adopted datasets. Furthermore, we conduct in-depth analyses to explore the potential of the data-centric paradigm and provide valuable insights. The code can be found at \textcolor{blue}{\url{https://anonymous.4open.science/r/KDD2024-86EA/}}
△ Less
Submitted 27 May, 2024;
originally announced May 2024.
-
Online Analytic Exemplar-Free Continual Learning with Large Models for Imbalanced Autonomous Driving Task
Authors:
Huiping Zhuang,
Di Fang,
Kai Tong,
Yuchen Liu,
Ziqian Zeng,
Xu Zhou,
Cen Chen
Abstract:
In the field of autonomous driving, even a meticulously trained model can encounter failures when faced with unfamiliar sceanrios. One of these scenarios can be formulated as an online continual learning (OCL) problem. That is, data come in an online fashion, and models are updated according to these streaming data. Two major OCL challenges are catastrophic forgetting and data imbalance. To addres…
▽ More
In the field of autonomous driving, even a meticulously trained model can encounter failures when faced with unfamiliar sceanrios. One of these scenarios can be formulated as an online continual learning (OCL) problem. That is, data come in an online fashion, and models are updated according to these streaming data. Two major OCL challenges are catastrophic forgetting and data imbalance. To address these challenges, in this paper, we propose an Analytic Exemplar-Free Online Continual Learning (AEF-OCL). The AEF-OCL leverages analytic continual learning principles and employs ridge regression as a classifier for features extracted by a large backbone network. It solves the OCL problem by recursively calculating the analytical solution, ensuring an equalization between the continual learning and its joint-learning counterpart, and works without the need to save any used samples (i.e., exemplar-free). Additionally, we introduce a Pseudo-Features Generator (PFG) module that recursively estimates the deviation of real features. The PFG generates offset pseudo-features following a normal distribution, thereby addressing the data imbalance issue. Experimental results demonstrate that despite being an exemplar-free strategy, our method outperforms various methods on the autonomous driving SODA10M dataset. Source code is available at https://github.com/ZHUANGHP/Analytic-continual-learning.
△ Less
Submitted 27 May, 2024;
originally announced May 2024.
-
LoRA-Switch: Boosting the Efficiency of Dynamic LLM Adapters via System-Algorithm Co-design
Authors:
Rui Kong,
Qiyang Li,
Xinyu Fang,
Qingtian Feng,
Qingfeng He,
Yazhu Dong,
Weijun Wang,
Yuanchun Li,
Linghe Kong,
Yunxin Liu
Abstract:
Recent literature has found that an effective method to customize or further improve large language models (LLMs) is to add dynamic adapters, such as low-rank adapters (LoRA) with Mixture-of-Experts (MoE) structures. Though such dynamic adapters incur modest computational complexity, they surprisingly lead to huge inference latency overhead, slowing down the decoding speed by 2.5+ times. In this p…
▽ More
Recent literature has found that an effective method to customize or further improve large language models (LLMs) is to add dynamic adapters, such as low-rank adapters (LoRA) with Mixture-of-Experts (MoE) structures. Though such dynamic adapters incur modest computational complexity, they surprisingly lead to huge inference latency overhead, slowing down the decoding speed by 2.5+ times. In this paper, we analyze the fine-grained costs of the dynamic adapters and find that the fragmented CUDA kernel calls are the root cause. Therefore, we propose LoRA-Switch, a system-algorithm co-designed architecture for efficient dynamic adapters. Unlike most existing dynamic structures that adopt layer-wise or block-wise dynamic routing, LoRA-Switch introduces a token-wise routing mechanism. It switches the LoRA adapters and weights for each token and merges them into the backbone for inference. For efficiency, this switching is implemented with an optimized CUDA kernel, which fuses the merging operations for all LoRA adapters at once. Based on experiments with popular open-source LLMs on common benchmarks, our approach has demonstrated similar accuracy improvement as existing dynamic adapters, while reducing the decoding latency by more than 2.4 times.
△ Less
Submitted 27 May, 2024;
originally announced May 2024.
-
C$^{3}$Bench: A Comprehensive Classical Chinese Understanding Benchmark for Large Language Models
Authors:
Jiahuan Cao,
Yongxin Shi,
Dezhi Peng,
Yang Liu,
Lianwen Jin
Abstract:
Classical Chinese Understanding (CCU) holds significant value in preserving and exploration of the outstanding traditional Chinese culture. Recently, researchers have attempted to leverage the potential of Large Language Models (LLMs) for CCU by capitalizing on their remarkable comprehension and semantic capabilities. However, no comprehensive benchmark is available to assess the CCU capabilities…
▽ More
Classical Chinese Understanding (CCU) holds significant value in preserving and exploration of the outstanding traditional Chinese culture. Recently, researchers have attempted to leverage the potential of Large Language Models (LLMs) for CCU by capitalizing on their remarkable comprehension and semantic capabilities. However, no comprehensive benchmark is available to assess the CCU capabilities of LLMs. To fill this gap, this paper introduces C$^{3}$bench, a Comprehensive Classical Chinese understanding benchmark, which comprises 50,000 text pairs for five primary CCU tasks, including classification, retrieval, named entity recognition, punctuation, and translation. Furthermore, the data in C$^{3}$bench originates from ten different domains, covering most of the categories in classical Chinese. Leveraging the proposed C$^{3}$bench, we extensively evaluate the quantitative performance of 15 representative LLMs on all five CCU tasks. Our results not only establish a public leaderboard of LLMs' CCU capabilities but also gain some findings. Specifically, existing LLMs are struggle with CCU tasks and still inferior to supervised models. Additionally, the results indicate that CCU is a task that requires special attention. We believe this study could provide a standard benchmark, comprehensive baselines, and valuable insights for the future advancement of LLM-based CCU research. The evaluation pipeline and dataset are available at \url{https://github.com/SCUT-DLVCLab/C3bench}.
△ Less
Submitted 30 May, 2024; v1 submitted 27 May, 2024;
originally announced May 2024.
-
Clip Body and Tail Separately: High Probability Guarantees for DPSGD with Heavy Tails
Authors:
Haichao Sha,
Yang Cao,
Yong Liu,
Yuncheng Wu,
Ruixuan Liu,
Hong Chen
Abstract:
Differentially Private Stochastic Gradient Descent (DPSGD) is widely utilized to preserve training data privacy in deep learning, which first clips the gradients to a predefined norm and then injects calibrated noise into the training procedure. Existing DPSGD works typically assume the gradients follow sub-Gaussian distributions and design various clipping mechanisms to optimize training performa…
▽ More
Differentially Private Stochastic Gradient Descent (DPSGD) is widely utilized to preserve training data privacy in deep learning, which first clips the gradients to a predefined norm and then injects calibrated noise into the training procedure. Existing DPSGD works typically assume the gradients follow sub-Gaussian distributions and design various clipping mechanisms to optimize training performance. However, recent studies have shown that the gradients in deep learning exhibit a heavy-tail phenomenon, that is, the tails of the gradient have infinite variance, which may lead to excessive clipping loss to the gradients with existing DPSGD mechanisms. To address this problem, we propose a novel approach, Discriminative Clipping~(DC)-DPSGD, with two key designs. First, we introduce a subspace identification technique to distinguish between body and tail gradients. Second, we present a discriminative clipping mechanism that applies different clipping thresholds for body and tail gradients to reduce the clipping loss. Under the non-convex condition, \ourtech{} reduces the empirical gradient norm from {${\mathbb{O}\left(\log^{\max(0,θ-1)}(T/δ)\log^{2θ}(\sqrt{T})\right)}$} to {${\mathbb{O}\left(\log(\sqrt{T})\right)}$} with heavy-tailed index $θ\geq 1/2$, iterations $T$, and arbitrary probability $δ$. Extensive experiments on four real-world datasets demonstrate that our approach outperforms three baselines by up to 9.72\% in terms of accuracy.
△ Less
Submitted 27 May, 2024;
originally announced May 2024.
-
Leveraging Machine Learning for Accurate IoT Device Identification in Dynamic Wireless Contexts
Authors:
Bhagyashri Tushir,
Vikram K Ramanna,
Yuhong Liu,
Behnam Dezfouli
Abstract:
Identifying IoT devices is crucial for network monitoring, security enforcement, and inventory tracking. However, most existing identification methods rely on deep packet inspection, which raises privacy concerns and adds computational complexity. More importantly, existing works overlook the impact of wireless channel dynamics on the accuracy of layer-2 features, thereby limiting their effectiven…
▽ More
Identifying IoT devices is crucial for network monitoring, security enforcement, and inventory tracking. However, most existing identification methods rely on deep packet inspection, which raises privacy concerns and adds computational complexity. More importantly, existing works overlook the impact of wireless channel dynamics on the accuracy of layer-2 features, thereby limiting their effectiveness in real-world scenarios. In this work, we define and use the latency of specific probe-response packet exchanges, referred to as "device latency," as the main feature for device identification. Additionally, we reveal the critical impact of wireless channel dynamics on the accuracy of device identification based on device latency. Specifically, this work introduces "accumulation score" as a novel approach to capturing fine-grained channel dynamics and their impact on device latency when training machine learning models. We implement the proposed methods and measure the accuracy and overhead of device identification in real-world scenarios. The results confirm that by incorporating the accumulation score for balanced data collection and training machine learning algorithms, we achieve an F1 score of over 97% for device identification, even amidst wireless channel dynamics, a significant improvement over the 75% F1 score achieved by disregarding the impact of channel dynamics on data collection and device latency.
△ Less
Submitted 15 May, 2024;
originally announced May 2024.
-
Human4DiT: Free-view Human Video Generation with 4D Diffusion Transformer
Authors:
Ruizhi Shao,
Youxin Pang,
Zerong Zheng,
Jingxiang Sun,
Yebin Liu
Abstract:
We present a novel approach for generating high-quality, spatio-temporally coherent human videos from a single image under arbitrary viewpoints. Our framework combines the strengths of U-Nets for accurate condition injection and diffusion transformers for capturing global correlations across viewpoints and time. The core is a cascaded 4D transformer architecture that factorizes attention across vi…
▽ More
We present a novel approach for generating high-quality, spatio-temporally coherent human videos from a single image under arbitrary viewpoints. Our framework combines the strengths of U-Nets for accurate condition injection and diffusion transformers for capturing global correlations across viewpoints and time. The core is a cascaded 4D transformer architecture that factorizes attention across views, time, and spatial dimensions, enabling efficient modeling of the 4D space. Precise conditioning is achieved by injecting human identity, camera parameters, and temporal signals into the respective transformers. To train this model, we curate a multi-dimensional dataset spanning images, videos, multi-view data and 3D/4D scans, along with a multi-dimensional training strategy. Our approach overcomes the limitations of previous methods based on GAN or UNet-based diffusion models, which struggle with complex motions and viewpoint changes. Through extensive experiments, we demonstrate our method's ability to synthesize realistic, coherent and free-view human videos, paving the way for advanced multimedia applications in areas such as virtual reality and animation. Our project website is https://human4dit.github.io.
△ Less
Submitted 27 May, 2024;
originally announced May 2024.
-
Rethinking Transformers in Solving POMDPs
Authors:
Chenhao Lu,
Ruizhe Shi,
Yuyao Liu,
Kaizhe Hu,
Simon S. Du,
Huazhe Xu
Abstract:
Sequential decision-making algorithms such as reinforcement learning (RL) in real-world scenarios inevitably face environments with partial observability. This paper scrutinizes the effectiveness of a popular architecture, namely Transformers, in Partially Observable Markov Decision Processes (POMDPs) and reveals its theoretical limitations. We establish that regular languages, which Transformers…
▽ More
Sequential decision-making algorithms such as reinforcement learning (RL) in real-world scenarios inevitably face environments with partial observability. This paper scrutinizes the effectiveness of a popular architecture, namely Transformers, in Partially Observable Markov Decision Processes (POMDPs) and reveals its theoretical limitations. We establish that regular languages, which Transformers struggle to model, are reducible to POMDPs. This poses a significant challenge for Transformers in learning POMDP-specific inductive biases, due to their lack of inherent recurrence found in other models like RNNs. This paper casts doubt on the prevalent belief in Transformers as sequence models for RL and proposes to introduce a point-wise recurrent structure. The Deep Linear Recurrent Unit (LRU) emerges as a well-suited alternative for Partially Observable RL, with empirical results highlighting the sub-optimal performance of the Transformer and considerable strength of LRU.
△ Less
Submitted 30 May, 2024; v1 submitted 27 May, 2024;
originally announced May 2024.
-
DreamMat: High-quality PBR Material Generation with Geometry- and Light-aware Diffusion Models
Authors:
Yuqing Zhang,
Yuan Liu,
Zhiyu Xie,
Lei Yang,
Zhongyuan Liu,
Mengzhou Yang,
Runze Zhang,
Qilong Kou,
Cheng Lin,
Wenping Wang,
Xiaogang Jin
Abstract:
2D diffusion model, which often contains unwanted baked-in shading effects and results in unrealistic rendering effects in the downstream applications. Generating Physically Based Rendering (PBR) materials instead of just RGB textures would be a promising solution. However, directly distilling the PBR material parameters from 2D diffusion models still suffers from incorrect material decomposition,…
▽ More
2D diffusion model, which often contains unwanted baked-in shading effects and results in unrealistic rendering effects in the downstream applications. Generating Physically Based Rendering (PBR) materials instead of just RGB textures would be a promising solution. However, directly distilling the PBR material parameters from 2D diffusion models still suffers from incorrect material decomposition, such as baked-in shading effects in albedo. We introduce DreamMat, an innovative approach to resolve the aforementioned problem, to generate high-quality PBR materials from text descriptions. We find out that the main reason for the incorrect material distillation is that large-scale 2D diffusion models are only trained to generate final shading colors, resulting in insufficient constraints on material decomposition during distillation. To tackle this problem, we first finetune a new light-aware 2D diffusion model to condition on a given lighting environment and generate the shading results on this specific lighting condition. Then, by applying the same environment lights in the material distillation, DreamMat can generate high-quality PBR materials that are not only consistent with the given geometry but also free from any baked-in shading effects in albedo. Extensive experiments demonstrate that the materials produced through our methods exhibit greater visual appeal to users and achieve significantly superior rendering quality compared to baseline methods, which are preferable for downstream tasks such as game and film production.
△ Less
Submitted 27 May, 2024;
originally announced May 2024.
-
PatchScaler: An Efficient Patch-independent Diffusion Model for Super-Resolution
Authors:
Yong Liu,
Hang Dong,
Jinshan Pan,
Qingji Dong,
Kai Chen,
Rongxiang Zhang,
Xing Mei,
Lean Fu,
Fei Wang
Abstract:
Diffusion models significantly improve the quality of super-resolved images with their impressive content generation capabilities. However, the huge computational costs limit the applications of these methods.Recent efforts have explored reasonable inference acceleration to reduce the number of sampling steps, but the computational cost remains high as each step is performed on the entire image.Th…
▽ More
Diffusion models significantly improve the quality of super-resolved images with their impressive content generation capabilities. However, the huge computational costs limit the applications of these methods.Recent efforts have explored reasonable inference acceleration to reduce the number of sampling steps, but the computational cost remains high as each step is performed on the entire image.This paper introduces PatchScaler, a patch-independent diffusion-based single image super-resolution (SR) method, designed to enhance the efficiency of the inference process.The proposed method is motivated by the observation that not all the image patches within an image need the same sampling steps for reconstructing high-resolution images.Based on this observation, we thus develop a Patch-adaptive Group Sampling (PGS) to divide feature patches into different groups according to the patch-level reconstruction difficulty and dynamically assign an appropriate sampling configuration for each group so that the inference speed can be better accelerated.In addition, to improve the denoising ability at each step of the sampling, we develop a texture prompt to guide the estimations of the diffusion model by retrieving high-quality texture priors from a patch-independent reference texture memory.Experiments show that our PatchScaler achieves favorable performance in both quantitative and qualitative evaluations with fast inference speed.Our code and model are available at \url{https://github.com/yongliuy/PatchScaler}.
△ Less
Submitted 27 May, 2024;
originally announced May 2024.
-
OED: Towards One-stage End-to-End Dynamic Scene Graph Generation
Authors:
Guan Wang,
Zhimin Li,
Qingchao Chen,
Yang Liu
Abstract:
Dynamic Scene Graph Generation (DSGG) focuses on identifying visual relationships within the spatial-temporal domain of videos. Conventional approaches often employ multi-stage pipelines, which typically consist of object detection, temporal association, and multi-relation classification. However, these methods exhibit inherent limitations due to the separation of multiple stages, and independent…
▽ More
Dynamic Scene Graph Generation (DSGG) focuses on identifying visual relationships within the spatial-temporal domain of videos. Conventional approaches often employ multi-stage pipelines, which typically consist of object detection, temporal association, and multi-relation classification. However, these methods exhibit inherent limitations due to the separation of multiple stages, and independent optimization of these sub-problems may yield sub-optimal solutions. To remedy these limitations, we propose a one-stage end-to-end framework, termed OED, which streamlines the DSGG pipeline. This framework reformulates the task as a set prediction problem and leverages pair-wise features to represent each subject-object pair within the scene graph. Moreover, another challenge of DSGG is capturing temporal dependencies, we introduce a Progressively Refined Module (PRM) for aggregating temporal context without the constraints of additional trackers or handcrafted trajectories, enabling end-to-end optimization of the network. Extensive experiments conducted on the Action Genome benchmark demonstrate the effectiveness of our design. The code and models are available at \url{https://github.com/guanw-pku/OED}.
△ Less
Submitted 27 May, 2024;
originally announced May 2024.
-
Part123: Part-aware 3D Reconstruction from a Single-view Image
Authors:
Anran Liu,
Cheng Lin,
Yuan Liu,
Xiaoxiao Long,
Zhiyang Dou,
Hao-Xiang Guo,
Ping Luo,
Wenping Wang
Abstract:
Recently, the emergence of diffusion models has opened up new opportunities for single-view reconstruction. However, all the existing methods represent the target object as a closed mesh devoid of any structural information, thus neglecting the part-based structure, which is crucial for many downstream applications, of the reconstructed shape. Moreover, the generated meshes usually suffer from lar…
▽ More
Recently, the emergence of diffusion models has opened up new opportunities for single-view reconstruction. However, all the existing methods represent the target object as a closed mesh devoid of any structural information, thus neglecting the part-based structure, which is crucial for many downstream applications, of the reconstructed shape. Moreover, the generated meshes usually suffer from large noises, unsmooth surfaces, and blurry textures, making it challenging to obtain satisfactory part segments using 3D segmentation techniques. In this paper, we present Part123, a novel framework for part-aware 3D reconstruction from a single-view image. We first use diffusion models to generate multiview-consistent images from a given image, and then leverage Segment Anything Model (SAM), which demonstrates powerful generalization ability on arbitrary objects, to generate multiview segmentation masks. To effectively incorporate 2D part-based information into 3D reconstruction and handle inconsistency, we introduce contrastive learning into a neural rendering framework to learn a part-aware feature space based on the multiview segmentation masks. A clustering-based algorithm is also developed to automatically derive 3D part segmentation results from the reconstructed models. Experiments show that our method can generate 3D models with high-quality segmented parts on various objects. Compared to existing unstructured reconstruction methods, the part-aware 3D models from our method benefit some important applications, including feature-preserving reconstruction, primitive fitting, and 3D shape editing.
△ Less
Submitted 27 May, 2024;
originally announced May 2024.
-
TIE: Revolutionizing Text-based Image Editing for Complex-Prompt Following and High-Fidelity Editing
Authors:
Xinyu Zhang,
Mengxue Kang,
Fei Wei,
Shuang Xu,
Yuhe Liu,
Lin Ma
Abstract:
As the field of image generation rapidly advances, traditional diffusion models and those integrated with multimodal large language models (LLMs) still encounter limitations in interpreting complex prompts and preserving image consistency pre and post-editing. To tackle these challenges, we present an innovative image editing framework that employs the robust Chain-of-Thought (CoT) reasoning and l…
▽ More
As the field of image generation rapidly advances, traditional diffusion models and those integrated with multimodal large language models (LLMs) still encounter limitations in interpreting complex prompts and preserving image consistency pre and post-editing. To tackle these challenges, we present an innovative image editing framework that employs the robust Chain-of-Thought (CoT) reasoning and localizing capabilities of multimodal LLMs to aid diffusion models in generating more refined images. We first meticulously design a CoT process comprising instruction decomposition, region localization, and detailed description. Subsequently, we fine-tune the LISA model, a lightweight multimodal LLM, using the CoT process of Multimodal LLMs and the mask of the edited image. By providing the diffusion models with knowledge of the generated prompt and image mask, our models generate images with a superior understanding of instructions. Through extensive experiments, our model has demonstrated superior performance in image generation, surpassing existing state-of-the-art models. Notably, our model exhibits an enhanced ability to understand complex prompts and generate corresponding images, while maintaining high fidelity and consistency in images before and after generation.
△ Less
Submitted 26 May, 2024;
originally announced May 2024.
-
ARC: A Generalist Graph Anomaly Detector with In-Context Learning
Authors:
Yixin Liu,
Shiyuan Li,
Yu Zheng,
Qingfeng Chen,
Chengqi Zhang,
Shirui Pan
Abstract:
Graph anomaly detection (GAD), which aims to identify abnormal nodes that differ from the majority within a graph, has garnered significant attention. However, current GAD methods necessitate training specific to each dataset, resulting in high training costs, substantial data requirements, and limited generalizability when being applied to new datasets and domains. To address these limitations, t…
▽ More
Graph anomaly detection (GAD), which aims to identify abnormal nodes that differ from the majority within a graph, has garnered significant attention. However, current GAD methods necessitate training specific to each dataset, resulting in high training costs, substantial data requirements, and limited generalizability when being applied to new datasets and domains. To address these limitations, this paper proposes ARC, a generalist GAD approach that enables a ``one-for-all'' GAD model to detect anomalies across various graph datasets on-the-fly. Equipped with in-context learning, ARC can directly extract dataset-specific patterns from the target dataset using few-shot normal samples at the inference stage, without the need for retraining or fine-tuning on the target dataset. ARC comprises three components that are well-crafted for capturing universal graph anomaly patterns: 1) smoothness-based feature Alignment module that unifies the features of different datasets into a common and anomaly-sensitive space; 2) ego-neighbor Residual graph encoder that learns abnormality-related node embeddings; and 3) cross-attentive in-Context anomaly scoring module that predicts node abnormality by leveraging few-shot normal samples. Extensive experiments on multiple benchmark datasets from various domains demonstrate the superior anomaly detection performance, efficiency, and generalizability of ARC.
△ Less
Submitted 26 May, 2024;
originally announced May 2024.
-
Visualizing the Shadows: Unveiling Data Poisoning Behaviors in Federated Learning
Authors:
Xueqing Zhang,
Junkai Zhang,
Ka-Ho Chow,
Juntao Chen,
Ying Mao,
Mohamed Rahouti,
Xiang Li,
Yuchen Liu,
Wenqi Wei
Abstract:
This demo paper examines the susceptibility of Federated Learning (FL) systems to targeted data poisoning attacks, presenting a novel system for visualizing and mitigating such threats. We simulate targeted data poisoning attacks via label flipping and analyze the impact on model performance, employing a five-component system that includes Simulation and Data Generation, Data Collection and Upload…
▽ More
This demo paper examines the susceptibility of Federated Learning (FL) systems to targeted data poisoning attacks, presenting a novel system for visualizing and mitigating such threats. We simulate targeted data poisoning attacks via label flipping and analyze the impact on model performance, employing a five-component system that includes Simulation and Data Generation, Data Collection and Upload, User-friendly Interface, Analysis and Insight, and Advisory System. Observations from three demo modules: label manipulation, attack timing, and malicious attack availability, and two analysis components: utility and analytical behavior of local model updates highlight the risks to system integrity and offer insight into the resilience of FL systems. The demo is available at https://github.com/CathyXueqingZhang/DataPoisoningVis.
△ Less
Submitted 26 May, 2024;
originally announced May 2024.
-
Fast and Globally Consistent Normal Orientation based on the Winding Number Normal Consistency
Authors:
Siyou Lin,
Zuoqiang Shi,
Yebin Liu
Abstract:
Estimating a consistently oriented normal vector field for an unoriented point cloud enables a number of important downstream applications in computer graphics. While normal estimation for a small patch of points can be done with simple techniques like principal component analysis (PCA), orienting these normals to be globally consistent has been a notoriously difficult problem. Some recent methods…
▽ More
Estimating a consistently oriented normal vector field for an unoriented point cloud enables a number of important downstream applications in computer graphics. While normal estimation for a small patch of points can be done with simple techniques like principal component analysis (PCA), orienting these normals to be globally consistent has been a notoriously difficult problem. Some recent methods exploit various properties of the winding number formula to achieve global consistency with state-of-the-art performance. Despite their exciting progress, these algorithms either have high space/time complexity, or do not produce accurate and consistently oriented normals for imperfect data. In this paper, we derive a novel property from the winding number formula to tackle this problem: the normal consistency property of the winding number formula. We refer to this property as the winding number normal consistency (WNNC). The derived property is based on the simple observation that the normals (negative gradients) sampled from the winding number field should be codirectional to the normals used to compute the winding number field. We further propose to turn the WNNC property into a normal update formula, which leads to an embarrassingly simple yet effective iterative algorithm that allows fast and high-quality convergence to a globally consistent normal vector field. Furthermore, our proposed algorithm only involves repeatedly evaluating the winding number formula and its derivatives, which can be accelerated and parallelized using treecode-based approximation algorithms due to their special structures. Exploiting this fact, we implement a GPU-accelerated treecode-based solver. Our GPU (and even CPU) implementation can be significantly faster than the recent state-of-the-art methods for normal orientation from raw points.
△ Less
Submitted 26 May, 2024;
originally announced May 2024.
-
vHeat: Building Vision Models upon Heat Conduction
Authors:
Zhaozhi Wang,
Yue Liu,
Yunfan Liu,
Hongtian Yu,
Yaowei Wang,
Qixiang Ye,
Yunjie Tian
Abstract:
A fundamental problem in learning robust and expressive visual representations lies in efficiently estimating the spatial relationships of visual semantics throughout the entire image. In this study, we propose vHeat, a novel vision backbone model that simultaneously achieves both high computational efficiency and global receptive field. The essential idea, inspired by the physical principle of he…
▽ More
A fundamental problem in learning robust and expressive visual representations lies in efficiently estimating the spatial relationships of visual semantics throughout the entire image. In this study, we propose vHeat, a novel vision backbone model that simultaneously achieves both high computational efficiency and global receptive field. The essential idea, inspired by the physical principle of heat conduction, is to conceptualize image patches as heat sources and model the calculation of their correlations as the diffusion of thermal energy. This mechanism is incorporated into deep models through the newly proposed module, the Heat Conduction Operator (HCO), which is physically plausible and can be efficiently implemented using DCT and IDCT operations with a complexity of $\mathcal{O}(N^{1.5})$. Extensive experiments demonstrate that vHeat surpasses Vision Transformers (ViTs) across various vision tasks, while also providing higher inference speeds, reduced FLOPs, and lower GPU memory usage for high-resolution images. The code will be released at https://github.com/MzeroMiko/vHeat.
△ Less
Submitted 26 May, 2024;
originally announced May 2024.
-
Memory-efficient High-resolution OCT Volume Synthesis with Cascaded Amortized Latent Diffusion Models
Authors:
Kun Huang,
Xiao Ma,
Yuhan Zhang,
Na Su,
Songtao Yuan,
Yong Liu,
Qiang Chen,
Huazhu Fu
Abstract:
Optical coherence tomography (OCT) image analysis plays an important role in the field of ophthalmology. Current successful analysis models rely on available large datasets, which can be challenging to be obtained for certain tasks. The use of deep generative models to create realistic data emerges as a promising approach. However, due to limitations in hardware resources, it is still difficulty t…
▽ More
Optical coherence tomography (OCT) image analysis plays an important role in the field of ophthalmology. Current successful analysis models rely on available large datasets, which can be challenging to be obtained for certain tasks. The use of deep generative models to create realistic data emerges as a promising approach. However, due to limitations in hardware resources, it is still difficulty to synthesize high-resolution OCT volumes. In this paper, we introduce a cascaded amortized latent diffusion model (CA-LDM) that can synthesis high-resolution OCT volumes in a memory-efficient way. First, we propose non-holistic autoencoders to efficiently build a bidirectional mapping between high-resolution volume space and low-resolution latent space. In tandem with autoencoders, we propose cascaded diffusion processes to synthesize high-resolution OCT volumes with a global-to-local refinement process, amortizing the memory and computational demands. Experiments on a public high-resolution OCT dataset show that our synthetic data have realistic high-resolution and global features, surpassing the capabilities of existing methods. Moreover, performance gains on two down-stream fine-grained segmentation tasks demonstrate the benefit of the proposed method in training deep learning models for medical imaging tasks. The code is public available at: https://github.com/nicetomeetu21/CA-LDM.
△ Less
Submitted 26 May, 2024;
originally announced May 2024.
-
Meta-Task Planning for Language Agents
Authors:
Cong Zhang,
Derrick Goh Xin Deik,
Dexun Li,
Hao Zhang,
Yong Liu
Abstract:
The rapid advancement of neural language models has sparked a new surge of intelligent agent research. Unlike traditional agents, large language model-based agents (LLM agents) have emerged as a promising paradigm for achieving artificial general intelligence (AGI) due to their superior reasoning and generalization capabilities. Effective planning is crucial for the success of LLM agents in real-w…
▽ More
The rapid advancement of neural language models has sparked a new surge of intelligent agent research. Unlike traditional agents, large language model-based agents (LLM agents) have emerged as a promising paradigm for achieving artificial general intelligence (AGI) due to their superior reasoning and generalization capabilities. Effective planning is crucial for the success of LLM agents in real-world tasks, making it a highly pursued topic in the community. Current planning methods typically translate tasks into executable action sequences. However, determining a feasible or optimal sequence for complex tasks at fine granularity, which often requires compositing long chains of heterogeneous actions, remains challenging. This paper introduces Meta-Task Planning (MTP), a zero-shot methodology for collaborative LLM-based multi-agent systems that simplifies complex task planning by decomposing it into a hierarchy of subordinate tasks, or meta-tasks. Each meta-task is then mapped into executable actions. MTP was assessed on two rigorous benchmarks, TravelPlanner and API-Bank. Notably, MTP achieved an average $\sim40\%$ success rate on TravelPlanner, significantly higher than the state-of-the-art (SOTA) baseline ($2.92\%$), and outperforming $LLM_{api}$-4 with ReAct on API-Bank by $\sim14\%$, showing the immense potential of integrating LLM with multi-agent systems.
△ Less
Submitted 30 May, 2024; v1 submitted 26 May, 2024;
originally announced May 2024.
-
CacheBlend: Fast Large Language Model Serving with Cached Knowledge Fusion
Authors:
Jiayi Yao,
Hanchen Li,
Yuhan Liu,
Siddhant Ray,
Yihua Cheng,
Qizheng Zhang,
Kuntai Du,
Shan Lu,
Junchen Jiang
Abstract:
Large language models (LLMs) often incorporate multiple text chunks in their inputs to provide the necessary contexts. To speed up the prefill of the long LLM inputs, one can pre-compute the KV cache of a text and re-use the KV cache when the context is reused as the prefix of another LLM input. However, the reused text chunks are not always the input prefix, and when they are not, their precomput…
▽ More
Large language models (LLMs) often incorporate multiple text chunks in their inputs to provide the necessary contexts. To speed up the prefill of the long LLM inputs, one can pre-compute the KV cache of a text and re-use the KV cache when the context is reused as the prefix of another LLM input. However, the reused text chunks are not always the input prefix, and when they are not, their precomputed KV caches cannot be directly used since they ignore the text's cross-attention with the preceding text in the LLM input. Thus, the benefits of reusing KV caches remain largely unrealized.
This paper tackles just one question: when an LLM input contains multiple text chunks, how to quickly combine their precomputed KV caches in order to achieve the same generation quality as the expensive full prefill (i.e., without reusing KV cache)? We present CacheBlend, a scheme that reuses the pre-computed KV caches, regardless prefix or not, and selectively recomputes the KV values of a small subset of tokens to partially update each reused KV cache. In the meantime,the small extra delay for recomputing some tokens can be pipelined with the retrieval of KV caches within the same job,allowing CacheBlend to store KV caches in slower devices with more storage capacity while retrieving them without increasing the inference delay. By comparing CacheBlend with the state-of-the-art KV cache reusing schemes on three open-source LLMs of various sizes and four popular benchmark datasets of different tasks, we show that CacheBlend reduces time-to-first-token (TTFT) by 2.2-3.3X and increases the inference throughput by 2.8-5X, compared with full KV recompute, without compromising generation quality or incurring more storage cost.
△ Less
Submitted 26 May, 2024;
originally announced May 2024.
-
LLMs for User Interest Exploration: A Hybrid Approach
Authors:
Jianling Wang,
Haokai Lu,
Yifan Liu,
He Ma,
Yueqi Wang,
Yang Gu,
Shuzhou Zhang,
Ningren,
Han,
Shuchao Bi,
Lexi Baugher,
Ed Chi,
Minmin Chen
Abstract:
Traditional recommendation systems are subject to a strong feedback loop by learning from and reinforcing past user-item interactions, which in turn limits the discovery of novel user interests. To address this, we introduce a hybrid hierarchical framework combining Large Language Models (LLMs) and classic recommendation models for user interest exploration. The framework controls the interfacing…
▽ More
Traditional recommendation systems are subject to a strong feedback loop by learning from and reinforcing past user-item interactions, which in turn limits the discovery of novel user interests. To address this, we introduce a hybrid hierarchical framework combining Large Language Models (LLMs) and classic recommendation models for user interest exploration. The framework controls the interfacing between the LLMs and the classic recommendation models through "interest clusters", the granularity of which can be explicitly determined by algorithm designers. It recommends the next novel interests by first representing "interest clusters" using language, and employs a fine-tuned LLM to generate novel interest descriptions that are strictly within these predefined clusters. At the low level, it grounds these generated interests to an item-level policy by restricting classic recommendation models, in this case a transformer-based sequence recommender to return items that fall within the novel clusters generated at the high level. We showcase the efficacy of this approach on an industrial-scale commercial platform serving billions of users. Live experiments show a significant increase in both exploration of novel interests and overall user enjoyment of the platform.
△ Less
Submitted 25 May, 2024;
originally announced May 2024.