Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Filters








3 Hits in 0.86 sec

CREMA: Multimodal Compositional Video Reasoning via Efficient Modular Adaptation and Fusion [article]

Shoubin Yu, Jaehong Yoon, Mohit Bansal
2024 arXiv   pre-print
Next, we introduce a query transformer with multiple parameter-efficient modules associated with each accessible modality.  ...  This paper tackles these critical challenges and proposes CREMA, an efficient and modular modality-fusion framework for injecting any new modality into video reasoning.  ...  We employ plug-and-play frozen experts to extract diverse modalities features, including depth map, optical flow, and surface normals, from raw videos.  ... 
arXiv:2402.05889v1 fatcat:usau24isbvc27doelcw2sxnioe

Model Composition for Multimodal Large Language Models [article]

Chi Chen, Yiyang Du, Zheng Fang, Ziyue Wang, Fuwen Luo, Peng Li, Ming Yan, Ji Zhang, Fei Huang, Maosong Sun, Yang Liu
2024 arXiv   pre-print
However, existing methods typically rely on joint training with paired multimodal instruction data, which is resource-intensive and challenging to extend to new modalities.  ...  Recent developments in Multimodal Large Language Models (MLLMs) have shown rapid progress, moving towards the goal of creating versatile MLLMs that understand inputs from various modalities.  ...  Languagebind: Extending video-language pretraining to n-modality by language-based semantic alignment. arXiv preprint arXiv:2310.01852.  ... 
arXiv:2402.12750v1 fatcat:fjmz3ddbjzfydmfw2iwszxkd2e

User-LLM: Efficient LLM Contextualization with User Embeddings [article]

Lin Ning, Luyang Liu, Jiaxing Wu, Neo Wu, Devora Berlowitz, Sushant Prakash, Bradley Green, Shawn O'Banion, Jun Xie
2024 arXiv   pre-print
To address this, we propose User-LLM, a novel framework that leverages user embeddings to contextualize LLMs.  ...  We integrate these user embeddings with LLMs through cross-attention and soft-prompting, enabling LLMs to dynamically adapt to user context.  ...  Recent works, such as NextGPT (Wu et al., 2023b) , OneLLM (Han et al., 2023) , and Anymal (Moon et al., 2023) , explored unified frameworks for diverse input modalities.  ... 
arXiv:2402.13598v1 fatcat:6lohpdh2tbb4rgckwkegtdqndi