OneLLM: One Framework to Align All Modalities with Language.

Next, we introduce a query transformer with multiple parameter-efficient modules associated with each accessible modality. ... This paper tackles these critical challenges and proposes CREMA, an efficient and modular modality-fusion framework for injecting any new modality into video reasoning. ... We employ plug-and-play frozen experts to extract diverse modalities features, including depth map, optical flow, and surface normals, from raw videos. ...

arXiv:2402.05889v1 fatcat:usau24isbvc27doelcw2sxnioe

However, existing methods typically rely on joint training with paired multimodal instruction data, which is resource-intensive and challenging to extend to new modalities. ... Recent developments in Multimodal Large Language Models (MLLMs) have shown rapid progress, moving towards the goal of creating versatile MLLMs that understand inputs from various modalities. ... Languagebind: Extending video-language pretraining to n-modality by language-based semantic alignment. arXiv preprint arXiv:2310.01852. ...

arXiv:2402.12750v1 fatcat:fjmz3ddbjzfydmfw2iwszxkd2e

To address this, we propose User-LLM, a novel framework that leverages user embeddings to contextualize LLMs. ... We integrate these user embeddings with LLMs through cross-attention and soft-prompting, enabling LLMs to dynamically adapt to user context. ... Recent works, such as NextGPT (Wu et al., 2023b) , OneLLM (Han et al., 2023) , and Anymal (Moon et al., 2023) , explored unified frameworks for diverse input modalities. ...

arXiv:2402.13598v1 fatcat:6lohpdh2tbb4rgckwkegtdqndi

CREMA: Multimodal Compositional Video Reasoning via Efficient Modular Adaptation and Fusion [article]

Preserved Fulltext

Model Composition for Multimodal Large Language Models [article]

Preserved Fulltext

User-LLM: Efficient LLM Contextualization with User Embeddings [article]

Preserved Fulltext