A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2024; you can also visit the original URL.
The file type is application/pdf
.
Filters
CREMA: Multimodal Compositional Video Reasoning via Efficient Modular Adaptation and Fusion
[article]
2024
arXiv
pre-print
Next, we introduce a query transformer with multiple parameter-efficient modules associated with each accessible modality. ...
This paper tackles these critical challenges and proposes CREMA, an efficient and modular modality-fusion framework for injecting any new modality into video reasoning. ...
We employ plug-and-play frozen experts to extract diverse modalities features, including depth map, optical flow, and surface normals, from raw videos. ...
arXiv:2402.05889v1
fatcat:usau24isbvc27doelcw2sxnioe
Model Composition for Multimodal Large Language Models
[article]
2024
arXiv
pre-print
However, existing methods typically rely on joint training with paired multimodal instruction data, which is resource-intensive and challenging to extend to new modalities. ...
Recent developments in Multimodal Large Language Models (MLLMs) have shown rapid progress, moving towards the goal of creating versatile MLLMs that understand inputs from various modalities. ...
Languagebind: Extending video-language pretraining to n-modality by language-based semantic alignment. arXiv preprint arXiv:2310.01852. ...
arXiv:2402.12750v1
fatcat:fjmz3ddbjzfydmfw2iwszxkd2e
User-LLM: Efficient LLM Contextualization with User Embeddings
[article]
2024
arXiv
pre-print
To address this, we propose User-LLM, a novel framework that leverages user embeddings to contextualize LLMs. ...
We integrate these user embeddings with LLMs through cross-attention and soft-prompting, enabling LLMs to dynamically adapt to user context. ...
Recent works, such as NextGPT (Wu et al., 2023b) , OneLLM (Han et al., 2023) , and Anymal (Moon et al., 2023) , explored unified frameworks for diverse input modalities. ...
arXiv:2402.13598v1
fatcat:6lohpdh2tbb4rgckwkegtdqndi