Transferring Image-CLIP to Video-Text Retrieval via Temporal Relations.

In this paper, based on the CLIP model, we revisit temporal modeling in the context of image-to-video knowledge transferring, which is the key point for extending image-text pretrained models to the video ... We evaluate our method on two representative video tasks: Video-Text Retrieval and Video Recognition. ... Conclusion In this paper, we study the temporal modeling in CLIP-based image-to-video knowledge transferring. ...

arXiv:2301.11116v1 fatcat:bi56l76tmfentmjqtxi5tjunh4

We present CLIP2Video network to transfer the image-language pre-training model to video-text retrieval in an end-to-end manner. ... Different from them, we leverage pretrained image-language model, simplify it as a two-stage framework with co-learning of image-text and enhancing temporal relations between video frames and video-text ... Inspired by the success of transferring image-text pre-training knowledge into video-text learning [18] , we directly adopt CLIP [30] for initialization to extend the ability in text-to-video retrieval ...

arXiv:2106.11097v1 fatcat:rsy5ezan6nfljajefczxp5pejy

Inspired by the great success of image-text pre-training (e.g., CLIP), we take the first step to exploit language semantics to boost transferable spatiotemporal representation learning. ... Pre-training on large-scale video data has become a common recipe for learning transferable spatiotemporal representations in recent years. ... ., VideoMAE, since they cannot perform text-to-video retrieval. ...

arXiv:2209.15280v3 fatcat:qnbxccziizakzl7bioog6pctuu

Multiple Versions

VideoCLIP trains a transformer for video and text by contrasting temporally overlapping positive video-text pairs with hard negatives from nearest neighbor retrieval. ... We present VideoCLIP, a contrastive approach to pre-train a unified model for zero-shot video and text understanding, without using any labels on downstream tasks. ... Recently, CLIP transfers imagetext similarity to many image classification tasks, where the text branch serves as supervision for learning a general image representation and subsequently serves as a hyper ...

arXiv:2109.14084v2 fatcat:bbv6j5ekcfhg3c5ladvx5ytdae

Open Access Multiple Versions

In this paper, we propose a CLIP4Clip model to transfer the knowledge of the CLIP model to video-language retrieval in an end-to-end manner. ... Several questions are investigated via empirical studies: 1) Whether image feature is enough for video-text retrieval? ... We mainly investigate how to transfer the knowledge from the image-text pretrained model CLIP (Radford et al., 2021) to video-text retrieval in this paper. ...

arXiv:2104.08860v2 fatcat:umh5tyixgfbell2mulkvwkhnb4

Open Access Multiple Versions

To close this gap we propose a new video mining pipeline which involves transferring captions from image captioning datasets to video clips with no additional manual effort. ... A major challenge in text-video and text-audio retrieval is the lack of large-scale training data. This is unlike image-captioning, where datasets are in the order of millions of samples. ... Hence we explore the link between audio and text transferred via image similarity to videos that all have audio, and show this improves text-audio retrieval. ...

arXiv:2204.00679v1 fatcat:j75krdoimjaolmfiyziiz6ulpy

We train a multi-modal teacher model on a video-text dataset, and then transfer its knowledge to a student language model with a text dataset. ... Recently, vokenization (Tan and Bansal, 2020) has attracted attention by using the predictions of a text-to-image retrieval model as labels for language model supervision. ... Visualization: Text-to-Video Retrieval. ...

arXiv:2107.02681v2 fatcat:etpkmmnjpjbkzkfawritb6brhm

Multiple Versions

Considering intrinsic semantic frame relations, HCMI performs self-attention to explore frame-level correlations and adaptively cluster correlated frames into clip-level and video-level representations ... Text-Video Retrieval plays an important role in multi-modal understanding and has attracted increasing attention in recent years. ... Based on CLIP, recent methods [25, 13, 34, 6] aim to transfer the well-pretrained image-text knowledge from text-image to text-video. ...

arXiv:2204.03382v8 fatcat:mkv5pr27zbdfll4cjn7nmbmzee

Open Access Multiple Versions

complementary frameworks in a learnable manner to boost various video applications. ... To fill the gap, we present general video foundation models, InternVideo, by taking advantage of both generative and discriminative self-supervised video learning. ... It is able to deliver the state-of-the-art performance on around 40 datasets, capable of action discrimination, video-language alignment, and open understanding. ...

arXiv:2212.03191v2 fatcat:zl2pbyusczbjlf5qmzjj53yhhe

Open Access Multiple Versions

Video-Text Retrieval (VTR) aims to search for the most relevant video related to the semantics in a given sentence, and vice versa. ... directions, with the expectation to provide some insights for researchers in the field of video-text retrieval. ... The operator [x] + = max(x, 0) is adopted twice to ensure that the matching samples should be close for both video-to-text and text-to-video retrieval. ...

arXiv:2302.12552v1 fatcat:zsvohxstdbg2nno2fv6pbhmfwq

. (3) We train the cross-view retrieval module with a novel EgoExoNCE loss that pulls egocentric and exocentric video features closer by aligning them to shared text features that describe similar actions ... In this paper, (1) we develop EgoInstructor, a retrieval-augmented multimodal captioning model that automatically retrieves semantically relevant third-person instructional videos to enhance the video ... We report clip/video to text retrieval results on YouCook-Clip/YouCook-Video. † indicates different test data (still unavailable), which is not fairly comparable. ...

arXiv:2401.00789v3 fatcat:2t2md27ogrhy7a7myyzuvdab2m

Open Access Multiple Versions

Specifically, we utilize a multi-scale approach to generate video-related descriptions. Furthermore, we introduce ViCLIP, a video-text representation learning model based on ViT-L. ... They are particularly beneficial for generating interleaved video-text data for learning a video-centric dialogue system, advancing video-to-text and text-to-video generation research. ... Similarly, based on CC3M, Nagrani et al. proposed VideoCC3M [20] by transferring captions from image-text datasets to video ones. ...

arXiv:2307.06942v2 fatcat:5uumuasiczgxdhvstvjsvgwghq

Open Access Multiple Versions

Experiments on text-to-video retrieval and video question answering on six datasets demonstrate that ClipBERT outperforms (or is on par with) existing methods that exploit full-length videos, suggesting ... The canonical approach to video-and-language learning (e.g., video question answering) dictates a neural model to learn from offline-extracted dense video features from vision models and text features ... the art on text-to-video retrieval and video question answering tasks. ...

arXiv:2102.06183v1 fatcat:n5yabezujbg27eosmpb23s4hlm

To this end, this paper presents a novel multi-grained contrastive model, namely X-CLIP, for video-text retrieval. ... Video-text retrieval has been a crucial and fundamental task in multi-modal research. ... This may be because the temporal encoder is used to model the temporal relation of different frames in a video. ...

arXiv:2207.07285v2 fatcat:lpta52dyorhkdljjxtyg3n5gci

Multiple Versions

This motivates a growing interest in video editing tasks, where videos are edited according to provided text descriptions. ... Diffusion models have achieved significant success in image and video generation. ... The overview of the training stage, which shows how to transfer a T2I model for V2V translation tasks. (a) Text-to-Image stage, (b) Text-to-Video stage, (c) Video-to-Video stage. ...

arXiv:2311.18837v1 fatcat:ve5ufdc4tfg7ncbzu6t4si6obe

Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring [article]

Preserved Fulltext

CLIP2Video: Mastering Video-Text Retrieval via Image CLIP [article]

Preserved Fulltext

Learning Transferable Spatiotemporal Representations from Natural Script Knowledge [article]

Preserved Fulltext

Other Versions

VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding [article]

Preserved Fulltext

Other Versions

CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval [article]

Preserved Fulltext

Other Versions

Learning Audio-Video Modalities from Image Captions [article]

Preserved Fulltext

VidLanKD: Improving Language Understanding via Video-Distilled Knowledge Transfer [article]

Preserved Fulltext

Other Versions

Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations [article]

Preserved Fulltext

Other Versions

InternVideo: General Video Foundation Models via Generative and Discriminative Learning [article]

Preserved Fulltext

Other Versions

Deep Learning for Video-Text Retrieval: a Review [article]

Preserved Fulltext

Retrieval-Augmented Egocentric Video Captioning [article]

Preserved Fulltext

Other Versions

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation [article]

Preserved Fulltext

Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling [article]

Preserved Fulltext

X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval [article]

Preserved Fulltext

Other Versions

VIDiff: Translating Videos via Multi-Modal Instructions with Diffusion Models [article]

Preserved Fulltext