A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2023; you can also visit the original URL.
The file type is application/pdf
.
Filters
Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring
[article]
2023
arXiv
pre-print
In this paper, based on the CLIP model, we revisit temporal modeling in the context of image-to-video knowledge transferring, which is the key point for extending image-text pretrained models to the video ...
We evaluate our method on two representative video tasks: Video-Text Retrieval and Video Recognition. ...
Conclusion In this paper, we study the temporal modeling in CLIP-based image-to-video knowledge transferring. ...
arXiv:2301.11116v1
fatcat:bi56l76tmfentmjqtxi5tjunh4
CLIP2Video: Mastering Video-Text Retrieval via Image CLIP
[article]
2021
arXiv
pre-print
We present CLIP2Video network to transfer the image-language pre-training model to video-text retrieval in an end-to-end manner. ...
Different from them, we leverage pretrained image-language model, simplify it as a two-stage framework with co-learning of image-text and enhancing temporal relations between video frames and video-text ...
Inspired by the success of transferring image-text pre-training knowledge into video-text learning [18] , we directly adopt CLIP [30] for initialization to extend the ability in text-to-video retrieval ...
arXiv:2106.11097v1
fatcat:rsy5ezan6nfljajefczxp5pejy
Learning Transferable Spatiotemporal Representations from Natural Script Knowledge
[article]
2023
arXiv
pre-print
Inspired by the great success of image-text pre-training (e.g., CLIP), we take the first step to exploit language semantics to boost transferable spatiotemporal representation learning. ...
Pre-training on large-scale video data has become a common recipe for learning transferable spatiotemporal representations in recent years. ...
., VideoMAE, since they cannot perform text-to-video retrieval. ...
arXiv:2209.15280v3
fatcat:qnbxccziizakzl7bioog6pctuu
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding
[article]
2021
arXiv
pre-print
VideoCLIP trains a transformer for video and text by contrasting temporally overlapping positive video-text pairs with hard negatives from nearest neighbor retrieval. ...
We present VideoCLIP, a contrastive approach to pre-train a unified model for zero-shot video and text understanding, without using any labels on downstream tasks. ...
Recently, CLIP transfers imagetext similarity to many image classification tasks, where the text branch serves as supervision for learning a general image representation and subsequently serves as a hyper ...
arXiv:2109.14084v2
fatcat:bbv6j5ekcfhg3c5ladvx5ytdae
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval
[article]
2021
arXiv
pre-print
In this paper, we propose a CLIP4Clip model to transfer the knowledge of the CLIP model to video-language retrieval in an end-to-end manner. ...
Several questions are investigated via empirical studies: 1) Whether image feature is enough for video-text retrieval? ...
We mainly investigate how to transfer the knowledge from the image-text pretrained model CLIP (Radford et al., 2021) to video-text retrieval in this paper. ...
arXiv:2104.08860v2
fatcat:umh5tyixgfbell2mulkvwkhnb4
Learning Audio-Video Modalities from Image Captions
[article]
2022
arXiv
pre-print
To close this gap we propose a new video mining pipeline which involves transferring captions from image captioning datasets to video clips with no additional manual effort. ...
A major challenge in text-video and text-audio retrieval is the lack of large-scale training data. This is unlike image-captioning, where datasets are in the order of millions of samples. ...
Hence we explore the link between audio and text transferred via image similarity to videos that all have audio, and show this improves text-audio retrieval. ...
arXiv:2204.00679v1
fatcat:j75krdoimjaolmfiyziiz6ulpy
VidLanKD: Improving Language Understanding via Video-Distilled Knowledge Transfer
[article]
2021
arXiv
pre-print
We train a multi-modal teacher model on a video-text dataset, and then transfer its knowledge to a student language model with a text dataset. ...
Recently, vokenization (Tan and Bansal, 2020) has attracted attention by using the predictions of a text-to-image retrieval model as labels for language model supervision. ...
Visualization: Text-to-Video Retrieval. ...
arXiv:2107.02681v2
fatcat:etpkmmnjpjbkzkfawritb6brhm
Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations
[article]
2022
arXiv
pre-print
Considering intrinsic semantic frame relations, HCMI performs self-attention to explore frame-level correlations and adaptively cluster correlated frames into clip-level and video-level representations ...
Text-Video Retrieval plays an important role in multi-modal understanding and has attracted increasing attention in recent years. ...
Based on CLIP, recent methods [25, 13, 34, 6] aim to transfer the well-pretrained image-text knowledge from text-image to text-video. ...
arXiv:2204.03382v8
fatcat:mkv5pr27zbdfll4cjn7nmbmzee
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
[article]
2022
arXiv
pre-print
complementary frameworks in a learnable manner to boost various video applications. ...
To fill the gap, we present general video foundation models, InternVideo, by taking advantage of both generative and discriminative self-supervised video learning. ...
It is able to deliver the state-of-the-art performance on around 40 datasets, capable of action discrimination, video-language alignment, and open understanding. ...
arXiv:2212.03191v2
fatcat:zl2pbyusczbjlf5qmzjj53yhhe
Deep Learning for Video-Text Retrieval: a Review
[article]
2023
arXiv
pre-print
Video-Text Retrieval (VTR) aims to search for the most relevant video related to the semantics in a given sentence, and vice versa. ...
directions, with the expectation to provide some insights for researchers in the field of video-text retrieval. ...
The operator [x] + = max(x, 0) is adopted twice to ensure that the matching samples should be close for both video-to-text and text-to-video retrieval. ...
arXiv:2302.12552v1
fatcat:zsvohxstdbg2nno2fv6pbhmfwq
Retrieval-Augmented Egocentric Video Captioning
[article]
2024
arXiv
pre-print
. (3) We train the cross-view retrieval module with a novel EgoExoNCE loss that pulls egocentric and exocentric video features closer by aligning them to shared text features that describe similar actions ...
In this paper, (1) we develop EgoInstructor, a retrieval-augmented multimodal captioning model that automatically retrieves semantically relevant third-person instructional videos to enhance the video ...
We report clip/video to text retrieval results on YouCook-Clip/YouCook-Video. † indicates different test data (still unavailable), which is not fairly comparable. ...
arXiv:2401.00789v3
fatcat:2t2md27ogrhy7a7myyzuvdab2m
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation
[article]
2024
arXiv
pre-print
Specifically, we utilize a multi-scale approach to generate video-related descriptions. Furthermore, we introduce ViCLIP, a video-text representation learning model based on ViT-L. ...
They are particularly beneficial for generating interleaved video-text data for learning a video-centric dialogue system, advancing video-to-text and text-to-video generation research. ...
Similarly, based on CC3M, Nagrani et al. proposed VideoCC3M [20] by transferring captions from image-text datasets to video ones. ...
arXiv:2307.06942v2
fatcat:5uumuasiczgxdhvstvjsvgwghq
Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling
[article]
2021
arXiv
pre-print
Experiments on text-to-video retrieval and video question answering on six datasets demonstrate that ClipBERT outperforms (or is on par with) existing methods that exploit full-length videos, suggesting ...
The canonical approach to video-and-language learning (e.g., video question answering) dictates a neural model to learn from offline-extracted dense video features from vision models and text features ...
the art on text-to-video retrieval and video question answering tasks. ...
arXiv:2102.06183v1
fatcat:n5yabezujbg27eosmpb23s4hlm
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval
[article]
2022
arXiv
pre-print
To this end, this paper presents a novel multi-grained contrastive model, namely X-CLIP, for video-text retrieval. ...
Video-text retrieval has been a crucial and fundamental task in multi-modal research. ...
This may be because the temporal encoder is used to model the temporal relation of different frames in a video. ...
arXiv:2207.07285v2
fatcat:lpta52dyorhkdljjxtyg3n5gci
VIDiff: Translating Videos via Multi-Modal Instructions with Diffusion Models
[article]
2023
arXiv
pre-print
This motivates a growing interest in video editing tasks, where videos are edited according to provided text descriptions. ...
Diffusion models have achieved significant success in image and video generation. ...
The overview of the training stage, which shows how to transfer a T2I model for V2V translation tasks. (a) Text-to-Image stage, (b) Text-to-Video stage, (c) Video-to-Video stage. ...
arXiv:2311.18837v1
fatcat:ve5ufdc4tfg7ncbzu6t4si6obe
« Previous
Showing results 1 — 15 out of 5,043 results