Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Filters








5,043 Hits in 4.5 sec

Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring [article]

Ruyang Liu and Jingjia Huang and Ge Li and Jiashi Feng and Xinglong Wu and Thomas H. Li
2023 arXiv   pre-print
In this paper, based on the CLIP model, we revisit temporal modeling in the context of image-to-video knowledge transferring, which is the key point for extending image-text pretrained models to the video  ...  We evaluate our method on two representative video tasks: Video-Text Retrieval and Video Recognition.  ...  Conclusion In this paper, we study the temporal modeling in CLIP-based image-to-video knowledge transferring.  ... 
arXiv:2301.11116v1 fatcat:bi56l76tmfentmjqtxi5tjunh4

CLIP2Video: Mastering Video-Text Retrieval via Image CLIP [article]

Han Fang, Pengfei Xiong, Luhui Xu, Yu Chen
2021 arXiv   pre-print
We present CLIP2Video network to transfer the image-language pre-training model to video-text retrieval in an end-to-end manner.  ...  Different from them, we leverage pretrained image-language model, simplify it as a two-stage framework with co-learning of image-text and enhancing temporal relations between video frames and video-text  ...  Inspired by the success of transferring image-text pre-training knowledge into video-text learning [18] , we directly adopt CLIP [30] for initialization to extend the ability in text-to-video retrieval  ... 
arXiv:2106.11097v1 fatcat:rsy5ezan6nfljajefczxp5pejy

Learning Transferable Spatiotemporal Representations from Natural Script Knowledge [article]

Ziyun Zeng, Yuying Ge, Xihui Liu, Bin Chen, Ping Luo, Shu-Tao Xia, Yixiao Ge
2023 arXiv   pre-print
Inspired by the great success of image-text pre-training (e.g., CLIP), we take the first step to exploit language semantics to boost transferable spatiotemporal representation learning.  ...  Pre-training on large-scale video data has become a common recipe for learning transferable spatiotemporal representations in recent years.  ...  ., VideoMAE, since they cannot perform text-to-video retrieval.  ... 
arXiv:2209.15280v3 fatcat:qnbxccziizakzl7bioog6pctuu

VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding [article]

Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, Christoph Feichtenhofer
2021 arXiv   pre-print
VideoCLIP trains a transformer for video and text by contrasting temporally overlapping positive video-text pairs with hard negatives from nearest neighbor retrieval.  ...  We present VideoCLIP, a contrastive approach to pre-train a unified model for zero-shot video and text understanding, without using any labels on downstream tasks.  ...  Recently, CLIP transfers imagetext similarity to many image classification tasks, where the text branch serves as supervision for learning a general image representation and subsequently serves as a hyper  ... 
arXiv:2109.14084v2 fatcat:bbv6j5ekcfhg3c5ladvx5ytdae

CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval [article]

Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, Tianrui Li
2021 arXiv   pre-print
In this paper, we propose a CLIP4Clip model to transfer the knowledge of the CLIP model to video-language retrieval in an end-to-end manner.  ...  Several questions are investigated via empirical studies: 1) Whether image feature is enough for video-text retrieval?  ...  We mainly investigate how to transfer the knowledge from the image-text pretrained model CLIP (Radford et al., 2021) to video-text retrieval in this paper.  ... 
arXiv:2104.08860v2 fatcat:umh5tyixgfbell2mulkvwkhnb4

Learning Audio-Video Modalities from Image Captions [article]

Arsha Nagrani, Paul Hongsuck Seo, Bryan Seybold, Anja Hauth, Santiago Manen, Chen Sun, Cordelia Schmid
2022 arXiv   pre-print
To close this gap we propose a new video mining pipeline which involves transferring captions from image captioning datasets to video clips with no additional manual effort.  ...  A major challenge in text-video and text-audio retrieval is the lack of large-scale training data. This is unlike image-captioning, where datasets are in the order of millions of samples.  ...  Hence we explore the link between audio and text transferred via image similarity to videos that all have audio, and show this improves text-audio retrieval.  ... 
arXiv:2204.00679v1 fatcat:j75krdoimjaolmfiyziiz6ulpy

VidLanKD: Improving Language Understanding via Video-Distilled Knowledge Transfer [article]

Zineng Tang, Jaemin Cho, Hao Tan, Mohit Bansal
2021 arXiv   pre-print
We train a multi-modal teacher model on a video-text dataset, and then transfer its knowledge to a student language model with a text dataset.  ...  Recently, vokenization (Tan and Bansal, 2020) has attracted attention by using the predictions of a text-to-image retrieval model as labels for language model supervision.  ...  Visualization: Text-to-Video Retrieval.  ... 
arXiv:2107.02681v2 fatcat:etpkmmnjpjbkzkfawritb6brhm

Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations [article]

Jie Jiang, Shaobo Min, Weijie Kong, Dihong Gong, Hongfa Wang, Zhifeng Li, Wei Liu
2022 arXiv   pre-print
Considering intrinsic semantic frame relations, HCMI performs self-attention to explore frame-level correlations and adaptively cluster correlated frames into clip-level and video-level representations  ...  Text-Video Retrieval plays an important role in multi-modal understanding and has attracted increasing attention in recent years.  ...  Based on CLIP, recent methods [25, 13, 34, 6] aim to transfer the well-pretrained image-text knowledge from text-image to text-video.  ... 
arXiv:2204.03382v8 fatcat:mkv5pr27zbdfll4cjn7nmbmzee

InternVideo: General Video Foundation Models via Generative and Discriminative Learning [article]

Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, Sen Xing, Guo Chen (+5 others)
2022 arXiv   pre-print
complementary frameworks in a learnable manner to boost various video applications.  ...  To fill the gap, we present general video foundation models, InternVideo, by taking advantage of both generative and discriminative self-supervised video learning.  ...  It is able to deliver the state-of-the-art performance on around 40 datasets, capable of action discrimination, video-language alignment, and open understanding.  ... 
arXiv:2212.03191v2 fatcat:zl2pbyusczbjlf5qmzjj53yhhe

Deep Learning for Video-Text Retrieval: a Review [article]

Cunjuan Zhu, Qi Jia, Wei Chen, Yanming Guo, Yu Liu
2023 arXiv   pre-print
Video-Text Retrieval (VTR) aims to search for the most relevant video related to the semantics in a given sentence, and vice versa.  ...  directions, with the expectation to provide some insights for researchers in the field of video-text retrieval.  ...  The operator [x] + = max(x, 0) is adopted twice to ensure that the matching samples should be close for both video-to-text and text-to-video retrieval.  ... 
arXiv:2302.12552v1 fatcat:zsvohxstdbg2nno2fv6pbhmfwq

Retrieval-Augmented Egocentric Video Captioning [article]

Jilan Xu, Yifei Huang, Junlin Hou, Guo Chen, Yuejie Zhang, Rui Feng, Weidi Xie
2024 arXiv   pre-print
. (3) We train the cross-view retrieval module with a novel EgoExoNCE loss that pulls egocentric and exocentric video features closer by aligning them to shared text features that describe similar actions  ...  In this paper, (1) we develop EgoInstructor, a retrieval-augmented multimodal captioning model that automatically retrieves semantically relevant third-person instructional videos to enhance the video  ...  We report clip/video to text retrieval results on YouCook-Clip/YouCook-Video. † indicates different test data (still unavailable), which is not fairly comparable.  ... 
arXiv:2401.00789v3 fatcat:2t2md27ogrhy7a7myyzuvdab2m

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation [article]

Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, Conghui He, Ping Luo (+4 others)
2024 arXiv   pre-print
Specifically, we utilize a multi-scale approach to generate video-related descriptions. Furthermore, we introduce ViCLIP, a video-text representation learning model based on ViT-L.  ...  They are particularly beneficial for generating interleaved video-text data for learning a video-centric dialogue system, advancing video-to-text and text-to-video generation research.  ...  Similarly, based on CC3M, Nagrani et al. proposed VideoCC3M [20] by transferring captions from image-text datasets to video ones.  ... 
arXiv:2307.06942v2 fatcat:5uumuasiczgxdhvstvjsvgwghq

Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling [article]

Jie Lei, Linjie Li, Luowei Zhou, Zhe Gan, Tamara L. Berg, Mohit Bansal, Jingjing Liu
2021 arXiv   pre-print
Experiments on text-to-video retrieval and video question answering on six datasets demonstrate that ClipBERT outperforms (or is on par with) existing methods that exploit full-length videos, suggesting  ...  The canonical approach to video-and-language learning (e.g., video question answering) dictates a neural model to learn from offline-extracted dense video features from vision models and text features  ...  the art on text-to-video retrieval and video question answering tasks.  ... 
arXiv:2102.06183v1 fatcat:n5yabezujbg27eosmpb23s4hlm

X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval [article]

Yiwei Ma, Guohai Xu, Xiaoshuai Sun, Ming Yan, Ji Zhang, Rongrong Ji
2022 arXiv   pre-print
To this end, this paper presents a novel multi-grained contrastive model, namely X-CLIP, for video-text retrieval.  ...  Video-text retrieval has been a crucial and fundamental task in multi-modal research.  ...  This may be because the temporal encoder is used to model the temporal relation of different frames in a video.  ... 
arXiv:2207.07285v2 fatcat:lpta52dyorhkdljjxtyg3n5gci

VIDiff: Translating Videos via Multi-Modal Instructions with Diffusion Models [article]

Zhen Xing and Qi Dai and Zihao Zhang and Hui Zhang and Han Hu and Zuxuan Wu and Yu-Gang Jiang
2023 arXiv   pre-print
This motivates a growing interest in video editing tasks, where videos are edited according to provided text descriptions.  ...  Diffusion models have achieved significant success in image and video generation.  ...  The overview of the training stage, which shows how to transfer a T2I model for V2V translation tasks. (a) Text-to-Image stage, (b) Text-to-Video stage, (c) Video-to-Video stage.  ... 
arXiv:2311.18837v1 fatcat:ve5ufdc4tfg7ncbzu6t4si6obe
« Previous Showing results 1 — 15 out of 5,043 results