Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Filters








821 Hits in 4.3 sec

Look Closer to Ground Better: Weakly-Supervised Temporal Grounding of Sentence in Video [article]

Zhenfang Chen, Lin Ma, Wenhan Luo, Peng Tang, Kwan-Yee K. Wong
2020 arXiv   pre-print
In this paper, we study the problem of weakly-supervised temporal grounding of sentence in video.  ...  Specifically, given an untrimmed video and a query sentence, our goal is to localize a temporal segment in the video that semantically corresponds to the query sentence, with no reliance on any temporal  ...  This makes the task of collecting temporal annotations difficult and non-trivial. In this paper, we study the problem of weakly-supervised temporal grounding of sentence in video (WSTG).  ... 
arXiv:2001.09308v1 fatcat:z2h6kx2j4rg2xbfq4sopn6dmy4

Self-supervised Learning for Semi-supervised Temporal Language Grounding [article]

Fan Luo, Shaoxiang Chen, Jingjing Chen, Zuxuan Wu, Yu-Gang Jiang
2021 arXiv   pre-print
Given a text description, Temporal Language Grounding (TLG) aims to localize temporal boundaries of the segments that contain the specified semantics in an untrimmed video.  ...  Previous works either tackle this task in a fully-supervised setting that requires a large amount of temporal annotations or in a weakly-supervised setting that usually cannot achieve satisfactory performance  ...  A closer look at temporal sentence ground- In CVPR, pages 1049–1058, 2016. 1 ing in videos: Datasets and metrics.  ... 
arXiv:2109.11475v2 fatcat:2qmfaum4off4dmxzbvgpgj2hty

A Closer Look at Temporal Sentence Grounding in Videos: Datasets and Metrics [article]

Yitian Yuan, Xiaohan Lan, Long Chen, Wei Liu, Xin Wang, Wenwu Zhu
2021 arXiv   pre-print
Despite Temporal Sentence Grounding in Videos (TSGV) has realized impressive progress over the last few years, current TSGV models tend to capture the moment annotation biases and fail to take full advantage  ...  In this paper, we first take a closer look at the existing evaluation protocol, and argue that both the prevailing datasets and metrics are the devils to cause the unreliable benchmarking.  ...  Conclusion In this paper, we took a closer look at the existing evaluation protocol of the temporal sentence grounding in videos (TSGV) task, and we found that both the prevailing datasets and metrics  ... 
arXiv:2101.09028v2 fatcat:tlvfxoxr4nfcbetnaqc4adxs5a

What, when, and where? – Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions [article]

Brian Chen, Nina Shvetsova, Andrew Rouditchenko, Daniel Kondermann, Samuel Thomas, Shih-Fu Chang, Rogerio Feris, James Glass, Hilde Kuehne
2023 arXiv   pre-print
Spatio-temporal grounding describes the task of localizing events in space and time, e.g., in video data, based on verbal descriptions only.  ...  To evaluate this challenging task in a real-life setting, a new benchmark dataset is proposed providing dense spatio-temporal grounding annotations in long, untrimmed, multi-action instructional videos  ...  One of the early works studied this task in the context of weakly supervised learning scenarios where we learn grounding with human-annotated captions of the video [52] .  ... 
arXiv:2303.16990v1 fatcat:becdz5s6bbhdnowfe46hmf2jme

Zero-Shot Video Grounding for Automatic Video Understanding in Sustainable Smart Cities

Ping Wang, Li Sun, Liuan Wang, Jun Sun
2022 Sustainability  
In this paper, a novel atom-based zero-shot video grounding (AZVG) method is proposed to retrieve the segments in the video that correspond to a given input sentence.  ...  Although it is training-free, the performance of AZVG is competitive to the weakly supervised methods and better than unsupervised SOTA methods on the Charades-STA dataset.  ...  Conflicts of Interest: The authors declare no conflict of interest.  ... 
doi:10.3390/su15010153 fatcat:gtgnid7xanbjhef5hjvzaqbhne

Dilated Context Integrated Network with Cross-Modal Consensus for Temporal Emotion Localization in Videos [article]

Juncheng Li, Junlin Xie, Linchao Zhu, Long Qian, Siliang Tang, Wenqiao Zhang, Haochen Shi, Shengyu Zhang, Longhui Wei, Qi Tian, Yueting Zhuang
2022 arXiv   pre-print
To address the third challenge, we introduce a cross-modal consensus learning paradigm, which leverages the inherent semantic consensus between the aligned video and subtitle to achieve weakly-supervised  ...  In this paper, we introduce a new task, named Temporal Emotion Localization in videos~(TEL), which aims to detect human emotions and localize their corresponding temporal boundaries in untrimmed videos  ...  Results We compare our approach to the state-of-the-art WSAL methods, video-subtitle moment retrieval, and weakly-supervised temporal sentence grounding. We summarize the results in Table 1 .  ... 
arXiv:2208.01954v1 fatcat:yesdc65rkzbsbjn7umhuyo5xly

A Closer Look at Debiased Temporal Sentence Grounding in Videos: Dataset, Metric, and Approach [article]

Xiaohan Lan, Yitian Yuan, Xin Wang, Long Chen, Zhi Wang, Lin Ma, Wenwu Zhu
2022 arXiv   pre-print
Temporal Sentence Grounding in Videos (TSGV), which aims to ground a natural language sentence in an untrimmed video, has drawn widespread attention over the past few years.  ...  In order to help the model better align the semantics between sentence queries and video moments, we enhance the representations during feature encoding.  ...  CONCLUSION In this paper, we take a closer look at mainstream benchmark datasets for temporal sentence grounding in videos and finds that there exists significant annotation bias, resulting in highly untrustworthy  ... 
arXiv:2203.05243v1 fatcat:lkyv5znigvdedfsffmnsslxq2e

A Survey on Video Moment Localization

Meng Liu, Liqiang Nie, Yunxiao Wang, Meng Wang, Yong Rui
2022 ACM Computing Surveys  
In this survey paper, we aim to present a comprehensive review of existing video moment localization techniques, including supervised, weakly supervised, and unsupervised ones.  ...  Beyond the task of temporal action localization whereby the target actions are pre-defined, video moment retrieval can query arbitrary complex activities.  ...  To well model the ine-grained video-text local correspondences, Yang et al. [103] presented a Local Correspondence Network (LCNet) for weakly supervised temporal sentence grounding.  ... 
doi:10.1145/3556537 fatcat:3s6cqyebnjfg3pvwvk3db7d6ra

Cross-task weakly supervised learning from instructional videos [article]

Dimitri Zhukov, Jean-Baptiste Alayrac, Ramazan Gokberk Cinbis, David Fouhey, Ivan Laptev, Josef Sivic
2019 arXiv   pre-print
We formalize this in a component model for recognizing steps and a weakly supervised learning framework that can learn this model under temporal constraints from narration and the list of steps.  ...  In this paper we investigate learning visual models for the steps of ordinary tasks using weak supervision via instructional narrations and an ordered list of steps instead of strong supervision via temporal  ...  This is weakly supervised since we provide only the list of steps, but not their temporal locations in training videos. Problem formulation. We denote the set of narrated instructional videos V.  ... 
arXiv:1903.08225v2 fatcat:w3s73pstvzccrb7zqvybbg5j7y

Cross-Task Weakly Supervised Learning From Instructional Videos

Dimitri Zhukov, Jean-Baptiste Alayrac, Ramazan Gokberk Cinbis, David Fouhey, Ivan Laptev, Josef Sivic
2019 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)  
We formalize this in a component model for recognizing steps and a weakly supervised learning framework that can learn this model under temporal constraints from narration and the list of steps.  ...  In this paper we investigate learning visual models for the steps of ordinary tasks using weak supervision via instructional narrations and an ordered list of steps instead of strong supervision via temporal  ...  This is weakly supervised since we provide only the list of steps, but not their temporal locations in training videos. Problem formulation. We denote the set of narrated instructional videos V.  ... 
doi:10.1109/cvpr.2019.00365 dblp:conf/cvpr/ZhukovACFLS19 fatcat:dz3tpgbim5bwxhri4hkj57bcfu

Temporal Sentence Grounding in Videos: A Survey and Future Directions [article]

Hao Zhang, Aixin Sun, Wei Jing, Joey Tianyi Zhou
2023 arXiv   pre-print
Temporal sentence grounding in videos (TSGV), \aka natural language video localization (NLVL) or video moment retrieval (VMR), aims to retrieve a temporal moment that semantically corresponds to a language  ...  As the background, we present a common structure of functional components in TSGV, in a tutorial style: from feature extraction from raw video and language query, to answer prediction of the target moment  ...  Spatio-Temporal Sentence Grounding in Videos Spatio-temporal sentence grounding in videos (STSGV) is another extension of TSGV.  ... 
arXiv:2201.08071v3 fatcat:ktw4onlakzfedok42mmj6r5zwq

Transferring Knowledge from Text to Video: Zero-Shot Anticipation for Procedural Actions [article]

Fadime Sener, Rishabh Saraf, Angela Yao
2022 arXiv   pre-print
Given a portion of an instructional video, our model recognizes and predicts coherent and plausible actions multiple steps into the future, all in rich natural language.  ...  To demonstrate the capabilities of our model, we introduce the Tasty Videos Dataset V2, a collection of 4022 recipes for zero-shot learning, recognition and anticipation.  ...  This step is supervised as it requires video segments of each step that are temporally aligned with the corresponding sentences.  ... 
arXiv:2106.03158v2 fatcat:2ojuj7hayrh5xj5yjkdj7o3dwe

Vision+X: A Survey on Multimodal Learning in the Light of Data [article]

Ye Zhu, Yu Wu, Nicu Sebe, Yan Yan
2022 arXiv   pre-print
The exploitation of the alignment, as well as the existing gap between the intrinsic nature of data modality and the technical designs, will benefit future research studies to better address and solve  ...  a specific challenge related to the concrete multimodal task, and to prompt a unified multimodal machine learning framework closer to a real human intelligence system.  ...  Similar to other metrics, a SPICE value closer to 1 implies better quality of the generated textual sentences with respect to ground truth captions and given images.  ... 
arXiv:2210.02884v1 fatcat:g44t3rxvqrf5ti4pfggjlygeda

Action Duration Prediction for Segment-Level Alignment of Weakly-Labeled Videos [article]

Reza Ghoddoosian, Saif Sayed, Vassilis Athitsos
2020 arXiv   pre-print
We propose a novel Duration Network, which captures a short temporal window of the video and learns to predict the remaining duration of a given action at any point in time with a level of granularity  ...  This paper focuses on weakly-supervised action alignment, where only the ordered sequence of video-level actions is available for training.  ...  Any opinions, findings, and conclusions or recommendations expressed in this publication are those of the authors, and do not necessarily reflect the views of the National Science Foundation.  ... 
arXiv:2011.10190v1 fatcat:pelqv65k5vabpnfxpmjgthwd2q

Learning a Weakly-Supervised Video Actor-Action Segmentation Model With a Wise Selection

Jie Chen, Zhiheng Li, Jiebo Luo, Chenliang Xu
2020 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)  
We address weakly-supervised video actor-action segmentation (VAAS), which extends general video object segmentation (VOS) to additionally consider action labels of the actors.  ...  In addition, a 3D-Conv GCAM is devised to adapt to the VAAS task.  ...  This work was supported in part by NSF 1741472, 1813709, and 1909912. The article solely reflects the opinions and conclusions of its authors but not the funding agents.  ... 
doi:10.1109/cvpr42600.2020.00992 dblp:conf/cvpr/ChenLLX20 fatcat:64pqhejqwvc7lnytfh34h7rz4y
« Previous Showing results 1 — 15 out of 821 results