Look Closer to Ground Better: Weakly-Supervised Temporal Grounding of Sentence in Video.

In this paper, we study the problem of weakly-supervised temporal grounding of sentence in video. ... Specifically, given an untrimmed video and a query sentence, our goal is to localize a temporal segment in the video that semantically corresponds to the query sentence, with no reliance on any temporal ... This makes the task of collecting temporal annotations difficult and non-trivial. In this paper, we study the problem of weakly-supervised temporal grounding of sentence in video (WSTG). ...

arXiv:2001.09308v1 fatcat:z2h6kx2j4rg2xbfq4sopn6dmy4

Given a text description, Temporal Language Grounding (TLG) aims to localize temporal boundaries of the segments that contain the specified semantics in an untrimmed video. ... Previous works either tackle this task in a fully-supervised setting that requires a large amount of temporal annotations or in a weakly-supervised setting that usually cannot achieve satisfactory performance ... A closer look at temporal sentence ground- In CVPR, pages 1049–1058, 2016. 1 ing in videos: Datasets and metrics. ...

arXiv:2109.11475v2 fatcat:2qmfaum4off4dmxzbvgpgj2hty

Open Access Multiple Versions

Despite Temporal Sentence Grounding in Videos (TSGV) has realized impressive progress over the last few years, current TSGV models tend to capture the moment annotation biases and fail to take full advantage ... In this paper, we first take a closer look at the existing evaluation protocol, and argue that both the prevailing datasets and metrics are the devils to cause the unreliable benchmarking. ... Conclusion In this paper, we took a closer look at the existing evaluation protocol of the temporal sentence grounding in videos (TSGV) task, and we found that both the prevailing datasets and metrics ...

arXiv:2101.09028v2 fatcat:tlvfxoxr4nfcbetnaqc4adxs5a

Open Access Multiple Versions

Spatio-temporal grounding describes the task of localizing events in space and time, e.g., in video data, based on verbal descriptions only. ... To evaluate this challenging task in a real-life setting, a new benchmark dataset is proposed providing dense spatio-temporal grounding annotations in long, untrimmed, multi-action instructional videos ... One of the early works studied this task in the context of weakly supervised learning scenarios where we learn grounding with human-annotated captions of the video [52] . ...

arXiv:2303.16990v1 fatcat:becdz5s6bbhdnowfe46hmf2jme

Open Access

In this paper, a novel atom-based zero-shot video grounding (AZVG) method is proposed to retrieve the segments in the video that correspond to a given input sentence. ... Although it is training-free, the performance of AZVG is competitive to the weakly supervised methods and better than unsupervised SOTA methods on the Charades-STA dataset. ... Conflicts of Interest: The authors declare no conflict of interest. ...

doi:10.3390/su15010153 fatcat:gtgnid7xanbjhef5hjvzaqbhne

DOAJ

To address the third challenge, we introduce a cross-modal consensus learning paradigm, which leverages the inherent semantic consensus between the aligned video and subtitle to achieve weakly-supervised ... In this paper, we introduce a new task, named Temporal Emotion Localization in videos~(TEL), which aims to detect human emotions and localize their corresponding temporal boundaries in untrimmed videos ... Results We compare our approach to the state-of-the-art WSAL methods, video-subtitle moment retrieval, and weakly-supervised temporal sentence grounding. We summarize the results in Table 1 . ...

arXiv:2208.01954v1 fatcat:yesdc65rkzbsbjn7umhuyo5xly

Temporal Sentence Grounding in Videos (TSGV), which aims to ground a natural language sentence in an untrimmed video, has drawn widespread attention over the past few years. ... In order to help the model better align the semantics between sentence queries and video moments, we enhance the representations during feature encoding. ... CONCLUSION In this paper, we take a closer look at mainstream benchmark datasets for temporal sentence grounding in videos and finds that there exists significant annotation bias, resulting in highly untrustworthy ...

arXiv:2203.05243v1 fatcat:lkyv5znigvdedfsffmnsslxq2e

In this survey paper, we aim to present a comprehensive review of existing video moment localization techniques, including supervised, weakly supervised, and unsupervised ones. ... Beyond the task of temporal action localization whereby the target actions are pre-defined, video moment retrieval can query arbitrary complex activities. ... To well model the ine-grained video-text local correspondences, Yang et al. [103] presented a Local Correspondence Network (LCNet) for weakly supervised temporal sentence grounding. ...

doi:10.1145/3556537 fatcat:3s6cqyebnjfg3pvwvk3db7d6ra

We formalize this in a component model for recognizing steps and a weakly supervised learning framework that can learn this model under temporal constraints from narration and the list of steps. ... In this paper we investigate learning visual models for the steps of ordinary tasks using weak supervision via instructional narrations and an ordered list of steps instead of strong supervision via temporal ... This is weakly supervised since we provide only the list of steps, but not their temporal locations in training videos. Problem formulation. We denote the set of narrated instructional videos V. ...

arXiv:1903.08225v2 fatcat:w3s73pstvzccrb7zqvybbg5j7y

Multiple Versions

We formalize this in a component model for recognizing steps and a weakly supervised learning framework that can learn this model under temporal constraints from narration and the list of steps. ... In this paper we investigate learning visual models for the steps of ordinary tasks using weak supervision via instructional narrations and an ordered list of steps instead of strong supervision via temporal ... This is weakly supervised since we provide only the list of steps, but not their temporal locations in training videos. Problem formulation. We denote the set of narrated instructional videos V. ...

doi:10.1109/cvpr.2019.00365 dblp:conf/cvpr/ZhukovACFLS19 fatcat:dz3tpgbim5bwxhri4hkj57bcfu

Temporal sentence grounding in videos (TSGV), \aka natural language video localization (NLVL) or video moment retrieval (VMR), aims to retrieve a temporal moment that semantically corresponds to a language ... As the background, we present a common structure of functional components in TSGV, in a tutorial style: from feature extraction from raw video and language query, to answer prediction of the target moment ... Spatio-Temporal Sentence Grounding in Videos Spatio-temporal sentence grounding in videos (STSGV) is another extension of TSGV. ...

arXiv:2201.08071v3 fatcat:ktw4onlakzfedok42mmj6r5zwq

Multiple Versions

Given a portion of an instructional video, our model recognizes and predicts coherent and plausible actions multiple steps into the future, all in rich natural language. ... To demonstrate the capabilities of our model, we introduce the Tasty Videos Dataset V2, a collection of 4022 recipes for zero-shot learning, recognition and anticipation. ... This step is supervised as it requires video segments of each step that are temporally aligned with the corresponding sentences. ...

arXiv:2106.03158v2 fatcat:2ojuj7hayrh5xj5yjkdj7o3dwe

Multiple Versions

The exploitation of the alignment, as well as the existing gap between the intrinsic nature of data modality and the technical designs, will benefit future research studies to better address and solve ... a specific challenge related to the concrete multimodal task, and to prompt a unified multimodal machine learning framework closer to a real human intelligence system. ... Similar to other metrics, a SPICE value closer to 1 implies better quality of the generated textual sentences with respect to ground truth captions and given images. ...

arXiv:2210.02884v1 fatcat:g44t3rxvqrf5ti4pfggjlygeda

Open Access

We propose a novel Duration Network, which captures a short temporal window of the video and learns to predict the remaining duration of a given action at any point in time with a level of granularity ... This paper focuses on weakly-supervised action alignment, where only the ordered sequence of video-level actions is available for training. ... Any opinions, findings, and conclusions or recommendations expressed in this publication are those of the authors, and do not necessarily reflect the views of the National Science Foundation. ...

arXiv:2011.10190v1 fatcat:pelqv65k5vabpnfxpmjgthwd2q

Open Access

We address weakly-supervised video actor-action segmentation (VAAS), which extends general video object segmentation (VOS) to additionally consider action labels of the actors. ... In addition, a 3D-Conv GCAM is devised to adapt to the VAAS task. ... This work was supported in part by NSF 1741472, 1813709, and 1909912. The article solely reflects the opinions and conclusions of its authors but not the funding agents. ...

doi:10.1109/cvpr42600.2020.00992 dblp:conf/cvpr/ChenLLX20 fatcat:64pqhejqwvc7lnytfh34h7rz4y

Look Closer to Ground Better: Weakly-Supervised Temporal Grounding of Sentence in Video [article]

Preserved Fulltext

Self-supervised Learning for Semi-supervised Temporal Language Grounding [article]

Preserved Fulltext

Other Versions

A Closer Look at Temporal Sentence Grounding in Videos: Datasets and Metrics [article]

Preserved Fulltext

Other Versions

What, when, and where? – Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions [article]

Preserved Fulltext

Zero-Shot Video Grounding for Automatic Video Understanding in Sustainable Smart Cities

Preserved Fulltext

Dilated Context Integrated Network with Cross-Modal Consensus for Temporal Emotion Localization in Videos [article]

Preserved Fulltext

A Closer Look at Debiased Temporal Sentence Grounding in Videos: Dataset, Metric, and Approach [article]

Preserved Fulltext

A Survey on Video Moment Localization

Preserved Fulltext

Cross-task weakly supervised learning from instructional videos [article]

Preserved Fulltext

Other Versions

Cross-Task Weakly Supervised Learning From Instructional Videos

Preserved Fulltext

Temporal Sentence Grounding in Videos: A Survey and Future Directions [article]

Preserved Fulltext

Other Versions

Transferring Knowledge from Text to Video: Zero-Shot Anticipation for Procedural Actions [article]

Preserved Fulltext

Other Versions

Vision+X: A Survey on Multimodal Learning in the Light of Data [article]

Preserved Fulltext

Action Duration Prediction for Segment-Level Alignment of Weakly-Labeled Videos [article]

Preserved Fulltext

Learning a Weakly-Supervised Video Actor-Action Segmentation Model With a Wise Selection

Preserved Fulltext