A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2020; you can also visit the original URL.
The file type is application/pdf
.
Filters
Look Closer to Ground Better: Weakly-Supervised Temporal Grounding of Sentence in Video
[article]
2020
arXiv
pre-print
In this paper, we study the problem of weakly-supervised temporal grounding of sentence in video. ...
Specifically, given an untrimmed video and a query sentence, our goal is to localize a temporal segment in the video that semantically corresponds to the query sentence, with no reliance on any temporal ...
This makes the task of collecting temporal annotations difficult and non-trivial. In this paper, we study the problem of weakly-supervised temporal grounding of sentence in video (WSTG). ...
arXiv:2001.09308v1
fatcat:z2h6kx2j4rg2xbfq4sopn6dmy4
Self-supervised Learning for Semi-supervised Temporal Language Grounding
[article]
2021
arXiv
pre-print
Given a text description, Temporal Language Grounding (TLG) aims to localize temporal boundaries of the segments that contain the specified semantics in an untrimmed video. ...
Previous works either tackle this task in a fully-supervised setting that requires a large amount of temporal annotations or in a weakly-supervised setting that usually cannot achieve satisfactory performance ...
A closer look at temporal sentence ground-
In CVPR, pages 1049–1058, 2016. 1 ing in videos: Datasets and metrics. ...
arXiv:2109.11475v2
fatcat:2qmfaum4off4dmxzbvgpgj2hty
A Closer Look at Temporal Sentence Grounding in Videos: Datasets and Metrics
[article]
2021
arXiv
pre-print
Despite Temporal Sentence Grounding in Videos (TSGV) has realized impressive progress over the last few years, current TSGV models tend to capture the moment annotation biases and fail to take full advantage ...
In this paper, we first take a closer look at the existing evaluation protocol, and argue that both the prevailing datasets and metrics are the devils to cause the unreliable benchmarking. ...
Conclusion In this paper, we took a closer look at the existing evaluation protocol of the temporal sentence grounding in videos (TSGV) task, and we found that both the prevailing datasets and metrics ...
arXiv:2101.09028v2
fatcat:tlvfxoxr4nfcbetnaqc4adxs5a
What, when, and where? – Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions
[article]
2023
arXiv
pre-print
Spatio-temporal grounding describes the task of localizing events in space and time, e.g., in video data, based on verbal descriptions only. ...
To evaluate this challenging task in a real-life setting, a new benchmark dataset is proposed providing dense spatio-temporal grounding annotations in long, untrimmed, multi-action instructional videos ...
One of the early works studied this task in the context of weakly supervised learning scenarios where we learn grounding with human-annotated captions of the video [52] . ...
arXiv:2303.16990v1
fatcat:becdz5s6bbhdnowfe46hmf2jme
Zero-Shot Video Grounding for Automatic Video Understanding in Sustainable Smart Cities
2022
Sustainability
In this paper, a novel atom-based zero-shot video grounding (AZVG) method is proposed to retrieve the segments in the video that correspond to a given input sentence. ...
Although it is training-free, the performance of AZVG is competitive to the weakly supervised methods and better than unsupervised SOTA methods on the Charades-STA dataset. ...
Conflicts of Interest: The authors declare no conflict of interest. ...
doi:10.3390/su15010153
fatcat:gtgnid7xanbjhef5hjvzaqbhne
Dilated Context Integrated Network with Cross-Modal Consensus for Temporal Emotion Localization in Videos
[article]
2022
arXiv
pre-print
To address the third challenge, we introduce a cross-modal consensus learning paradigm, which leverages the inherent semantic consensus between the aligned video and subtitle to achieve weakly-supervised ...
In this paper, we introduce a new task, named Temporal Emotion Localization in videos~(TEL), which aims to detect human emotions and localize their corresponding temporal boundaries in untrimmed videos ...
Results We compare our approach to the state-of-the-art WSAL methods, video-subtitle moment retrieval, and weakly-supervised temporal sentence grounding. We summarize the results in Table 1 . ...
arXiv:2208.01954v1
fatcat:yesdc65rkzbsbjn7umhuyo5xly
A Closer Look at Debiased Temporal Sentence Grounding in Videos: Dataset, Metric, and Approach
[article]
2022
arXiv
pre-print
Temporal Sentence Grounding in Videos (TSGV), which aims to ground a natural language sentence in an untrimmed video, has drawn widespread attention over the past few years. ...
In order to help the model better align the semantics between sentence queries and video moments, we enhance the representations during feature encoding. ...
CONCLUSION In this paper, we take a closer look at mainstream benchmark datasets for temporal sentence grounding in videos and finds that there exists significant annotation bias, resulting in highly untrustworthy ...
arXiv:2203.05243v1
fatcat:lkyv5znigvdedfsffmnsslxq2e
A Survey on Video Moment Localization
2022
ACM Computing Surveys
In this survey paper, we aim to present a comprehensive review of existing video moment localization techniques, including supervised, weakly supervised, and unsupervised ones. ...
Beyond the task of temporal action localization whereby the target actions are pre-defined, video moment retrieval can query arbitrary complex activities. ...
To well model the ine-grained video-text local correspondences, Yang et al. [103] presented a Local Correspondence Network (LCNet) for weakly supervised temporal sentence grounding. ...
doi:10.1145/3556537
fatcat:3s6cqyebnjfg3pvwvk3db7d6ra
Cross-task weakly supervised learning from instructional videos
[article]
2019
arXiv
pre-print
We formalize this in a component model for recognizing steps and a weakly supervised learning framework that can learn this model under temporal constraints from narration and the list of steps. ...
In this paper we investigate learning visual models for the steps of ordinary tasks using weak supervision via instructional narrations and an ordered list of steps instead of strong supervision via temporal ...
This is weakly supervised since we provide only the list of steps, but not their temporal locations in training videos. Problem formulation. We denote the set of narrated instructional videos V. ...
arXiv:1903.08225v2
fatcat:w3s73pstvzccrb7zqvybbg5j7y
Cross-Task Weakly Supervised Learning From Instructional Videos
2019
2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
We formalize this in a component model for recognizing steps and a weakly supervised learning framework that can learn this model under temporal constraints from narration and the list of steps. ...
In this paper we investigate learning visual models for the steps of ordinary tasks using weak supervision via instructional narrations and an ordered list of steps instead of strong supervision via temporal ...
This is weakly supervised since we provide only the list of steps, but not their temporal locations in training videos. Problem formulation. We denote the set of narrated instructional videos V. ...
doi:10.1109/cvpr.2019.00365
dblp:conf/cvpr/ZhukovACFLS19
fatcat:dz3tpgbim5bwxhri4hkj57bcfu
Temporal Sentence Grounding in Videos: A Survey and Future Directions
[article]
2023
arXiv
pre-print
Temporal sentence grounding in videos (TSGV), \aka natural language video localization (NLVL) or video moment retrieval (VMR), aims to retrieve a temporal moment that semantically corresponds to a language ...
As the background, we present a common structure of functional components in TSGV, in a tutorial style: from feature extraction from raw video and language query, to answer prediction of the target moment ...
Spatio-Temporal Sentence Grounding in Videos Spatio-temporal sentence grounding in videos (STSGV) is another extension of TSGV. ...
arXiv:2201.08071v3
fatcat:ktw4onlakzfedok42mmj6r5zwq
Transferring Knowledge from Text to Video: Zero-Shot Anticipation for Procedural Actions
[article]
2022
arXiv
pre-print
Given a portion of an instructional video, our model recognizes and predicts coherent and plausible actions multiple steps into the future, all in rich natural language. ...
To demonstrate the capabilities of our model, we introduce the Tasty Videos Dataset V2, a collection of 4022 recipes for zero-shot learning, recognition and anticipation. ...
This step is supervised as it requires video segments of each step that are temporally aligned with the corresponding sentences. ...
arXiv:2106.03158v2
fatcat:2ojuj7hayrh5xj5yjkdj7o3dwe
Vision+X: A Survey on Multimodal Learning in the Light of Data
[article]
2022
arXiv
pre-print
The exploitation of the alignment, as well as the existing gap between the intrinsic nature of data modality and the technical designs, will benefit future research studies to better address and solve ...
a specific challenge related to the concrete multimodal task, and to prompt a unified multimodal machine learning framework closer to a real human intelligence system. ...
Similar to other metrics, a SPICE value closer to 1 implies better quality of the generated textual sentences with respect to ground truth captions and given images. ...
arXiv:2210.02884v1
fatcat:g44t3rxvqrf5ti4pfggjlygeda
Action Duration Prediction for Segment-Level Alignment of Weakly-Labeled Videos
[article]
2020
arXiv
pre-print
We propose a novel Duration Network, which captures a short temporal window of the video and learns to predict the remaining duration of a given action at any point in time with a level of granularity ...
This paper focuses on weakly-supervised action alignment, where only the ordered sequence of video-level actions is available for training. ...
Any opinions, findings, and conclusions or recommendations expressed in this publication are those of the authors, and do not necessarily reflect the views of the National Science Foundation. ...
arXiv:2011.10190v1
fatcat:pelqv65k5vabpnfxpmjgthwd2q
Learning a Weakly-Supervised Video Actor-Action Segmentation Model With a Wise Selection
2020
2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
We address weakly-supervised video actor-action segmentation (VAAS), which extends general video object segmentation (VOS) to additionally consider action labels of the actors. ...
In addition, a 3D-Conv GCAM is devised to adapt to the VAAS task. ...
This work was supported in part by NSF 1741472, 1813709, and 1909912. The article solely reflects the opinions and conclusions of its authors but not the funding agents. ...
doi:10.1109/cvpr42600.2020.00992
dblp:conf/cvpr/ChenLLX20
fatcat:64pqhejqwvc7lnytfh34h7rz4y
« Previous
Showing results 1 — 15 out of 821 results