Exploring multi-modality structure for cross domain adaptation in video concept annotation.

classifiers in the source domains to assist multi-graph optimization (a graph-based semi-supervised learning method) in the target domain for video concept annotation. ... Domain adaptive video concept detection and annotation has recently received significant attention, but in existing video adaptation processes, all the features are treated as one modality, while multimodalities ... Fig. 1 . 1 An example for multi-modalities based domain adaptive video concept detection and annotation in one-to-one adaptation case. ...

doi:10.1016/j.neucom.2011.05.041 fatcat:dog64kb5qvafxfxzgs6f2y3weq

Next, we categorize transformer models into Single-Stream and Multi-Stream structures, highlight their innovations and compare their performances. ... We then describe the typical paradigm of pre-training & fine-tuning on Video-Language processing in terms of proxy tasks, downstream tasks and commonly used video datasets. ... It follows the single-stream structure, porting the original BERT structure to the multi-modal domain as illustrated in Fig. 2-(1) . ...

arXiv:2109.09920v1 fatcat:ixysz5k4vrbktmf6cqftttls7m

Open Access

Multimodal deep learning systems which employ multiple modalities like text, image, audio, video, etc., are showing better performance in comparison with individual modalities (i.e., unimodal) systems. ... domain. ... statement Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in ...

arXiv:2107.13782v2 fatcat:s4spofwxjndb7leqbcqnwbifq4

Open Access Multiple Versions

correspondence across modalities, learning structured (generative) models to account for natural data dependency or model hidden topics, handling rare classes, leveraging unlabeled data, scaling to large ... Multimedia content accounts for over 60% of traffic in the current internet [74] . ... Cross-modal association and image annotation Viewing semantic concepts as binary detection on low-level multi-modal features is not the only way for multimedia semantics extraction. ...

doi:10.1007/978-0-387-76569-3_2 fatcat:jul6fw7esfaurct6erjnvpcq6q

In this article, we present a deep and comprehensive overview for multi-modal analysis in multimedia. ... These data are heterogeneous and multi-modal in nature, imposing great challenges for processing and analyzing them. ... ACKNOWLEDGMENT We thank Guohao Li, Shengze Yu and Yitian Yuan for providing relevant materials and valuable opinions. This work will never be accomplished without their useful suggestions. ...

doi:10.1109/tcsvt.2019.2940647 fatcat:l4tchrkgrnaeradvc4nhfan2w4

Multiple Versions

Knowledge Graphs (KGs) play a pivotal role in advancing various AI applications, with the semantic web community's exploration into multi-modal dimensions unlocking new avenues for innovation. ... In this survey, we carefully review over 300 articles, focusing on KG-aware research in two principal aspects: KG-driven Multi-Modal (KG4MM) learning, where KGs support multi-modal tasks, and Multi-Modal ... Third, exploring unique pre-training paradigms suited for (MM)KGs to fully harness the value of structured knowledge in multi-modal pre-training. ...

arXiv:2402.05391v3 fatcat:vbrdoocaozfptlwbdf77irxame

Multiple Versions

Hence many researchers have developed automatic techniques for detecting possible cross-modal discordance in web-based content. ... We analyze, categorize and identify existing approaches in addition to challenges and shortcomings they face in order to unearth new research opportunities in the field of multi-modal misinformation detection ... In the next section, we introduce some of cross-modal clues for detecting them in multi-modal settings. ...

arXiv:2203.13883v6 fatcat:ppocrlukmnbhrggs6xui4p2vgi

Open Access Multiple Versions

In addition, we discuss promising future directions for this field, in particular large-scale datasets and interpretable video moment localization models. ... We also review the datasets available for video moment localization and group results of related work. ... [31] for the dense video captioning task, which contains annotations from the open domain. ...

doi:10.1145/3556537 fatcat:3s6cqyebnjfg3pvwvk3db7d6ra

To tackle performance degradation and address concerns in high video annotation cost uniformly, the video unsupervised domain adaptation (VUDA) is introduced to adapt video models from the labeled source ... Further, with the high cost of video annotation, it is more practical to use unlabeled videos for training. ... We summarize the recent progress in VUDA research while providing insights into future VUDA research from the perspectives of leveraging multi-modal information, investigating reconstruction-based methods ...

arXiv:2211.10412v2 fatcat:zatg7vnuunboxlxnm3he62yq2q

Open Access Multiple Versions

In this work, we propose to leverage "motion prior" in videos for improving human segmentation in a weakly-supervised active learning setting. ... approach with domain adaptation approaches. ... To demonstrate severe domain shift, we evaluate our method mainly on cross-modality (RGB to IR) domain adaptation for human segmentation. ...

arXiv:1807.11436v1 fatcat:jdiwvoqas5hkhohburdpw2wegu

Open Access

Addressing data heterogeneity issues can improve performance, reduce computational costs, and aid in developing personalized, adaptive models with less annotated data. ... Sensor-based Human Activity Recognition (HAR) is crucial in ubiquitous computing, analysing behaviours through multi-dimensional observations. ... [10] adapted the non-local block for heterogeneous data fusion of video and smart-glove. ...

arXiv:2403.15422v1 fatcat:f5c5frgvavdc7mwuhfoic3aguy

Open Access

cross-modal semantic interaction. ... We conclude this paper by putting forward some open issues and promising research trends for VAD, e.g., the cognitive mechanisms of human-machine dialogue under cross-modal dialogue context, and knowledge-enhanced ... ACKNOWLEDGMENTS This work was partially supported by the National Science Fund for Distinguished Young Scholars (62025205), and the National Natural Science Foundation of China (No. 62032020, 61960206008 ...

arXiv:2207.00782v1 fatcat:a57laj75xfa43gg4hjvxdh4c4i

In cognitive modeling, a new model entitled CAM is proposed which is suitable for cross-media semantic understanding. Moreover, a Cross-Media Intelligent Retrieval System (CMIRS) will be illustrated. ... Furthermore, we propose a framework for cross-media semantic understanding which contains discriminative modeling, generative modeling and cognitive modeling. ... The concepts provide model entities of interest in the domain, and are typically organized into a taxonomy tree where each node represents a concept and each concept is a specialization of its parent. ...

doi:10.1109/ccis.2012.6664429 dblp:conf/ccis/ShiJZYZ12 fatcat:wefyscguf5bmjj2uzjafdjc2ty

In categorizing existing works into Image/Video-Based Re-ID, Re-ID with limited data/annotations, Cross-Modal Re-ID, and Special Re-ID Scenarios, we thoroughly elucidate the advantages demonstrated by ... the Transformer in addressing a multitude of challenges across these domains. ... Consequently, the Transformer is particularly wellsuited for establishing inter-modal associations and facilitating the fusion of multi-modal information in cross-modal Re-ID tasks. 4) Transformer in Special ...

arXiv:2401.06960v1 fatcat:t5gsdon4grh55kkkzo3tnkdb5e

To fill this gap, we construct the Tencent 'Ads Video Segmentation'~(TAVS) dataset in the ads domain to escalate multi-modal video analysis to a new level. ... TAVS is organized hierarchically in semantic aspects for comprehensive temporal video segmentation with three levels of categories for multi-label classification, e.g., 'place' - 'working place' - 'office ... To explore a good representative domain for such high-level intelligence, we believe that ads video, as its properties of rich plot developments and multi-modal nature are just like a 'short movie', is ...

arXiv:2212.04700v1 fatcat:hllj75yoejh5rk7mdefvaa4yf4

Open Access

Exploring multi-modality structure for cross domain adaptation in video concept annotation

Preserved Fulltext

Survey: Transformer based Video-Language Pre-training [article]

Preserved Fulltext

Multimodal Co-learning: Challenges, Applications with Datasets, Recent Advances and Future Directions [article]

Preserved Fulltext

Other Versions

Extracting Semantics from Multimedia Content: Challenges and Solutions [chapter]

Preserved Fulltext

Multi-modal Deep Analysis for Multimedia

Preserved Fulltext

Other Versions

Knowledge Graphs Meet Multi-Modal Learning: A Comprehensive Survey [article]

Preserved Fulltext

Other Versions

Multi-modal Misinformation Detection: Approaches, Challenges and Opportunities [article]

Preserved Fulltext

Other Versions

A Survey on Video Moment Localization

Preserved Fulltext

Video Unsupervised Domain Adaptation with Deep Learning: A Comprehensive Survey [article]

Preserved Fulltext

Other Versions

Leveraging Motion Priors in Videos for Improving Human Segmentation [article]

Preserved Fulltext

Machine Learning Techniques for Sensor-based Human Activity Recognition with Data Heterogeneity – A Review [article]

Preserved Fulltext

Enabling Harmonious Human-Machine Interaction with Visual-Context Augmented Dialogue System: A Review [article]

Preserved Fulltext

Cross-media cloud computing

Preserved Fulltext

Transformer for Object Re-Identification: A Survey [article]

Preserved Fulltext

Tencent AVS: A Holistic Ads Video Dataset for Multi-modal Scene Segmentation [article]

Preserved Fulltext