Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Filters








162,453 Hits in 4.6 sec

Adaptive Hierarchical Graph Reasoning with Semantic Coherence for Video-and-Language Inference [article]

Juncheng Li, Siliang Tang, Linchao Zhu, Haochen Shi, Xuanwen Huang, Fei Wu, Yi Yang, Yueting Zhuang
2021 arXiv   pre-print
Video-and-Language Inference is a recently proposed task for joint video-and-language understanding.  ...  Specifically, it performs joint reasoning over video and subtitles in three hierarchies, where the graph structure is adaptively adjusted according to the semantic structures of the statement.  ...  Then, we compute the context gate γ (n) i based on the visual guidance, subtitle guidance, and the semantic query, which controls the fusion of visual and linguistic information: γ (n) i = σ(W 3 [g v ,  ... 
arXiv:2107.12270v2 fatcat:jzfz6lwztrfpxpnocx4dx72eoq

A Multi-level Alignment Training Scheme for Video-and-Language Grounding [article]

Yubo Zhang, Feiyang Niu, Qing Ping, Govind Thattai
2022 arXiv   pre-print
Global and segment levels of video-language alignment pairs were designed, based on the information similarity ranging from high-level context to fine-grained semantics.  ...  For a pair of video and language description, their semantic relation is reflected by their encodings' similarity.  ...  By contrasting the embedding similarity of modality pairs that are more relevant to the ones that have less semantic connection, the network is able to ground the similar information closer in the shared  ... 
arXiv:2204.10938v2 fatcat:bp2kuuhecbhfhd3qse6rcgh3gm

Enabling Harmonious Human-Machine Interaction with Visual-Context Augmented Dialogue System: A Review [article]

Hao Wang, Bin Guo, Yating Zeng, Yasan Ding, Chen Qiu, Ying Zhang, Lina Yao, Zhiwen Yu
2022 arXiv   pre-print
Consequently, Visual Context Augmented Dialogue System (VAD), which has the potential to communicate with humans by perceiving and understanding multimodal information (i.e., visual context in images or  ...  The intelligent dialogue system, aiming at communicating with humans harmoniously with natural language, is brilliant for promoting the advancement of human-machine interaction in the era of artificial  ...  In addition, VAD requires cross-modal fusion and reasoning module to understand the semantic interaction between visual context and textual language information.  ... 
arXiv:2207.00782v1 fatcat:a57laj75xfa43gg4hjvxdh4c4i

Like a bilingual baby: The advantage of visually grounding a bilingual language model [article]

Khai-Nguyen Nguyen and Zixin Tang and Ankur Mali and Alex Kelly
2023 arXiv   pre-print
We find that the visual grounding improves the model's understanding of semantic similarity both within and across languages and improves perplexity.  ...  Our results provide additional evidence of the advantages of visually grounded language models and point to the need for more naturalistic language data from multilingual speakers and multilingual datasets  ...  We find that the use of visual information lowers perplexity and improves correlation to human judgements of semantic similarity both between and within languages.  ... 
arXiv:2210.05487v2 fatcat:gmncu4ezc5bibgdfnvo5h5kkye

If Sentences Could See: Investigating Visual Information for Semantic Textual Similarity

Goran Glavas, Ivan Vulic, Simone Paolo Ponzetto
2017 International Conference on Computational Semantics  
We investigate the effects of incorporating visual signal from images into unsupervised Semantic Textual Similarity (STS) measures.  ...  We also show that selective inclusion of visual information may further boost performance in the multi-modal setup.  ...  Han et al. (2013) use LSA-based and WordNet-based measures of word similarity to find the pairs of semantically aligned words.  ... 
dblp:conf/iwcs/GlavasVP17 fatcat:fzli3w7uardixnszp4dyjs4nei

Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning [article]

Zhicheng Huang, Zhaoyang Zeng, Yupan Huang, Bei Liu, Dongmei Fu, Jianlong Fu
2021 arXiv   pre-print
As region-based visual features usually represent parts of an image, it is challenging for existing vision-language models to fully understand the semantics from paired natural languages.  ...  VD is designed to represent consistent visual abstractions of similar semantics. It is updated on-the-fly and utilized in our proposed pre-training task Masked Visual Modeling (MVM).  ...  This is reasonable as VD is designed to aggregate similar visual semantics into the same image feature.  ... 
arXiv:2104.03135v2 fatcat:ipergnpirzhblnwg2epmptmasa

On the Conceptualization of a Modeling Language for Semantic Model Annotations [chapter]

Hans-Georg Fill
2011 Lecture Notes in Business Information Processing  
annotation of visual models.  ...  In this paper we describe the theoretical foundations, formal considerations, and technical characteristics that were taken into account for the conceptualization of a modeling language for the semantic  ...  Thereby, a user can edit the annotations in a visual form without having to deal with a formal specification language.  ... 
doi:10.1007/978-3-642-22056-2_14 fatcat:vaq5tg6cijd7rdshnlinpcqare

Segment-Phrase Table for Semantic Segmentation, Visual Entailment and Paraphrasing

Hamid Izadinia, Fereshteh Sadeghi, Santosh K. Divvala, Hannaneh Hajishirzi, Yejin Choi, Ali Farhadi
2015 2015 IEEE International Conference on Computer Vision (ICCV)  
First, we show that fine-grained textual labels facilitate contextual reasoning that helps in satisfying semantic constraints across image segments.  ...  Next, we show that the association of high-quality segmentations to textual phrases aids in richer semantic understanding and reasoning of these textual phrases.  ...  Relative Visual Similarity Baselines. We use language-based and detection baselines similar to the two previous task.  ... 
doi:10.1109/iccv.2015.10 dblp:conf/iccv/IzadiniaSDHCF15 fatcat:5vnr2nu4fvarvgru2a2pzvtzju

A Literature Review of Image Retrieval based On Semantic Concept

Alaa. M.Riad, Hamdy. K. Elminir, Sameh. Abd-Elghany
2012 International Journal of Computer Applications  
This paper attempts to provide a comprehensive review and characterize the problem of the semantic gap that is the key problem of content-based image retrieval and the current attempts in high-level semantic-based  ...  Finally, based on existing technologies and the demand from real-world applications, a few promising future research directions are suggested.  ...  Ontology Reasoning Ontological reasoning is the cornerstone of the semantic web, a vision of a future where machines are able to reason about various aspects of available information to produce more comprehensive  ... 
doi:10.5120/5008-7327 fatcat:blhmwfs46zfh7as7di44wwdgkm

Open-world Semantic Segmentation via Contrasting and Clustering Vision-Language Embedding [article]

Quande Liu, Youpeng Wen, Jianhua Han, Chunjing Xu, Hang Xu, Xiaodan Liang
2022 arXiv   pre-print
semantics and high-level category information that are crucial for the segmentation task.  ...  Our method, Vision-language-driven Semantic Segmentation (ViL-Seg), employs an image and a text encoder to generate visual and text embeddings for the image-caption data, with two core components that  ...  information from natural language supervision.  ... 
arXiv:2207.08455v3 fatcat:ahp4ijh4njeezpxhibrgg346ru

Semantics-enhanced Cross-modal Masked Image Modeling for Vision-Language Pre-training [article]

Haowei Liu, Yaya Shi, Haiyang Xu, Chunfeng Yuan, Qinghao Ye, Chenliang Li, Ming Yan, Ji Zhang, Fei Huang, Bing Li, Weiming Hu
2024 arXiv   pre-print
Compared to previous VLP models with similar model size and data scale, our SemMIM model achieves state-of-the-art or competitive performance on multiple downstream vision-language tasks.  ...  In this work, we propose a semantics-enhanced cross-modal MIM framework (SemMIM) for vision-language representation learning.  ...  Thus during masked modeling, the model only reasons visual information to recover masked regions, without direct visionlanguage interaction.  ... 
arXiv:2403.00249v1 fatcat:rhgtmvjhm5ghnffk62lc66oxw4

ERNIE-ViL: Knowledge Enhanced Vision-Language Representations through Scene Graphs

Fei Yu, Jiji Tang, Weichong Yin, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang
2021 PROCEEDINGS OF THE THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE TWENTY-EIGHTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE  
ERNIE-ViL tries to build the detailed semantic connections (objects, attributes of objects and relationships between objects) across vision and language, which are essential to vision-language cross-modal  ...  Thus, ERNIE-ViL can learn the joint representations characterizing the alignments of the detailed semantics across vision and language.  ...  detailed information in the visual scenes in more fine-grained level.  ... 
doi:10.1609/aaai.v35i4.16431 fatcat:ex46bttn7zhdtpkhegkduipa2i

Natural Language Question Answering with Goal-directed Answer Set Programming

Kinjal Basu, Gopal Gupta
2021 International Conference on Logic Programming  
To do so, knowledge driven generalized semantic representation for English text is utmost important for any NLU applications.  ...  Ideally, for any realistic (human like) NLU system, commonsense reasoning must be an integral part of it and goal directed answer-setprogramming (ASP) is indispensable to do commonsense reasoning.  ...  Answering questions about a given picture, or Visual Question Answering (VQA) can be processed similar to the textual QA.  ... 
dblp:conf/iclp/0002G21 fatcat:lrvytk3lyff6rekw7utm6vfy4i

Segment-Phrase Table for Semantic Segmentation, Visual Entailment and Paraphrasing [article]

Hamid Izadinia, Fereshteh Sadeghi, Santosh Kumar Divvala, Yejin Choi, Ali Farhadi
2015 arXiv   pre-print
First, we show that fine-grained textual labels facilitate contextual reasoning that helps in satisfying semantic constraints across image segments.  ...  Next, we show that the association of high-quality segmentations to textual phrases aids in richer semantic understanding and reasoning of these textual phrases.  ...  Relative Visual Similarity Baselines. We use language-based and detection baselines similar to the two previous task.  ... 
arXiv:1509.08075v1 fatcat:shim4l4bzvgw3luduyujaud564

Computer Vision and Natural Language Processing

Peratham Wiriyathammabhum, Douglas Summers-Stay, Cornelia Fermüller, Yiannis Aloimonos
2016 ACM Computing Surveys  
We also emphasize strategies to integrate computer vision and natural language processing models as a unified theme of distributional semantics.  ...  We make an analog of distributional semantics in computer vision and natural language processing as image embedding and word embedding, respectively.  ...  Joseph Tighe for a useful informal discussion.  ... 
doi:10.1145/3009906 fatcat:bdgaeoz4w5djhd5spab4lrc4au
« Previous Showing results 1 — 15 out of 162,453 results