Semantic Similarity-based Visual Reasoning without Language Information.

Video-and-Language Inference is a recently proposed task for joint video-and-language understanding. ... Specifically, it performs joint reasoning over video and subtitles in three hierarchies, where the graph structure is adaptively adjusted according to the semantic structures of the statement. ... Then, we compute the context gate γ (n) i based on the visual guidance, subtitle guidance, and the semantic query, which controls the fusion of visual and linguistic information: γ (n) i = σ(W 3 [g v , ...

arXiv:2107.12270v2 fatcat:jzfz6lwztrfpxpnocx4dx72eoq

Multiple Versions

Global and segment levels of video-language alignment pairs were designed, based on the information similarity ranging from high-level context to fine-grained semantics. ... For a pair of video and language description, their semantic relation is reflected by their encodings' similarity. ... By contrasting the embedding similarity of modality pairs that are more relevant to the ones that have less semantic connection, the network is able to ground the similar information closer in the shared ...

arXiv:2204.10938v2 fatcat:bp2kuuhecbhfhd3qse6rcgh3gm

Multiple Versions

Consequently, Visual Context Augmented Dialogue System (VAD), which has the potential to communicate with humans by perceiving and understanding multimodal information (i.e., visual context in images or ... The intelligent dialogue system, aiming at communicating with humans harmoniously with natural language, is brilliant for promoting the advancement of human-machine interaction in the era of artificial ... In addition, VAD requires cross-modal fusion and reasoning module to understand the semantic interaction between visual context and textual language information. ...

arXiv:2207.00782v1 fatcat:a57laj75xfa43gg4hjvxdh4c4i

We find that the visual grounding improves the model's understanding of semantic similarity both within and across languages and improves perplexity. ... Our results provide additional evidence of the advantages of visually grounded language models and point to the need for more naturalistic language data from multilingual speakers and multilingual datasets ... We find that the use of visual information lowers perplexity and improves correlation to human judgements of semantic similarity both between and within languages. ...

arXiv:2210.05487v2 fatcat:gmncu4ezc5bibgdfnvo5h5kkye

Multiple Versions

We investigate the effects of incorporating visual signal from images into unsupervised Semantic Textual Similarity (STS) measures. ... We also show that selective inclusion of visual information may further boost performance in the multi-modal setup. ... Han et al. (2013) use LSA-based and WordNet-based measures of word similarity to find the pairs of semantically aligned words. ...

dblp:conf/iwcs/GlavasVP17 fatcat:fzli3w7uardixnszp4dyjs4nei

As region-based visual features usually represent parts of an image, it is challenging for existing vision-language models to fully understand the semantics from paired natural languages. ... VD is designed to represent consistent visual abstractions of similar semantics. It is updated on-the-fly and utilized in our proposed pre-training task Masked Visual Modeling (MVM). ... This is reasonable as VD is designed to aggregate similar visual semantics into the same image feature. ...

arXiv:2104.03135v2 fatcat:ipergnpirzhblnwg2epmptmasa

Multiple Versions

annotation of visual models. ... In this paper we describe the theoretical foundations, formal considerations, and technical characteristics that were taken into account for the conceptualization of a modeling language for the semantic ... Thereby, a user can edit the annotations in a visual form without having to deal with a formal specification language. ...

doi:10.1007/978-3-642-22056-2_14 fatcat:vaq5tg6cijd7rdshnlinpcqare

First, we show that fine-grained textual labels facilitate contextual reasoning that helps in satisfying semantic constraints across image segments. ... Next, we show that the association of high-quality segmentations to textual phrases aids in richer semantic understanding and reasoning of these textual phrases. ... Relative Visual Similarity Baselines. We use language-based and detection baselines similar to the two previous task. ...

doi:10.1109/iccv.2015.10 dblp:conf/iccv/IzadiniaSDHCF15 fatcat:5vnr2nu4fvarvgru2a2pzvtzju

This paper attempts to provide a comprehensive review and characterize the problem of the semantic gap that is the key problem of content-based image retrieval and the current attempts in high-level semantic-based ... Finally, based on existing technologies and the demand from real-world applications, a few promising future research directions are suggested. ... Ontology Reasoning Ontological reasoning is the cornerstone of the semantic web, a vision of a future where machines are able to reason about various aspects of available information to produce more comprehensive ...

doi:10.5120/5008-7327 fatcat:blhmwfs46zfh7as7di44wwdgkm

semantics and high-level category information that are crucial for the segmentation task. ... Our method, Vision-language-driven Semantic Segmentation (ViL-Seg), employs an image and a text encoder to generate visual and text embeddings for the image-caption data, with two core components that ... information from natural language supervision. ...

arXiv:2207.08455v3 fatcat:ahp4ijh4njeezpxhibrgg346ru

Open Access Multiple Versions

Compared to previous VLP models with similar model size and data scale, our SemMIM model achieves state-of-the-art or competitive performance on multiple downstream vision-language tasks. ... In this work, we propose a semantics-enhanced cross-modal MIM framework (SemMIM) for vision-language representation learning. ... Thus during masked modeling, the model only reasons visual information to recover masked regions, without direct visionlanguage interaction. ...

arXiv:2403.00249v1 fatcat:rhgtmvjhm5ghnffk62lc66oxw4

Open Access

ERNIE-ViL tries to build the detailed semantic connections (objects, attributes of objects and relationships between objects) across vision and language, which are essential to vision-language cross-modal ... Thus, ERNIE-ViL can learn the joint representations characterizing the alignments of the detailed semantics across vision and language. ... detailed information in the visual scenes in more fine-grained level. ...

doi:10.1609/aaai.v35i4.16431 fatcat:ex46bttn7zhdtpkhegkduipa2i

To do so, knowledge driven generalized semantic representation for English text is utmost important for any NLU applications. ... Ideally, for any realistic (human like) NLU system, commonsense reasoning must be an integral part of it and goal directed answer-setprogramming (ASP) is indispensable to do commonsense reasoning. ... Answering questions about a given picture, or Visual Question Answering (VQA) can be processed similar to the textual QA. ...

dblp:conf/iclp/0002G21 fatcat:lrvytk3lyff6rekw7utm6vfy4i

First, we show that fine-grained textual labels facilitate contextual reasoning that helps in satisfying semantic constraints across image segments. ... Next, we show that the association of high-quality segmentations to textual phrases aids in richer semantic understanding and reasoning of these textual phrases. ... Relative Visual Similarity Baselines. We use language-based and detection baselines similar to the two previous task. ...

arXiv:1509.08075v1 fatcat:shim4l4bzvgw3luduyujaud564

We also emphasize strategies to integrate computer vision and natural language processing models as a unified theme of distributional semantics. ... We make an analog of distributional semantics in computer vision and natural language processing as image embedding and word embedding, respectively. ... Joseph Tighe for a useful informal discussion. ...

doi:10.1145/3009906 fatcat:bdgaeoz4w5djhd5spab4lrc4au

Adaptive Hierarchical Graph Reasoning with Semantic Coherence for Video-and-Language Inference [article]

Preserved Fulltext

Other Versions

A Multi-level Alignment Training Scheme for Video-and-Language Grounding [article]

Preserved Fulltext

Other Versions

Enabling Harmonious Human-Machine Interaction with Visual-Context Augmented Dialogue System: A Review [article]

Preserved Fulltext

Like a bilingual baby: The advantage of visually grounding a bilingual language model [article]

Preserved Fulltext

Other Versions

If Sentences Could See: Investigating Visual Information for Semantic Textual Similarity

Preserved Fulltext

Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning [article]

Preserved Fulltext

On the Conceptualization of a Modeling Language for Semantic Model Annotations [chapter]

Preserved Fulltext

Segment-Phrase Table for Semantic Segmentation, Visual Entailment and Paraphrasing

Preserved Fulltext

A Literature Review of Image Retrieval based On Semantic Concept

Preserved Fulltext

Open-world Semantic Segmentation via Contrasting and Clustering Vision-Language Embedding [article]

Preserved Fulltext

Other Versions

Semantics-enhanced Cross-modal Masked Image Modeling for Vision-Language Pre-training [article]

Preserved Fulltext

ERNIE-ViL: Knowledge Enhanced Vision-Language Representations through Scene Graphs

Preserved Fulltext

Natural Language Question Answering with Goal-directed Answer Set Programming

Preserved Fulltext

Segment-Phrase Table for Semantic Segmentation, Visual Entailment and Paraphrasing [article]

Preserved Fulltext

Computer Vision and Natural Language Processing

Preserved Fulltext