A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2021; you can also visit the original URL.
The file type is application/pdf
.
Filters
Adaptive Hierarchical Graph Reasoning with Semantic Coherence for Video-and-Language Inference
[article]
2021
arXiv
pre-print
Video-and-Language Inference is a recently proposed task for joint video-and-language understanding. ...
Specifically, it performs joint reasoning over video and subtitles in three hierarchies, where the graph structure is adaptively adjusted according to the semantic structures of the statement. ...
Then, we compute the context gate γ (n) i based on the visual guidance, subtitle guidance, and the semantic query, which controls the fusion of visual and linguistic information: γ (n) i = σ(W 3 [g v , ...
arXiv:2107.12270v2
fatcat:jzfz6lwztrfpxpnocx4dx72eoq
A Multi-level Alignment Training Scheme for Video-and-Language Grounding
[article]
2022
arXiv
pre-print
Global and segment levels of video-language alignment pairs were designed, based on the information similarity ranging from high-level context to fine-grained semantics. ...
For a pair of video and language description, their semantic relation is reflected by their encodings' similarity. ...
By contrasting the embedding similarity of modality pairs that are more relevant to the ones that have less semantic connection, the network is able to ground the similar information closer in the shared ...
arXiv:2204.10938v2
fatcat:bp2kuuhecbhfhd3qse6rcgh3gm
Enabling Harmonious Human-Machine Interaction with Visual-Context Augmented Dialogue System: A Review
[article]
2022
arXiv
pre-print
Consequently, Visual Context Augmented Dialogue System (VAD), which has the potential to communicate with humans by perceiving and understanding multimodal information (i.e., visual context in images or ...
The intelligent dialogue system, aiming at communicating with humans harmoniously with natural language, is brilliant for promoting the advancement of human-machine interaction in the era of artificial ...
In addition, VAD requires cross-modal fusion and reasoning module to understand the semantic interaction between visual context and textual language information. ...
arXiv:2207.00782v1
fatcat:a57laj75xfa43gg4hjvxdh4c4i
Like a bilingual baby: The advantage of visually grounding a bilingual language model
[article]
2023
arXiv
pre-print
We find that the visual grounding improves the model's understanding of semantic similarity both within and across languages and improves perplexity. ...
Our results provide additional evidence of the advantages of visually grounded language models and point to the need for more naturalistic language data from multilingual speakers and multilingual datasets ...
We find that the use of visual information lowers perplexity and improves correlation to human judgements of semantic similarity both between and within languages. ...
arXiv:2210.05487v2
fatcat:gmncu4ezc5bibgdfnvo5h5kkye
If Sentences Could See: Investigating Visual Information for Semantic Textual Similarity
2017
International Conference on Computational Semantics
We investigate the effects of incorporating visual signal from images into unsupervised Semantic Textual Similarity (STS) measures. ...
We also show that selective inclusion of visual information may further boost performance in the multi-modal setup. ...
Han et al. (2013) use LSA-based and WordNet-based measures of word similarity to find the pairs of semantically aligned words. ...
dblp:conf/iwcs/GlavasVP17
fatcat:fzli3w7uardixnszp4dyjs4nei
Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning
[article]
2021
arXiv
pre-print
As region-based visual features usually represent parts of an image, it is challenging for existing vision-language models to fully understand the semantics from paired natural languages. ...
VD is designed to represent consistent visual abstractions of similar semantics. It is updated on-the-fly and utilized in our proposed pre-training task Masked Visual Modeling (MVM). ...
This is reasonable as VD is designed to aggregate similar visual semantics into the same image feature. ...
arXiv:2104.03135v2
fatcat:ipergnpirzhblnwg2epmptmasa
On the Conceptualization of a Modeling Language for Semantic Model Annotations
[chapter]
2011
Lecture Notes in Business Information Processing
annotation of visual models. ...
In this paper we describe the theoretical foundations, formal considerations, and technical characteristics that were taken into account for the conceptualization of a modeling language for the semantic ...
Thereby, a user can edit the annotations in a visual form without having to deal with a formal specification language. ...
doi:10.1007/978-3-642-22056-2_14
fatcat:vaq5tg6cijd7rdshnlinpcqare
Segment-Phrase Table for Semantic Segmentation, Visual Entailment and Paraphrasing
2015
2015 IEEE International Conference on Computer Vision (ICCV)
First, we show that fine-grained textual labels facilitate contextual reasoning that helps in satisfying semantic constraints across image segments. ...
Next, we show that the association of high-quality segmentations to textual phrases aids in richer semantic understanding and reasoning of these textual phrases. ...
Relative Visual Similarity Baselines. We use language-based and detection baselines similar to the two previous task. ...
doi:10.1109/iccv.2015.10
dblp:conf/iccv/IzadiniaSDHCF15
fatcat:5vnr2nu4fvarvgru2a2pzvtzju
A Literature Review of Image Retrieval based On Semantic Concept
2012
International Journal of Computer Applications
This paper attempts to provide a comprehensive review and characterize the problem of the semantic gap that is the key problem of content-based image retrieval and the current attempts in high-level semantic-based ...
Finally, based on existing technologies and the demand from real-world applications, a few promising future research directions are suggested. ...
Ontology Reasoning Ontological reasoning is the cornerstone of the semantic web, a vision of a future where machines are able to reason about various aspects of available information to produce more comprehensive ...
doi:10.5120/5008-7327
fatcat:blhmwfs46zfh7as7di44wwdgkm
Open-world Semantic Segmentation via Contrasting and Clustering Vision-Language Embedding
[article]
2022
arXiv
pre-print
semantics and high-level category information that are crucial for the segmentation task. ...
Our method, Vision-language-driven Semantic Segmentation (ViL-Seg), employs an image and a text encoder to generate visual and text embeddings for the image-caption data, with two core components that ...
information from natural language supervision. ...
arXiv:2207.08455v3
fatcat:ahp4ijh4njeezpxhibrgg346ru
Semantics-enhanced Cross-modal Masked Image Modeling for Vision-Language Pre-training
[article]
2024
arXiv
pre-print
Compared to previous VLP models with similar model size and data scale, our SemMIM model achieves state-of-the-art or competitive performance on multiple downstream vision-language tasks. ...
In this work, we propose a semantics-enhanced cross-modal MIM framework (SemMIM) for vision-language representation learning. ...
Thus during masked modeling, the model only reasons visual information to recover masked regions, without direct visionlanguage interaction. ...
arXiv:2403.00249v1
fatcat:rhgtmvjhm5ghnffk62lc66oxw4
ERNIE-ViL: Knowledge Enhanced Vision-Language Representations through Scene Graphs
2021
PROCEEDINGS OF THE THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE TWENTY-EIGHTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE
ERNIE-ViL tries to build the detailed semantic connections (objects, attributes of objects and relationships between objects) across vision and language, which are essential to vision-language cross-modal ...
Thus, ERNIE-ViL can learn the joint representations characterizing the alignments of the detailed semantics across vision and language. ...
detailed information in the visual scenes in more fine-grained level. ...
doi:10.1609/aaai.v35i4.16431
fatcat:ex46bttn7zhdtpkhegkduipa2i
Natural Language Question Answering with Goal-directed Answer Set Programming
2021
International Conference on Logic Programming
To do so, knowledge driven generalized semantic representation for English text is utmost important for any NLU applications. ...
Ideally, for any realistic (human like) NLU system, commonsense reasoning must be an integral part of it and goal directed answer-setprogramming (ASP) is indispensable to do commonsense reasoning. ...
Answering questions about a given picture, or Visual Question Answering (VQA) can be processed similar to the textual QA. ...
dblp:conf/iclp/0002G21
fatcat:lrvytk3lyff6rekw7utm6vfy4i
Segment-Phrase Table for Semantic Segmentation, Visual Entailment and Paraphrasing
[article]
2015
arXiv
pre-print
First, we show that fine-grained textual labels facilitate contextual reasoning that helps in satisfying semantic constraints across image segments. ...
Next, we show that the association of high-quality segmentations to textual phrases aids in richer semantic understanding and reasoning of these textual phrases. ...
Relative Visual Similarity Baselines. We use language-based and detection baselines similar to the two previous task. ...
arXiv:1509.08075v1
fatcat:shim4l4bzvgw3luduyujaud564
Computer Vision and Natural Language Processing
2016
ACM Computing Surveys
We also emphasize strategies to integrate computer vision and natural language processing models as a unified theme of distributional semantics. ...
We make an analog of distributional semantics in computer vision and natural language processing as image embedding and word embedding, respectively. ...
Joseph Tighe for a useful informal discussion. ...
doi:10.1145/3009906
fatcat:bdgaeoz4w5djhd5spab4lrc4au
« Previous
Showing results 1 — 15 out of 162,453 results