A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2023; you can also visit the original URL.
The file type is application/pdf
.
Filters
The Contribution of Knowledge in Visiolinguistic Learning: A Survey on Tasks and Challenges
[article]
2023
arXiv
pre-print
External knowledge sources such as knowledge graphs (KGs) and Large Language Models (LLMs) are able to cover such generalization gaps by filling in missing knowledge, resulting in the emergence of hybrid ...
Current datasets used for VL pre-training only contain a limited amount of visual and linguistic knowledge, thus significantly limiting the generalization capabilities of many VL models. ...
Acknowledgments The research work was supported by the Hellenic Foundation for Research and Innovation (HFRI) under the 3rd Call for HFRI PhD Fellowships (Fellowship Number 5537). ...
arXiv:2303.02411v1
fatcat:5pg342btxzfqjg3khj2ycifg2e
Cross-Modal Retrieval Augmentation for Multi-Modal Classification
[article]
2021
arXiv
pre-print
Here, we explore the use of unstructured external knowledge sources of images and their corresponding captions for improving visual question answering (VQA). ...
First, we train a novel alignment model for embedding images and captions in the same space, which achieves substantial improvement in performance on image-caption retrieval w.r.t. similar methods. ...
We trained a powerful alignment model, DXR, for performing retrieval over external knowledge sources. ...
arXiv:2104.08108v1
fatcat:rqbuv7jzejdedaq2oey7wk7nb4
Describing Natural Images Containing Novel Objects with Knowledge Guided Assitance
[article]
2017
arXiv
pre-print
Evaluations show that our models outperform most of the prior work for out-of-domain captioning on MSCOCO and are useful for integration of knowledge and vision in general. ...
semantic attention and constrained inference in the caption generation model for describing images that depict unseen/novel objects. ...
DESCRIBING IMAGES WITH NOVEL OBJECTS USING KNOWLEDGE GUIDED ASSISTANCE (KGA) In this section, we present our caption generation model for generating captions for unseen/novel image objects with support ...
arXiv:1710.06303v1
fatcat:mu6zbevjbvd2jfl6sjd6yqmisy
A survey on knowledge-enhanced multimodal learning
[article]
2024
arXiv
pre-print
In the same time, knowledge graphs enhance explainability, fairness and validity of decision making, issues of outermost importance for such complex implementations. ...
VL models have reached unprecedented performances by extending the idea of Transformers, so that both modalities can learn from each other. ...
The research work was supported by the Hellenic Foundation for Research and Innovation (HFRI) under the 3rd Call for HFRI PhD Fellowships (Fellowship Number 5537). ...
arXiv:2211.12328v3
fatcat:7qzsr6yrwfeeppsijcqgfblzzq
A Simple Baseline for Knowledge-Based Visual Question Answering
[article]
2023
arXiv
pre-print
Our code is publicly available at https://github.com/alexandrosXe/ASimple-Baseline-For-Knowledge-Based-VQA ...
Recent works have emphasized the significance of incorporating both explicit (through external databases) and implicit (through LLMs) knowledge to answer questions requiring external knowledge effectively ...
information (bounding box coordinates), retrieved external and implicit knowledge (using a GPT-3) into a transformer-based question-answering model. ...
arXiv:2310.13570v2
fatcat:bn77vebkjnc2pgricketkd5nbu
VLSP 2021 - VieCap4H Challenge: Automatic Image Caption Generation for Healthcare Domain in Vietnamese
2022
VNU Journal of Science Computer Science and Communication Engineering
This paper presents VieCap4H, a grand data challenge on automatic image caption generation for the healthcare domain in Vietnamese. ...
The task is considered as an image captioning task. ...
the provided dataset and contribute their knowledge to advance the field, making potential applications of the task in either healthcare settings and general settings (e.g. virtual assistants for blind ...
doi:10.25073/2588-1086/vnucsce.341
fatcat:alfadfcahrda3adldzjah5yii4
Integrating Visuospatial, Linguistic and Commonsense Structure into Story Visualization
[article]
2021
arXiv
pre-print
Second, we augment the structured input with commonsense information and study the impact of this external knowledge on the generation of visual story. ...
In this paper, we first explore the use of constituency parse trees using a Transformer-based recurrent architecture for encoding structured input. ...
Acknowledgments We thank Darryl Hannan, Hanna Tischer, Hyounghun Kim, Jaemin Cho, and the reviewers for their useful feedback. ...
arXiv:2110.10834v1
fatcat:odfsk6jmqbha3hbcjqphjoozqm
Textually Enriched Neural Module Networks for Visual Question Answering
[article]
2018
arXiv
pre-print
In this paper, we enrich the image information with textual data using image captions and external knowledge bases to generate more coherent answers. ...
We achieve 57.1% overall accuracy on the test-dev open-ended questions from the visual question answering (VQA 1.0) real image dataset. ...
Tadas Baltrusaitis, Amir Zadeh and Chaitanya Ahuja for providing us the guidance, constant timely feedback and resources required for this project. ...
arXiv:1809.08697v1
fatcat:4hyda5agtfcydhvlpdnvho4khq
News image-text matching with news knowledge graph
2021
IEEE Access
Our approach improves the results of cross-modal retrieval on four datasets compared to five baselines. 7:Transformer + RoBERTa [45] ,an end-toend model for news image captioning with a novel combination ...
For example, in [6] - [8] , these methods generate the news image caption with named entities. These methods first generate a template caption with placeholders for named entities. ...
doi:10.1109/access.2021.3093650
fatcat:vytqrwtombfy7mizmehpcpovaa
Knowledge Pursuit Prompting for Zero-Shot Multimodal Synthesis
[article]
2023
arXiv
pre-print
the acquired knowledge for prompt refinement, and utilizes text-driven generators for visual synthesis. ...
To address this question, we propose Knowledge Pursuit Prompting (KPP), a zero-shot framework that iteratively incorporates external knowledge to help generators produce reliable visual content. ...
Moreover, the authors thank Aditya Chattopadhyay, Tianjiao Ding, Ryan Pilgrim, Tianyuan Zhang, and Bowen Li for their insightful feedback that improves this work. ...
arXiv:2311.17898v2
fatcat:3me7yqv7vjhd3mnxhrzzdp2qlq
Towards Multi-Modal Sarcasm Detection via Hierarchical Congruity Modeling with Knowledge Enhancement
[article]
2022
arXiv
pre-print
Moreover, they neglected the rich information contained in external knowledge, e.g., image captions. ...
In addition, we exploit the effect of various knowledge resources for sarcasm detection. ...
To address this limitation, we propose to generate image captions as the external knowledge to assist sarcasm detection. We further compare the effect of each knowledge form in the experiments. ...
arXiv:2210.03501v2
fatcat:g6uw6jwpbnhb7bahe6qpavlmny
Graph Neural Networks in Vision-Language Image Understanding: A Survey
[article]
2023
arXiv
pre-print
To the best of our knowledge, this is the first comprehensive survey that covers image captioning, visual question answering, and image retrieval techniques that focus on using GNNs as the main part of ...
Solutions to this problem form the underpinning of a range of tasks, including image captioning, Visual Question Answering (VQA), and image retrieval. ...
He regularly reviews for major computer vision conferences (CVPR, ICCV, ECCV, and NeurIPS) and related journals (TPAMI, IJCV and TIP). ...
arXiv:2303.03761v1
fatcat:plipklmukjfapn57w4wo6fcogi
SCO-VIST: Social Interaction Commonsense Knowledge-based Visual Storytelling
[article]
2024
arXiv
pre-print
However, most models mainly focus on applying factual information and using taxonomic/lexical external knowledge when attempting to create stories. ...
Unlike tasks like image captioning, visual stories should contain factual descriptions, worldviews, and human social commonsense to put disjointed elements together to form a coherent and engaging human-writeable ...
To promote more diverse stories, newer works have also used knowledge graphs to assist the storytelling process, allowing for richer stories capable of expressing imaginative concepts that are not explicitly ...
arXiv:2402.00319v1
fatcat:mekja4vs6jhmjmqegcn4esntaq
Open-Domain, Content-based, Multi-modal Fact-checking of Out-of-Context Images via Online Resources
[article]
2022
arXiv
pre-print
Our work offers the first step and benchmark for open-domain, content-based, multi-modal fact-checking, and significantly outperforms previous baselines that did not leverage external evidence. ...
vs. visual evidence, and the image vs. caption. ...
We also thank Rebecca Weil for helpful advice and feedback. ...
arXiv:2112.00061v3
fatcat:7w5ndinlbjht7b5e7elzyozycy
Memory-Augmented Image Captioning
2021
AAAI Conference on Artificial Intelligence
Current deep learning-based image captioning systems have been proven to store practical knowledge with their parameters and achieve competitive performances in the public datasets. ...
Towards this goal, we introduce a memory-augmented method, which extends an existing image caption model by incorporating extra explicit knowledge from a memory bank. ...
In actual, for an LSTM-based captioning architecture, f M (C i ) could result from attended image features or context vector; for a Transformer-based captioning architecture, f M (C i ) could obtain from ...
dblp:conf/aaai/Fei21a
fatcat:dk56hnhhyvgpdnbcg3thmv44f4
« Previous
Showing results 1 — 15 out of 10,425 results