External knowledge-assisted Transformer for image captioning.

External knowledge sources such as knowledge graphs (KGs) and Large Language Models (LLMs) are able to cover such generalization gaps by filling in missing knowledge, resulting in the emergence of hybrid ... Current datasets used for VL pre-training only contain a limited amount of visual and linguistic knowledge, thus significantly limiting the generalization capabilities of many VL models. ... Acknowledgments The research work was supported by the Hellenic Foundation for Research and Innovation (HFRI) under the 3rd Call for HFRI PhD Fellowships (Fellowship Number 5537). ...

arXiv:2303.02411v1 fatcat:5pg342btxzfqjg3khj2ycifg2e

Open Access

Here, we explore the use of unstructured external knowledge sources of images and their corresponding captions for improving visual question answering (VQA). ... First, we train a novel alignment model for embedding images and captions in the same space, which achieves substantial improvement in performance on image-caption retrieval w.r.t. similar methods. ... We trained a powerful alignment model, DXR, for performing retrieval over external knowledge sources. ...

arXiv:2104.08108v1 fatcat:rqbuv7jzejdedaq2oey7wk7nb4

Evaluations show that our models outperform most of the prior work for out-of-domain captioning on MSCOCO and are useful for integration of knowledge and vision in general. ... semantic attention and constrained inference in the caption generation model for describing images that depict unseen/novel objects. ... DESCRIBING IMAGES WITH NOVEL OBJECTS USING KNOWLEDGE GUIDED ASSISTANCE (KGA) In this section, we present our caption generation model for generating captions for unseen/novel image objects with support ...

arXiv:1710.06303v1 fatcat:mu6zbevjbvd2jfl6sjd6yqmisy

In the same time, knowledge graphs enhance explainability, fairness and validity of decision making, issues of outermost importance for such complex implementations. ... VL models have reached unprecedented performances by extending the idea of Transformers, so that both modalities can learn from each other. ... The research work was supported by the Hellenic Foundation for Research and Innovation (HFRI) under the 3rd Call for HFRI PhD Fellowships (Fellowship Number 5537). ...

arXiv:2211.12328v3 fatcat:7qzsr6yrwfeeppsijcqgfblzzq

Open Access Multiple Versions

Our code is publicly available at https://github.com/alexandrosXe/ASimple-Baseline-For-Knowledge-Based-VQA ... Recent works have emphasized the significance of incorporating both explicit (through external databases) and implicit (through LLMs) knowledge to answer questions requiring external knowledge effectively ... information (bounding box coordinates), retrieved external and implicit knowledge (using a GPT-3) into a transformer-based question-answering model. ...

arXiv:2310.13570v2 fatcat:bn77vebkjnc2pgricketkd5nbu

Open Access Multiple Versions

This paper presents VieCap4H, a grand data challenge on automatic image caption generation for the healthcare domain in Vietnamese. ... The task is considered as an image captioning task. ... the provided dataset and contribute their knowledge to advance the field, making potential applications of the task in either healthcare settings and general settings (e.g. virtual assistants for blind ...

doi:10.25073/2588-1086/vnucsce.341 fatcat:alfadfcahrda3adldzjah5yii4

Second, we augment the structured input with commonsense information and study the impact of this external knowledge on the generation of visual story. ... In this paper, we first explore the use of constituency parse trees using a Transformer-based recurrent architecture for encoding structured input. ... Acknowledgments We thank Darryl Hannan, Hanna Tischer, Hyounghun Kim, Jaemin Cho, and the reviewers for their useful feedback. ...

arXiv:2110.10834v1 fatcat:odfsk6jmqbha3hbcjqphjoozqm

In this paper, we enrich the image information with textual data using image captions and external knowledge bases to generate more coherent answers. ... We achieve 57.1% overall accuracy on the test-dev open-ended questions from the visual question answering (VQA 1.0) real image dataset. ... Tadas Baltrusaitis, Amir Zadeh and Chaitanya Ahuja for providing us the guidance, constant timely feedback and resources required for this project. ...

arXiv:1809.08697v1 fatcat:4hyda5agtfcydhvlpdnvho4khq

Our approach improves the results of cross-modal retrieval on four datasets compared to five baselines. 7:Transformer + RoBERTa [45] ,an end-toend model for news image captioning with a novel combination ... For example, in [6] - [8] , these methods generate the news image caption with named entities. These methods first generate a template caption with placeholders for named entities. ...

doi:10.1109/access.2021.3093650 fatcat:vytqrwtombfy7mizmehpcpovaa

DOAJ

the acquired knowledge for prompt refinement, and utilizes text-driven generators for visual synthesis. ... To address this question, we propose Knowledge Pursuit Prompting (KPP), a zero-shot framework that iteratively incorporates external knowledge to help generators produce reliable visual content. ... Moreover, the authors thank Aditya Chattopadhyay, Tianjiao Ding, Ryan Pilgrim, Tianyuan Zhang, and Bowen Li for their insightful feedback that improves this work. ...

arXiv:2311.17898v2 fatcat:3me7yqv7vjhd3mnxhrzzdp2qlq

Open Access Multiple Versions

Moreover, they neglected the rich information contained in external knowledge, e.g., image captions. ... In addition, we exploit the effect of various knowledge resources for sarcasm detection. ... To address this limitation, we propose to generate image captions as the external knowledge to assist sarcasm detection. We further compare the effect of each knowledge form in the experiments. ...

arXiv:2210.03501v2 fatcat:g6uw6jwpbnhb7bahe6qpavlmny

Open Access Multiple Versions

To the best of our knowledge, this is the first comprehensive survey that covers image captioning, visual question answering, and image retrieval techniques that focus on using GNNs as the main part of ... Solutions to this problem form the underpinning of a range of tasks, including image captioning, Visual Question Answering (VQA), and image retrieval. ... He regularly reviews for major computer vision conferences (CVPR, ICCV, ECCV, and NeurIPS) and related journals (TPAMI, IJCV and TIP). ...

arXiv:2303.03761v1 fatcat:plipklmukjfapn57w4wo6fcogi

Multiple Versions

However, most models mainly focus on applying factual information and using taxonomic/lexical external knowledge when attempting to create stories. ... Unlike tasks like image captioning, visual stories should contain factual descriptions, worldviews, and human social commonsense to put disjointed elements together to form a coherent and engaging human-writeable ... To promote more diverse stories, newer works have also used knowledge graphs to assist the storytelling process, allowing for richer stories capable of expressing imaginative concepts that are not explicitly ...

arXiv:2402.00319v1 fatcat:mekja4vs6jhmjmqegcn4esntaq

Our work offers the first step and benchmark for open-domain, content-based, multi-modal fact-checking, and significantly outperforms previous baselines that did not leverage external evidence. ... vs. visual evidence, and the image vs. caption. ... We also thank Rebecca Weil for helpful advice and feedback. ...

arXiv:2112.00061v3 fatcat:7w5ndinlbjht7b5e7elzyozycy

Multiple Versions

Current deep learning-based image captioning systems have been proven to store practical knowledge with their parameters and achieve competitive performances in the public datasets. ... Towards this goal, we introduce a memory-augmented method, which extends an existing image caption model by incorporating extra explicit knowledge from a memory bank. ... In actual, for an LSTM-based captioning architecture, f M (C i ) could result from attended image features or context vector; for a Transformer-based captioning architecture, f M (C i ) could obtain from ...

dblp:conf/aaai/Fei21a fatcat:dk56hnhhyvgpdnbcg3thmv44f4

The Contribution of Knowledge in Visiolinguistic Learning: A Survey on Tasks and Challenges [article]

Preserved Fulltext

Cross-Modal Retrieval Augmentation for Multi-Modal Classification [article]

Preserved Fulltext

Describing Natural Images Containing Novel Objects with Knowledge Guided Assitance [article]

Preserved Fulltext

A survey on knowledge-enhanced multimodal learning [article]

Preserved Fulltext

Other Versions

A Simple Baseline for Knowledge-Based Visual Question Answering [article]

Preserved Fulltext

Other Versions

VLSP 2021 - VieCap4H Challenge: Automatic Image Caption Generation for Healthcare Domain in Vietnamese

Preserved Fulltext

Integrating Visuospatial, Linguistic and Commonsense Structure into Story Visualization [article]

Preserved Fulltext

Textually Enriched Neural Module Networks for Visual Question Answering [article]

Preserved Fulltext

News image-text matching with news knowledge graph

Preserved Fulltext

Knowledge Pursuit Prompting for Zero-Shot Multimodal Synthesis [article]

Preserved Fulltext

Other Versions

Towards Multi-Modal Sarcasm Detection via Hierarchical Congruity Modeling with Knowledge Enhancement [article]

Preserved Fulltext

Graph Neural Networks in Vision-Language Image Understanding: A Survey [article]

Preserved Fulltext

Other Versions

SCO-VIST: Social Interaction Commonsense Knowledge-based Visual Storytelling [article]

Preserved Fulltext

Open-Domain, Content-based, Multi-modal Fact-checking of Out-of-Context Images via Online Resources [article]

Preserved Fulltext

Other Versions

Memory-Augmented Image Captioning

Preserved Fulltext