Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Filters








10,425 Hits in 4.3 sec

The Contribution of Knowledge in Visiolinguistic Learning: A Survey on Tasks and Challenges [article]

Maria Lymperaiou, Giorgos Stamou
2023 arXiv   pre-print
External knowledge sources such as knowledge graphs (KGs) and Large Language Models (LLMs) are able to cover such generalization gaps by filling in missing knowledge, resulting in the emergence of hybrid  ...  Current datasets used for VL pre-training only contain a limited amount of visual and linguistic knowledge, thus significantly limiting the generalization capabilities of many VL models.  ...  Acknowledgments The research work was supported by the Hellenic Foundation for Research and Innovation (HFRI) under the 3rd Call for HFRI PhD Fellowships (Fellowship Number 5537).  ... 
arXiv:2303.02411v1 fatcat:5pg342btxzfqjg3khj2ycifg2e

Cross-Modal Retrieval Augmentation for Multi-Modal Classification [article]

Shir Gur, Natalia Neverova, Chris Stauffer, Ser-Nam Lim, Douwe Kiela, Austin Reiter
2021 arXiv   pre-print
Here, we explore the use of unstructured external knowledge sources of images and their corresponding captions for improving visual question answering (VQA).  ...  First, we train a novel alignment model for embedding images and captions in the same space, which achieves substantial improvement in performance on image-caption retrieval w.r.t. similar methods.  ...  We trained a powerful alignment model, DXR, for performing retrieval over external knowledge sources.  ... 
arXiv:2104.08108v1 fatcat:rqbuv7jzejdedaq2oey7wk7nb4

Describing Natural Images Containing Novel Objects with Knowledge Guided Assitance [article]

Aditya Mogadala, Umanga Bista, Lexing Xie, Achim Rettinger
2017 arXiv   pre-print
Evaluations show that our models outperform most of the prior work for out-of-domain captioning on MSCOCO and are useful for integration of knowledge and vision in general.  ...  semantic attention and constrained inference in the caption generation model for describing images that depict unseen/novel objects.  ...  DESCRIBING IMAGES WITH NOVEL OBJECTS USING KNOWLEDGE GUIDED ASSISTANCE (KGA) In this section, we present our caption generation model for generating captions for unseen/novel image objects with support  ... 
arXiv:1710.06303v1 fatcat:mu6zbevjbvd2jfl6sjd6yqmisy

A survey on knowledge-enhanced multimodal learning [article]

Maria Lymperaiou, Giorgos Stamou
2024 arXiv   pre-print
In the same time, knowledge graphs enhance explainability, fairness and validity of decision making, issues of outermost importance for such complex implementations.  ...  VL models have reached unprecedented performances by extending the idea of Transformers, so that both modalities can learn from each other.  ...  The research work was supported by the Hellenic Foundation for Research and Innovation (HFRI) under the 3rd Call for HFRI PhD Fellowships (Fellowship Number 5537).  ... 
arXiv:2211.12328v3 fatcat:7qzsr6yrwfeeppsijcqgfblzzq

A Simple Baseline for Knowledge-Based Visual Question Answering [article]

Alexandros Xenos, Themos Stafylakis, Ioannis Patras, Georgios Tzimiropoulos
2023 arXiv   pre-print
Our code is publicly available at https://github.com/alexandrosXe/ASimple-Baseline-For-Knowledge-Based-VQA  ...  Recent works have emphasized the significance of incorporating both explicit (through external databases) and implicit (through LLMs) knowledge to answer questions requiring external knowledge effectively  ...  information (bounding box coordinates), retrieved external and implicit knowledge (using a GPT-3) into a transformer-based question-answering model.  ... 
arXiv:2310.13570v2 fatcat:bn77vebkjnc2pgricketkd5nbu

VLSP 2021 - VieCap4H Challenge: Automatic Image Caption Generation for Healthcare Domain in Vietnamese

Thao Minh Le, Long Hoang Dang, Thanh-Son Nguyen, Huyen Nguyen, Xuan-Son Vu
2022 VNU Journal of Science Computer Science and Communication Engineering  
This paper presents VieCap4H, a grand data challenge on automatic image caption generation for the healthcare domain in Vietnamese.  ...  The task is considered as an image captioning task.  ...  the provided dataset and contribute their knowledge to advance the field, making potential applications of the task in either healthcare settings and general settings (e.g. virtual assistants for blind  ... 
doi:10.25073/2588-1086/vnucsce.341 fatcat:alfadfcahrda3adldzjah5yii4

Integrating Visuospatial, Linguistic and Commonsense Structure into Story Visualization [article]

Adyasha Maharana, Mohit Bansal
2021 arXiv   pre-print
Second, we augment the structured input with commonsense information and study the impact of this external knowledge on the generation of visual story.  ...  In this paper, we first explore the use of constituency parse trees using a Transformer-based recurrent architecture for encoding structured input.  ...  Acknowledgments We thank Darryl Hannan, Hanna Tischer, Hyounghun Kim, Jaemin Cho, and the reviewers for their useful feedback.  ... 
arXiv:2110.10834v1 fatcat:odfsk6jmqbha3hbcjqphjoozqm

Textually Enriched Neural Module Networks for Visual Question Answering [article]

Khyathi Raghavi Chandu, Mary Arpita Pyreddy, Matthieu Felix, Narendra Nath Joshi
2018 arXiv   pre-print
In this paper, we enrich the image information with textual data using image captions and external knowledge bases to generate more coherent answers.  ...  We achieve 57.1% overall accuracy on the test-dev open-ended questions from the visual question answering (VQA 1.0) real image dataset.  ...  Tadas Baltrusaitis, Amir Zadeh and Chaitanya Ahuja for providing us the guidance, constant timely feedback and resources required for this project.  ... 
arXiv:1809.08697v1 fatcat:4hyda5agtfcydhvlpdnvho4khq

News image-text matching with news knowledge graph

Zhao Yumeng, Yun Jing, Gao Shuo, Liu Limin
2021 IEEE Access  
Our approach improves the results of cross-modal retrieval on four datasets compared to five baselines. 7:Transformer + RoBERTa [45] ,an end-toend model for news image captioning with a novel combination  ...  For example, in [6] - [8] , these methods generate the news image caption with named entities. These methods first generate a template caption with placeholders for named entities.  ... 
doi:10.1109/access.2021.3093650 fatcat:vytqrwtombfy7mizmehpcpovaa

Knowledge Pursuit Prompting for Zero-Shot Multimodal Synthesis [article]

Jinqi Luo, Kwan Ho Ryan Chan, Dimitris Dimos, René Vidal
2023 arXiv   pre-print
the acquired knowledge for prompt refinement, and utilizes text-driven generators for visual synthesis.  ...  To address this question, we propose Knowledge Pursuit Prompting (KPP), a zero-shot framework that iteratively incorporates external knowledge to help generators produce reliable visual content.  ...  Moreover, the authors thank Aditya Chattopadhyay, Tianjiao Ding, Ryan Pilgrim, Tianyuan Zhang, and Bowen Li for their insightful feedback that improves this work.  ... 
arXiv:2311.17898v2 fatcat:3me7yqv7vjhd3mnxhrzzdp2qlq

Towards Multi-Modal Sarcasm Detection via Hierarchical Congruity Modeling with Knowledge Enhancement [article]

Hui Liu, Wenya Wang, Haoliang Li
2022 arXiv   pre-print
Moreover, they neglected the rich information contained in external knowledge, e.g., image captions.  ...  In addition, we exploit the effect of various knowledge resources for sarcasm detection.  ...  To address this limitation, we propose to generate image captions as the external knowledge to assist sarcasm detection. We further compare the effect of each knowledge form in the experiments.  ... 
arXiv:2210.03501v2 fatcat:g6uw6jwpbnhb7bahe6qpavlmny

Graph Neural Networks in Vision-Language Image Understanding: A Survey [article]

Henry Senior, Gregory Slabaugh, Shanxin Yuan, Luca Rossi
2023 arXiv   pre-print
To the best of our knowledge, this is the first comprehensive survey that covers image captioning, visual question answering, and image retrieval techniques that focus on using GNNs as the main part of  ...  Solutions to this problem form the underpinning of a range of tasks, including image captioning, Visual Question Answering (VQA), and image retrieval.  ...  He regularly reviews for major computer vision conferences (CVPR, ICCV, ECCV, and NeurIPS) and related journals (TPAMI, IJCV and TIP).  ... 
arXiv:2303.03761v1 fatcat:plipklmukjfapn57w4wo6fcogi

SCO-VIST: Social Interaction Commonsense Knowledge-based Visual Storytelling [article]

Eileen Wang, Soyeon Caren Han, Josiah Poon
2024 arXiv   pre-print
However, most models mainly focus on applying factual information and using taxonomic/lexical external knowledge when attempting to create stories.  ...  Unlike tasks like image captioning, visual stories should contain factual descriptions, worldviews, and human social commonsense to put disjointed elements together to form a coherent and engaging human-writeable  ...  To promote more diverse stories, newer works have also used knowledge graphs to assist the storytelling process, allowing for richer stories capable of expressing imaginative concepts that are not explicitly  ... 
arXiv:2402.00319v1 fatcat:mekja4vs6jhmjmqegcn4esntaq

Open-Domain, Content-based, Multi-modal Fact-checking of Out-of-Context Images via Online Resources [article]

Sahar Abdelnabi, Rakibul Hasan, Mario Fritz
2022 arXiv   pre-print
Our work offers the first step and benchmark for open-domain, content-based, multi-modal fact-checking, and significantly outperforms previous baselines that did not leverage external evidence.  ...  vs. visual evidence, and the image vs. caption.  ...  We also thank Rebecca Weil for helpful advice and feedback.  ... 
arXiv:2112.00061v3 fatcat:7w5ndinlbjht7b5e7elzyozycy

Memory-Augmented Image Captioning

Zhengcong Fei
2021 AAAI Conference on Artificial Intelligence  
Current deep learning-based image captioning systems have been proven to store practical knowledge with their parameters and achieve competitive performances in the public datasets.  ...  Towards this goal, we introduce a memory-augmented method, which extends an existing image caption model by incorporating extra explicit knowledge from a memory bank.  ...  In actual, for an LSTM-based captioning architecture, f M (C i ) could result from attended image features or context vector; for a Transformer-based captioning architecture, f M (C i ) could obtain from  ... 
dblp:conf/aaai/Fei21a fatcat:dk56hnhhyvgpdnbcg3thmv44f4
« Previous Showing results 1 — 15 out of 10,425 results