Boost Image Captioning with Knowledge Reasoning.

It is challenging to detect misused/false/out-of-context pairs of images and captions, even with human effort, because of the complex correlation between the attached image and the veracity of the caption ... In this paper, to address these issues, we aimed to leverage textual semantics understanding from the large corpus and integrated with different combinations of text-image matching and image captioning ... Illustration of boosting with image-caption matching method. ...

doi:10.3390/a15110423 fatcat:xp5ln5ex7zag5outjxlpm2atvu

DOAJ Szczepanski

Visual attention has shown usefulness in image captioning, with the goal of enabling a caption model to selectively focus on regions of interest. ... In particular, we highlight the complementary nature of the two types of attention and develop a model (Boosted Attention) to integrate them for image captioning. ... Boosted Attention Method As mentioned in section 3, on the one hand, objects of interest in stimulus-based attention are reasonably consistent with objects of interest in image captioning, suggesting that ...

arXiv:1904.00767v1 fatcat:frusslaprbaa7nok4isqccf6li

Visual attention has shown usefulness in image captioning, with the goal of enabling a caption model to selectively focus on regions of interest. ... In particular, we highlight the complementary nature of the two types of attention and develop a model (Boosted Attention) to integrate them for image captioning. ... Boosted Attention Method As mentioned in section 3, on the one hand, objects of interest in stimulus-based attention are reasonably consistent with objects of interest in image captioning, suggesting that ...

doi:10.1007/978-3-030-01252-6_5 fatcat:hgigkthiczcbfk5ovhxvorr4xq

The results show that the reason students of Serambi Mekkah University used English in their Instagram captions is mainly driven by their need of improving their English proficiency, specifically the spelling ... The questionnaire measured the responses in terms of English learning and self-image factors. ... of the respondents also indicate that the language can boost their self-image in the social media. ...

doi:10.32672/elaste.v2i2.3693 fatcat:lsydgrcrdrhztlj6oqfqjb77nm

This paper is on the problem of Knowledge-Based Visual Question Answering (KB-VQA). ... Recent works have emphasized the significance of incorporating both explicit (through external databases) and implicit (through LLMs) knowledge to answer questions requiring external knowledge effectively ... We rank the image captions per image v i according to their cosine similarity with the image v i and keep the top-m most similar captions c i per example. ...

arXiv:2310.13570v2 fatcat:bn77vebkjnc2pgricketkd5nbu

Open Access Multiple Versions

Existing methods first retrieve knowledge from external resources, then reason over the selected knowledge, the input image, and question for answer prediction. ... To address this challenge, we propose PICa, a simple yet effective method that Prompts GPT3 via the use of Image Captions, for knowledge-based VQA. ... Based on this, joint reasoning over the retrieved knowledge and the image-question pair is performed (a) Previous methods adopt a two-step approach, which first retrieves the external knowledge, then reasons ...

doi:10.1609/aaai.v36i3.20215 fatcat:3isaftn7kbafdpr5pbaro4qum4

In this paper, we propose a joint commonsense and relation reasoning method that exploits prior knowledge for image and video captioning without relying on any detectors. ... Exploiting relationships between objects for image and video captioning has received increasing attention. ... For image captioning, the spatial image regions are first densely sampled. space with knowledge embedding vectors of prior knowledge. ...

doi:10.1609/aaai.v34i07.6731 fatcat:ynlfbar46fd4xmcqgl4yido4zu

Existing methods first retrieve knowledge from external resources, then reason over the selected knowledge, the input image, and question for answer prediction. ... To address this challenge, we propose PICa, a simple yet effective method that Prompts GPT3 via the use of Image Captions, for knowledge-based VQA. ... We represent images either as captions with VinVL (Zhang et al. 2021) , or enhance captions with tags predicted by the public Microsoft Azure tagging API 3 . ...

arXiv:2109.05014v2 fatcat:vmvtn64kgzblnh5whqvqgelihm

Multiple Versions

We propose a scheme for "understanding" captions in biomedical publications by extracting and classifying "image pointers" (references to the accompanying image). ... From the standpoint of the automated extraction of scientific knowledge, an important but little-studied part of scientific publications are the figures and accompanying captions. ... Since the main purpose of caption text is to comment on an image, captions are often littered with references to the image, and these "image pointers" are interspersed with grammatical text in a variety ...

doi:10.1145/956750.956809 dblp:conf/kdd/CohenWM03 fatcat:vuty64csr5g2lgp6bawc3hcy5i

We propose a scheme for "understanding" captions in biomedical publications by extracting and classifying "image pointers" (references to the accompanying image). ... From the standpoint of the automated extraction of scientific knowledge, an important but little-studied part of scientific publications are the figures and accompanying captions. ... Since the main purpose of caption text is to comment on an image, captions are often littered with references to the image, and these "image pointers" are interspersed with grammatical text in a variety ...

doi:10.1145/956804.956809 fatcat:2stfwlldjra75nxxnpnip6cee4

Text-based image captioning is an important but under-explored task, aiming to generate descriptions containing visual objects and scene text. ... Our DEVICE is capable of generalizing scenes more comprehensively and boosting the accuracy of described visual entities. ... The TextCaps dataset [4] is constructed for text-based image captioning, which contains 28408 images with 5 captions per image. ...

arXiv:2302.01540v3 fatcat:zkowzmdtpvcjhhknm6covcfkte

Multiple Versions

This way naturally enables a flexible implementation of state-of-the-art algorithms for image captioning, video captioning, and vision-language pre-training, aiming to facilitate the rapid development ... across different vision-language tasks, X-modaler can be simply extended to power startup prototypes for other tasks in cross-modal analytics, including visual question answering, visual commonsense reasoning ... ., COCO, MSVD, and MSR-VTT), we demonstrate that our X-modaler provides state-of-the-art solutions for image/video captioning task. ...

arXiv:2108.08217v1 fatcat:bqgpmnijcfd2li5xo2xc2gy4mu

Hateful memes are an emerging method of spreading hate on the internet, relying on both images and text to convey a hateful message. ... In the process, we build a gradient-boosted decision tree and an LSTM-based model that achieve comparable performance (73.8 validation and 72.7 test auROC) to the gold standard of humans and state-of-the-art ... Text and image embedding For our gradient-boosted decision tree, we use a captioning model [23] to capture the relevant content of an image in text format. ...

arXiv:2108.10069v1 fatcat:2fmmd4nhjrbixoj5zpp5l66oii

Open Access

In this paper, we propose a framework MuMUR, that utilizes knowledge transfer from a multilingual model to boost the performance of multi-modal (image and video) retrieval. ... Multi-modal retrieval has seen tremendous progress with the development of vision-language models. ... One could argue that the reason might be the smaller video retrieval datasets and creating meaningful structural knowledge becomes a challenging task. ...

arXiv:2208.11553v7 fatcat:zpvt5pw7wnextptwzzrppar4ge

Open Access Multiple Versions

However, these methods only perform a unidirectional fusion from unstructured knowledge to structured knowledge, limiting their potential to capture joint reasoning over the heterogeneous modalities of ... To perform more expressive reasoning, we propose VQA-GNN, a new VQA model that performs bidirectional fusion between unstructured and structured multimodal knowledge to obtain unified knowledge representations ... to facilitate joint reasoning with image and background knowledge. ...

arXiv:2205.11501v2 fatcat:kxnygb36abbbbclba4ngxusvoa

Open Access Multiple Versions

Leverage Boosting and Transformer on Text-Image Matching for Cheap Fakes Detection

Preserved Fulltext

Boosted Attention: Leveraging Human Attention for Image Captioning [article]

Preserved Fulltext

Boosted Attention: Leveraging Human Attention for Image Captioning [chapter]

Preserved Fulltext

The Motivation of Using English Language in Instagram Captions

Preserved Fulltext

A Simple Baseline for Knowledge-Based Visual Question Answering [article]

Preserved Fulltext

Other Versions

An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA

Preserved Fulltext

Joint Commonsense and Relation Reasoning for Image and Video Captioning

Preserved Fulltext

An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA [article]

Preserved Fulltext

Other Versions

Understanding captions in biomedical publications

Preserved Fulltext

Understanding captions in biomedical publications

Preserved Fulltext

DEVICE: DEpth and VIsual ConcEpts Aware Transformer for TextCaps [article]

Preserved Fulltext

Other Versions

X-modaler: A Versatile and High-performance Codebase for Cross-modal Analytics [article]

Preserved Fulltext

An Interpretable Approach to Hateful Meme Detection [article]

Preserved Fulltext

MuMUR : Multilingual Multimodal Universal Retrieval [article]

Preserved Fulltext

Other Versions

VQA-GNN: Reasoning with Multimodal Knowledge via Graph Neural Networks for Visual Question Answering [article]

Preserved Fulltext