A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2022; you can also visit the original URL.
The file type is application/pdf
.
Filters
Leverage Boosting and Transformer on Text-Image Matching for Cheap Fakes Detection
2022
Algorithms
It is challenging to detect misused/false/out-of-context pairs of images and captions, even with human effort, because of the complex correlation between the attached image and the veracity of the caption ...
In this paper, to address these issues, we aimed to leverage textual semantics understanding from the large corpus and integrated with different combinations of text-image matching and image captioning ...
Illustration of boosting with image-caption matching method. ...
doi:10.3390/a15110423
fatcat:xp5ln5ex7zag5outjxlpm2atvu
Boosted Attention: Leveraging Human Attention for Image Captioning
[article]
2019
arXiv
pre-print
Visual attention has shown usefulness in image captioning, with the goal of enabling a caption model to selectively focus on regions of interest. ...
In particular, we highlight the complementary nature of the two types of attention and develop a model (Boosted Attention) to integrate them for image captioning. ...
Boosted Attention Method As mentioned in section 3, on the one hand, objects of interest in stimulus-based attention are reasonably consistent with objects of interest in image captioning, suggesting that ...
arXiv:1904.00767v1
fatcat:frusslaprbaa7nok4isqccf6li
Boosted Attention: Leveraging Human Attention for Image Captioning
[chapter]
2018
Lecture Notes in Computer Science
Visual attention has shown usefulness in image captioning, with the goal of enabling a caption model to selectively focus on regions of interest. ...
In particular, we highlight the complementary nature of the two types of attention and develop a model (Boosted Attention) to integrate them for image captioning. ...
Boosted Attention Method As mentioned in section 3, on the one hand, objects of interest in stimulus-based attention are reasonably consistent with objects of interest in image captioning, suggesting that ...
doi:10.1007/978-3-030-01252-6_5
fatcat:hgigkthiczcbfk5ovhxvorr4xq
The Motivation of Using English Language in Instagram Captions
2021
English LAnguage Study and TEaching
The results show that the reason students of Serambi Mekkah University used English in their Instagram captions is mainly driven by their need of improving their English proficiency, specifically the spelling ...
The questionnaire measured the responses in terms of English learning and self-image factors. ...
of the respondents also indicate that the language can boost their self-image in the social media. ...
doi:10.32672/elaste.v2i2.3693
fatcat:lsydgrcrdrhztlj6oqfqjb77nm
A Simple Baseline for Knowledge-Based Visual Question Answering
[article]
2023
arXiv
pre-print
This paper is on the problem of Knowledge-Based Visual Question Answering (KB-VQA). ...
Recent works have emphasized the significance of incorporating both explicit (through external databases) and implicit (through LLMs) knowledge to answer questions requiring external knowledge effectively ...
We rank the image captions per image v i according to their cosine similarity with the image v i and keep the top-m most similar captions c i per example. ...
arXiv:2310.13570v2
fatcat:bn77vebkjnc2pgricketkd5nbu
An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA
2022
PROCEEDINGS OF THE THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE TWENTY-EIGHTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE
Existing methods first retrieve knowledge from external resources, then reason over the selected knowledge, the input image, and question for answer prediction. ...
To address this challenge, we propose PICa, a simple yet effective method that Prompts GPT3 via the use of Image Captions, for knowledge-based VQA. ...
Based on this, joint reasoning over the retrieved knowledge and the image-question pair is performed (a) Previous methods adopt a two-step approach, which first retrieves the external knowledge, then reasons ...
doi:10.1609/aaai.v36i3.20215
fatcat:3isaftn7kbafdpr5pbaro4qum4
Joint Commonsense and Relation Reasoning for Image and Video Captioning
2020
PROCEEDINGS OF THE THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE TWENTY-EIGHTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE
In this paper, we propose a joint commonsense and relation reasoning method that exploits prior knowledge for image and video captioning without relying on any detectors. ...
Exploiting relationships between objects for image and video captioning has received increasing attention. ...
For image captioning, the spatial image regions are first densely sampled. space with knowledge embedding vectors of prior knowledge. ...
doi:10.1609/aaai.v34i07.6731
fatcat:ynlfbar46fd4xmcqgl4yido4zu
An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA
[article]
2022
arXiv
pre-print
Existing methods first retrieve knowledge from external resources, then reason over the selected knowledge, the input image, and question for answer prediction. ...
To address this challenge, we propose PICa, a simple yet effective method that Prompts GPT3 via the use of Image Captions, for knowledge-based VQA. ...
We represent images either as captions with VinVL (Zhang et al. 2021) , or enhance captions with tags predicted by the public Microsoft Azure tagging API 3 . ...
arXiv:2109.05014v2
fatcat:vmvtn64kgzblnh5whqvqgelihm
Understanding captions in biomedical publications
2003
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '03
We propose a scheme for "understanding" captions in biomedical publications by extracting and classifying "image pointers" (references to the accompanying image). ...
From the standpoint of the automated extraction of scientific knowledge, an important but little-studied part of scientific publications are the figures and accompanying captions. ...
Since the main purpose of caption text is to comment on an image, captions are often littered with references to the image, and these "image pointers" are interspersed with grammatical text in a variety ...
doi:10.1145/956750.956809
dblp:conf/kdd/CohenWM03
fatcat:vuty64csr5g2lgp6bawc3hcy5i
Understanding captions in biomedical publications
2003
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '03
We propose a scheme for "understanding" captions in biomedical publications by extracting and classifying "image pointers" (references to the accompanying image). ...
From the standpoint of the automated extraction of scientific knowledge, an important but little-studied part of scientific publications are the figures and accompanying captions. ...
Since the main purpose of caption text is to comment on an image, captions are often littered with references to the image, and these "image pointers" are interspersed with grammatical text in a variety ...
doi:10.1145/956804.956809
fatcat:2stfwlldjra75nxxnpnip6cee4
DEVICE: DEpth and VIsual ConcEpts Aware Transformer for TextCaps
[article]
2023
arXiv
pre-print
Text-based image captioning is an important but under-explored task, aiming to generate descriptions containing visual objects and scene text. ...
Our DEVICE is capable of generalizing scenes more comprehensively and boosting the accuracy of described visual entities. ...
The TextCaps dataset [4] is constructed for text-based image captioning, which contains 28408 images with 5 captions per image. ...
arXiv:2302.01540v3
fatcat:zkowzmdtpvcjhhknm6covcfkte
X-modaler: A Versatile and High-performance Codebase for Cross-modal Analytics
[article]
2021
arXiv
pre-print
This way naturally enables a flexible implementation of state-of-the-art algorithms for image captioning, video captioning, and vision-language pre-training, aiming to facilitate the rapid development ...
across different vision-language tasks, X-modaler can be simply extended to power startup prototypes for other tasks in cross-modal analytics, including visual question answering, visual commonsense reasoning ...
., COCO, MSVD, and MSR-VTT), we demonstrate that our X-modaler provides state-of-the-art solutions for image/video captioning task. ...
arXiv:2108.08217v1
fatcat:bqgpmnijcfd2li5xo2xc2gy4mu
An Interpretable Approach to Hateful Meme Detection
[article]
2021
arXiv
pre-print
Hateful memes are an emerging method of spreading hate on the internet, relying on both images and text to convey a hateful message. ...
In the process, we build a gradient-boosted decision tree and an LSTM-based model that achieve comparable performance (73.8 validation and 72.7 test auROC) to the gold standard of humans and state-of-the-art ...
Text and image embedding For our gradient-boosted decision tree, we use a captioning model [23] to capture the relevant content of an image in text format. ...
arXiv:2108.10069v1
fatcat:2fmmd4nhjrbixoj5zpp5l66oii
MuMUR : Multilingual Multimodal Universal Retrieval
[article]
2023
arXiv
pre-print
In this paper, we propose a framework MuMUR, that utilizes knowledge transfer from a multilingual model to boost the performance of multi-modal (image and video) retrieval. ...
Multi-modal retrieval has seen tremendous progress with the development of vision-language models. ...
One could argue that the reason might be the smaller video retrieval datasets and creating meaningful structural knowledge becomes a challenging task. ...
arXiv:2208.11553v7
fatcat:zpvt5pw7wnextptwzzrppar4ge
VQA-GNN: Reasoning with Multimodal Knowledge via Graph Neural Networks for Visual Question Answering
[article]
2023
arXiv
pre-print
However, these methods only perform a unidirectional fusion from unstructured knowledge to structured knowledge, limiting their potential to capture joint reasoning over the heterogeneous modalities of ...
To perform more expressive reasoning, we propose VQA-GNN, a new VQA model that performs bidirectional fusion between unstructured and structured multimodal knowledge to obtain unified knowledge representations ...
to facilitate joint reasoning with image and background knowledge. ...
arXiv:2205.11501v2
fatcat:kxnygb36abbbbclba4ngxusvoa
« Previous
Showing results 1 — 15 out of 9,538 results