Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Filters








9,538 Hits in 3.0 sec

Leverage Boosting and Transformer on Text-Image Matching for Cheap Fakes Detection

Tuan-Vinh La, Minh-Son Dao, Duy-Dong Le, Kim-Phung Thai, Quoc-Hung Nguyen, Thuy-Kieu Phan-Thi
2022 Algorithms  
It is challenging to detect misused/false/out-of-context pairs of images and captions, even with human effort, because of the complex correlation between the attached image and the veracity of the caption  ...  In this paper, to address these issues, we aimed to leverage textual semantics understanding from the large corpus and integrated with different combinations of text-image matching and image captioning  ...  Illustration of boosting with image-caption matching method.  ... 
doi:10.3390/a15110423 fatcat:xp5ln5ex7zag5outjxlpm2atvu

Boosted Attention: Leveraging Human Attention for Image Captioning [article]

Shi Chen, Qi Zhao
2019 arXiv   pre-print
Visual attention has shown usefulness in image captioning, with the goal of enabling a caption model to selectively focus on regions of interest.  ...  In particular, we highlight the complementary nature of the two types of attention and develop a model (Boosted Attention) to integrate them for image captioning.  ...  Boosted Attention Method As mentioned in section 3, on the one hand, objects of interest in stimulus-based attention are reasonably consistent with objects of interest in image captioning, suggesting that  ... 
arXiv:1904.00767v1 fatcat:frusslaprbaa7nok4isqccf6li

Boosted Attention: Leveraging Human Attention for Image Captioning [chapter]

Shi Chen, Qi Zhao
2018 Lecture Notes in Computer Science  
Visual attention has shown usefulness in image captioning, with the goal of enabling a caption model to selectively focus on regions of interest.  ...  In particular, we highlight the complementary nature of the two types of attention and develop a model (Boosted Attention) to integrate them for image captioning.  ...  Boosted Attention Method As mentioned in section 3, on the one hand, objects of interest in stimulus-based attention are reasonably consistent with objects of interest in image captioning, suggesting that  ... 
doi:10.1007/978-3-030-01252-6_5 fatcat:hgigkthiczcbfk5ovhxvorr4xq

The Motivation of Using English Language in Instagram Captions

Rahayu Atila, Septhia Irnanda
2021 English LAnguage Study and TEaching  
The results show that the reason students of Serambi Mekkah University used English in their Instagram captions is mainly driven by their need of improving their English proficiency, specifically the spelling  ...  The questionnaire measured the responses in terms of English learning and self-image factors.  ...  of the respondents also indicate that the language can boost their self-image in the social media.  ... 
doi:10.32672/elaste.v2i2.3693 fatcat:lsydgrcrdrhztlj6oqfqjb77nm

A Simple Baseline for Knowledge-Based Visual Question Answering [article]

Alexandros Xenos, Themos Stafylakis, Ioannis Patras, Georgios Tzimiropoulos
2023 arXiv   pre-print
This paper is on the problem of Knowledge-Based Visual Question Answering (KB-VQA).  ...  Recent works have emphasized the significance of incorporating both explicit (through external databases) and implicit (through LLMs) knowledge to answer questions requiring external knowledge effectively  ...  We rank the image captions per image v i according to their cosine similarity with the image v i and keep the top-m most similar captions c i per example.  ... 
arXiv:2310.13570v2 fatcat:bn77vebkjnc2pgricketkd5nbu

An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA

Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xiaowei Hu, Yumao Lu, Zicheng Liu, Lijuan Wang
2022 PROCEEDINGS OF THE THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE TWENTY-EIGHTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE  
Existing methods first retrieve knowledge from external resources, then reason over the selected knowledge, the input image, and question for answer prediction.  ...  To address this challenge, we propose PICa, a simple yet effective method that Prompts GPT3 via the use of Image Captions, for knowledge-based VQA.  ...  Based on this, joint reasoning over the retrieved knowledge and the image-question pair is performed (a) Previous methods adopt a two-step approach, which first retrieves the external knowledge, then reasons  ... 
doi:10.1609/aaai.v36i3.20215 fatcat:3isaftn7kbafdpr5pbaro4qum4

Joint Commonsense and Relation Reasoning for Image and Video Captioning

Jingyi Hou, Xinxiao Wu, Xiaoxun Zhang, Yayun Qi, Yunde Jia, Jiebo Luo
2020 PROCEEDINGS OF THE THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE TWENTY-EIGHTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE  
In this paper, we propose a joint commonsense and relation reasoning method that exploits prior knowledge for image and video captioning without relying on any detectors.  ...  Exploiting relationships between objects for image and video captioning has received increasing attention.  ...  For image captioning, the spatial image regions are first densely sampled. space with knowledge embedding vectors of prior knowledge.  ... 
doi:10.1609/aaai.v34i07.6731 fatcat:ynlfbar46fd4xmcqgl4yido4zu

An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA [article]

Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xiaowei Hu, Yumao Lu, Zicheng Liu, Lijuan Wang
2022 arXiv   pre-print
Existing methods first retrieve knowledge from external resources, then reason over the selected knowledge, the input image, and question for answer prediction.  ...  To address this challenge, we propose PICa, a simple yet effective method that Prompts GPT3 via the use of Image Captions, for knowledge-based VQA.  ...  We represent images either as captions with VinVL (Zhang et al. 2021) , or enhance captions with tags predicted by the public Microsoft Azure tagging API 3 .  ... 
arXiv:2109.05014v2 fatcat:vmvtn64kgzblnh5whqvqgelihm

Understanding captions in biomedical publications

William W. Cohen, Richard Wang, Robert F. Murphy
2003 Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '03  
We propose a scheme for "understanding" captions in biomedical publications by extracting and classifying "image pointers" (references to the accompanying image).  ...  From the standpoint of the automated extraction of scientific knowledge, an important but little-studied part of scientific publications are the figures and accompanying captions.  ...  Since the main purpose of caption text is to comment on an image, captions are often littered with references to the image, and these "image pointers" are interspersed with grammatical text in a variety  ... 
doi:10.1145/956750.956809 dblp:conf/kdd/CohenWM03 fatcat:vuty64csr5g2lgp6bawc3hcy5i

Understanding captions in biomedical publications

William W. Cohen, Richard Wang, Robert F. Murphy
2003 Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '03  
We propose a scheme for "understanding" captions in biomedical publications by extracting and classifying "image pointers" (references to the accompanying image).  ...  From the standpoint of the automated extraction of scientific knowledge, an important but little-studied part of scientific publications are the figures and accompanying captions.  ...  Since the main purpose of caption text is to comment on an image, captions are often littered with references to the image, and these "image pointers" are interspersed with grammatical text in a variety  ... 
doi:10.1145/956804.956809 fatcat:2stfwlldjra75nxxnpnip6cee4

DEVICE: DEpth and VIsual ConcEpts Aware Transformer for TextCaps [article]

Dongsheng Xu, Qingbao Huang, Feng Shuang, Yi Cai
2023 arXiv   pre-print
Text-based image captioning is an important but under-explored task, aiming to generate descriptions containing visual objects and scene text.  ...  Our DEVICE is capable of generalizing scenes more comprehensively and boosting the accuracy of described visual entities.  ...  The TextCaps dataset [4] is constructed for text-based image captioning, which contains 28408 images with 5 captions per image.  ... 
arXiv:2302.01540v3 fatcat:zkowzmdtpvcjhhknm6covcfkte

X-modaler: A Versatile and High-performance Codebase for Cross-modal Analytics [article]

Yehao Li and Yingwei Pan and Jingwen Chen and Ting Yao and Tao Mei
2021 arXiv   pre-print
This way naturally enables a flexible implementation of state-of-the-art algorithms for image captioning, video captioning, and vision-language pre-training, aiming to facilitate the rapid development  ...  across different vision-language tasks, X-modaler can be simply extended to power startup prototypes for other tasks in cross-modal analytics, including visual question answering, visual commonsense reasoning  ...  ., COCO, MSVD, and MSR-VTT), we demonstrate that our X-modaler provides state-of-the-art solutions for image/video captioning task.  ... 
arXiv:2108.08217v1 fatcat:bqgpmnijcfd2li5xo2xc2gy4mu

An Interpretable Approach to Hateful Meme Detection [article]

Tanvi Deshpande, Nitya Mani
2021 arXiv   pre-print
Hateful memes are an emerging method of spreading hate on the internet, relying on both images and text to convey a hateful message.  ...  In the process, we build a gradient-boosted decision tree and an LSTM-based model that achieve comparable performance (73.8 validation and 72.7 test auROC) to the gold standard of humans and state-of-the-art  ...  Text and image embedding For our gradient-boosted decision tree, we use a captioning model [23] to capture the relevant content of an image in text format.  ... 
arXiv:2108.10069v1 fatcat:2fmmd4nhjrbixoj5zpp5l66oii

MuMUR : Multilingual Multimodal Universal Retrieval [article]

Avinash Madasu, Estelle Aflalo, Gabriela Ben Melech Stan, Shachar Rosenman, Shao-Yen Tseng, Gedas Bertasius, Vasudev Lal
2023 arXiv   pre-print
In this paper, we propose a framework MuMUR, that utilizes knowledge transfer from a multilingual model to boost the performance of multi-modal (image and video) retrieval.  ...  Multi-modal retrieval has seen tremendous progress with the development of vision-language models.  ...  One could argue that the reason might be the smaller video retrieval datasets and creating meaningful structural knowledge becomes a challenging task.  ... 
arXiv:2208.11553v7 fatcat:zpvt5pw7wnextptwzzrppar4ge

VQA-GNN: Reasoning with Multimodal Knowledge via Graph Neural Networks for Visual Question Answering [article]

Yanan Wang, Michihiro Yasunaga, Hongyu Ren, Shinya Wada, Jure Leskovec
2023 arXiv   pre-print
However, these methods only perform a unidirectional fusion from unstructured knowledge to structured knowledge, limiting their potential to capture joint reasoning over the heterogeneous modalities of  ...  To perform more expressive reasoning, we propose VQA-GNN, a new VQA model that performs bidirectional fusion between unstructured and structured multimodal knowledge to obtain unified knowledge representations  ...  to facilitate joint reasoning with image and background knowledge.  ... 
arXiv:2205.11501v2 fatcat:kxnygb36abbbbclba4ngxusvoa
« Previous Showing results 1 — 15 out of 9,538 results