Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Filters








19,224 Hits in 4.2 sec

Leverage Boosting and Transformer on Text-Image Matching for Cheap Fakes Detection

Tuan-Vinh La, Minh-Son Dao, Duy-Dong Le, Kim-Phung Thai, Quoc-Hung Nguyen, Thuy-Kieu Phan-Thi
2022 Algorithms  
It is challenging to detect misused/false/out-of-context pairs of images and captions, even with human effort, because of the complex correlation between the attached image and the veracity of the caption  ...  methods via ANN/Transformer boosting schema to classify a triple of (image, caption1, caption2) into OOC (out-of-context) and NOOC (no out-of-context) labels.  ...  Acknowledgments: We acknowledge the University of Economic Ho Chi Minh City (UEH) for funding this research. Conflicts of Interest: The authors declare no conflict of interest.  ... 
doi:10.3390/a15110423 fatcat:xp5ln5ex7zag5outjxlpm2atvu

CapsFusion: Rethinking Image-Text Data at Scale [article]

Qiying Yu, Quan Sun, Xiaosong Zhang, Yufeng Cui, Fan Zhang, Yue Cao, Xinlong Wang, Jingjing Liu
2024 arXiv   pre-print
Upon closer examination, we identify the root cause as the overly-simplified language structure and lack of knowledge details in existing synthetic captions.  ...  These effectiveness, efficiency and scalability advantages position CapsFusion as a promising candidate for future scaling of LMM training.  ...  For random mixing [16, 39] , we set the mixing ratio of two types of captions as 1:1 [16] and do not tune this ratio as in [39] .  ... 
arXiv:2310.20550v3 fatcat:ohbdmnw2onfyvlyjtedlg5mgoq

Retrieval-Augmented Multimodal Language Modeling [article]

Michihiro Yasunaga, Armen Aghajanyan, Weijia Shi, Rich James, Jure Leskovec, Percy Liang, Mike Lewis, Luke Zettlemoyer, Wen-tau Yih
2022 arXiv   pre-print
much less compute for training (<30% of DALL-E).  ...  We show that RA-CM3 significantly outperforms baseline multimodal models such as DALL-E and CM3 on both image and caption generation tasks (12 FID and 17 CIDEr improvements on MS-COCO), while requiring  ...  Acknowledgements We greatly thank members of the Meta AI team, Stanford P-Lambda and SNAP groups for providing valuable feedback.  ... 
arXiv:2211.12561v1 fatcat:34jk34vn3jeshihpix243ezgiq

Image Captioning In the Transformer Age [article]

Yang Xu, Li Li, Haiyang Xu, Songfang Huang, Fei Huang, Jianfei Cai
2022 arXiv   pre-print
Due to the page limitation, we only refer to highly important papers in this short survey and more related works can be found at https://github.com/SjokerLily/awesome-image-captioning.  ...  Image Captioning (IC) has achieved astonishing developments by incorporating various techniques into the CNN-RNN encoder-decoder architecture.  ...  For example, the modular network can disentangle the mixed up concepts by various modules for addressing different concepts.  ... 
arXiv:2204.07374v1 fatcat:ftsoam2ei5da5fkygq4pztzxda

Imageability- and Length-controllable Image Captioning

Marc A. Kastner, Kazuki Umemura, Ichiro Ide, Yasutomo Kawanishi, Takatsugu Hirayama, Keisuke Doman, Daisuke Deguchi, Hiroshi Murase, Shinrichi Satoh
2021 IEEE Access  
Image captioning can show great performance for generating captions for general purposes, but it remains difficult to adjust the generated captions for different applications.  ...  In this paper, we propose an image captioning method which can generate both imageability-and length-controllable captions.  ...  Related Work in Image Captioning. The related work is split into general-purpose and affective image captioning.  ... 
doi:10.1109/access.2021.3131393 fatcat:y3zvkdpeufbixl7lsctak5nnkm

Retrieval-Augmented Transformer for Image Captioning [article]

Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
2022 arXiv   pre-print
Our work opens up new avenues for improving image captioning models at larger scale.  ...  In this paper, we investigate the development of an image captioning approach with a kNN memory, with which knowledge can be retrieved from an external corpus to aid the generation process.  ...  ACKNOWLEDGMENTS We thank CINECA, the Italian Supercomputing Center, for providing computational resources.  ... 
arXiv:2207.13162v2 fatcat:hsyrylwbaveo3gp2dkf3qsudcy

Transform and Tell: Entity-Aware News Image Captioning

Alasdair Tran, Alexander Mathews, Lexing Xie
2020 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)  
We propose an end-to-end model which generates captions for images embedded in news articles.  ...  News images present two key challenges: they rely on real-world knowledge, especially about named entities; and they typically have linguistically rich captions that include uncommon words.  ...  We thank NVIDIA for providing us with Titan V GPUs through their GPU Grant Program.  ... 
doi:10.1109/cvpr42600.2020.01305 dblp:conf/cvpr/TranMX20 fatcat:7mbw6c6ckrbj5a2zvpwglpklwa

Transform and Tell: Entity-Aware News Image Captioning [article]

Alasdair Tran, Alexander Mathews, Lexing Xie
2020 arXiv   pre-print
We propose an end-to-end model which generates captions for images embedded in news articles.  ...  News images present two key challenges: they rely on real-world knowledge, especially about named entities; and they typically have linguistically rich captions that include uncommon words.  ...  We thank NVIDIA for providing us with Titan V GPUs through their GPU Grant Program.  ... 
arXiv:2004.08070v2 fatcat:3at2ydmeebgwraejiurtd2ze4u

Fashion-Oriented Image Captioning with External Knowledge Retrieval and Fully Attentive Gates

Nicholas Moratelli, Manuele Barraco, Davide Morelli, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
2023 Sensors  
The presented method constantly outperforms all compared approaches, demonstrating its effectiveness for fashion image captioning.  ...  Experimental analyses were carried out on the fashion captioning dataset (FACAD) for fashion image captioning, which contains more than 130k fine-grained descriptions, validating the effectiveness of the  ...  Acknowledgments: The authors would like to thank CINECA, the Italian Supercomputing Center, for providing computational resources. Conflicts of Interest: The authors declare no conflict of interest.  ... 
doi:10.3390/s23031286 pmid:36772326 pmcid:PMC9921965 fatcat:xhgz3wkjffesppomzsgch2bl4q

See Your Heart: Psychological states Interpretation through Visual Creations [article]

Likun Yang, Xiaokun Feng, Xiaotang Chen, Shiyu Zhang, Kaiqi Huang
2023 arXiv   pre-print
Building on SpyIn, we conduct experiments of several image captioning method, and propose a visual-semantic combined model which obtains a SOTA result on SpyIn.  ...  The results indicate that VEIT is a more challenging task requiring scene graph information and psychological knowledge.  ...  The second is a state-of-the-art image captioning model, Meshed-Memory Transformers (M 2 ) (Cornia et al. 2020) , which is a transformer-based image captioning approach relies on separately computed object-bounding-box  ... 
arXiv:2302.10276v2 fatcat:wdszjo5vvzdnbme5reiudwtagq

What Does a Language-And-Vision Transformer See: The Impact of Semantic Information on Visual Representations

Nikolai Ilinykh, Simon Dobnik
2021 Frontiers in Artificial Intelligence  
In this paper, we take a multi-modal transformer trained for image captioning and examine the structure of the self-attention patterns extracted from the visual stream.  ...  Numerous studies have examined the knowledge captured by language models (LSTMs, transformers) and vision architectures (CNNs, vision transformers) for respective uni-modal tasks.  ...  While more focus on a single object might be beneficial for image classification, a more sophisticated multi-modal task (e.g., image captioning) requires scenelevel knowledge about objects and relations  ... 
doi:10.3389/frai.2021.767971 pmid:34927063 pmcid:PMC8679841 fatcat:6cwzazualvg55gcdmdz3gnrxrm

Topic Scene Graph Generation by Attention Distillation from Caption [article]

W. Wang, R. Wang, X. Chen
2021 arXiv   pre-print
In addition, as this attention distillation process provides an opportunity for combining the generation of image caption and scene graph together, we further transform the scene graph into linguistic  ...  If an image tells a story, the image caption is the briefest narrator.  ...  As for the Transformer, mixed training has little impact on the performances.  ... 
arXiv:2110.05731v1 fatcat:dqc6n2utf5ds5loy74qfylshem

ZeroCap: Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic [article]

Yoad Tewel, Yoav Shalev, Idan Schwartz, Lior Wolf
2022 arXiv   pre-print
While such models can provide a powerful score for matching and subsequent zero-shot tasks, they are not capable of generating caption given an image.  ...  The resulting captions are much less restrictive than those obtained by supervised captioning methods.  ...  Finally, combining multi-modal encoders with our method allows knowledge to be extracted in a new way that mixes between text and images. .  ... 
arXiv:2111.14447v2 fatcat:vfowvq2qofbt3aef4l5swdanum

Impact of visual assistance for automated audio captioning [article]

Wim Boes, Hugo Van hamme
2023 arXiv   pre-print
We study the impact of visual assistance for automated audio captioning.  ...  Utilizing multi-encoder transformer architectures, which have previously been employed to introduce vision-related information in the context of sound event detection, we analyze the usefulness of incorporating  ...  The images are fed into VGG16 [25] , a convolutional network for image classification, pretrained on the ImageNet data set [26] .  ... 
arXiv:2211.10539v2 fatcat:bkwxhpj7grgqfl2io5gisehwky

UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning [article]

Wei Li, Can Gao, Guocheng Niu, Xinyan Xiao, Hao Liu, Jiachen Liu, Hua Wu, Haifeng Wang
2022 arXiv   pre-print
Moreover, the textual knowledge and visual knowledge can enhance each other in the unified semantic space.  ...  They can only utilize single-modal data (i.e. text or image) or limited multi-modal data (i.e. image-text pairs).  ...  For image-retrieval, each image is transformed into 100 image regions and the object labels are detected for all regions by Faster R-CNN.  ... 
arXiv:2012.15409v4 fatcat:woa3moustzc6nexs3ggg3acsdm
« Previous Showing results 1 — 15 out of 19,224 results