A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2022; you can also visit the original URL.
The file type is application/pdf
.
Filters
Leverage Boosting and Transformer on Text-Image Matching for Cheap Fakes Detection
2022
Algorithms
It is challenging to detect misused/false/out-of-context pairs of images and captions, even with human effort, because of the complex correlation between the attached image and the veracity of the caption ...
methods via ANN/Transformer boosting schema to classify a triple of (image, caption1, caption2) into OOC (out-of-context) and NOOC (no out-of-context) labels. ...
Acknowledgments: We acknowledge the University of Economic Ho Chi Minh City (UEH) for funding this research.
Conflicts of Interest: The authors declare no conflict of interest. ...
doi:10.3390/a15110423
fatcat:xp5ln5ex7zag5outjxlpm2atvu
CapsFusion: Rethinking Image-Text Data at Scale
[article]
2024
arXiv
pre-print
Upon closer examination, we identify the root cause as the overly-simplified language structure and lack of knowledge details in existing synthetic captions. ...
These effectiveness, efficiency and scalability advantages position CapsFusion as a promising candidate for future scaling of LMM training. ...
For random mixing [16, 39] , we set the mixing ratio of two types of captions as 1:1 [16] and do not tune this ratio as in [39] . ...
arXiv:2310.20550v3
fatcat:ohbdmnw2onfyvlyjtedlg5mgoq
Retrieval-Augmented Multimodal Language Modeling
[article]
2022
arXiv
pre-print
much less compute for training (<30% of DALL-E). ...
We show that RA-CM3 significantly outperforms baseline multimodal models such as DALL-E and CM3 on both image and caption generation tasks (12 FID and 17 CIDEr improvements on MS-COCO), while requiring ...
Acknowledgements We greatly thank members of the Meta AI team, Stanford P-Lambda and SNAP groups for providing valuable feedback. ...
arXiv:2211.12561v1
fatcat:34jk34vn3jeshihpix243ezgiq
Image Captioning In the Transformer Age
[article]
2022
arXiv
pre-print
Due to the page limitation, we only refer to highly important papers in this short survey and more related works can be found at https://github.com/SjokerLily/awesome-image-captioning. ...
Image Captioning (IC) has achieved astonishing developments by incorporating various techniques into the CNN-RNN encoder-decoder architecture. ...
For example, the modular network can disentangle the mixed up concepts by various modules for addressing different concepts. ...
arXiv:2204.07374v1
fatcat:ftsoam2ei5da5fkygq4pztzxda
Imageability- and Length-controllable Image Captioning
2021
IEEE Access
Image captioning can show great performance for generating captions for general purposes, but it remains difficult to adjust the generated captions for different applications. ...
In this paper, we propose an image captioning method which can generate both imageability-and length-controllable captions. ...
Related Work in Image Captioning. The related work is split into general-purpose and affective image captioning. ...
doi:10.1109/access.2021.3131393
fatcat:y3zvkdpeufbixl7lsctak5nnkm
Retrieval-Augmented Transformer for Image Captioning
[article]
2022
arXiv
pre-print
Our work opens up new avenues for improving image captioning models at larger scale. ...
In this paper, we investigate the development of an image captioning approach with a kNN memory, with which knowledge can be retrieved from an external corpus to aid the generation process. ...
ACKNOWLEDGMENTS We thank CINECA, the Italian Supercomputing Center, for providing computational resources. ...
arXiv:2207.13162v2
fatcat:hsyrylwbaveo3gp2dkf3qsudcy
Transform and Tell: Entity-Aware News Image Captioning
2020
2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
We propose an end-to-end model which generates captions for images embedded in news articles. ...
News images present two key challenges: they rely on real-world knowledge, especially about named entities; and they typically have linguistically rich captions that include uncommon words. ...
We thank NVIDIA for providing us with Titan V GPUs through their GPU Grant Program. ...
doi:10.1109/cvpr42600.2020.01305
dblp:conf/cvpr/TranMX20
fatcat:7mbw6c6ckrbj5a2zvpwglpklwa
Transform and Tell: Entity-Aware News Image Captioning
[article]
2020
arXiv
pre-print
We propose an end-to-end model which generates captions for images embedded in news articles. ...
News images present two key challenges: they rely on real-world knowledge, especially about named entities; and they typically have linguistically rich captions that include uncommon words. ...
We thank NVIDIA for providing us with Titan V GPUs through their GPU Grant Program. ...
arXiv:2004.08070v2
fatcat:3at2ydmeebgwraejiurtd2ze4u
Fashion-Oriented Image Captioning with External Knowledge Retrieval and Fully Attentive Gates
2023
Sensors
The presented method constantly outperforms all compared approaches, demonstrating its effectiveness for fashion image captioning. ...
Experimental analyses were carried out on the fashion captioning dataset (FACAD) for fashion image captioning, which contains more than 130k fine-grained descriptions, validating the effectiveness of the ...
Acknowledgments: The authors would like to thank CINECA, the Italian Supercomputing Center, for providing computational resources.
Conflicts of Interest: The authors declare no conflict of interest. ...
doi:10.3390/s23031286
pmid:36772326
pmcid:PMC9921965
fatcat:xhgz3wkjffesppomzsgch2bl4q
See Your Heart: Psychological states Interpretation through Visual Creations
[article]
2023
arXiv
pre-print
Building on SpyIn, we conduct experiments of several image captioning method, and propose a visual-semantic combined model which obtains a SOTA result on SpyIn. ...
The results indicate that VEIT is a more challenging task requiring scene graph information and psychological knowledge. ...
The second is a state-of-the-art image captioning model, Meshed-Memory Transformers (M 2 ) (Cornia et al. 2020) , which is a transformer-based image captioning approach relies on separately computed object-bounding-box ...
arXiv:2302.10276v2
fatcat:wdszjo5vvzdnbme5reiudwtagq
What Does a Language-And-Vision Transformer See: The Impact of Semantic Information on Visual Representations
2021
Frontiers in Artificial Intelligence
In this paper, we take a multi-modal transformer trained for image captioning and examine the structure of the self-attention patterns extracted from the visual stream. ...
Numerous studies have examined the knowledge captured by language models (LSTMs, transformers) and vision architectures (CNNs, vision transformers) for respective uni-modal tasks. ...
While more focus on a single object might be beneficial for image classification, a more sophisticated multi-modal task (e.g., image captioning) requires scenelevel knowledge about objects and relations ...
doi:10.3389/frai.2021.767971
pmid:34927063
pmcid:PMC8679841
fatcat:6cwzazualvg55gcdmdz3gnrxrm
Topic Scene Graph Generation by Attention Distillation from Caption
[article]
2021
arXiv
pre-print
In addition, as this attention distillation process provides an opportunity for combining the generation of image caption and scene graph together, we further transform the scene graph into linguistic ...
If an image tells a story, the image caption is the briefest narrator. ...
As for the Transformer, mixed training has little impact on the performances. ...
arXiv:2110.05731v1
fatcat:dqc6n2utf5ds5loy74qfylshem
ZeroCap: Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic
[article]
2022
arXiv
pre-print
While such models can provide a powerful score for matching and subsequent zero-shot tasks, they are not capable of generating caption given an image. ...
The resulting captions are much less restrictive than those obtained by supervised captioning methods. ...
Finally, combining multi-modal encoders with our method allows knowledge to be extracted in a new way that mixes between text and images. . ...
arXiv:2111.14447v2
fatcat:vfowvq2qofbt3aef4l5swdanum
Impact of visual assistance for automated audio captioning
[article]
2023
arXiv
pre-print
We study the impact of visual assistance for automated audio captioning. ...
Utilizing multi-encoder transformer architectures, which have previously been employed to introduce vision-related information in the context of sound event detection, we analyze the usefulness of incorporating ...
The images are fed into VGG16 [25] , a convolutional network for image classification, pretrained on the ImageNet data set [26] . ...
arXiv:2211.10539v2
fatcat:bkwxhpj7grgqfl2io5gisehwky
UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning
[article]
2022
arXiv
pre-print
Moreover, the textual knowledge and visual knowledge can enhance each other in the unified semantic space. ...
They can only utilize single-modal data (i.e. text or image) or limited multi-modal data (i.e. image-text pairs). ...
For image-retrieval, each image is transformed into 100 image regions and the object labels are detected for all regions by Faster R-CNN. ...
arXiv:2012.15409v4
fatcat:woa3moustzc6nexs3ggg3acsdm
« Previous
Showing results 1 — 15 out of 19,224 results