Mixed Knowledge Relation Transformer for Image Captioning.

It is challenging to detect misused/false/out-of-context pairs of images and captions, even with human effort, because of the complex correlation between the attached image and the veracity of the caption ... methods via ANN/Transformer boosting schema to classify a triple of (image, caption1, caption2) into OOC (out-of-context) and NOOC (no out-of-context) labels. ... Acknowledgments: We acknowledge the University of Economic Ho Chi Minh City (UEH) for funding this research. Conflicts of Interest: The authors declare no conflict of interest. ...

doi:10.3390/a15110423 fatcat:xp5ln5ex7zag5outjxlpm2atvu

DOAJ Szczepanski

Upon closer examination, we identify the root cause as the overly-simplified language structure and lack of knowledge details in existing synthetic captions. ... These effectiveness, efficiency and scalability advantages position CapsFusion as a promising candidate for future scaling of LMM training. ... For random mixing [16, 39] , we set the mixing ratio of two types of captions as 1:1 [16] and do not tune this ratio as in [39] . ...

arXiv:2310.20550v3 fatcat:ohbdmnw2onfyvlyjtedlg5mgoq

Multiple Versions

much less compute for training (<30% of DALL-E). ... We show that RA-CM3 significantly outperforms baseline multimodal models such as DALL-E and CM3 on both image and caption generation tasks (12 FID and 17 CIDEr improvements on MS-COCO), while requiring ... Acknowledgements We greatly thank members of the Meta AI team, Stanford P-Lambda and SNAP groups for providing valuable feedback. ...

arXiv:2211.12561v1 fatcat:34jk34vn3jeshihpix243ezgiq

Open Access

Due to the page limitation, we only refer to highly important papers in this short survey and more related works can be found at https://github.com/SjokerLily/awesome-image-captioning. ... Image Captioning (IC) has achieved astonishing developments by incorporating various techniques into the CNN-RNN encoder-decoder architecture. ... For example, the modular network can disentangle the mixed up concepts by various modules for addressing different concepts. ...

arXiv:2204.07374v1 fatcat:ftsoam2ei5da5fkygq4pztzxda

Image captioning can show great performance for generating captions for general purposes, but it remains difficult to adjust the generated captions for different applications. ... In this paper, we propose an image captioning method which can generate both imageability-and length-controllable captions. ... Related Work in Image Captioning. The related work is split into general-purpose and affective image captioning. ...

doi:10.1109/access.2021.3131393 fatcat:y3zvkdpeufbixl7lsctak5nnkm

DOAJ

Our work opens up new avenues for improving image captioning models at larger scale. ... In this paper, we investigate the development of an image captioning approach with a kNN memory, with which knowledge can be retrieved from an external corpus to aid the generation process. ... ACKNOWLEDGMENTS We thank CINECA, the Italian Supercomputing Center, for providing computational resources. ...

arXiv:2207.13162v2 fatcat:hsyrylwbaveo3gp2dkf3qsudcy

Multiple Versions

We propose an end-to-end model which generates captions for images embedded in news articles. ... News images present two key challenges: they rely on real-world knowledge, especially about named entities; and they typically have linguistically rich captions that include uncommon words. ... We thank NVIDIA for providing us with Titan V GPUs through their GPU Grant Program. ...

doi:10.1109/cvpr42600.2020.01305 dblp:conf/cvpr/TranMX20 fatcat:7mbw6c6ckrbj5a2zvpwglpklwa

We propose an end-to-end model which generates captions for images embedded in news articles. ... News images present two key challenges: they rely on real-world knowledge, especially about named entities; and they typically have linguistically rich captions that include uncommon words. ... We thank NVIDIA for providing us with Titan V GPUs through their GPU Grant Program. ...

arXiv:2004.08070v2 fatcat:3at2ydmeebgwraejiurtd2ze4u

Multiple Versions

The presented method constantly outperforms all compared approaches, demonstrating its effectiveness for fashion image captioning. ... Experimental analyses were carried out on the fashion captioning dataset (FACAD) for fashion image captioning, which contains more than 130k fine-grained descriptions, validating the effectiveness of the ... Acknowledgments: The authors would like to thank CINECA, the Italian Supercomputing Center, for providing computational resources. Conflicts of Interest: The authors declare no conflict of interest. ...

doi:10.3390/s23031286 pmid:36772326 pmcid:PMC9921965 fatcat:xhgz3wkjffesppomzsgch2bl4q

DOAJ

Building on SpyIn, we conduct experiments of several image captioning method, and propose a visual-semantic combined model which obtains a SOTA result on SpyIn. ... The results indicate that VEIT is a more challenging task requiring scene graph information and psychological knowledge. ... The second is a state-of-the-art image captioning model, Meshed-Memory Transformers (M 2 ) (Cornia et al. 2020) , which is a transformer-based image captioning approach relies on separately computed object-bounding-box ...

arXiv:2302.10276v2 fatcat:wdszjo5vvzdnbme5reiudwtagq

Multiple Versions

In this paper, we take a multi-modal transformer trained for image captioning and examine the structure of the self-attention patterns extracted from the visual stream. ... Numerous studies have examined the knowledge captured by language models (LSTMs, transformers) and vision architectures (CNNs, vision transformers) for respective uni-modal tasks. ... While more focus on a single object might be beneficial for image classification, a more sophisticated multi-modal task (e.g., image captioning) requires scenelevel knowledge about objects and relations ...

doi:10.3389/frai.2021.767971 pmid:34927063 pmcid:PMC8679841 fatcat:6cwzazualvg55gcdmdz3gnrxrm

DOAJ

In addition, as this attention distillation process provides an opportunity for combining the generation of image caption and scene graph together, we further transform the scene graph into linguistic ... If an image tells a story, the image caption is the briefest narrator. ... As for the Transformer, mixed training has little impact on the performances. ...

arXiv:2110.05731v1 fatcat:dqc6n2utf5ds5loy74qfylshem

Open Access

While such models can provide a powerful score for matching and subsequent zero-shot tasks, they are not capable of generating caption given an image. ... The resulting captions are much less restrictive than those obtained by supervised captioning methods. ... Finally, combining multi-modal encoders with our method allows knowledge to be extracted in a new way that mixes between text and images. . ...

arXiv:2111.14447v2 fatcat:vfowvq2qofbt3aef4l5swdanum

Open Access Multiple Versions

We study the impact of visual assistance for automated audio captioning. ... Utilizing multi-encoder transformer architectures, which have previously been employed to introduce vision-related information in the context of sound event detection, we analyze the usefulness of incorporating ... The images are fed into VGG16 [25] , a convolutional network for image classification, pretrained on the ImageNet data set [26] . ...

arXiv:2211.10539v2 fatcat:bkwxhpj7grgqfl2io5gisehwky

Multiple Versions

Moreover, the textual knowledge and visual knowledge can enhance each other in the unified semantic space. ... They can only utilize single-modal data (i.e. text or image) or limited multi-modal data (i.e. image-text pairs). ... For image-retrieval, each image is transformed into 100 image regions and the object labels are detected for all regions by Faster R-CNN. ...

arXiv:2012.15409v4 fatcat:woa3moustzc6nexs3ggg3acsdm

Multiple Versions

Leverage Boosting and Transformer on Text-Image Matching for Cheap Fakes Detection

Preserved Fulltext

CapsFusion: Rethinking Image-Text Data at Scale [article]

Preserved Fulltext

Other Versions

Retrieval-Augmented Multimodal Language Modeling [article]

Preserved Fulltext

Image Captioning In the Transformer Age [article]

Preserved Fulltext

Imageability- and Length-controllable Image Captioning

Preserved Fulltext

Retrieval-Augmented Transformer for Image Captioning [article]

Preserved Fulltext

Other Versions

Transform and Tell: Entity-Aware News Image Captioning

Preserved Fulltext

Transform and Tell: Entity-Aware News Image Captioning [article]

Preserved Fulltext

Other Versions

Fashion-Oriented Image Captioning with External Knowledge Retrieval and Fully Attentive Gates

Preserved Fulltext

See Your Heart: Psychological states Interpretation through Visual Creations [article]

Preserved Fulltext

What Does a Language-And-Vision Transformer See: The Impact of Semantic Information on Visual Representations

Preserved Fulltext

Topic Scene Graph Generation by Attention Distillation from Caption [article]

Preserved Fulltext

ZeroCap: Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic [article]

Preserved Fulltext

Other Versions

Impact of visual assistance for automated audio captioning [article]

Preserved Fulltext

Other Versions

UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning [article]

Preserved Fulltext

Other Versions