Visual contextual relationship augmented transformer for image captioning.

Structured representations of images that model visual relationships are beneficial for many vision and vision-language applications. ... By leveraging masked language modeling, contrastive learning, and dependency tree distances for self-supervision, our method learns better object features as well as implicit visual relationships. ... The second row presents the visual relationship distance graphs for the corresponding images. The bottom rows show the distance graphs and dependency trees for augmented captions. ...

dblp:conf/nips/GuKJ0MZS20 fatcat:d6hjosn2zbf3rfdqnrp3buqray

The process of transforming input images into corresponding textual explanations stands as a crucial and complex endeavor within the domains of computer vision and natural language processing. ... In this paper, we propose an innovative ensemble approach that harnesses the capabilities of Contrastive Language-Image Pretraining models. ... For images, we apply random transformations such as rotations, flips, and crops to generate augmented versions of the original images. ...

arXiv:2401.06167v1 fatcat:rortkgokurdyzjxewzqap3zpgu

Image captioning is an important task for improving human-computer interaction as well as for a deeper understanding of the mechanisms underlying the image description by human. ... The current commonly used datasets although rather large in terms of number of images are quite small in terms of the number of different captions per one image. ... However, augmentation is rarely used in vision-language tasks, such as image captioning, visual question answering, or visual dialog. ...

doi:10.3390/app10175978 fatcat:lndvsjxi5fhy7kr3fqmp6xolji

DOAJ

How far can we go with textual representations for understanding pictures? In image understanding, it is essential to use concise but detailed image representations. ... This paper delves into the effectiveness of textual representations for image understanding in the specific context of VQA. ... Even though they leverage captions of images for pre-training, they rely on deep visual features for image representation. ...

arXiv:2106.13445v1 fatcat:57wl3n2hznbz7hbrim4x4efvdm

Open Access

In this paper, we explore explicit and implicit visual relationships to enrich region-level representations for image captioning. ... Recent methods for image captioning follow encoder-decoder framework that transforms the sequence of salient regions in an image into natural language descriptions. ... APPROACH In this section, we will introduce how to explicitly and implicitly explore visual relationships for image captioning. ...

arXiv:2105.02391v1 fatcat:qmbtqobhgjfilli47tbq72mbdu

In this paper, we focus on multimodal hateful memes detection and propose a novel method that incorporates the image captioning process into the memes detection process. ... Memes are used for spreading ideas through social networks. Although most memes are created for humor, some memes become hateful under the combination of pictures and text. ... 'OCR Text (Back-Translation)' means we augment the OCR sentences in training set through trained back-translators. 'Image Caption' means using captions for meme detection. ...

arXiv:2011.12870v3 fatcat:nly2jznbhzhzlfehij7wghxjam

Open Access Multiple Versions

INDEX TERMS image captioning, text-based image captioning, bottom-up top-down, grid feature, multimodal transformer, m4c ... Based on the M4C-Captioner model, this paper proposes the simple but effective EAES embedding module for effectively embedding images and scene texts into the multimodal Transformer layers. ... Therefore, we augment a grid feature that adds global contextual information to the image. ...

doi:10.1109/access.2022.3158763 fatcat:ukcjnnaaazedbmap3dg22bp5ki

DOAJ

The connection between news and the images that illustrate them goes beyond visual concept to natural language matching. ... Instead, the open-domain and event-reporting nature of news leads to semantically complex texts, in which images are used as a contextualizing element. ... CONCLUSIONS In this paper, we proposed NewsLXMERT, a multimodal transformer leveraging news contextual elements to model the complex relationships between news and images. ...

doi:10.1145/3503161.3548430 fatcat:f7men5dygvfrhc3mw6foqzacc4

models for Bangla image captioning. ... As computers have become efficient at understanding visual information and transforming it into a written representation, research interest in tasks like automatic image captioning has seen a significant ... For example, an image captioning system can be used in human-computer interaction, develop a hearing-aid system for visually impaired people, perform concept-based image indexing for information re-of ...

arXiv:2205.14462v1 fatcat:5krju3v7xba2vcarjfkurezfeu

Open Access

3D dense captioning, as an emerging vision-language task, aims to identify and locate each object from a set of point clouds and generate a distinctive natural language sentence for describing each located ... Specifically, the GCM module captures the inter-object relationship among all objects with global contextual information to obtain more complete scene information of the whole point clouds. ... Image captioning aims at generating a single sentence for an image, while dense image captioning is the task of localizing multiple objects in a given image and describing each object by natural language ...

arXiv:2210.03925v1 fatcat:ar7ucs6mprdd7jib2grsdr6ote

Video captioning has been an emerging research topic in computer vision, which aims to generate a natural sentence to correctly reflect the visual content of a video. ... In this paper, we uniquely introduce a Retrieval Augmentation Mechanism (RAM) that enables the explicit reference to existing video-sentence pairs within any encoder-decoder captioning model. ... In [64] , Conditional Random Field is exploited to model the relationships between diferent visual entities of the input video and generate descriptions for the video. Guadarrama et al. ...

doi:10.1145/3539225 fatcat:na34xvi25bcnfes7p43kdaqjge

Salient object detection identifies objects in an image that grab visual attention. Although contextual features are considered in recent literature, they often fail in real-world complex scenarios. ... To our knowledge, such high-level semantic contextual information of image scenes is underexplored for saliency detection in the literature. ... We apply random cropping, flipping and multi-scale image training for data augmentation. ...

doi:10.1109/iccv48922.2021.00412 fatcat:z6yw3znqvzbodohlh7k5uh74ki

Such information is even more important for story visualization since its inputs have an explicit narrative structure that needs to be translated into an image sequence (or visual story). ... We show that off-the-shelf dense-captioning models trained on Visual Genome can improve the spatial structure of images from a different target domain without needing fine-tuning. ... Acknowledgments We thank Darryl Hannan, Hanna Tischer, Hyounghun Kim, Jaemin Cho, and the reviewers for their useful feedback. ...

arXiv:2110.10834v1 fatcat:odfsk6jmqbha3hbcjqphjoozqm

In this work, we discuss that directly employing CLIP for zero-shot image captioning relies more on the textual modality in context and largely ignores the visual information, which we call contextual ... CLIP (Contrastive Language-Image Pre-Training) has shown remarkable zero-shot transfer capabilities in cross-modal correlation tasks such as visual classification and image retrieval. ... Conclusion In this paper, we propose Anchor-augmented Vision-Language Space Alignment for zero-shot image captioning. ...

arXiv:2211.07275v1 fatcat:l4n6bilhs5chlgqdi2rqdlstdu

Recently, image scene graphs have been used to augment captioning models so as to leverage their structural semantics, such as object entities, relationships and attributes. ... SG2Caps outperforms existing scene graph-only captioning models by a large margin, indicating scene graphs as a promising representation for image captioning. ... Unlike high dimensional (2048-D) region-level visual features, we utilize low-dimensional (256-D) image-level visual features in our visual-feature augmented SG2Caps framework. ...

arXiv:2102.04990v3 fatcat:ccsotsmzvrhlrkjakzyvh4bzkq

Multiple Versions

Self-Supervised Relationship Probing

Preserved Fulltext

Enhancing Multimodal Understanding with CLIP-Based Image-to-Text Transformation [article]

Preserved Fulltext

Text Augmentation Using BERT for Image Captioning

Preserved Fulltext

A Picture May Be Worth a Hundred Words for Visual Question Answering [article]

Preserved Fulltext

Exploring Explicit and Implicit Visual Relationships for Image Captioning [article]

Preserved Fulltext

Multimodal Learning for Hateful Memes Detection [article]

Preserved Fulltext

Other Versions

EAES: Effective Augmented Embedding Spaces for Text-based Image Captioning

Preserved Fulltext

Understanding News Text and Images Connection with Context-enriched Multimodal Transformers

Preserved Fulltext

BAN-Cap: A Multi-Purpose English-Bangla Image Descriptions Dataset [article]

Preserved Fulltext

Contextual Modeling for 3D Dense Captioning on Point Clouds [article]

Preserved Fulltext

Retrieval Augmented Convolutional Encoder-Decoder Networks for Video Captioning

Preserved Fulltext

Scene Context-Aware Salient Object Detection

Preserved Fulltext

Integrating Visuospatial, Linguistic and Commonsense Structure into Story Visualization [article]

Preserved Fulltext

Zero-shot Image Captioning by Anchor-augmented Vision-Language Space Alignment [article]

Preserved Fulltext

In Defense of Scene Graphs for Image Captioning [article]

Preserved Fulltext

Other Versions