A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2022; you can also visit the original URL.
The file type is application/pdf
.
Filters
Self-Supervised Relationship Probing
2020
Neural Information Processing Systems
Structured representations of images that model visual relationships are beneficial for many vision and vision-language applications. ...
By leveraging masked language modeling, contrastive learning, and dependency tree distances for self-supervision, our method learns better object features as well as implicit visual relationships. ...
The second row presents the visual relationship distance graphs for the corresponding images. The bottom rows show the distance graphs and dependency trees for augmented captions. ...
dblp:conf/nips/GuKJ0MZS20
fatcat:d6hjosn2zbf3rfdqnrp3buqray
Enhancing Multimodal Understanding with CLIP-Based Image-to-Text Transformation
[article]
2024
arXiv
pre-print
The process of transforming input images into corresponding textual explanations stands as a crucial and complex endeavor within the domains of computer vision and natural language processing. ...
In this paper, we propose an innovative ensemble approach that harnesses the capabilities of Contrastive Language-Image Pretraining models. ...
For images, we apply random transformations such as rotations, flips, and crops to generate augmented versions of the original images. ...
arXiv:2401.06167v1
fatcat:rortkgokurdyzjxewzqap3zpgu
Text Augmentation Using BERT for Image Captioning
2020
Applied Sciences
Image captioning is an important task for improving human-computer interaction as well as for a deeper understanding of the mechanisms underlying the image description by human. ...
The current commonly used datasets although rather large in terms of number of images are quite small in terms of the number of different captions per one image. ...
However, augmentation is rarely used in vision-language tasks, such as image captioning, visual question answering, or visual dialog. ...
doi:10.3390/app10175978
fatcat:lndvsjxi5fhy7kr3fqmp6xolji
A Picture May Be Worth a Hundred Words for Visual Question Answering
[article]
2021
arXiv
pre-print
How far can we go with textual representations for understanding pictures? In image understanding, it is essential to use concise but detailed image representations. ...
This paper delves into the effectiveness of textual representations for image understanding in the specific context of VQA. ...
Even though they leverage captions of images for pre-training, they rely on deep visual features for image representation. ...
arXiv:2106.13445v1
fatcat:57wl3n2hznbz7hbrim4x4efvdm
Exploring Explicit and Implicit Visual Relationships for Image Captioning
[article]
2021
arXiv
pre-print
In this paper, we explore explicit and implicit visual relationships to enrich region-level representations for image captioning. ...
Recent methods for image captioning follow encoder-decoder framework that transforms the sequence of salient regions in an image into natural language descriptions. ...
APPROACH In this section, we will introduce how to explicitly and implicitly explore visual relationships for image captioning. ...
arXiv:2105.02391v1
fatcat:qmbtqobhgjfilli47tbq72mbdu
Multimodal Learning for Hateful Memes Detection
[article]
2020
arXiv
pre-print
In this paper, we focus on multimodal hateful memes detection and propose a novel method that incorporates the image captioning process into the memes detection process. ...
Memes are used for spreading ideas through social networks. Although most memes are created for humor, some memes become hateful under the combination of pictures and text. ...
'OCR Text (Back-Translation)' means we augment the OCR sentences in training set through trained back-translators. 'Image Caption' means using captions for meme detection. ...
arXiv:2011.12870v3
fatcat:nly2jznbhzhzlfehij7wghxjam
EAES: Effective Augmented Embedding Spaces for Text-based Image Captioning
2022
IEEE Access
INDEX TERMS image captioning, text-based image captioning, bottom-up top-down, grid feature, multimodal transformer, m4c ...
Based on the M4C-Captioner model, this paper proposes the simple but effective EAES embedding module for effectively embedding images and scene texts into the multimodal Transformer layers. ...
Therefore, we augment a grid feature that adds global contextual information to the image. ...
doi:10.1109/access.2022.3158763
fatcat:ukcjnnaaazedbmap3dg22bp5ki
Understanding News Text and Images Connection with Context-enriched Multimodal Transformers
2022
Proceedings of the 30th ACM International Conference on Multimedia
The connection between news and the images that illustrate them goes beyond visual concept to natural language matching. ...
Instead, the open-domain and event-reporting nature of news leads to semantically complex texts, in which images are used as a contextualizing element. ...
CONCLUSIONS In this paper, we proposed NewsLXMERT, a multimodal transformer leveraging news contextual elements to model the complex relationships between news and images. ...
doi:10.1145/3503161.3548430
fatcat:f7men5dygvfrhc3mw6foqzacc4
BAN-Cap: A Multi-Purpose English-Bangla Image Descriptions Dataset
[article]
2022
arXiv
pre-print
models for Bangla image captioning. ...
As computers have become efficient at understanding visual information and transforming it into a written representation, research interest in tasks like automatic image captioning has seen a significant ...
For example, an image captioning system can be used in human-computer interaction, develop a hearing-aid system for visually impaired people, perform concept-based image indexing for information re-of ...
arXiv:2205.14462v1
fatcat:5krju3v7xba2vcarjfkurezfeu
Contextual Modeling for 3D Dense Captioning on Point Clouds
[article]
2022
arXiv
pre-print
3D dense captioning, as an emerging vision-language task, aims to identify and locate each object from a set of point clouds and generate a distinctive natural language sentence for describing each located ...
Specifically, the GCM module captures the inter-object relationship among all objects with global contextual information to obtain more complete scene information of the whole point clouds. ...
Image captioning aims at generating a single sentence for an image, while dense image captioning is the task of localizing multiple objects in a given image and describing each object by natural language ...
arXiv:2210.03925v1
fatcat:ar7ucs6mprdd7jib2grsdr6ote
Retrieval Augmented Convolutional Encoder-Decoder Networks for Video Captioning
2022
ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)
Video captioning has been an emerging research topic in computer vision, which aims to generate a natural sentence to correctly reflect the visual content of a video. ...
In this paper, we uniquely introduce a Retrieval Augmentation Mechanism (RAM) that enables the explicit reference to existing video-sentence pairs within any encoder-decoder captioning model. ...
In [64] , Conditional Random Field is exploited to model the relationships between diferent visual entities of the input video and generate descriptions for the video. Guadarrama et al. ...
doi:10.1145/3539225
fatcat:na34xvi25bcnfes7p43kdaqjge
Scene Context-Aware Salient Object Detection
2021
2021 IEEE/CVF International Conference on Computer Vision (ICCV)
Salient object detection identifies objects in an image that grab visual attention. Although contextual features are considered in recent literature, they often fail in real-world complex scenarios. ...
To our knowledge, such high-level semantic contextual information of image scenes is underexplored for saliency detection in the literature. ...
We apply random cropping, flipping and multi-scale image training for data augmentation. ...
doi:10.1109/iccv48922.2021.00412
fatcat:z6yw3znqvzbodohlh7k5uh74ki
Integrating Visuospatial, Linguistic and Commonsense Structure into Story Visualization
[article]
2021
arXiv
pre-print
Such information is even more important for story visualization since its inputs have an explicit narrative structure that needs to be translated into an image sequence (or visual story). ...
We show that off-the-shelf dense-captioning models trained on Visual Genome can improve the spatial structure of images from a different target domain without needing fine-tuning. ...
Acknowledgments We thank Darryl Hannan, Hanna Tischer, Hyounghun Kim, Jaemin Cho, and the reviewers for their useful feedback. ...
arXiv:2110.10834v1
fatcat:odfsk6jmqbha3hbcjqphjoozqm
Zero-shot Image Captioning by Anchor-augmented Vision-Language Space Alignment
[article]
2022
arXiv
pre-print
In this work, we discuss that directly employing CLIP for zero-shot image captioning relies more on the textual modality in context and largely ignores the visual information, which we call contextual ...
CLIP (Contrastive Language-Image Pre-Training) has shown remarkable zero-shot transfer capabilities in cross-modal correlation tasks such as visual classification and image retrieval. ...
Conclusion In this paper, we propose Anchor-augmented Vision-Language Space Alignment for zero-shot image captioning. ...
arXiv:2211.07275v1
fatcat:l4n6bilhs5chlgqdi2rqdlstdu
In Defense of Scene Graphs for Image Captioning
[article]
2021
arXiv
pre-print
Recently, image scene graphs have been used to augment captioning models so as to leverage their structural semantics, such as object entities, relationships and attributes. ...
SG2Caps outperforms existing scene graph-only captioning models by a large margin, indicating scene graphs as a promising representation for image captioning. ...
Unlike high dimensional (2048-D) region-level visual features, we utilize low-dimensional (256-D) image-level visual features in our visual-feature augmented SG2Caps framework. ...
arXiv:2102.04990v3
fatcat:ccsotsmzvrhlrkjakzyvh4bzkq
« Previous
Showing results 1 — 15 out of 3,649 results