Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Filters








3,649 Hits in 2.1 sec

Self-Supervised Relationship Probing

Jiuxiang Gu, Jason Kuen, Shafiq R. Joty, Jianfei Cai, Vlad I. Morariu, Handong Zhao, Tong Sun
2020 Neural Information Processing Systems  
Structured representations of images that model visual relationships are beneficial for many vision and vision-language applications.  ...  By leveraging masked language modeling, contrastive learning, and dependency tree distances for self-supervision, our method learns better object features as well as implicit visual relationships.  ...  The second row presents the visual relationship distance graphs for the corresponding images. The bottom rows show the distance graphs and dependency trees for augmented captions.  ... 
dblp:conf/nips/GuKJ0MZS20 fatcat:d6hjosn2zbf3rfdqnrp3buqray

Enhancing Multimodal Understanding with CLIP-Based Image-to-Text Transformation [article]

Chang Che, Qunwei Lin, Xinyu Zhao, Jiaxin Huang, Liqiang Yu
2024 arXiv   pre-print
The process of transforming input images into corresponding textual explanations stands as a crucial and complex endeavor within the domains of computer vision and natural language processing.  ...  In this paper, we propose an innovative ensemble approach that harnesses the capabilities of Contrastive Language-Image Pretraining models.  ...  For images, we apply random transformations such as rotations, flips, and crops to generate augmented versions of the original images.  ... 
arXiv:2401.06167v1 fatcat:rortkgokurdyzjxewzqap3zpgu

Text Augmentation Using BERT for Image Captioning

Viktar Atliha, Dmitrij Šešok
2020 Applied Sciences  
Image captioning is an important task for improving human-computer interaction as well as for a deeper understanding of the mechanisms underlying the image description by human.  ...  The current commonly used datasets although rather large in terms of number of images are quite small in terms of the number of different captions per one image.  ...  However, augmentation is rarely used in vision-language tasks, such as image captioning, visual question answering, or visual dialog.  ... 
doi:10.3390/app10175978 fatcat:lndvsjxi5fhy7kr3fqmp6xolji

A Picture May Be Worth a Hundred Words for Visual Question Answering [article]

Yusuke Hirota, Noa Garcia, Mayu Otani, Chenhui Chu, Yuta Nakashima, Ittetsu Taniguchi, Takao Onoye
2021 arXiv   pre-print
How far can we go with textual representations for understanding pictures? In image understanding, it is essential to use concise but detailed image representations.  ...  This paper delves into the effectiveness of textual representations for image understanding in the specific context of VQA.  ...  Even though they leverage captions of images for pre-training, they rely on deep visual features for image representation.  ... 
arXiv:2106.13445v1 fatcat:57wl3n2hznbz7hbrim4x4efvdm

Exploring Explicit and Implicit Visual Relationships for Image Captioning [article]

Zeliang Song, Xiaofei Zhou
2021 arXiv   pre-print
In this paper, we explore explicit and implicit visual relationships to enrich region-level representations for image captioning.  ...  Recent methods for image captioning follow encoder-decoder framework that transforms the sequence of salient regions in an image into natural language descriptions.  ...  APPROACH In this section, we will introduce how to explicitly and implicitly explore visual relationships for image captioning.  ... 
arXiv:2105.02391v1 fatcat:qmbtqobhgjfilli47tbq72mbdu

Multimodal Learning for Hateful Memes Detection [article]

Yi Zhou, Zhenhao Chen
2020 arXiv   pre-print
In this paper, we focus on multimodal hateful memes detection and propose a novel method that incorporates the image captioning process into the memes detection process.  ...  Memes are used for spreading ideas through social networks. Although most memes are created for humor, some memes become hateful under the combination of pictures and text.  ...  'OCR Text (Back-Translation)' means we augment the OCR sentences in training set through trained back-translators. 'Image Caption' means using captions for meme detection.  ... 
arXiv:2011.12870v3 fatcat:nly2jznbhzhzlfehij7wghxjam

EAES: Effective Augmented Embedding Spaces for Text-based Image Captioning

Khang Nguyen, Doanh C. Bui, Truc Trinh, Nguyen D. Vo
2022 IEEE Access  
INDEX TERMS image captioning, text-based image captioning, bottom-up top-down, grid feature, multimodal transformer, m4c  ...  Based on the M4C-Captioner model, this paper proposes the simple but effective EAES embedding module for effectively embedding images and scene texts into the multimodal Transformer layers.  ...  Therefore, we augment a grid feature that adds global contextual information to the image.  ... 
doi:10.1109/access.2022.3158763 fatcat:ukcjnnaaazedbmap3dg22bp5ki

Understanding News Text and Images Connection with Context-enriched Multimodal Transformers

Cláudio Bartolomeu, Rui Nóbrega, David Semedo
2022 Proceedings of the 30th ACM International Conference on Multimedia  
The connection between news and the images that illustrate them goes beyond visual concept to natural language matching.  ...  Instead, the open-domain and event-reporting nature of news leads to semantically complex texts, in which images are used as a contextualizing element.  ...  CONCLUSIONS In this paper, we proposed NewsLXMERT, a multimodal transformer leveraging news contextual elements to model the complex relationships between news and images.  ... 
doi:10.1145/3503161.3548430 fatcat:f7men5dygvfrhc3mw6foqzacc4

BAN-Cap: A Multi-Purpose English-Bangla Image Descriptions Dataset [article]

Mohammad Faiyaz Khan, S.M. Sadiq-Ur-Rahman Shifath, Md Saiful Islam
2022 arXiv   pre-print
models for Bangla image captioning.  ...  As computers have become efficient at understanding visual information and transforming it into a written representation, research interest in tasks like automatic image captioning has seen a significant  ...  For example, an image captioning system can be used in human-computer interaction, develop a hearing-aid system for visually impaired people, perform concept-based image indexing for information re-of  ... 
arXiv:2205.14462v1 fatcat:5krju3v7xba2vcarjfkurezfeu

Contextual Modeling for 3D Dense Captioning on Point Clouds [article]

Yufeng Zhong, Long Xu, Jiebo Luo, Lin Ma
2022 arXiv   pre-print
3D dense captioning, as an emerging vision-language task, aims to identify and locate each object from a set of point clouds and generate a distinctive natural language sentence for describing each located  ...  Specifically, the GCM module captures the inter-object relationship among all objects with global contextual information to obtain more complete scene information of the whole point clouds.  ...  Image captioning aims at generating a single sentence for an image, while dense image captioning is the task of localizing multiple objects in a given image and describing each object by natural language  ... 
arXiv:2210.03925v1 fatcat:ar7ucs6mprdd7jib2grsdr6ote

Retrieval Augmented Convolutional Encoder-Decoder Networks for Video Captioning

Jingwen Chen, Yingwei Pan, Yehao Li, Ting Yao, Hongyang Chao, Tao Mei
2022 ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)  
Video captioning has been an emerging research topic in computer vision, which aims to generate a natural sentence to correctly reflect the visual content of a video.  ...  In this paper, we uniquely introduce a Retrieval Augmentation Mechanism (RAM) that enables the explicit reference to existing video-sentence pairs within any encoder-decoder captioning model.  ...  In [64] , Conditional Random Field is exploited to model the relationships between diferent visual entities of the input video and generate descriptions for the video. Guadarrama et al.  ... 
doi:10.1145/3539225 fatcat:na34xvi25bcnfes7p43kdaqjge

Scene Context-Aware Salient Object Detection

Avishek Siris, Jianbo Jiao, Gary K.L. Tam, Xianghua Xie, Rynson W.H. Lau
2021 2021 IEEE/CVF International Conference on Computer Vision (ICCV)  
Salient object detection identifies objects in an image that grab visual attention. Although contextual features are considered in recent literature, they often fail in real-world complex scenarios.  ...  To our knowledge, such high-level semantic contextual information of image scenes is underexplored for saliency detection in the literature.  ...  We apply random cropping, flipping and multi-scale image training for data augmentation.  ... 
doi:10.1109/iccv48922.2021.00412 fatcat:z6yw3znqvzbodohlh7k5uh74ki

Integrating Visuospatial, Linguistic and Commonsense Structure into Story Visualization [article]

Adyasha Maharana, Mohit Bansal
2021 arXiv   pre-print
Such information is even more important for story visualization since its inputs have an explicit narrative structure that needs to be translated into an image sequence (or visual story).  ...  We show that off-the-shelf dense-captioning models trained on Visual Genome can improve the spatial structure of images from a different target domain without needing fine-tuning.  ...  Acknowledgments We thank Darryl Hannan, Hanna Tischer, Hyounghun Kim, Jaemin Cho, and the reviewers for their useful feedback.  ... 
arXiv:2110.10834v1 fatcat:odfsk6jmqbha3hbcjqphjoozqm

Zero-shot Image Captioning by Anchor-augmented Vision-Language Space Alignment [article]

Junyang Wang, Yi Zhang, Ming Yan, Ji Zhang, Jitao Sang
2022 arXiv   pre-print
In this work, we discuss that directly employing CLIP for zero-shot image captioning relies more on the textual modality in context and largely ignores the visual information, which we call contextual  ...  CLIP (Contrastive Language-Image Pre-Training) has shown remarkable zero-shot transfer capabilities in cross-modal correlation tasks such as visual classification and image retrieval.  ...  Conclusion In this paper, we propose Anchor-augmented Vision-Language Space Alignment for zero-shot image captioning.  ... 
arXiv:2211.07275v1 fatcat:l4n6bilhs5chlgqdi2rqdlstdu

In Defense of Scene Graphs for Image Captioning [article]

Kien Nguyen and Subarna Tripathi and Bang Du and Tanaya Guha and Truong Q. Nguyen
2021 arXiv   pre-print
Recently, image scene graphs have been used to augment captioning models so as to leverage their structural semantics, such as object entities, relationships and attributes.  ...  SG2Caps outperforms existing scene graph-only captioning models by a large margin, indicating scene graphs as a promising representation for image captioning.  ...  Unlike high dimensional (2048-D) region-level visual features, we utilize low-dimensional (256-D) image-level visual features in our visual-feature augmented SG2Caps framework.  ... 
arXiv:2102.04990v3 fatcat:ccsotsmzvrhlrkjakzyvh4bzkq
« Previous Showing results 1 — 15 out of 3,649 results