The synergy of double attention: Combine sentence-level and word-level attention for image captioning.

Long Short-Term Memory (LSTM) combined with attention mechanism is extensively used to generate semantic sentences of images in image captioning models. ... First, the Synergy-Gated Attention (SGA) method is proposed, which can process the spatial features and the salient region features of given images simultaneously. ... Introduction Image captioning is a task that makes a sentence from reading an image. The sentence should be fluence and hold semantic consistency with image. ...

doi:10.3837/tiis.2022.10.010 fatcat:aqspiix37fcwbhcrr57tisd3ea

In this study, a dual-attention learning network with word and sentence embedding (WSDAN) is proposed. ... We design a module, transformer with sentence embedding (TSE), to extract a double embedding representation of questions containing keywords and medical information. ... double embedding of words and sentences, but only uses image-guided attention. ( 4 ) WSDAN(NP) only with F (V,Q) has no pretrained model that uses double embedding but only uses question-guided attention ...

arXiv:2210.00220v2 fatcat:sjxdelnvwzakleawzdf5tttm4q

Multiple Versions

An intuitive way to search for images is to use queries composed of an example image and a complementary text. ... While the first provides rich and implicit context for the search, the latter explicitly calls for new traits, or specifies how some elements of the example image should be changed to retrieve the desired ... to produce a sentence-level feature. ...

arXiv:2203.08101v2 fatcat:uwqqwuwazvbl7ajowvdj4uknze

Multiple Versions

NLG tasks and datasets, and draw attention to the challenges in NLG evaluation, focusing on different evaluation methods and their relationships; (c) highlight some future emphasis and relatively recent ... research issues that arise due to the increasing synergy between NLG and other artificial intelligence areas, such as computer vision, text and computational creativity. ... Then the sentence-level and word-level dynamic attention and , are formulated as: = u ⊤ W 1 h , , = h , ⊤ W 2 h . (43) Finally, the static and dynamic attentions are combined into ˜ , to reweight the article ...

arXiv:2112.11739v1 fatcat:ygrpp6f25ja4vfbhcr5ycfpxhy

Open Access Multiple Versions

In this paper, we present an overview of the major advances achieved in VLPMs for producing joint representations of vision and language. ... As the preliminaries, we briefly describe the general task definition and genetic architecture of VLPMs. ... such as visual regions/patches and textual words of the aligned image-text pairs. ...

arXiv:2204.07356v5 fatcat:uesrj6kfkffvzeycvx7hredl3e

Multiple Versions

recent research topics that have arisen partly as a result of growing synergies between NLG and other areas of artificial intelligence; (c) draw attention to the challenges in NLG evaluation, relating ... This survey therefore aims to (a) give an up-to-date synthesis of research on the core tasks in NLG and the architectures adopted in which such tasks are organised; (b) highlight a number of relatively ... Acknowledgements We thank the four reviewers for their detailed and constructive comments. ...

arXiv:1703.09902v4 fatcat:owx2fgo2bjej3b27ve2f3ledoe

Multiple Versions

topics that have arisen partly as a result of growing synergies between NLG and other areas of artificial intelligence; (c) draw attention to the challenges in NLG evaluation, relating them to similar ... challenges faced in other areas of NLP, with an emphasis on different evaluation methods and the relationships between them. ... Acknowledgments We thank the four reviewers for their detailed and constructive comments. ...

doi:10.1613/jair.5477 fatcat:ycuteghjzncn7nx6pzkhzd6mn4

DOAJ Szczepanski

Notably, Product1M contains over 1 million image-caption pairs and consists of two sample types, i.e., single-product and multi-product samples, which encompass a wide variety of cosmetics brands. ... To promote the study of this challenging task, we contribute Product1M, one of the largest multi-modal cosmetic datasets for real-world instance-level retrieval. ... For the 'CAPTURE-1Inst' model, we feed the whole image and an image-level bounding box, which is of the same size as the image, to CAP-TURE for inference. ...

arXiv:2107.14572v2 fatcat:cemydi2wojbyvcmh44flggtjem

Open Access Multiple Versions

When Calvin Klein uses "we are one," "one for all" and "for all for ever" combined with the image of the unified group of young people walking arm in arm to sell perfume, these terms and image have moved ... In the ck one ad, this contextual framing is completed with very few words: The caption at the top of the ad reads "we are one" and at the bottom reads "for all for ever." ...

doi:10.5070/b3242011670 fatcat:j6cwrxmnsjg4ziogrk72wpzxma

Szczepanski

of information in sketches and image captions, as well as the potential benefit of combining the two modalities. ... Using our dataset, we study for the first time the problem of fine-grained image retrieval from freehand scene sketches and sketch captions. ... [63] is one of the first popular works to use the attention mechanism with an LSTM for image captioning. ...

arXiv:2203.02113v3 fatcat:zpt353655vejzi7j3fpbp3jq5i

Open Access Multiple Versions

Thus, there is a demand for a systematic review of paradigms to organize knowledge structures beyond data-level mentions. ... The survey concludes with discussions on the challenges and possible directions for future exploration. ... the word-level attention and Multi-level CNN [169] developing an input attention mechanism with attention-based pooling. ...

arXiv:2302.05019v1 fatcat:7in54wjwyzhfnkx755izrkzr3y

At present, there is a lack of research work that sorts out the overall progress of BMs and guides the follow-up research. ... With the rapid development of deep learning, training Big Models (BMs) for multiple downstream tasks becomes a popular paradigm. ... However, even if the social bias is eliminated at the word level, the sentence-level bias can still exist due to the imbalanced combination of words. ...

arXiv:2203.14101v4 fatcat:rdikzudoezak5b36cf6hhne5u4

Multiple Versions

Citation

Sha Yuan, Hanyu Zhao, Shuai Zhao, Jiahong Leng, Yangxiao Liang, Xiaozhi Wang, Jifan Yu, Xin Lv, Zhou Shao, Jiaao He, Yankai Lin, Xu Han, Zhenghao Liu, Ning Ding, Yongming Rao, Yizhao Gao, Liang Zhang, Ming Ding, Cong Fang, Yisen Wang, Mingsheng Long, Jing Zhang, Yinpeng Dong, Tianyu Pang, Peng Cui, Lingxiao Huang, Zheng Liang, Huawei Shen, Hui Zhang, Quanshi Zhang, Qingxiu Dong, Zhixing Tan, Mingxuan Wang, Shuo Wang, Long Zhou, Haoran Li, Junwei Bao, Yingwei Pan, Weinan Zhang, Zhou Yu, Rui Yan, Chence Shi, Minghao Xu, Zuobai Zhang, Guoqiang Wang, Xiang Pan, Mengjie Li, Xiaoyu Chu, Zijun Yao, Fangwei Zhu, Shulin Cao, Weicheng Xue, Zixuan Ma, Zhengyan Zhang, Shengding Hu, Yujia Qin, Chaojun Xiao, Zheni Zeng, Ganqu Cui, Weize Chen, Weilin Zhao, Yuan Yao, Peng Li, Wenzhao Zheng, Wenliang Zhao, Ziyi Wang, Borui Zhang, Nanyi Fei, Anwen Hu, Zenan Ling, Haoyang Li, Boxi Cao, Xianpei Han, Weidong Zhan, Baobao Chang, Hao Sun, Jiawen Deng, Chujie Zheng, Juanzi Li, Lei Hou, Xigang Cao, Jidong Zhai, Zhiyuan Liu, Maosong Sun, Jiwen Lu, Zhiwu Lu, Qin Jin, Ruihua Song, Ji-Rong Wen, Zhouchen Lin, Liwei Wang, Hang Su, Jun Zhu, Zhifang Sui, Jiajun Zhang, Yang Liu, Xiaodong He, Minlie Huang, Jian Tang, Jie Tang. "A Roadmap for Big Model." arXiv (2022)

Nonetheless, the loose correlation between images and texts of web-crawled data renders the contrastive objective data inefficient and craving for a large training batch size. ... The synergy between two objectives lets xCLIP enjoy the best of both worlds: superior performance in both zero-shot transfer and representation learning. ... For each attention head from the last layer, we extract the attention map with [CLS] token as the query. ...

arXiv:2210.09304v1 fatcat:tkphueek3be2vas5ygplk2m2zq

Open Access

For each category, we present a review of state-of-the-art neural approaches, draw the connection between them and traditional approaches, and discuss the progress that has been made and challenges still ... The present paper surveys neural approaches to conversational AI that have been developed in the last few years. ... al., 2015a) , a sentence pair of different languages in machine translation (Gao et al., 2014a) , and an image-text pair in image captioning (Fang et al., 2015) and so on. ...

arXiv:1809.08267v3 fatcat:j57xlm4ogferdnrpfs4f2jporq

Multiple Versions

Writing good English must be one of the most difficult jobs in the world. The tracking of a developing language that is rich, diverse, and constantly evolving in use and meaning is not an easy task. ... Today's rules and uses quickly become outdated, but this book captures English as it should be used now. ... For the illustration 18.1, Diabetes UK is the copyright holder for the website and image. Chinese artist Feng Feng provided oriental influence for WPP's Acknowledgements. ...

doi:10.5281/zenodo.5345141 fatcat:hd4nud5fyrg7rk3l36nrzbwrgi

Open Access

Image Captioning with Synergy-Gated Attention and Recurrent Fusion LSTM

Preserved Fulltext

A Dual-Attention Learning Network with Word and Sentence Embedding for Medical Visual Question Answering [article]

Preserved Fulltext

Other Versions

ARTEMIS: Attention-based Retrieval with Text-Explicit Matching and Implicit Similarity [article]

Preserved Fulltext

Other Versions

A Survey of Natural Language Generation [article]

Preserved Fulltext

Vision-and-Language Pretrained Models: A Survey [article]

Preserved Fulltext

Other Versions

Survey of the State of the Art in Natural Language Generation: Core tasks, applications and evaluation [article]

Preserved Fulltext

Other Versions

Survey of the State of the Art in Natural Language Generation: Core tasks, applications and evaluation

Preserved Fulltext

Product1M: Towards Weakly Supervised Instance-Level Product Retrieval via Cross-modal Pretraining [article]

Preserved Fulltext

Other Versions

Message in a Bottle: An Advertising Campaign's Appropriation of Obama's Inclusive Rhetoric, and What This Reveals About National Identity

Preserved Fulltext

FS-COCO: Towards Understanding of Freehand Sketches of Common Objects in Context [article]

Preserved Fulltext

Other Versions

A Comprehensive Survey on Automatic Knowledge Graph Construction [article]

Preserved Fulltext

A Roadmap for Big Model [article]

Preserved Fulltext

Other Versions

Non-Contrastive Learning Meets Language-Image Pre-Training [article]

Preserved Fulltext

Neural Approaches to Conversational AI [article]

Preserved Fulltext

Other Versions

Effective writing skills for Public Relations

Preserved Fulltext