Fine-Grained Bidirectional Attention-Based Generative Networks for Image-Text Matching

Li, Zhixin; Zhu, Jianwei; Wei, Jiahui; Zeng, Yufei

doi:10.1007/978-3-031-26409-2_24

Zhixin Li¹³,
Jianwei Zhu¹³,
Jiahui Wei¹³ &
…
Yufei Zeng¹³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13715))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

773 Accesses

Abstract

In this paper, we propose a method called BiKA (Bidirectional Knowledge-assisted embedding and Attention-based generation) for the task of image-text matching. It mainly improves the embedding ability of images and texts from two aspects: first, modality conversion, we build a bidirectional image and text generation network to explore the positive effect of mutual conversion between modalities on image-text feature embedding; then is relational dependency, we built a bidirectional graph convolutional neural network to establish the dependency between objects, introduce non-Euclidean data into image-text fine-grained matching to explore the positive effect of this dependency on fine-grained embedding of images and texts. Experiments on two public datasets show that the performance of our method is significantly improved compared to many state-of-the-art models.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 109.00; Price excludes VAT (USA)

Softcover Book: USD 139.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

DSGSR: Dynamic Semantic Generation and Similarity Reasoning for Image-Text Matching

Cross-modal multi-relationship aware reasoning for image-text matching

Article 27 January 2021

Cross Attention Graph Matching Network for Image-Text Retrieval

References

Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086 (2018)
Google Scholar
Chen, H., Ding, G., Liu, X., Lin, Z., Liu, J., Han, J.: IMRAM: iterative matching with recurrent attention memory for cross-modal image-text retrieval. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12655–12663 (2020)
Google Scholar
Chen, S., Li, Z., Tang, Z.: Relation R-CNN: a graph based relation-aware network for object detection. IEEE Signal Process. Lett. 27, 1680–1684 (2020)
Article Google Scholar
Chen, Z.M., Wei, X.S., Wang, P., Guo, Y.: Multi-label image recognition with graph convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5177–5186 (2019)
Google Scholar
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014)
Google Scholar
Gu, J., Cai, J., Joty, S.R., Niu, L., Wang, G.: Look, imagine and match: improving textual-visual cross-modal retrieval with generative models. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7181–7189 (2018)
Google Scholar
He, X., Deng, L., Chou, W.: Discriminative learning in sequential pattern recognition. IEEE Signal Process. Mag. 25(5), 14–36 (2008)
Article Google Scholar
Herdade, S., Kappeler, A., Boakye, K., Soares, J.: Image captioning: transforming objects into words. arXiv preprint arXiv:1906.05963 (2019)
Hu, H., Gu, J., Zhang, Z., Dai, J., Wei, Y.: Relation networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3588–3597 (2018)
Google Scholar
Ji, Z., Chen, K., Wang, H.: Step-wise hierarchical alignment network for image-text matching. arXiv preprint arXiv:2106.06509 (2021)
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128–3137 (2015)
Google Scholar
Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016)
Lee, K.H., Chen, X., Hua, G., Hu, H., He, X.: Stacked cross attention for image-text matching. In: Proceedings of the European Conference on Computer Vision, pp. 201–216 (2018)
Google Scholar
Li, K., Zhang, Y., Li, K., Li, Y., Fu, Y.: Visual semantic reasoning for image-text matching. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4654–4662 (2019)
Google Scholar
Li, Z., Xie, X., Ling, F., Ma, H., Shi, Z.: Matching images and texts with multi-head attention network for cross-media hashing retrieval. Eng. Appl. Artif. Intell. 106, 104475 (2021)
Article Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Liu, C., Mao, Z., Liu, A.A., Zhang, T., Wang, B., Zhang, Y.: Focus your attention: a bidirectional focal attention network for image-text matching. In: Proceedings of the 27th ACM International Conference on Multimedia, pp. 3–11 (2019)
Google Scholar
Van der Maaten, L., Hinton, G.: Visualizing data using T-SNE. J. Mach. Learn. Res. 9(11), 2579–2605 (2008)
MATH Google Scholar
Peng, Y., Qi, J.: CM-GANs: cross-modal generative adversarial networks for common representation learning. ACM Trans. Multimed. Comput. Commun. Appl. 15(1), 1–24 (2019)
Article MathSciNet Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. Adv. Neural. Inf. Process. Syst. 28, 91–99 (2015)
Google Scholar
Vaswani, A., et al.: Attention is all you need. Adv. Neural. Inf. Process. Syst. 30, 5998–6008 (2017)
Google Scholar
Wang, S., Wang, R., Yao, Z., Shan, S., Chen, X.: Cross-modal scene graph matching for relationship-aware image-text retrieval. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1508–1517 (2020)
Google Scholar
Wang, Y., et al.: Position focused attention network for image-text matching. arXiv preprint arXiv:1907.09748 (2019)
Wang, Z., Liu, X., Li, H., Sheng, L., Yan, J., Wang, X., Shao, J.: CAMP: cross-modal adaptive message passing for text-image retrieval. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5764–5773 (2019)
Google Scholar
Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: Proceedings of the International Conference on Machine Learning, pp. 2048–2057. PMLR (2015)
Google Scholar
Xu, T., et al.: AttnGAN: fine-grained text to image generation with attentional generative adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1316–1324 (2018)
Google Scholar
Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist. 2, 67–78 (2014)
Article Google Scholar

Download references

Acknowledgments

This work is supported by National Natural Science Foundation of China (Nos. 61966004, 61866004), Guangxi Natural Science Foundation (No. 2019GXNSFDA245018), Guangxi “Bagui Scholar“ Teams for Innovation and Research Project, Guangxi Talent Highland Project of Big Data Intelligence and Application, and Guangxi Collaborative Innovation Center of Multi-source Information Integration and Intelligent Processing.

Author information

Authors and Affiliations

Guangxi Key Lab of Multi-source Information Mining and Security, Guangxi Normal University, Guilin, 541004, China
Zhixin Li, Jianwei Zhu, Jiahui Wei & Yufei Zeng

Authors

Zhixin Li
View author publications
You can also search for this author in PubMed Google Scholar
Jianwei Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Jiahui Wei
View author publications
You can also search for this author in PubMed Google Scholar
Yufei Zeng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhixin Li .

Editor information

Editors and Affiliations

Grenoble Alpes University, Saint Martin d'Hères, France
Massih-Reza Amini
INSA Rouen Normandy, Saint Etienne du Rouvray, France
Stéphane Canu
Ruhr-Universität Bochum, Bochum, Germany
Asja Fischer
KU Leuven, Leuven, Belgium
Tias Guns
Central European University, Vienna, Austria
Petra Kralj Novak
Aristotle University of Thessaloniki, Thessaloniki, Greece
Grigorios Tsoumakas

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, Z., Zhu, J., Wei, J., Zeng, Y. (2023). Fine-Grained Bidirectional Attention-Based Generative Networks for Image-Text Matching. In: Amini, MR., Canu, S., Fischer, A., Guns, T., Kralj Novak, P., Tsoumakas, G. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2022. Lecture Notes in Computer Science(), vol 13715. Springer, Cham. https://doi.org/10.1007/978-3-031-26409-2_24

Download citation

DOI: https://doi.org/10.1007/978-3-031-26409-2_24
Published: 17 March 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-26408-5
Online ISBN: 978-3-031-26409-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the ECML PKDD community (opens in a new tab)

Fine-Grained Bidirectional Attention-Based Generative Networks for Image-Text Matching

Abstract

Access this chapter

Similar content being viewed by others

DSGSR: Dynamic Semantic Generation and Similarity Reasoning for Image-Text Matching

Cross-modal multi-relationship aware reasoning for image-text matching

Cross Attention Graph Matching Network for Image-Text Retrieval

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Navigation

Fine-Grained Bidirectional Attention-Based Generative Networks for Image-Text Matching

Abstract

Access this chapter

Similar content being viewed by others

DSGSR: Dynamic Semantic Generation and Similarity Reasoning for Image-Text Matching

Cross-modal multi-relationship aware reasoning for image-text matching

Cross Attention Graph Matching Network for Image-Text Retrieval

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation