Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Fine-Grained Bidirectional Attention-Based Generative Networks forĀ Image-Text Matching

  • Conference paper
  • First Online:
Machine Learning and Knowledge Discovery in Databases (ECML PKDD 2022)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13715))

  • 773 Accesses

Abstract

In this paper, we propose a method called BiKA (Bidirectional Knowledge-assisted embedding and Attention-based generation) for the task of image-text matching. It mainly improves the embedding ability of images and texts from two aspects: first, modality conversion, we build a bidirectional image and text generation network to explore the positive effect of mutual conversion between modalities on image-text feature embedding; then is relational dependency, we built a bidirectional graph convolutional neural network to establish the dependency between objects, introduce non-Euclidean data into image-text fine-grained matching to explore the positive effect of this dependency on fine-grained embedding of images and texts. Experiments on two public datasets show that the performance of our method is significantly improved compared to many state-of-the-art models.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 109.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 139.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077ā€“6086 (2018)

    Google ScholarĀ 

  2. Chen, H., Ding, G., Liu, X., Lin, Z., Liu, J., Han, J.: IMRAM: iterative matching with recurrent attention memory for cross-modal image-text retrieval. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12655ā€“12663 (2020)

    Google ScholarĀ 

  3. Chen, S., Li, Z., Tang, Z.: Relation R-CNN: a graph based relation-aware network for object detection. IEEE Signal Process. Lett. 27, 1680ā€“1684 (2020)

    ArticleĀ  Google ScholarĀ 

  4. Chen, Z.M., Wei, X.S., Wang, P., Guo, Y.: Multi-label image recognition with graph convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5177ā€“5186 (2019)

    Google ScholarĀ 

  5. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 580ā€“587 (2014)

    Google ScholarĀ 

  6. Gu, J., Cai, J., Joty, S.R., Niu, L., Wang, G.: Look, imagine and match: improving textual-visual cross-modal retrieval with generative models. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7181ā€“7189 (2018)

    Google ScholarĀ 

  7. He, X., Deng, L., Chou, W.: Discriminative learning in sequential pattern recognition. IEEE Signal Process. Mag. 25(5), 14ā€“36 (2008)

    ArticleĀ  Google ScholarĀ 

  8. Herdade, S., Kappeler, A., Boakye, K., Soares, J.: Image captioning: transforming objects into words. arXiv preprint arXiv:1906.05963 (2019)

  9. Hu, H., Gu, J., Zhang, Z., Dai, J., Wei, Y.: Relation networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3588ā€“3597 (2018)

    Google ScholarĀ 

  10. Ji, Z., Chen, K., Wang, H.: Step-wise hierarchical alignment network for image-text matching. arXiv preprint arXiv:2106.06509 (2021)

  11. Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128ā€“3137 (2015)

    Google ScholarĀ 

  12. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016)

  13. Lee, K.H., Chen, X., Hua, G., Hu, H., He, X.: Stacked cross attention for image-text matching. In: Proceedings of the European Conference on Computer Vision, pp. 201ā€“216 (2018)

    Google ScholarĀ 

  14. Li, K., Zhang, Y., Li, K., Li, Y., Fu, Y.: Visual semantic reasoning for image-text matching. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4654ā€“4662 (2019)

    Google ScholarĀ 

  15. Li, Z., Xie, X., Ling, F., Ma, H., Shi, Z.: Matching images and texts with multi-head attention network for cross-media hashing retrieval. Eng. Appl. Artif. Intell. 106, 104475 (2021)

    ArticleĀ  Google ScholarĀ 

  16. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740ā€“755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    ChapterĀ  Google ScholarĀ 

  17. Liu, C., Mao, Z., Liu, A.A., Zhang, T., Wang, B., Zhang, Y.: Focus your attention: a bidirectional focal attention network for image-text matching. In: Proceedings of the 27th ACM International Conference on Multimedia, pp. 3ā€“11 (2019)

    Google ScholarĀ 

  18. Van der Maaten, L., Hinton, G.: Visualizing data using T-SNE. J. Mach. Learn. Res. 9(11), 2579ā€“2605 (2008)

    MATHĀ  Google ScholarĀ 

  19. Peng, Y., Qi, J.: CM-GANs: cross-modal generative adversarial networks for common representation learning. ACM Trans. Multimed. Comput. Commun. Appl. 15(1), 1ā€“24 (2019)

    ArticleĀ  MathSciNetĀ  Google ScholarĀ 

  20. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. Adv. Neural. Inf. Process. Syst. 28, 91ā€“99 (2015)

    Google ScholarĀ 

  21. Vaswani, A., et al.: Attention is all you need. Adv. Neural. Inf. Process. Syst. 30, 5998ā€“6008 (2017)

    Google ScholarĀ 

  22. Wang, S., Wang, R., Yao, Z., Shan, S., Chen, X.: Cross-modal scene graph matching for relationship-aware image-text retrieval. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1508ā€“1517 (2020)

    Google ScholarĀ 

  23. Wang, Y., et al.: Position focused attention network for image-text matching. arXiv preprint arXiv:1907.09748 (2019)

  24. Wang, Z., Liu, X., Li, H., Sheng, L., Yan, J., Wang, X., Shao, J.: CAMP: cross-modal adaptive message passing for text-image retrieval. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5764ā€“5773 (2019)

    Google ScholarĀ 

  25. Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: Proceedings of the International Conference on Machine Learning, pp. 2048ā€“2057. PMLR (2015)

    Google ScholarĀ 

  26. Xu, T., et al.: AttnGAN: fine-grained text to image generation with attentional generative adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1316ā€“1324 (2018)

    Google ScholarĀ 

  27. Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist. 2, 67ā€“78 (2014)

    ArticleĀ  Google ScholarĀ 

Download references

Acknowledgments

This work is supported by National Natural Science Foundation of China (Nos. 61966004, 61866004), Guangxi Natural Science Foundation (No. 2019GXNSFDA245018), Guangxi ā€œBagui Scholarā€œ Teams for Innovation and Research Project, Guangxi Talent Highland Project of Big Data Intelligence and Application, and Guangxi Collaborative Innovation Center of Multi-source Information Integration and Intelligent Processing.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhixin Li .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

Ā© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Li, Z., Zhu, J., Wei, J., Zeng, Y. (2023). Fine-Grained Bidirectional Attention-Based Generative Networks forĀ Image-Text Matching. In: Amini, MR., Canu, S., Fischer, A., Guns, T., Kralj Novak, P., Tsoumakas, G. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2022. Lecture Notes in Computer Science(), vol 13715. Springer, Cham. https://doi.org/10.1007/978-3-031-26409-2_24

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-26409-2_24

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-26408-5

  • Online ISBN: 978-3-031-26409-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics