AST: An Attention-Guided Segment Transformer for Drone-Based Cross-View Geo-Localization

Zhao, Zichuan; Tang, Tianhang; Chen, Jie; Shi, Xuelei; Liu, Yiguang

doi:10.1007/978-981-97-2092-7_17

Zichuan Zhao⁹,
Tianhang Tang⁹,
Jie Chen⁹,
Xuelei Shi⁹ &
…
Yiguang Liu⁹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14593))

Included in the following conference series:

International Conference on Computational Visual Media

139 Accesses

Abstract

To tackle the problem of drone-based cross-view geo-localization, we address how to match drone-view images and satellite-view images, which is extremely challenging due to the variability of view angles and view distances. Inspired by how humans recognize aerial images, we propose an effective Attention-guided Segment Transformer (AST) structure: a novel segmentation strategy is introduced to cope with the huge variations between aerial views, and this segmentation is adaptive and non-uniform, allowing it to segment regions with corresponding relationships even after significant changes in viewpoint; furthermore, a new segment token module is designed to generate segment tokens that are concatenated with the original class token to supplement the local information. Compared to CNN-based methods, AST fully utilizes the self-attention mechanism to establish global context correlations; and the newly introduced segment token module allows AST to effectively extract local features as well—a capability not present in the vanilla vision transformer. Remarkably, AST demonstrates good robustness to viewpoint changes, even when there are overlapping regions, and this good treat is confirmed by the experimental results on the University-1652 dataset, which also show competitive performance for both tasks of drone-view target localization and drone navigation.

This work is supported by NSFC under grants U19A2071 and 61860206007, Sichuan Science and Technology Program under grant 2023YFG0334, as well as the funding from Sichuan University under grant 2020SCUNG205.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Ali, A., et al.: XCIT: cross-covariance image transformers. Adv. Neural Inf. Process. Syst. 34, 20014–20027 (2021)
Google Scholar
Brown, T., et al.: Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 33, 1877–1901 (2020)
Google Scholar
Bui, D.V., Kubo, M., Sato, H.: A part-aware attention neural network for cross-view geo-localization between UAV and satellite. J. Rob. Network. Artif. Life 9(3), 275–284 (2022)
Google Scholar
Cao, H., et al.: Swin-unet: Unet-like pure transformer for medical image segmentation. arXiv:2105.05537 (2021)
Chen, C.F.R., Fan, Q., Panda, R.: Crossvit: cross-attention multi-scale vision transformer for image classification. In: Proceedings of IEEE/CVF International Conference on Computer Vision, pp. 357–366 (2021)
Google Scholar
Chopra, S., Hadsell, R., LeCun, Y.: Learning a similarity metric discriminatively, with application to face verification. In: IEEE Computer Society Conference on Computer Vision Pattern Recognition, vol. 1, pp. 539–546. IEEE (2005)
Google Scholar
Chu, X., et al.: Twins: revisiting the design of spatial attention in vision transformers. Adv. Neural Inf. Process. Syst. 34, 9355–9366 (2021)
Google Scholar
Dai, M., Hu, J., Zhuang, J., Zheng, E.: A transformer-based feature segmentation and region alignment method for UAV-view geo-localization. IEEE Trans. Circ. Syst. Video Technol. 32(7), 4376–4389 (2021)
Article Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 (2018)
Ding, L., Zhou, J., Meng, L., Long, Z.: A practical cross-view image matching method between UAV and satellite for UAV-based geo-localization. Remote Sens. 13(1), 47 (2020)
Article Google Scholar
Dong, X., et al.: Cswin transformer: a general vision transformer backbone with cross-shaped windows. In: Proceedings of IEEE/CVF Conference on Computer Vision Pattern Recognition, pp. 12124–12134 (2022)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16\(\times \)16 words: transformers for image recognition at scale. arXiv:2010.11929 (2020)
Han, K., Xiao, A., Wu, E., Guo, J., Xu, C., Wang, Y.: Transformer in transformer. Adv. Neural Inf. Process. Syst. 34, 15908–15919 (2021)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Heo, B., Yun, S., Han, D., Chun, S., Choe, J., Oh, S.J.: Rethinking spatial dimensions of vision transformers. In: Proceedings of IEEE/CVF International Conference on Computer Vision, pp. 11936–11945 (2021)
Google Scholar
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018)
Google Scholar
Hu, S., Feng, M., Nguyen, R.M., Lee, G.H.: Cvm-net: cross-view matching network for image-based ground-to-aerial geo-localization. In: Proceedings of IEEE Conference on Computer Vision Pattern Recognition, pp. 7258–7267 (2018)
Google Scholar
Kirillov, A., et al.: Segment anything. arXiv:2304.02643 (2023)
Lin, J., et al.: Joint representation learning and keypoint detection for cross-view geo-localization. IEEE Trans. Image Process. 31, 3780–3792 (2022)
Article Google Scholar
Lin, T.Y., Cui, Y., Belongie, S., Hays, J.: Learning deep representations for ground-to-aerial geolocalization. In: Proceedings of IEEE Conference on Computer Vision Pattern Recognition, pp. 5007–5015 (2015)
Google Scholar
Liu, L., Li, H.: Lending orientation to neural networks for cross-view geo-localization. In: Proceedings of IEEE/CVF Conference on Computer Vision Pattern Recognition, pp. 5624–5633 (2019)
Google Scholar
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of IEEE/CVF Conference on Computer Vision Pattern Recognition, pp. 10012–10022 (2021)
Google Scholar
Lu, Z., Pu, T., Chen, T., Lin, L.: Content-aware hierarchical representation selection for cross-view geo-localization. In: Proceedings of Asian Conference on Computer Vision, pp. 4211–4224 (2022)
Google Scholar
Luo, W., Li, Y., Urtasun, R., Zemel, R.: Understanding the effective receptive field in deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 29 (2016)
Google Scholar
Pan, Z., Zhuang, B., Liu, J., He, H., Cai, J.: Scalable vision transformers with hierarchical pooling. In: Proceedings of IEEE/CVF Conference on Computer Vision Pattern Recognition, pp. 377–386 (2021)
Google Scholar
Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 32 (2019)
Google Scholar
Radenović, F., Tolias, G., Chum, O.: Fine-tuning CNN image retrieval with no human annotation. IEEE Trans. Pattern Anal. Mach. Intell. 41(7), 1655–1668 (2018)
Article Google Scholar
Shi, Y., Liu, L., Yu, X., Li, H.: Spatial-aware feature aggregation for image based cross-view geo-localization. Adv. Neural Inf. Process. Syst. 32 (2019)
Google Scholar
Tian, X., Shao, J., Ouyang, D., Shen, H.T.: UAV-satellite view synthesis for cross-view geo-localization. IEEE Trans. Circuits Syst. Video Technol. 32(7), 4804–4815 (2021)
Article Google Scholar
Toker, A., Zhou, Q., Maximov, M., Leal-Taixé, L.: Coming down to earth: satellite-to-street view synthesis for geo-localization. In: Proceedings of IEEE/CVF Conference on Computer Vision Pattern Recognition, pp. 6488–6497 (2021)
Google Scholar
Vaswani, A., et al.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017)
Google Scholar
Wang, P., et al.: KVT: k-nn attention for boosting vision transformers. In: Avidan, S., Brostow, G., Cisse, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022, vol. 13684, pp. 285–302. Springer, Heidleberg (2022). https://doi.org/10.1007/978-3-031-20053-3_17
Chapter Google Scholar
Wang, T., et al.: Each part matters: local patterns facilitate cross-view geo-localization. IEEE Trans. Circuits Syst. Video Technol. 32(2), 867–879 (2021)
Article Google Scholar
Wang, W., et al.: Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In: Proceedings of IEEE/CVF International Conference on Computer Vision, pp. 568–578 (2021)
Google Scholar
Wang, Z., Cun, X., Bao, J., Zhou, W., Liu, J., Li, H.: Uformer: a general u-shaped transformer for image restoration. In: Proceedings of IEEE/CVF International Conference on Computer Vision, pp. 17683–17693 (2022)
Google Scholar
Yang, H., Lu, X., Zhu, Y.: Cross-view geo-localization with layer-to-layer transformer. Adv. Neural Inf. Process. Syst. 34, 29009–29020 (2021)
Google Scholar
Zhai, M., Bessinger, Z., Workman, S., Jacobs, N.: Predicting ground-level scene layout from aerial imagery. In: Proceedings of IEEE Conference on Computer Vision Pattern Recognition, pp. 867–875 (2017)
Google Scholar
Zheng, Z., Wei, Y., Yang, Y.: University-1652: a multi-view multi-source benchmark for drone-based geo-localization. In: Proceedings of 28th ACM International Conference on Multimedia, pp. 1395–1403 (2020)
Google Scholar
Zhou, D., et al.: Deepvit: towards deeper vision transformer. arXiv:2103.11886 (2021)
Zhu, S., Shah, M., Chen, C.: Transgeo: transformer is all you need for cross-view image geo-localization. In: Proceedings of IEEE/CVF Conference on Computer Vision Pattern Recognition, pp. 1162–1171 (2022)
Google Scholar

Download references

Author information

Authors and Affiliations

Sichuan University, No.24 South Section 1, Yihuan Road, Chengdu, China
Zichuan Zhao, Tianhang Tang, Jie Chen, Xuelei Shi & Yiguang Liu

Authors

Zichuan Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Tianhang Tang
View author publications
You can also search for this author in PubMed Google Scholar
Jie Chen
View author publications
You can also search for this author in PubMed Google Scholar
Xuelei Shi
View author publications
You can also search for this author in PubMed Google Scholar
Yiguang Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yiguang Liu .

Editor information

Editors and Affiliations

Victoria University of Wellington, Wellington, New Zealand
Fang-Lue Zhang
Ben-Gurion University, Be'er Sheva, Israel
Andrei Sharf

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhao, Z., Tang, T., Chen, J., Shi, X., Liu, Y. (2024). AST: An Attention-Guided Segment Transformer for Drone-Based Cross-View Geo-Localization. In: Zhang, FL., Sharf, A. (eds) Computational Visual Media. CVM 2024. Lecture Notes in Computer Science, vol 14593. Springer, Singapore. https://doi.org/10.1007/978-981-97-2092-7_17

Download citation

DOI: https://doi.org/10.1007/978-981-97-2092-7_17
Published: 30 March 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-2091-0
Online ISBN: 978-981-97-2092-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

AST: An Attention-Guided Segment Transformer for Drone-Based Cross-View Geo-Localization