Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3581783.3611827acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Exploring the Adversarial Robustness of Video Object Segmentation via One-shot Adversarial Attacks

Published:27 October 2023Publication History

ABSTRACT

Video object segmentation (VOS) is a fundamental task for computer vision and multimedia. Despite significant progress of VOS models in recent works, there has been little research on the VOS models' adversarial robustness, posing serious security risks in the VOS models' practical applications (e.g., autonomous driving and video surveillance). Adversarial robustness refers to the ability of the model to resist malicious attacks on adversarial examples. To address this gap, we propose a one-shot adversarial robustness evaluation framework (i.e., the adversary only perturbs the first frame) for VOS models, including white-box and black-box attacks. For white-box attacks, we introduce Objective Attention (OA) and Boundary Attention (BA) mechanisms to enhance the attention of attack on objects from both pixel and object levels while mitigating issues such as multi-objects attack imbalance, attack bias towards the background, and boundary reservation. For black-box attacks, we propose the Video Diverse Input (VDI) module, which utilizes data augmentation to simulate historical information, improving our method's black-box transferability. We conduct extensive experiments to evaluate the adversarial robustness of VOS models with different structures. Our experimental results reveal that existing VOS models are more vulnerable to our attacks (both white-box and black-box) compared to other state-of-the-art attacks. We further analyze the influence of different designs (e.g., memory and matching mechanisms) on adversarial robustness. Finally, we provide insights for designing more secure VOS models in the future.

References

  1. Goutam Bhat, Felix Järemo Lawin, Martin Danelljan, Andreas Robinson, Michael Felsberg, Luc Van Gool, and Radu Timofte. 2020. Learning what to learn for video object segmentation. In European Conference on Computer Vision. Springer, 777--794.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Sergi Caelles, Kevis-Kokitsi Maninis, Jordi Pont-Tuset, Laura Leal-Taixé, Daniel Cremers, and Luc Van Gool. 2017. One-shot video object segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 221--230.Google ScholarGoogle ScholarCross RefCross Ref
  3. Nicholas Carlini and David Wagner. 2017. Towards evaluating the robustness of neural networks. In 2017 ieee symposium on security and privacy (sp). Ieee, 39--57.Google ScholarGoogle Scholar
  4. Xuesong Chen, Xiyu Yan, Feng Zheng, Yong Jiang, Shu-Tao Xia, Yong Zhao, and Rongrong Ji. 2020. One-shot adversarial attacks on visual tracking with dual attention. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10176--10185.Google ScholarGoogle ScholarCross RefCross Ref
  5. Zhaoyu Chen, Bo Li, Shuang Wu, Shouhong Ding, and Wenqiang Zhang. 2023 a. Query-Efficient Decision-based Black-Box Patch Attack. arXiv preprint arXiv:2307.00477 (2023).Google ScholarGoogle Scholar
  6. Zhaoyu Chen, Bo Li, Shuang Wu, Kaixun Jiang, Shouhong Ding, and Wenqiang Zhang. 2023 b. Content-based Unrestricted Adversarial Attack. arXiv preprint arXiv:2305.10665 (2023).Google ScholarGoogle Scholar
  7. Zhaoyu Chen, Bo Li, Shuang Wu, Jianghe Xu, Shouhong Ding, and Wenqiang Zhang. 2022a. Shape matters: deformable patch attack. In Computer Vision-ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part IV. Springer, 529--548.Google ScholarGoogle Scholar
  8. Zhaoyu Chen, Bo Li, Jianghe Xu, Shuang Wu, Shouhong Ding, and Wenqiang Zhang. 2022b. Towards Practical Certifiable Patch Defense with Vision Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15148--15158.Google ScholarGoogle ScholarCross RefCross Ref
  9. Ho Kei Cheng and Alexander G Schwing. 2022. XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model. arXiv preprint arXiv:2207.07115 (2022).Google ScholarGoogle Scholar
  10. Ho Kei Cheng, Yu-Wing Tai, and Chi-Keung Tang. 2021. Rethinking space-time networks with improved memory coverage for efficient video object segmentation. Advances in Neural Information Processing Systems, Vol. 34 (2021), 11781--11794.Google ScholarGoogle Scholar
  11. Isaac Cohen and Gerard Medioni. 1999. Detecting and tracking moving objects for video surveillance. In Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149), Vol. 2. IEEE, 319--325.Google ScholarGoogle ScholarCross RefCross Ref
  12. Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, Philip HS Torr, and Song Bai. 2023. MOSE: A New Dataset for Video Object Segmentation in Complex Scenes. arXiv preprint arXiv:2302.01872 (2023).Google ScholarGoogle Scholar
  13. Yinpeng Dong, Fangzhou Liao, Tianyu Pang, Xiaolin Hu, and Jun Zhu. 2017. Discovering adversarial examples with momentum. arXiv preprint arXiv:1710.06081 (2017).Google ScholarGoogle Scholar
  14. Yinpeng Dong, Tianyu Pang, Hang Su, and Jun Zhu. 2019. Evading defenses to transferable adversarial examples by translation-invariant attacks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4312--4321.Google ScholarGoogle ScholarCross RefCross Ref
  15. Jindong Gu, Hengshuang Zhao, Volker Tresp, and Philip HS Torr. 2022. SegPGD: An Effective and Efficient Adversarial Attack for Evaluating and Boosting Segmentation Robustness. In Computer Vision--ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXIX. Springer, 308--325.Google ScholarGoogle Scholar
  16. Pinxue Guo, Tony Huang, Peiyang He, Xuefeng Liu, Tianjun Xiao, Zhaoyu Chen, and Wenqiang Zhang. 2023. OpenVIS: Open-vocabulary Video Instance Segmentation. arXiv preprint arXiv:2305.16835 (2023).Google ScholarGoogle Scholar
  17. Pinxue Guo, Wei Zhang, Xiaoqiang Li, and Wenqiang Zhang. 2022. Adaptive Online Mutual Learning Bi-decoders for Video Object Segmentation. IEEE Transactions on Image Processing (2022), 1-1. https://doi.org/10.1109/TIP.2022.3219230Google ScholarGoogle ScholarCross RefCross Ref
  18. Lingyi Hong, Wenchao Chen, Zhongying Liu, Wei Zhang, Pinxue Guo, Zhaoyu Chen, and Wenqiang Zhang. 2022a. LVOS: A Benchmark for Long-term Video Object Segmentation. arXiv preprint arXiv:2211.10181 (2022).Google ScholarGoogle Scholar
  19. Lingyi Hong, Wei Zhang, Liangyu Chen, Wenqiang Zhang, and Jianping Fan. 2022b. Adaptive Selection of Reference Frames for Video Object Segmentation. IEEE Transactions on Image Processing, Vol. 31 (2022), 1057--1071. https://doi.org/10.1109/TIP.2021.3137660Google ScholarGoogle ScholarCross RefCross Ref
  20. Kaixun Jiang, Zhaoyu Chen, Tony Huang, Jiafeng Wang, Dingkang Yang, Bo Li, Yan Wang, and Wenqiang Zhang. 2023. Efficient Decision-based Black-box Patch Attacks on Video Recognition. arXiv preprint arXiv:2303.11917 (2023).Google ScholarGoogle Scholar
  21. Joakim Johnander, Martin Danelljan, Emil Brissman, Fahad Shahbaz Khan, and Michael Felsberg. 2019. A generative appearance model for end-to-end video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8953--8962.Google ScholarGoogle ScholarCross RefCross Ref
  22. Alexey Kurakin, Ian J Goodfellow, and Samy Bengio. 2018. Adversarial examples in the physical world. In Artificial intelligence safety and security. Chapman and Hall/CRC, 99--112.Google ScholarGoogle Scholar
  23. Jiadong Lin, Chuanbiao Song, Kun He, Liwei Wang, and John E Hopcroft. 2019. Nesterov accelerated gradient and scale invariance for adversarial attacks. arXiv preprint arXiv:1908.06281 (2019).Google ScholarGoogle Scholar
  24. Siao Liu, Zhaoyu Chen, Wei Li, Jiwei Zhu, Jiafeng Wang, Wenqiang Zhang, and Zhongxue Gan. 2022a. Efficient universal shuffle attack for visual object tracking. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2739--2743.Google ScholarGoogle ScholarCross RefCross Ref
  25. Siao Liu, Zhaoyu Chen, Yang Liu, Yuzheng Wang, Dingkang Yang, Zhile Zhao, Ziqing Zhou, Xie Yi, Wei Li, Wenqiang Zhang, and Zhongxue Gan. 2023 a. Improving Generalization in Visual Reinforcement Learning via Conflict-aware Gradient Agreement Augmentation. arxiv: 2308.01194 [cs.CV]Google ScholarGoogle Scholar
  26. Yang Liu, Jing Liu, Kun Yang, Bobo Ju, Siao Liu, Yuzheng Wang, Dingkang Yang, Peng Sun, and Liang Song. 2023 b. AMP-Net: Appearance-Motion Prototype Network Assisted Automatic Video Anomaly Detection System. IEEE Transactions on Industrial Informatics (2023), 1--13. https://doi.org/10.1109/TII.2023.3298476Google ScholarGoogle ScholarCross RefCross Ref
  27. Yang Liu, Jing Liu, Mengyang Zhao, Shuang Li, and Liang Song. 2022b. Collaborative normality learning framework for weakly supervised video anomaly detection. IEEE Transactions on Circuits and Systems II: Express Briefs, Vol. 69, 5 (2022), 2508--2512.Google ScholarGoogle ScholarCross RefCross Ref
  28. Yang Liu, Dingkang Yang, Yan Wang, Jing Liu, and Liang Song. 2023 c. Generalized video anomaly event detection: Systematic taxonomy and comparison of deep models. arXiv preprint arXiv:2302.05087 (2023).Google ScholarGoogle Scholar
  29. Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. 2018. Towards Deep Learning Models Resistant to Adversarial Attacks. In ICLR.Google ScholarGoogle Scholar
  30. Yunyao Mao, Ning Wang, Wengang Zhou, and Houqiang Li. 2021. Joint inductive and transductive learning for video object segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9670--9679.Google ScholarGoogle ScholarCross RefCross Ref
  31. Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. 2016. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 2016 fourth international conference on 3D vision (3DV). Ieee, 565--571.Google ScholarGoogle Scholar
  32. Seoung Wug Oh, Joon-Young Lee, Kalyan Sunkavalli, and Seon Joo Kim. 2018. Fast video object segmentation by reference-guided mask propagation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7376--7385.Google ScholarGoogle ScholarCross RefCross Ref
  33. Seoung Wug Oh, Joon-Young Lee, Ning Xu, and Seon Joo Kim. 2019. Video object segmentation using space-time memory networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9226--9235.Google ScholarGoogle ScholarCross RefCross Ref
  34. Hyojin Park, Jayeon Yoo, Seohyeong Jeong, Ganesh Venkatesh, and Nojun Kwak. 2021. Learning dynamic network using a reuse gate function in semi-supervised video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8405--8414.Google ScholarGoogle ScholarCross RefCross Ref
  35. Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander Sorkine-Hornung. 2016. A benchmark dataset and evaluation methodology for video object segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 724--732.Google ScholarGoogle ScholarCross RefCross Ref
  36. Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbeláez, Alex Sorkine-Hornung, and Luc Van Gool. 2017. The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675 (2017).Google ScholarGoogle Scholar
  37. Roi Pony, Itay Naeh, and Shie Mannor. 2021. Over-the-air adversarial flickering attacks against video recognition networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 515--524.Google ScholarGoogle ScholarCross RefCross Ref
  38. Andreas Robinson, Felix Jaremo Lawin, Martin Danelljan, Fahad Shahbaz Khan, and Michael Felsberg. 2020. Learning fast and robust target models for video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7406--7415.Google ScholarGoogle ScholarCross RefCross Ref
  39. Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. 2013. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199 (2013).Google ScholarGoogle Scholar
  40. Paul Voigtlaender, Yuning Chai, Florian Schroff, Hartwig Adam, Bastian Leibe, and Liang-Chieh Chen. 2019. Feelvos: Fast end-to-end embedding learning for video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9481--9490.Google ScholarGoogle ScholarCross RefCross Ref
  41. Paul Voigtlaender and Bastian Leibe. 2017. Online adaptation of convolutional neural networks for video object segmentation. arXiv preprint arXiv:1706.09364 (2017).Google ScholarGoogle Scholar
  42. Yuzheng Wang, Zhaoyu Chen, Dingkang Yang, Yang Liu, Siao Liu, Wenqiang Zhang, and Lizhe Qi. 2023. Adversarial contrastive distillation with adaptive denoising. In ICASSP 2023--2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1--5.Google ScholarGoogle ScholarCross RefCross Ref
  43. Peisong Wen, Ruolin Yang, Qianqian Xu, Chen Qian, Qingming Huang, Runmin Cong, and Jianlou Si. 2020. DMVOS: Discriminative matching for real-time video object segmentation. In Proceedings of the 28th ACM International Conference on Multimedia. 2048--2056.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Cihang Xie, Jianyu Wang, Zhishuai Zhang, Yuyin Zhou, Lingxi Xie, and Alan Yuille. 2017. Adversarial examples for semantic segmentation and object detection. In Proceedings of the IEEE international conference on computer vision. 1369--1378.Google ScholarGoogle ScholarCross RefCross Ref
  45. Cihang Xie, Zhishuai Zhang, Yuyin Zhou, Song Bai, Jianyu Wang, Zhou Ren, and Alan L Yuille. 2019. Improving transferability of adversarial examples with input diversity. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2730--2739.Google ScholarGoogle ScholarCross RefCross Ref
  46. Ning Xu, Linjie Yang, Yuchen Fan, Jianchao Yang, Dingcheng Yue, Yuchen Liang, Brian Price, Scott Cohen, and Thomas Huang. 2018. Youtube-vos: Sequence-to-sequence video object segmentation. In Proceedings of the European conference on computer vision (ECCV). 585--601.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Xiaohao Xu, Jinglu Wang, Xiang Ming, and Yan Lu. 2022. Towards Robust Video Object Segmentation with Adaptive Object Calibration. In Proceedings of the 30th ACM International Conference on Multimedia. 2709--2718.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Dingkang Yang, Zhaoyu Chen, Yuzheng Wang, Shunli Wang, Mingcheng Li, Siao Liu, Xiao Zhao, Shuai Huang, Zhiyan Dong, Peng Zhai, and Lihua Zhang. 2023 a. Context De-Confounded Emotion Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 19005--19015.Google ScholarGoogle ScholarCross RefCross Ref
  49. Dingkang Yang, Shuai Huang, Haopeng Kuang, Yangtao Du, and Lihua Zhang. 2022a. Disentangled Representation Learning for Multimodal Emotion Recognition. In Proceedings of the 30th ACM International Conference on Multimedia (ACM MM). 1642--1651.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Dingkang Yang, Shuai Huang, Shunli Wang, Yang Liu, Peng Zhai, Liuzhen Su, Mingcheng Li, and Lihua Zhang. 2022b. Emotion Recognition for Multiple Context Awareness. In Proceedings of the European Conference on Computer Vision (ECCV), Vol. 13697. 144--162.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Dingkang Yang, Shuai Huang, Zhi Xu, Zhenpeng Li, Shunli Wang, Mingcheng Li, Yuzheng Wang, Yang Liu, Kun Yang, Zhaoyu Chen, et al. 2023 b. AIDE: A Vision-Driven Multi-View, Multi-Modal, Multi-Tasking Dataset for Assistive Driving Perception. arXiv preprint arXiv:2307.13933 (2023).Google ScholarGoogle Scholar
  52. Dingkang Yang, Haopeng Kuang, Shuai Huang, and Lihua Zhang. 2022c. Learning Modality-Specific and -Agnostic Representations for Asynchronous Multimodal Language Sequences. In Proceedings of the 30th ACM International Conference on Multimedia (ACM MM). 1708--1717.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Dingkang Yang, Yang Liu, Can Huang, Mingcheng Li, Xiao Zhao, Yuzheng Wang, Kun Yang, Yan Wang, Peng Zhai, and Lihua Zhang. 2023 c. Target and source modality co-reinforcement for emotion understanding from asynchronous multimodal sequences. Knowledge-Based Systems, Vol. 265 (2023), 110370.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Zongxin Yang, Yunchao Wei, and Yi Yang. 2020. Collaborative video object segmentation by foreground-background integration. In European Conference on Computer Vision. Springer, 332--348.Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Zongxin Yang, Yunchao Wei, and Yi Yang. 2021. Associating objects with transformers for video object segmentation. Advances in Neural Information Processing Systems, Vol. 34 (2021), 2491--2502.Google ScholarGoogle Scholar
  56. Zongxin Yang and Yi Yang. 2022. Decoupling Features in Hierarchical Propagation for Video Object Segmentation. arXiv preprint arXiv:2210.09782 (2022).Google ScholarGoogle Scholar
  57. Huaizheng Zhang, Pinxue Guo, Zhongwen Le, and Wenqiang Zhang. 2023. Robust Video Object Segmentation with Restricted Attention. In ICASSP 2023--2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).Google ScholarGoogle Scholar
  58. Kaihua Zhang, Long Wang, Dong Liu, Bo Liu, Qingshan Liu, and Zhu Li. 2020. Dual temporal memory network for efficient video object segmentation. In Proceedings of the 28th ACM International Conference on Multimedia. 1515--1523.Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Ziyu Zhang, Sanja Fidler, and Raquel Urtasun. 2016. Instance-level segmentation for autonomous driving with deep densely connected mrfs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 669--677.Google ScholarGoogle ScholarCross RefCross Ref
  60. Xiaoqi Zhao, Youwei Pang, Jiaxing Yang, Lihe Zhang, and Huchuan Lu. 2021. Multi-source fusion and automatic predictor selection for zero-shot video object segmentation. In Proceedings of the 29th ACM International Conference on Multimedia. 2645--2653.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Exploring the Adversarial Robustness of Video Object Segmentation via One-shot Adversarial Attacks

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      MM '23: Proceedings of the 31st ACM International Conference on Multimedia
      October 2023
      9913 pages
      ISBN:9798400701085
      DOI:10.1145/3581783

      Copyright © 2023 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 27 October 2023

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate995of4,171submissions,24%

      Upcoming Conference

      MM '24
      MM '24: The 32nd ACM International Conference on Multimedia
      October 28 - November 1, 2024
      Melbourne , VIC , Australia
    • Article Metrics

      • Downloads (Last 12 months)191
      • Downloads (Last 6 weeks)11

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader