research-article

Exploring the Adversarial Robustness of Video Object Segmentation via One-shot Adversarial Attacks

Authors:
Kaixun Jiang

Fudan University, Shanghai, China

Fudan University, Shanghai, China

0000-0002-2878-0497
View Profile

,
Lingyi Hong

Fudan University, Shanghai, China

Fudan University, Shanghai, China

0000-0002-2749-5133
View Profile

,
Zhaoyu Chen

Fudan University, Shanghai, China

Fudan University, Shanghai, China

0000-0002-7112-2596
View Profile

,
Pinxue Guo

Fudan University, Shanghai, China

Fudan University, Shanghai, China

0000-0002-4388-9757
View Profile

,
Zeng Tao

Fudan University, Shanghai, China

Fudan University, Shanghai, China

0009-0006-2998-6709
View Profile

,
Yan Wang

Fudan University, Shanghai, China

Fudan University, Shanghai, China

0000-0002-4953-2660
View Profile

,
Wenqiang Zhang

Fudan University, Shanghai, China

Fudan University, Shanghai, China

0000-0002-3339-8751
View Profile

MM '23: Proceedings of the 31st ACM International Conference on MultimediaOctober 2023Pages 8598–8607https://doi.org/10.1145/3581783.3611827

Published:27 October 2023Publication History

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 8598–8607

ABSTRACT

Video object segmentation (VOS) is a fundamental task for computer vision and multimedia. Despite significant progress of VOS models in recent works, there has been little research on the VOS models' adversarial robustness, posing serious security risks in the VOS models' practical applications (e.g., autonomous driving and video surveillance). Adversarial robustness refers to the ability of the model to resist malicious attacks on adversarial examples. To address this gap, we propose a one-shot adversarial robustness evaluation framework (i.e., the adversary only perturbs the first frame) for VOS models, including white-box and black-box attacks. For white-box attacks, we introduce Objective Attention (OA) and Boundary Attention (BA) mechanisms to enhance the attention of attack on objects from both pixel and object levels while mitigating issues such as multi-objects attack imbalance, attack bias towards the background, and boundary reservation. For black-box attacks, we propose the Video Diverse Input (VDI) module, which utilizes data augmentation to simulate historical information, improving our method's black-box transferability. We conduct extensive experiments to evaluate the adversarial robustness of VOS models with different structures. Our experimental results reveal that existing VOS models are more vulnerable to our attacks (both white-box and black-box) compared to other state-of-the-art attacks. We further analyze the influence of different designs (e.g., memory and matching mechanisms) on adversarial robustness. Finally, we provide insights for designing more secure VOS models in the future.

References

Goutam Bhat, Felix Järemo Lawin, Martin Danelljan, Andreas Robinson, Michael Felsberg, Luc Van Gool, and Radu Timofte. 2020. Learning what to learn for video object segmentation. In European Conference on Computer Vision. Springer, 777--794.Google ScholarDigital Library
Sergi Caelles, Kevis-Kokitsi Maninis, Jordi Pont-Tuset, Laura Leal-Taixé, Daniel Cremers, and Luc Van Gool. 2017. One-shot video object segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 221--230.Google ScholarCross Ref
Nicholas Carlini and David Wagner. 2017. Towards evaluating the robustness of neural networks. In 2017 ieee symposium on security and privacy (sp). Ieee, 39--57.Google Scholar
Xuesong Chen, Xiyu Yan, Feng Zheng, Yong Jiang, Shu-Tao Xia, Yong Zhao, and Rongrong Ji. 2020. One-shot adversarial attacks on visual tracking with dual attention. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10176--10185.Google ScholarCross Ref
Zhaoyu Chen, Bo Li, Shuang Wu, Shouhong Ding, and Wenqiang Zhang. 2023 a. Query-Efficient Decision-based Black-Box Patch Attack. arXiv preprint arXiv:2307.00477 (2023).Google Scholar
Zhaoyu Chen, Bo Li, Shuang Wu, Kaixun Jiang, Shouhong Ding, and Wenqiang Zhang. 2023 b. Content-based Unrestricted Adversarial Attack. arXiv preprint arXiv:2305.10665 (2023).Google Scholar
Zhaoyu Chen, Bo Li, Shuang Wu, Jianghe Xu, Shouhong Ding, and Wenqiang Zhang. 2022a. Shape matters: deformable patch attack. In Computer Vision-ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part IV. Springer, 529--548.Google Scholar
Zhaoyu Chen, Bo Li, Jianghe Xu, Shuang Wu, Shouhong Ding, and Wenqiang Zhang. 2022b. Towards Practical Certifiable Patch Defense with Vision Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15148--15158.Google ScholarCross Ref
Ho Kei Cheng and Alexander G Schwing. 2022. XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model. arXiv preprint arXiv:2207.07115 (2022).Google Scholar
Ho Kei Cheng, Yu-Wing Tai, and Chi-Keung Tang. 2021. Rethinking space-time networks with improved memory coverage for efficient video object segmentation. Advances in Neural Information Processing Systems, Vol. 34 (2021), 11781--11794.Google Scholar
Isaac Cohen and Gerard Medioni. 1999. Detecting and tracking moving objects for video surveillance. In Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149), Vol. 2. IEEE, 319--325.Google ScholarCross Ref
Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, Philip HS Torr, and Song Bai. 2023. MOSE: A New Dataset for Video Object Segmentation in Complex Scenes. arXiv preprint arXiv:2302.01872 (2023).Google Scholar
Yinpeng Dong, Fangzhou Liao, Tianyu Pang, Xiaolin Hu, and Jun Zhu. 2017. Discovering adversarial examples with momentum. arXiv preprint arXiv:1710.06081 (2017).Google Scholar
Yinpeng Dong, Tianyu Pang, Hang Su, and Jun Zhu. 2019. Evading defenses to transferable adversarial examples by translation-invariant attacks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4312--4321.Google ScholarCross Ref
Jindong Gu, Hengshuang Zhao, Volker Tresp, and Philip HS Torr. 2022. SegPGD: An Effective and Efficient Adversarial Attack for Evaluating and Boosting Segmentation Robustness. In Computer Vision--ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXIX. Springer, 308--325.Google Scholar
Pinxue Guo, Tony Huang, Peiyang He, Xuefeng Liu, Tianjun Xiao, Zhaoyu Chen, and Wenqiang Zhang. 2023. OpenVIS: Open-vocabulary Video Instance Segmentation. arXiv preprint arXiv:2305.16835 (2023).Google Scholar
Pinxue Guo, Wei Zhang, Xiaoqiang Li, and Wenqiang Zhang. 2022. Adaptive Online Mutual Learning Bi-decoders for Video Object Segmentation. IEEE Transactions on Image Processing (2022), 1-1. https://doi.org/10.1109/TIP.2022.3219230Google ScholarCross Ref
Lingyi Hong, Wenchao Chen, Zhongying Liu, Wei Zhang, Pinxue Guo, Zhaoyu Chen, and Wenqiang Zhang. 2022a. LVOS: A Benchmark for Long-term Video Object Segmentation. arXiv preprint arXiv:2211.10181 (2022).Google Scholar
Lingyi Hong, Wei Zhang, Liangyu Chen, Wenqiang Zhang, and Jianping Fan. 2022b. Adaptive Selection of Reference Frames for Video Object Segmentation. IEEE Transactions on Image Processing, Vol. 31 (2022), 1057--1071. https://doi.org/10.1109/TIP.2021.3137660Google ScholarCross Ref
Kaixun Jiang, Zhaoyu Chen, Tony Huang, Jiafeng Wang, Dingkang Yang, Bo Li, Yan Wang, and Wenqiang Zhang. 2023. Efficient Decision-based Black-box Patch Attacks on Video Recognition. arXiv preprint arXiv:2303.11917 (2023).Google Scholar
Joakim Johnander, Martin Danelljan, Emil Brissman, Fahad Shahbaz Khan, and Michael Felsberg. 2019. A generative appearance model for end-to-end video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8953--8962.Google ScholarCross Ref
Alexey Kurakin, Ian J Goodfellow, and Samy Bengio. 2018. Adversarial examples in the physical world. In Artificial intelligence safety and security. Chapman and Hall/CRC, 99--112.Google Scholar
Jiadong Lin, Chuanbiao Song, Kun He, Liwei Wang, and John E Hopcroft. 2019. Nesterov accelerated gradient and scale invariance for adversarial attacks. arXiv preprint arXiv:1908.06281 (2019).Google Scholar
Siao Liu, Zhaoyu Chen, Wei Li, Jiwei Zhu, Jiafeng Wang, Wenqiang Zhang, and Zhongxue Gan. 2022a. Efficient universal shuffle attack for visual object tracking. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2739--2743.Google ScholarCross Ref
Siao Liu, Zhaoyu Chen, Yang Liu, Yuzheng Wang, Dingkang Yang, Zhile Zhao, Ziqing Zhou, Xie Yi, Wei Li, Wenqiang Zhang, and Zhongxue Gan. 2023 a. Improving Generalization in Visual Reinforcement Learning via Conflict-aware Gradient Agreement Augmentation. arxiv: 2308.01194 [cs.CV]Google Scholar
Yang Liu, Jing Liu, Kun Yang, Bobo Ju, Siao Liu, Yuzheng Wang, Dingkang Yang, Peng Sun, and Liang Song. 2023 b. AMP-Net: Appearance-Motion Prototype Network Assisted Automatic Video Anomaly Detection System. IEEE Transactions on Industrial Informatics (2023), 1--13. https://doi.org/10.1109/TII.2023.3298476Google ScholarCross Ref
Yang Liu, Jing Liu, Mengyang Zhao, Shuang Li, and Liang Song. 2022b. Collaborative normality learning framework for weakly supervised video anomaly detection. IEEE Transactions on Circuits and Systems II: Express Briefs, Vol. 69, 5 (2022), 2508--2512.Google ScholarCross Ref
Yang Liu, Dingkang Yang, Yan Wang, Jing Liu, and Liang Song. 2023 c. Generalized video anomaly event detection: Systematic taxonomy and comparison of deep models. arXiv preprint arXiv:2302.05087 (2023).Google Scholar
Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. 2018. Towards Deep Learning Models Resistant to Adversarial Attacks. In ICLR.Google Scholar
Yunyao Mao, Ning Wang, Wengang Zhou, and Houqiang Li. 2021. Joint inductive and transductive learning for video object segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9670--9679.Google ScholarCross Ref
Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. 2016. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 2016 fourth international conference on 3D vision (3DV). Ieee, 565--571.Google Scholar
Seoung Wug Oh, Joon-Young Lee, Kalyan Sunkavalli, and Seon Joo Kim. 2018. Fast video object segmentation by reference-guided mask propagation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7376--7385.Google ScholarCross Ref
Seoung Wug Oh, Joon-Young Lee, Ning Xu, and Seon Joo Kim. 2019. Video object segmentation using space-time memory networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9226--9235.Google ScholarCross Ref
Hyojin Park, Jayeon Yoo, Seohyeong Jeong, Ganesh Venkatesh, and Nojun Kwak. 2021. Learning dynamic network using a reuse gate function in semi-supervised video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8405--8414.Google ScholarCross Ref
Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander Sorkine-Hornung. 2016. A benchmark dataset and evaluation methodology for video object segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 724--732.Google ScholarCross Ref
Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbeláez, Alex Sorkine-Hornung, and Luc Van Gool. 2017. The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675 (2017).Google Scholar
Roi Pony, Itay Naeh, and Shie Mannor. 2021. Over-the-air adversarial flickering attacks against video recognition networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 515--524.Google ScholarCross Ref
Andreas Robinson, Felix Jaremo Lawin, Martin Danelljan, Fahad Shahbaz Khan, and Michael Felsberg. 2020. Learning fast and robust target models for video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7406--7415.Google ScholarCross Ref
Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. 2013. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199 (2013).Google Scholar
Paul Voigtlaender, Yuning Chai, Florian Schroff, Hartwig Adam, Bastian Leibe, and Liang-Chieh Chen. 2019. Feelvos: Fast end-to-end embedding learning for video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9481--9490.Google ScholarCross Ref
Paul Voigtlaender and Bastian Leibe. 2017. Online adaptation of convolutional neural networks for video object segmentation. arXiv preprint arXiv:1706.09364 (2017).Google Scholar
Yuzheng Wang, Zhaoyu Chen, Dingkang Yang, Yang Liu, Siao Liu, Wenqiang Zhang, and Lizhe Qi. 2023. Adversarial contrastive distillation with adaptive denoising. In ICASSP 2023--2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1--5.Google ScholarCross Ref
Peisong Wen, Ruolin Yang, Qianqian Xu, Chen Qian, Qingming Huang, Runmin Cong, and Jianlou Si. 2020. DMVOS: Discriminative matching for real-time video object segmentation. In Proceedings of the 28th ACM International Conference on Multimedia. 2048--2056.Google ScholarDigital Library
Cihang Xie, Jianyu Wang, Zhishuai Zhang, Yuyin Zhou, Lingxi Xie, and Alan Yuille. 2017. Adversarial examples for semantic segmentation and object detection. In Proceedings of the IEEE international conference on computer vision. 1369--1378.Google ScholarCross Ref
Cihang Xie, Zhishuai Zhang, Yuyin Zhou, Song Bai, Jianyu Wang, Zhou Ren, and Alan L Yuille. 2019. Improving transferability of adversarial examples with input diversity. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2730--2739.Google ScholarCross Ref
Ning Xu, Linjie Yang, Yuchen Fan, Jianchao Yang, Dingcheng Yue, Yuchen Liang, Brian Price, Scott Cohen, and Thomas Huang. 2018. Youtube-vos: Sequence-to-sequence video object segmentation. In Proceedings of the European conference on computer vision (ECCV). 585--601.Google ScholarDigital Library
Xiaohao Xu, Jinglu Wang, Xiang Ming, and Yan Lu. 2022. Towards Robust Video Object Segmentation with Adaptive Object Calibration. In Proceedings of the 30th ACM International Conference on Multimedia. 2709--2718.Google ScholarDigital Library
Dingkang Yang, Zhaoyu Chen, Yuzheng Wang, Shunli Wang, Mingcheng Li, Siao Liu, Xiao Zhao, Shuai Huang, Zhiyan Dong, Peng Zhai, and Lihua Zhang. 2023 a. Context De-Confounded Emotion Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 19005--19015.Google ScholarCross Ref
Dingkang Yang, Shuai Huang, Haopeng Kuang, Yangtao Du, and Lihua Zhang. 2022a. Disentangled Representation Learning for Multimodal Emotion Recognition. In Proceedings of the 30th ACM International Conference on Multimedia (ACM MM). 1642--1651.Google ScholarDigital Library
Dingkang Yang, Shuai Huang, Shunli Wang, Yang Liu, Peng Zhai, Liuzhen Su, Mingcheng Li, and Lihua Zhang. 2022b. Emotion Recognition for Multiple Context Awareness. In Proceedings of the European Conference on Computer Vision (ECCV), Vol. 13697. 144--162.Google ScholarDigital Library
Dingkang Yang, Shuai Huang, Zhi Xu, Zhenpeng Li, Shunli Wang, Mingcheng Li, Yuzheng Wang, Yang Liu, Kun Yang, Zhaoyu Chen, et al. 2023 b. AIDE: A Vision-Driven Multi-View, Multi-Modal, Multi-Tasking Dataset for Assistive Driving Perception. arXiv preprint arXiv:2307.13933 (2023).Google Scholar
Dingkang Yang, Haopeng Kuang, Shuai Huang, and Lihua Zhang. 2022c. Learning Modality-Specific and -Agnostic Representations for Asynchronous Multimodal Language Sequences. In Proceedings of the 30th ACM International Conference on Multimedia (ACM MM). 1708--1717.Google ScholarDigital Library
Dingkang Yang, Yang Liu, Can Huang, Mingcheng Li, Xiao Zhao, Yuzheng Wang, Kun Yang, Yan Wang, Peng Zhai, and Lihua Zhang. 2023 c. Target and source modality co-reinforcement for emotion understanding from asynchronous multimodal sequences. Knowledge-Based Systems, Vol. 265 (2023), 110370.Google ScholarDigital Library
Zongxin Yang, Yunchao Wei, and Yi Yang. 2020. Collaborative video object segmentation by foreground-background integration. In European Conference on Computer Vision. Springer, 332--348.Google ScholarDigital Library
Zongxin Yang, Yunchao Wei, and Yi Yang. 2021. Associating objects with transformers for video object segmentation. Advances in Neural Information Processing Systems, Vol. 34 (2021), 2491--2502.Google Scholar
Zongxin Yang and Yi Yang. 2022. Decoupling Features in Hierarchical Propagation for Video Object Segmentation. arXiv preprint arXiv:2210.09782 (2022).Google Scholar
Huaizheng Zhang, Pinxue Guo, Zhongwen Le, and Wenqiang Zhang. 2023. Robust Video Object Segmentation with Restricted Attention. In ICASSP 2023--2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).Google Scholar
Kaihua Zhang, Long Wang, Dong Liu, Bo Liu, Qingshan Liu, and Zhu Li. 2020. Dual temporal memory network for efficient video object segmentation. In Proceedings of the 28th ACM International Conference on Multimedia. 1515--1523.Google ScholarDigital Library
Ziyu Zhang, Sanja Fidler, and Raquel Urtasun. 2016. Instance-level segmentation for autonomous driving with deep densely connected mrfs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 669--677.Google ScholarCross Ref
Xiaoqi Zhao, Youwei Pang, Jiaxing Yang, Lihe Zhang, and Huchuan Lu. 2021. Multi-source fusion and automatic predictor selection for zero-shot video object segmentation. In Proceedings of the 29th ACM International Conference on Multimedia. 2645--2653.Google ScholarDigital Library

Index Terms

Exploring the Adversarial Robustness of Video Object Segmentation via One-shot Adversarial Attacks
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Video segmentation

Recommendations

Attention-guided Adversarial Attack for Video Object Segmentation
Video Object Segmentation (VOS) methods have made many breakthroughs with the help of the continuous development and advancement of deep learning. However, the deep learning model is vulnerable to malicious adversarial attacks, which mislead the model to ...
Read More
Defense Against Adversarial Attacks with Efficient Frequency-Adaptive Compression and Reconstruction
Highlights
- This paper systematically analyzes the robustness of elimination-based defense under closed-set and open-set attacks. Experimental results show that the ...
Abstract
The increasing use of deep neural networks exposes themselves to adversarial attacks in the real world drawn from closed-set and open-set, which poses great threats to their application in safety-critical systems. Since adversarial ...
Read More
Masking and purifying inputs for blocking textual adversarial attacks
Abstract
The vulnerability of deep neural networks (DNNs) to adversarial attacks has attracted attention in many fields, and researchers have sought methods to improve the robustness of DNNs. Most existing methods are empirical defenses that can only cope ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '23: Proceedings of the 31st ACM International Conference on Multimedia
October 2023
9913 pages
ISBN:9798400701085
DOI:10.1145/3581783
General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA
Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 October 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
adversarial robustness
one-shot attack.
video object segmentation
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate995of4,171submissions,24%
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 191
  Total Downloads
- Downloads (Last 12 months)191
- Downloads (Last 6 weeks)11
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Exploring the Adversarial Robustness of Video Object Segmentation via One-shot Adversarial Attacks

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

ABSTRACT

References

Cited By

Index Terms

Recommendations

Attention-guided Adversarial Attack for Video Object Segmentation

Defense Against Adversarial Attacks with Efficient Frequency-Adaptive Compression and Reconstruction

Masking and purifying inputs for blocking textual adversarial attacks