ABSTRACT
Video object segmentation (VOS) is a fundamental task for computer vision and multimedia. Despite significant progress of VOS models in recent works, there has been little research on the VOS models' adversarial robustness, posing serious security risks in the VOS models' practical applications (e.g., autonomous driving and video surveillance). Adversarial robustness refers to the ability of the model to resist malicious attacks on adversarial examples. To address this gap, we propose a one-shot adversarial robustness evaluation framework (i.e., the adversary only perturbs the first frame) for VOS models, including white-box and black-box attacks. For white-box attacks, we introduce Objective Attention (OA) and Boundary Attention (BA) mechanisms to enhance the attention of attack on objects from both pixel and object levels while mitigating issues such as multi-objects attack imbalance, attack bias towards the background, and boundary reservation. For black-box attacks, we propose the Video Diverse Input (VDI) module, which utilizes data augmentation to simulate historical information, improving our method's black-box transferability. We conduct extensive experiments to evaluate the adversarial robustness of VOS models with different structures. Our experimental results reveal that existing VOS models are more vulnerable to our attacks (both white-box and black-box) compared to other state-of-the-art attacks. We further analyze the influence of different designs (e.g., memory and matching mechanisms) on adversarial robustness. Finally, we provide insights for designing more secure VOS models in the future.
- Goutam Bhat, Felix Järemo Lawin, Martin Danelljan, Andreas Robinson, Michael Felsberg, Luc Van Gool, and Radu Timofte. 2020. Learning what to learn for video object segmentation. In European Conference on Computer Vision. Springer, 777--794.Google ScholarDigital Library
- Sergi Caelles, Kevis-Kokitsi Maninis, Jordi Pont-Tuset, Laura Leal-Taixé, Daniel Cremers, and Luc Van Gool. 2017. One-shot video object segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 221--230.Google ScholarCross Ref
- Nicholas Carlini and David Wagner. 2017. Towards evaluating the robustness of neural networks. In 2017 ieee symposium on security and privacy (sp). Ieee, 39--57.Google Scholar
- Xuesong Chen, Xiyu Yan, Feng Zheng, Yong Jiang, Shu-Tao Xia, Yong Zhao, and Rongrong Ji. 2020. One-shot adversarial attacks on visual tracking with dual attention. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10176--10185.Google ScholarCross Ref
- Zhaoyu Chen, Bo Li, Shuang Wu, Shouhong Ding, and Wenqiang Zhang. 2023 a. Query-Efficient Decision-based Black-Box Patch Attack. arXiv preprint arXiv:2307.00477 (2023).Google Scholar
- Zhaoyu Chen, Bo Li, Shuang Wu, Kaixun Jiang, Shouhong Ding, and Wenqiang Zhang. 2023 b. Content-based Unrestricted Adversarial Attack. arXiv preprint arXiv:2305.10665 (2023).Google Scholar
- Zhaoyu Chen, Bo Li, Shuang Wu, Jianghe Xu, Shouhong Ding, and Wenqiang Zhang. 2022a. Shape matters: deformable patch attack. In Computer Vision-ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part IV. Springer, 529--548.Google Scholar
- Zhaoyu Chen, Bo Li, Jianghe Xu, Shuang Wu, Shouhong Ding, and Wenqiang Zhang. 2022b. Towards Practical Certifiable Patch Defense with Vision Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15148--15158.Google ScholarCross Ref
- Ho Kei Cheng and Alexander G Schwing. 2022. XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model. arXiv preprint arXiv:2207.07115 (2022).Google Scholar
- Ho Kei Cheng, Yu-Wing Tai, and Chi-Keung Tang. 2021. Rethinking space-time networks with improved memory coverage for efficient video object segmentation. Advances in Neural Information Processing Systems, Vol. 34 (2021), 11781--11794.Google Scholar
- Isaac Cohen and Gerard Medioni. 1999. Detecting and tracking moving objects for video surveillance. In Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149), Vol. 2. IEEE, 319--325.Google ScholarCross Ref
- Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, Philip HS Torr, and Song Bai. 2023. MOSE: A New Dataset for Video Object Segmentation in Complex Scenes. arXiv preprint arXiv:2302.01872 (2023).Google Scholar
- Yinpeng Dong, Fangzhou Liao, Tianyu Pang, Xiaolin Hu, and Jun Zhu. 2017. Discovering adversarial examples with momentum. arXiv preprint arXiv:1710.06081 (2017).Google Scholar
- Yinpeng Dong, Tianyu Pang, Hang Su, and Jun Zhu. 2019. Evading defenses to transferable adversarial examples by translation-invariant attacks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4312--4321.Google ScholarCross Ref
- Jindong Gu, Hengshuang Zhao, Volker Tresp, and Philip HS Torr. 2022. SegPGD: An Effective and Efficient Adversarial Attack for Evaluating and Boosting Segmentation Robustness. In Computer Vision--ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXIX. Springer, 308--325.Google Scholar
- Pinxue Guo, Tony Huang, Peiyang He, Xuefeng Liu, Tianjun Xiao, Zhaoyu Chen, and Wenqiang Zhang. 2023. OpenVIS: Open-vocabulary Video Instance Segmentation. arXiv preprint arXiv:2305.16835 (2023).Google Scholar
- Pinxue Guo, Wei Zhang, Xiaoqiang Li, and Wenqiang Zhang. 2022. Adaptive Online Mutual Learning Bi-decoders for Video Object Segmentation. IEEE Transactions on Image Processing (2022), 1-1. https://doi.org/10.1109/TIP.2022.3219230Google ScholarCross Ref
- Lingyi Hong, Wenchao Chen, Zhongying Liu, Wei Zhang, Pinxue Guo, Zhaoyu Chen, and Wenqiang Zhang. 2022a. LVOS: A Benchmark for Long-term Video Object Segmentation. arXiv preprint arXiv:2211.10181 (2022).Google Scholar
- Lingyi Hong, Wei Zhang, Liangyu Chen, Wenqiang Zhang, and Jianping Fan. 2022b. Adaptive Selection of Reference Frames for Video Object Segmentation. IEEE Transactions on Image Processing, Vol. 31 (2022), 1057--1071. https://doi.org/10.1109/TIP.2021.3137660Google ScholarCross Ref
- Kaixun Jiang, Zhaoyu Chen, Tony Huang, Jiafeng Wang, Dingkang Yang, Bo Li, Yan Wang, and Wenqiang Zhang. 2023. Efficient Decision-based Black-box Patch Attacks on Video Recognition. arXiv preprint arXiv:2303.11917 (2023).Google Scholar
- Joakim Johnander, Martin Danelljan, Emil Brissman, Fahad Shahbaz Khan, and Michael Felsberg. 2019. A generative appearance model for end-to-end video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8953--8962.Google ScholarCross Ref
- Alexey Kurakin, Ian J Goodfellow, and Samy Bengio. 2018. Adversarial examples in the physical world. In Artificial intelligence safety and security. Chapman and Hall/CRC, 99--112.Google Scholar
- Jiadong Lin, Chuanbiao Song, Kun He, Liwei Wang, and John E Hopcroft. 2019. Nesterov accelerated gradient and scale invariance for adversarial attacks. arXiv preprint arXiv:1908.06281 (2019).Google Scholar
- Siao Liu, Zhaoyu Chen, Wei Li, Jiwei Zhu, Jiafeng Wang, Wenqiang Zhang, and Zhongxue Gan. 2022a. Efficient universal shuffle attack for visual object tracking. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2739--2743.Google ScholarCross Ref
- Siao Liu, Zhaoyu Chen, Yang Liu, Yuzheng Wang, Dingkang Yang, Zhile Zhao, Ziqing Zhou, Xie Yi, Wei Li, Wenqiang Zhang, and Zhongxue Gan. 2023 a. Improving Generalization in Visual Reinforcement Learning via Conflict-aware Gradient Agreement Augmentation. arxiv: 2308.01194 [cs.CV]Google Scholar
- Yang Liu, Jing Liu, Kun Yang, Bobo Ju, Siao Liu, Yuzheng Wang, Dingkang Yang, Peng Sun, and Liang Song. 2023 b. AMP-Net: Appearance-Motion Prototype Network Assisted Automatic Video Anomaly Detection System. IEEE Transactions on Industrial Informatics (2023), 1--13. https://doi.org/10.1109/TII.2023.3298476Google ScholarCross Ref
- Yang Liu, Jing Liu, Mengyang Zhao, Shuang Li, and Liang Song. 2022b. Collaborative normality learning framework for weakly supervised video anomaly detection. IEEE Transactions on Circuits and Systems II: Express Briefs, Vol. 69, 5 (2022), 2508--2512.Google ScholarCross Ref
- Yang Liu, Dingkang Yang, Yan Wang, Jing Liu, and Liang Song. 2023 c. Generalized video anomaly event detection: Systematic taxonomy and comparison of deep models. arXiv preprint arXiv:2302.05087 (2023).Google Scholar
- Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. 2018. Towards Deep Learning Models Resistant to Adversarial Attacks. In ICLR.Google Scholar
- Yunyao Mao, Ning Wang, Wengang Zhou, and Houqiang Li. 2021. Joint inductive and transductive learning for video object segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9670--9679.Google ScholarCross Ref
- Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. 2016. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 2016 fourth international conference on 3D vision (3DV). Ieee, 565--571.Google Scholar
- Seoung Wug Oh, Joon-Young Lee, Kalyan Sunkavalli, and Seon Joo Kim. 2018. Fast video object segmentation by reference-guided mask propagation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7376--7385.Google ScholarCross Ref
- Seoung Wug Oh, Joon-Young Lee, Ning Xu, and Seon Joo Kim. 2019. Video object segmentation using space-time memory networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9226--9235.Google ScholarCross Ref
- Hyojin Park, Jayeon Yoo, Seohyeong Jeong, Ganesh Venkatesh, and Nojun Kwak. 2021. Learning dynamic network using a reuse gate function in semi-supervised video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8405--8414.Google ScholarCross Ref
- Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander Sorkine-Hornung. 2016. A benchmark dataset and evaluation methodology for video object segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 724--732.Google ScholarCross Ref
- Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbeláez, Alex Sorkine-Hornung, and Luc Van Gool. 2017. The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675 (2017).Google Scholar
- Roi Pony, Itay Naeh, and Shie Mannor. 2021. Over-the-air adversarial flickering attacks against video recognition networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 515--524.Google ScholarCross Ref
- Andreas Robinson, Felix Jaremo Lawin, Martin Danelljan, Fahad Shahbaz Khan, and Michael Felsberg. 2020. Learning fast and robust target models for video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7406--7415.Google ScholarCross Ref
- Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. 2013. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199 (2013).Google Scholar
- Paul Voigtlaender, Yuning Chai, Florian Schroff, Hartwig Adam, Bastian Leibe, and Liang-Chieh Chen. 2019. Feelvos: Fast end-to-end embedding learning for video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9481--9490.Google ScholarCross Ref
- Paul Voigtlaender and Bastian Leibe. 2017. Online adaptation of convolutional neural networks for video object segmentation. arXiv preprint arXiv:1706.09364 (2017).Google Scholar
- Yuzheng Wang, Zhaoyu Chen, Dingkang Yang, Yang Liu, Siao Liu, Wenqiang Zhang, and Lizhe Qi. 2023. Adversarial contrastive distillation with adaptive denoising. In ICASSP 2023--2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1--5.Google ScholarCross Ref
- Peisong Wen, Ruolin Yang, Qianqian Xu, Chen Qian, Qingming Huang, Runmin Cong, and Jianlou Si. 2020. DMVOS: Discriminative matching for real-time video object segmentation. In Proceedings of the 28th ACM International Conference on Multimedia. 2048--2056.Google ScholarDigital Library
- Cihang Xie, Jianyu Wang, Zhishuai Zhang, Yuyin Zhou, Lingxi Xie, and Alan Yuille. 2017. Adversarial examples for semantic segmentation and object detection. In Proceedings of the IEEE international conference on computer vision. 1369--1378.Google ScholarCross Ref
- Cihang Xie, Zhishuai Zhang, Yuyin Zhou, Song Bai, Jianyu Wang, Zhou Ren, and Alan L Yuille. 2019. Improving transferability of adversarial examples with input diversity. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2730--2739.Google ScholarCross Ref
- Ning Xu, Linjie Yang, Yuchen Fan, Jianchao Yang, Dingcheng Yue, Yuchen Liang, Brian Price, Scott Cohen, and Thomas Huang. 2018. Youtube-vos: Sequence-to-sequence video object segmentation. In Proceedings of the European conference on computer vision (ECCV). 585--601.Google ScholarDigital Library
- Xiaohao Xu, Jinglu Wang, Xiang Ming, and Yan Lu. 2022. Towards Robust Video Object Segmentation with Adaptive Object Calibration. In Proceedings of the 30th ACM International Conference on Multimedia. 2709--2718.Google ScholarDigital Library
- Dingkang Yang, Zhaoyu Chen, Yuzheng Wang, Shunli Wang, Mingcheng Li, Siao Liu, Xiao Zhao, Shuai Huang, Zhiyan Dong, Peng Zhai, and Lihua Zhang. 2023 a. Context De-Confounded Emotion Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 19005--19015.Google ScholarCross Ref
- Dingkang Yang, Shuai Huang, Haopeng Kuang, Yangtao Du, and Lihua Zhang. 2022a. Disentangled Representation Learning for Multimodal Emotion Recognition. In Proceedings of the 30th ACM International Conference on Multimedia (ACM MM). 1642--1651.Google ScholarDigital Library
- Dingkang Yang, Shuai Huang, Shunli Wang, Yang Liu, Peng Zhai, Liuzhen Su, Mingcheng Li, and Lihua Zhang. 2022b. Emotion Recognition for Multiple Context Awareness. In Proceedings of the European Conference on Computer Vision (ECCV), Vol. 13697. 144--162.Google ScholarDigital Library
- Dingkang Yang, Shuai Huang, Zhi Xu, Zhenpeng Li, Shunli Wang, Mingcheng Li, Yuzheng Wang, Yang Liu, Kun Yang, Zhaoyu Chen, et al. 2023 b. AIDE: A Vision-Driven Multi-View, Multi-Modal, Multi-Tasking Dataset for Assistive Driving Perception. arXiv preprint arXiv:2307.13933 (2023).Google Scholar
- Dingkang Yang, Haopeng Kuang, Shuai Huang, and Lihua Zhang. 2022c. Learning Modality-Specific and -Agnostic Representations for Asynchronous Multimodal Language Sequences. In Proceedings of the 30th ACM International Conference on Multimedia (ACM MM). 1708--1717.Google ScholarDigital Library
- Dingkang Yang, Yang Liu, Can Huang, Mingcheng Li, Xiao Zhao, Yuzheng Wang, Kun Yang, Yan Wang, Peng Zhai, and Lihua Zhang. 2023 c. Target and source modality co-reinforcement for emotion understanding from asynchronous multimodal sequences. Knowledge-Based Systems, Vol. 265 (2023), 110370.Google ScholarDigital Library
- Zongxin Yang, Yunchao Wei, and Yi Yang. 2020. Collaborative video object segmentation by foreground-background integration. In European Conference on Computer Vision. Springer, 332--348.Google ScholarDigital Library
- Zongxin Yang, Yunchao Wei, and Yi Yang. 2021. Associating objects with transformers for video object segmentation. Advances in Neural Information Processing Systems, Vol. 34 (2021), 2491--2502.Google Scholar
- Zongxin Yang and Yi Yang. 2022. Decoupling Features in Hierarchical Propagation for Video Object Segmentation. arXiv preprint arXiv:2210.09782 (2022).Google Scholar
- Huaizheng Zhang, Pinxue Guo, Zhongwen Le, and Wenqiang Zhang. 2023. Robust Video Object Segmentation with Restricted Attention. In ICASSP 2023--2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).Google Scholar
- Kaihua Zhang, Long Wang, Dong Liu, Bo Liu, Qingshan Liu, and Zhu Li. 2020. Dual temporal memory network for efficient video object segmentation. In Proceedings of the 28th ACM International Conference on Multimedia. 1515--1523.Google ScholarDigital Library
- Ziyu Zhang, Sanja Fidler, and Raquel Urtasun. 2016. Instance-level segmentation for autonomous driving with deep densely connected mrfs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 669--677.Google ScholarCross Ref
- Xiaoqi Zhao, Youwei Pang, Jiaxing Yang, Lihe Zhang, and Huchuan Lu. 2021. Multi-source fusion and automatic predictor selection for zero-shot video object segmentation. In Proceedings of the 29th ACM International Conference on Multimedia. 2645--2653.Google ScholarDigital Library
Index Terms
- Exploring the Adversarial Robustness of Video Object Segmentation via One-shot Adversarial Attacks
Recommendations
Attention-guided Adversarial Attack for Video Object Segmentation
Video Object Segmentation (VOS) methods have made many breakthroughs with the help of the continuous development and advancement of deep learning. However, the deep learning model is vulnerable to malicious adversarial attacks, which mislead the model to ...
Defense Against Adversarial Attacks with Efficient Frequency-Adaptive Compression and Reconstruction
Highlights- This paper systematically analyzes the robustness of elimination-based defense under closed-set and open-set attacks. Experimental results show that the ...
AbstractThe increasing use of deep neural networks exposes themselves to adversarial attacks in the real world drawn from closed-set and open-set, which poses great threats to their application in safety-critical systems. Since adversarial ...
Masking and purifying inputs for blocking textual adversarial attacks
AbstractThe vulnerability of deep neural networks (DNNs) to adversarial attacks has attracted attention in many fields, and researchers have sought methods to improve the robustness of DNNs. Most existing methods are empirical defenses that can only cope ...
Comments