ABSTRACT
Tag localization which localizes the relevant video clips for an associated semantic tag has become an important research topic in the field of video retrieval and recommendation. Most existing approaches adopt and depend in large degree on carefully selected features which are manually designed by experts and do not take into consideration of multimodality. In order to take into account complementarity of different modalities and take advantage of learned features, in this paper, we propose a multimodal tag localization framework by exploiting deep learning to learn both visual and textual features of videos for tag localization, followed by the multimodal fusion of both visual and textual results. Extensive experiments on the public dataset show that our proposed approach achieves promising results. The tag localization based on visual deep learning greatly improves the precision of tag localization, and the multi-modal fusion of both visual and textual modalities further improves the precision despite the low performances of single textual modality.
- L. Ballan, M. Bertini, A. Del Bimbo, M. Meoni, and G. Serra. Tag suggestion and localization in user-generated videos based on social knowledge. In Proceedings of second ACM SIGMM workshop on Social media, pages 3--8. ACM, 2010. Google ScholarDigital Library
- W.-T. Chu, C.-J. Li, and Y.-K. Chou. Tag suggestion and localization for web videos by bipartite graph matching. In Proceedings of the 3rd ACM SIGMM international workshop on Social media, pages 35--40. ACM, 2011. Google ScholarDigital Library
- J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248--255. IEEE, 2009.Google ScholarCross Ref
- R. Fergus, L. Fei-Fei, P. Perona, and A. Zisserman. Learning object categories from google's image search. In Computer Vision, 2005. ICCV 2005. Tenth IEEE International Conference on, volume 2, pages 1816--1823. IEEE, 2005. Google ScholarDigital Library
- A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 1725--1732. IEEE, 2014. Google ScholarDigital Library
- A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097--1105, 2012.Google ScholarDigital Library
- G. Li, M. Wang, Y.-T. Zheng, H. Li, Z.-J. Zha, and T.-S. Chua. Shottagger: tag location for internet videos. In Proceedings of the 1st ACM International Conference on Multimedia Retrieval, page 37. ACM, 2011. Google ScholarDigital Library
- H. Li, L. Yi, Y. Guan, and H. Zhang. Dut-webv: a benchmark dataset for performance evaluation of tag localization for web video. In Advances in Multimedia Modeling, pages 305--315. Springer, 2013.Google ScholarCross Ref
- H. Li, L. Yi, B. Liu, and Y. Wang. Localizing relevant frames in web videos using topic model and relevance filtering. Machine Vision and Applications, 25(7):1661--1670, 2014. Google ScholarDigital Library
- W. Liu, T. Mei, and Y. Zhang. Instant mobile video search with layered audio-video indexing and progressive transmission. Multimedia, IEEE Transactions on, 16(8):2242--2255, 2014.Google Scholar
- W. Liu, T. Mei, Y. Zhang, C. Che, and J. Luo. Multi-task deep visual-semantic embedding for video thumbnail selection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3707--3715, 2015.Google ScholarCross Ref
- T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.Google Scholar
- T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, pages 3111--3119, 2013.Google ScholarDigital Library
- A. Ulges, C. Schulze, and T. Breuel. Multiple instance learning from weakly labeled videos. In SAMT Workshop on Cross-Media Information Analysis and Retrieval, 2008.Google Scholar
- M.-L. Zhang and Z.-H. Zhou. Improve multi-instance neural networks through feature selection. Neural Processing Letters, 19(1):1--10, 2004. Google ScholarDigital Library
Index Terms
- Multimodal tag localization based on deep learning
Recommendations
Survey on Deep Learning Based Fusion Recognition of Multimodal Biometrics
Biometric RecognitionAbstractWe take multimodal as a new research paradigm. This research paradigm is based on the premise that all human interactions with the outside world required the support of multimodal sensory systems. Deep learning (DL) has shown outstanding ...
Tag suggestion and localization in user-generated videos based on social knowledge
WSM '10: Proceedings of second ACM SIGMM workshop on Social mediaNowadays, almost any web site that provides means for sharing user-generated multimedia content, like Flickr, Facebook, YouTube and Vimeo, has tagging functionalities to let users annotate the material that they want to share. The tags are then used to ...
Robust Deep Multi-modal Learning Based on Gated Information Fusion Network
Computer Vision – ACCV 2018AbstractThe goal of multi-modal learning is to use complementary information on the relevant task provided by the multiple modalities to achieve reliable and robust performance. Recently, deep learning has led significant improvement in multi-modal ...
Comments