research-article

Multimodal tag localization based on deep learning

Authors:
Rui Zhang

Chinese Academy of Sciences, Beijing, China and University of Chinese Academy of Sciences, Beijing, China

Chinese Academy of Sciences, Beijing, China and University of Chinese Academy of Sciences, Beijing, China
View Profile

,
Sheng Tang

Chinese Academy of Sciences, Beijing, China

Chinese Academy of Sciences, Beijing, China
View Profile

,
Wu Liu

Beijing University of Posts and Telecommunications, Beijing, China

Beijing University of Posts and Telecommunications, Beijing, China
View Profile

,
JinTao Li

Chinese Academy of Sciences, Beijing, China

Chinese Academy of Sciences, Beijing, China
View Profile

ICIMCS '15: Proceedings of the 7th International Conference on Internet Multimedia Computing and ServiceAugust 2015Article No.: 50Pages 1–4https://doi.org/10.1145/2808492.2808542

Published:19 August 2015Publication History

ICIMCS '15: Proceedings of the 7th International Conference on Internet Multimedia Computing and Service

Pages 1–4

ABSTRACT

Tag localization which localizes the relevant video clips for an associated semantic tag has become an important research topic in the field of video retrieval and recommendation. Most existing approaches adopt and depend in large degree on carefully selected features which are manually designed by experts and do not take into consideration of multimodality. In order to take into account complementarity of different modalities and take advantage of learned features, in this paper, we propose a multimodal tag localization framework by exploiting deep learning to learn both visual and textual features of videos for tag localization, followed by the multimodal fusion of both visual and textual results. Extensive experiments on the public dataset show that our proposed approach achieves promising results. The tag localization based on visual deep learning greatly improves the precision of tag localization, and the multi-modal fusion of both visual and textual modalities further improves the precision despite the low performances of single textual modality.

References

L. Ballan, M. Bertini, A. Del Bimbo, M. Meoni, and G. Serra. Tag suggestion and localization in user-generated videos based on social knowledge. In Proceedings of second ACM SIGMM workshop on Social media, pages 3--8. ACM, 2010. Google ScholarDigital Library
W.-T. Chu, C.-J. Li, and Y.-K. Chou. Tag suggestion and localization for web videos by bipartite graph matching. In Proceedings of the 3rd ACM SIGMM international workshop on Social media, pages 35--40. ACM, 2011. Google ScholarDigital Library
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248--255. IEEE, 2009.Google ScholarCross Ref
R. Fergus, L. Fei-Fei, P. Perona, and A. Zisserman. Learning object categories from google's image search. In Computer Vision, 2005. ICCV 2005. Tenth IEEE International Conference on, volume 2, pages 1816--1823. IEEE, 2005. Google ScholarDigital Library
A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 1725--1732. IEEE, 2014. Google ScholarDigital Library
A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097--1105, 2012.Google ScholarDigital Library
G. Li, M. Wang, Y.-T. Zheng, H. Li, Z.-J. Zha, and T.-S. Chua. Shottagger: tag location for internet videos. In Proceedings of the 1st ACM International Conference on Multimedia Retrieval, page 37. ACM, 2011. Google ScholarDigital Library
H. Li, L. Yi, Y. Guan, and H. Zhang. Dut-webv: a benchmark dataset for performance evaluation of tag localization for web video. In Advances in Multimedia Modeling, pages 305--315. Springer, 2013.Google ScholarCross Ref
H. Li, L. Yi, B. Liu, and Y. Wang. Localizing relevant frames in web videos using topic model and relevance filtering. Machine Vision and Applications, 25(7):1661--1670, 2014. Google ScholarDigital Library
W. Liu, T. Mei, and Y. Zhang. Instant mobile video search with layered audio-video indexing and progressive transmission. Multimedia, IEEE Transactions on, 16(8):2242--2255, 2014.Google Scholar
W. Liu, T. Mei, Y. Zhang, C. Che, and J. Luo. Multi-task deep visual-semantic embedding for video thumbnail selection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3707--3715, 2015.Google ScholarCross Ref
T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.Google Scholar
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, pages 3111--3119, 2013.Google ScholarDigital Library
A. Ulges, C. Schulze, and T. Breuel. Multiple instance learning from weakly labeled videos. In SAMT Workshop on Cross-Media Information Analysis and Retrieval, 2008.Google Scholar
M.-L. Zhang and Z.-H. Zhou. Improve multi-instance neural networks through feature selection. Neural Processing Letters, 19(1):1--10, 2004. Google ScholarDigital Library

Index Terms

Multimodal tag localization based on deep learning
1. Information systems
  1. Information retrieval

Recommendations

Survey on Deep Learning Based Fusion Recognition of Multimodal Biometrics
Biometric Recognition
Abstract
We take multimodal as a new research paradigm. This research paradigm is based on the premise that all human interactions with the outside world required the support of multimodal sensory systems. Deep learning (DL) has shown outstanding ...
Read More
Tag suggestion and localization in user-generated videos based on social knowledge
WSM '10: Proceedings of second ACM SIGMM workshop on Social media

Nowadays, almost any web site that provides means for sharing user-generated multimedia content, like Flickr, Facebook, YouTube and Vimeo, has tagging functionalities to let users annotate the material that they want to share. The tags are then used to ...
Read More
Robust Deep Multi-modal Learning Based on Gated Information Fusion Network
Computer Vision – ACCV 2018
Abstract
The goal of multi-modal learning is to use complementary information on the relevant task provided by the multiple modalities to achieve reliable and robust performance. Recently, deep learning has led significant improvement in multi-modal ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICIMCS '15: Proceedings of the 7th International Conference on Internet Multimedia Computing and Service
August 2015
397 pages
ISBN:9781450335287
DOI:10.1145/2808492
General Chairs:
Ramesh Jain
University of California, Irvine
,
Shuqiang Jiang
Institute of Computing Technology, Chinese Academy of Sciences, China
,
Program Chairs:
John Smith
IBM Thomas J. Watson Research Center
,
Jitao Sang
Institute of Automation, Chinese Academy of Sciences, China
,
Guohui Li
National University of Defense Technology, China
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 August 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
deep learning
multi-modal fusion
semantic tag localization
Qualifiers
- research-article
Conference

Acceptance Rates
ICIMCS '15 Paper Acceptance Rate20of128submissions,16%Overall Acceptance Rate163of456submissions,36%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 161
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Multimodal tag localization based on deep learning

ICIMCS '15: Proceedings of the 7th International Conference on Internet Multimedia Computing and Service

ABSTRACT

References

Cited By

Index Terms

Recommendations

Survey on Deep Learning Based Fusion Recognition of Multimodal Biometrics

Tag suggestion and localization in user-generated videos based on social knowledge

Robust Deep Multi-modal Learning Based on Gated Information Fusion Network