Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2808492.2808542acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicimcsConference Proceedingsconference-collections
research-article

Multimodal tag localization based on deep learning

Authors Info & Claims
Published:19 August 2015Publication History

ABSTRACT

Tag localization which localizes the relevant video clips for an associated semantic tag has become an important research topic in the field of video retrieval and recommendation. Most existing approaches adopt and depend in large degree on carefully selected features which are manually designed by experts and do not take into consideration of multimodality. In order to take into account complementarity of different modalities and take advantage of learned features, in this paper, we propose a multimodal tag localization framework by exploiting deep learning to learn both visual and textual features of videos for tag localization, followed by the multimodal fusion of both visual and textual results. Extensive experiments on the public dataset show that our proposed approach achieves promising results. The tag localization based on visual deep learning greatly improves the precision of tag localization, and the multi-modal fusion of both visual and textual modalities further improves the precision despite the low performances of single textual modality.

References

  1. L. Ballan, M. Bertini, A. Del Bimbo, M. Meoni, and G. Serra. Tag suggestion and localization in user-generated videos based on social knowledge. In Proceedings of second ACM SIGMM workshop on Social media, pages 3--8. ACM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. W.-T. Chu, C.-J. Li, and Y.-K. Chou. Tag suggestion and localization for web videos by bipartite graph matching. In Proceedings of the 3rd ACM SIGMM international workshop on Social media, pages 35--40. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248--255. IEEE, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  4. R. Fergus, L. Fei-Fei, P. Perona, and A. Zisserman. Learning object categories from google's image search. In Computer Vision, 2005. ICCV 2005. Tenth IEEE International Conference on, volume 2, pages 1816--1823. IEEE, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 1725--1732. IEEE, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097--1105, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. G. Li, M. Wang, Y.-T. Zheng, H. Li, Z.-J. Zha, and T.-S. Chua. Shottagger: tag location for internet videos. In Proceedings of the 1st ACM International Conference on Multimedia Retrieval, page 37. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. H. Li, L. Yi, Y. Guan, and H. Zhang. Dut-webv: a benchmark dataset for performance evaluation of tag localization for web video. In Advances in Multimedia Modeling, pages 305--315. Springer, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  9. H. Li, L. Yi, B. Liu, and Y. Wang. Localizing relevant frames in web videos using topic model and relevance filtering. Machine Vision and Applications, 25(7):1661--1670, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. W. Liu, T. Mei, and Y. Zhang. Instant mobile video search with layered audio-video indexing and progressive transmission. Multimedia, IEEE Transactions on, 16(8):2242--2255, 2014.Google ScholarGoogle Scholar
  11. W. Liu, T. Mei, Y. Zhang, C. Che, and J. Luo. Multi-task deep visual-semantic embedding for video thumbnail selection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3707--3715, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  12. T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.Google ScholarGoogle Scholar
  13. T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, pages 3111--3119, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. A. Ulges, C. Schulze, and T. Breuel. Multiple instance learning from weakly labeled videos. In SAMT Workshop on Cross-Media Information Analysis and Retrieval, 2008.Google ScholarGoogle Scholar
  15. M.-L. Zhang and Z.-H. Zhou. Improve multi-instance neural networks through feature selection. Neural Processing Letters, 19(1):1--10, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Multimodal tag localization based on deep learning

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      ICIMCS '15: Proceedings of the 7th International Conference on Internet Multimedia Computing and Service
      August 2015
      397 pages
      ISBN:9781450335287
      DOI:10.1145/2808492
      • General Chairs:
      • Ramesh Jain,
      • Shuqiang Jiang,
      • Program Chairs:
      • John Smith,
      • Jitao Sang,
      • Guohui Li

      Copyright © 2015 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 19 August 2015

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      ICIMCS '15 Paper Acceptance Rate20of128submissions,16%Overall Acceptance Rate163of456submissions,36%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader