research-article

LDNN: Linguistic Knowledge Injectable Deep Neural Network for Group Cohesiveness Understanding

Authors:
Yanan Wang

KDDI Research, Inc., Fujimino-shi, Saitama, Japan

KDDI Research, Inc., Fujimino-shi, Saitama, Japan
View Profile

,
Jianming Wu

KDDI Research, Inc., Fujimino-shi, Saitama, Japan

KDDI Research, Inc., Fujimino-shi, Saitama, Japan
View Profile

,
Jinfa Huang

Peking University, Beijing, China

Peking University, Beijing, China
View Profile

,
Gen Hattori

KDDI Research, Inc., Fujimino-shi, Saitama, Japan

KDDI Research, Inc., Fujimino-shi, Saitama, Japan
View Profile

,
Yasuhiro Takishima

KDDI Research, Inc., Fujimino-shi, Saitama, Japan

KDDI Research, Inc., Fujimino-shi, Saitama, Japan
View Profile

,
Shinya Wada

KDDI Research, Inc., Fujimino-shi, Saitama, Japan

KDDI Research, Inc., Fujimino-shi, Saitama, Japan
View Profile

,
Rui Kimura

KDDI Research, Inc., Fujimino-shi, Saitama, Japan

KDDI Research, Inc., Fujimino-shi, Saitama, Japan
View Profile

,
Jie Chen

Peking University, Beijing, China

Peking University, Beijing, China
View Profile

,
Satoshi Kurihara

Keio University, Tokyo, Japan

Keio University, Tokyo, Japan
View Profile

ICMI '20: Proceedings of the 2020 International Conference on Multimodal InteractionOctober 2020Pages 343–350https://doi.org/10.1145/3382507.3418830

Published:22 October 2020Publication History

ICMI '20: Proceedings of the 2020 International Conference on Multimodal Interaction

Pages 343–350

ABSTRACT

Group cohesiveness reflects the level of intimacy that people feel with each other, and the development of a dialogue robot that can understand group cohesiveness will lead to the promotion of human communication. However, group cohesiveness is a complex concept that is difficult to predict based only on image pixels. Inspired by the fact that humans intuitively associate linguistic knowledge accumulated in the brain with the visual images they see, we propose a linguistic knowledge injectable deep neural network (LDNN) that builds a visual model (visual LDNN) for predicting group cohesiveness that can automatically associate the linguistic knowledge hidden behind images. LDNN consists of a visual encoder and a language encoder, and applies domain adaptation and linguistic knowledge transition mechanisms to transform linguistic knowledge from a language model to the visual LDNN. We train LDNN by adding descriptions to the training and validation sets of the Group AFfect Dataset 3.0 (GAF 3.0), and test the visual LDNN without any description. Comparing visual LDNN with various fine-tuned DNN models and three state-of-the-art models in the test set, the results demonstrate that the visual LDNN not only improves the performance of the fine-tuned DNN model leading to an MSE very similar to the state-of-the-art model, but is also a practical and efficient method that requires relatively little preprocessing. Furthermore, ablation studies confirm that LDNN is an effective method to inject linguistic knowledge into visual models.

Supplemental Material

3382507.3418830.mp4

mp4

13.7 MB

Download

References

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR.Google Scholar
Jyoti Aneja, Aditya Deshpande, and Alexander G. Schwing. 2018. Convolutional Image Captioning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In ICCV.Google ScholarDigital Library
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL.Google Scholar
Abhinav Dhall. 2019. EmotiW 2019: Automatic Emotion, Engagement and Cohesion Prediction Tasks. In 2019 International Conference on Multimodal Interaction (ICMI '19).Google ScholarDigital Library
Abhinav Dhall, Jyoti Joshi, Karan Sikka, Roland Goecke, and Nicu Sebe. 2015. The more the merrier: Analysing the affect of a group of people in images. In FG. IEEE.Google Scholar
Li Fei-Fei, Asha Iyer, Christof Koch, and Pietro Perona. 2007. What do we perceive in a glance of a real-world scene? Journal of vision (2007).Google Scholar
Terrence Fong, Charles Thorpe, and Charles Baur. 2003. Collaboration, dialogue, human-robot interaction. In Robotics Research. Springer.Google Scholar
Shreya Ghosh, Abhinav Dhall, Nicu Sebe, and Tom Gedeon. 2019. Predicting group cohesiveness in images. In 2019 International Joint Conference on Neural Networks (IJCNN). IEEE.Google ScholarCross Ref
Da Guo, Kai Wang, Jianfei Yang, Kaipeng Zhang, Xiaojiang Peng, and Yu Qiao. 2019. Exploring Regularizations with Face, Body and Image Cues for Group Cohesion Prediction. In 2019 International Conference on Multimodal Interaction (ICMI '19).Google ScholarDigital Library
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR.Google Scholar
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. In NIPS.Google Scholar
Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. 2017. Densely connected convolutional networks. In CVPR.Google Scholar
Xin Huang, Yuxin Peng, and Mingkuan Yuan. 2017. Cross-modal common representation learning by hybrid transfer network. In IJCAI.Google Scholar
H. Hung and D. Gatica-Perez. 2010. Estimating Cohesion in Small Groups Using Audio-Visual Nonverbal Behavior. IEEE Transactions on Multimedia 12, 6 (2010).Google ScholarDigital Library
Xiuyi Jia, Xiang Zheng, Weiwei Li, Changqing Zhang, and Zechao Li. 2019. Facial Emotion Distribution Learning by Exploiting Low-Rank Label Correlations Locally. In CVPR.Google Scholar
Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition.Google ScholarCross Ref
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In NIPS.Google Scholar
Kristen A Lindquist, Jennifer K MacCormack, and Holly Shablack. 2015. The role of language in emotion: Predictions from psychological constructionism. Frontiers in Psychology (2015).Google Scholar
Zhun Liu, Ying Shen, Varun Bharadhwaj Lakshminarasimhan, Paul Pu Liang, AmirAli Bagher Zadeh, and Louis-Philippe Morency. 2018. Efficient Low-rank Multimodal Fusion With Modality-Specific Factors. In ACL.Google Scholar
Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know What You Don?t Know: Unanswerable Questions for SQuAD. In ACL.Google Scholar
Sebastian Raschka. 2018. Model evaluation, model selection, and algorithm selection in machine learning. arXiv preprint arXiv:1811.12808 (2018).Google Scholar
Garima Sharma, Shreya Ghosh, and Abhinav Dhall. 2019. Automatic Group Level Affect and cohesion prediction in videos. In International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW2019).Google ScholarCross Ref
Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In ICLR.Google Scholar
Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. 2019. Vl-bert: Pre-training of generic visual-linguistic representations. In ICLR.Google Scholar
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NIPS.Google Scholar
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In CVPR.Google Scholar
Yanan Wang, Jianming Wu, and Keiichiro Hoashi. 2019. Multi-Attention Fusion Network for Video-Based Emotion Recognition. In ICMI.Google Scholar
Tien Xuan Dang, Soo-Hyung Kim, Hyung-Jeong Yang, Guee-Sang Lee, and Thanh-Hung Vo. 2019. Group-Level Cohesion Prediction Using Deep Learning Models with A Multi-Stream Hybrid Network. In 2019 International Conference on Multimodal Interaction (ICMI '19).Google Scholar
An Yang, Quan Wang, Jing Liu, Kai Liu, Yajuan Lyu, Hua Wu, Qiaoqiao She, and Sujian Li. 2019. Enhancing pre-trained language representations with rich knowledge for machine reading comprehension. In ACL.Google Scholar
Amir Zadeh, Paul Pu Liang, Navonil Mazumder, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2018. Memory fusion network for multi-view sequential learning. In AAAI.Google Scholar
Amir Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2018. Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. In ACL.Google Scholar
Yue Zheng, Yali Li, and Shengjin Wang. 2019. Intention Oriented Image Captions With Guiding Objects. In CVPR.Google Scholar
Bin Zhu, Xin Guo, Kenneth Barner, and Charles Boncelet. 2019. Automatic Group Cohesiveness Detection With Multi-Modal Features. In 2019 International Conference on Multimodal Interaction (ICMI '19).Google Scholar

Index Terms

LDNN: Linguistic Knowledge Injectable Deep Neural Network for Group Cohesiveness Understanding
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
2. Human-centered computing
  1. Interaction design

Recommendations

Implicit Knowledge Injectable Cross Attention Audiovisual Model for Group Emotion Recognition
ICMI '20: Proceedings of the 2020 International Conference on Multimodal Interaction

Audio-video group emotion recognition is a challenging task since it is difficult to gather a broad range of potential information to obtain meaningful emotional representations. Humans can easily understand emotions because they can associate implicit ...
Read More
Edge-preserving image denoising using a deep convolutional neural network
Highlights
- This paper makes use of a deep CNN for image denoising.
- The network is trained ...
Abstract
This paper introduces a novel denoising approach making use of a deep convolutional neural network to preserve image edges. The network is trained by using the edge map obtained from the well-known Canny algorithm and aims at ...
Read More
Multimodal Deep Convolutional Neural Network for Audio-Visual Emotion Recognition
ICMR '16: Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval

Emotion recognition is a challenging task because of the emotional gap between subjective emotion and the low-level audio-visual features. Inspired by the recent success of deep learning in bridging the semantic gap, this paper proposes to bridge the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICMI '20: Proceedings of the 2020 International Conference on Multimodal Interaction
October 2020
920 pages
ISBN:9781450375818
DOI:10.1145/3382507
General Chairs:
Khiet Truong
University of Twente, the Netherlands
,
Dirk Heylen
University of Twente, the Netherlands
,
Mary Czerwinski
Microsoft Research, USA
,
Program Chairs:
Nadia Berthouze
University College London, United Kingdom
,
Mohamed Chetouani
Sorbonne University, France
,
Mikio Nakano
C4A Research Institute, Japan
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 22 October 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
affective computing
human interaction
machine learning for multimodal interaction
multimodal fusion and representation
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate453of1,080submissions,42%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 153
  Total Downloads
- Downloads (Last 12 months)15
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

LDNN: Linguistic Knowledge Injectable Deep Neural Network for Group Cohesiveness Understanding

ICMI '20: Proceedings of the 2020 International Conference on Multimodal Interaction

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Implicit Knowledge Injectable Cross Attention Audiovisual Model for Group Emotion Recognition

Edge-preserving image denoising using a deep convolutional neural network

Multimodal Deep Convolutional Neural Network for Audio-Visual Emotion Recognition

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media