announcement

Emotion Recognition During Speech Using Dynamics of Multiple Regions of the Face

Authors:
Yelin Kim

University of Michigan, Ann Arbor, MI

University of Michigan, Ann Arbor, MI
View Profile

,
Emily Mower Provost

University of Michigan, Ann Arbor, MI

University of Michigan, Ann Arbor, MI
View Profile

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 12 Issue 1sArticle No.: 25pp 1–23https://doi.org/10.1145/2808204

Published:21 October 2015Publication History

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

The need for human-centered, affective multimedia interfaces has motivated research in automatic emotion recognition. In this article, we focus on facial emotion recognition. Specifically, we target a domain in which speakers produce emotional facial expressions while speaking. The main challenge of this domain is the presence of modulations due to both emotion and speech. For example, an individual's mouth movement may be similar when he smiles and when he pronounces the phoneme /IY/, as in “cheese”. The result of this confusion is a decrease in performance of facial emotion recognition systems. In our previous work, we investigated the joint effects of emotion and speech on facial movement. We found that it is critical to employ proper temporal segmentation and to leverage knowledge of spoken content to improve classification performance. In the current work, we investigate the temporal characteristics of specific regions of the face, such as the forehead, eyebrow, cheek, and mouth. We present methodology that uses the temporal patterns of specific regions of the face in the context of a facial emotion recognition system. We test our proposed approaches on two emotion datasets, the IEMOCAP and SAVEE datasets. Our results demonstrate that the combination of emotion recognition systems based on different facial regions improves overall accuracy compared to systems that do not leverage different characteristics of individual regions.

Supplemental Material

Available for Download

zip

kim.zip (23 KB)

Supplemental movie, appendix, image and software files for, Emotion Recognition During Speech Using Dynamics of Multiple Regions of the Face

References

Barry Arons. 1994. Pitch-based emphasis detection for segmenting speech recordings. In Proceedings of the International Conference on Spoken Language Processing. 1931--1934.Google Scholar
Douglas Bates, Martin Maechler, and Ben Bolker. 2007. lme4: Linear mixed-effects models using S4 classes (R package version 0.9975-11).Google Scholar
Elisabetta Bevacqua and Catherine Pelachaud. 2004. Expressive audio-visual speech. Comput. Anim. Virtual Worlds 15, 3--4, 297--304. Google ScholarDigital Library
Subhabrata Bhattacharya, Behnaz Nojavanasghari, Tao Chen, Dong Liu, Shih-Fu Chang, and Mubarak Shah. 2013. Towards a comprehensive computational model for aesthetic assessment of videos. In Proceedings of the 21st ACM International Conference on Multimedia. ACM, 361--364. Google ScholarDigital Library
Benjamin Bigot, Isabelle Ferrane, and Z. Ibrahim. 2008. Towards the detection and the characterization of conversational speech zones in audiovisual documents. In Proceedings of the International Workshop on Content-Based Multimedia Indexing (CBMI'08). IEEE, 162--169.Google Scholar
Michael J. Black and Yaser Yacoob. 1997. Recognizing facial expressions in image sequences using local parameterized models of image motion. Int. J. Comput. Vision 25, 1, 23--48. Google ScholarDigital Library
Marisa Boston, John Hale, Reinhold Kliegl, Umesh Patil, and Shravan Vasishth. 2008. Parsing costs as predictors of reading difficulty: An evaluation using the Potsdam Sentence Corpus. The Mind Research Repository (beta) 1.Google Scholar
Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N. Chang, Sungbok Lee, and Shrikanth S. Narayanan. 2008. IEMOCAP: Interactive emotional dyadic motion capture database. Lang. Resources Eval. 42, 4, 335--359.Google ScholarCross Ref
Carlos Busso and Shrikanth S. Narayanan. 2007. Interrelation between speech and facial gestures in emotional utterances: a single subject study. IEEE Trans. Audio Speech Lang. Process. 15, 8, 2331--2347. Google ScholarDigital Library
Rafael A. Calvo and Sidney D'Mello. 2010. Affect detection: An interdisciplinary review of models, methods, and their applications. IEEE Trans. Affective Computing 1, 1, 18--37. Google ScholarDigital Library
Erik Cambria, Bjorn Schuller, Yunqing Xia, and Catherine Havasi. 2013. New avenues in opinion mining and sentiment analysis. IEEE Intell. Syst. 1. Google ScholarDigital Library
Chandramouli Chandrasekaran, Andrea Trubanova, Sébastien Stillittano, Alice Caplier, and Asif A. Ghazanfar. 2009. The natural statistics of audiovisual speech. PLoS Comput. Biol. 5, 7, 1000436.Google ScholarCross Ref
Jingying Chen, Maylor K. Leung, and Yongsheng Gao. 2003. Noisy logo recognition using line segment Hausdorff distance. Pattern Recog. 36, 4, 943--955.Google ScholarCross Ref
Tao Chen, Felix X. Yu, Jiawei Chen, Yin Cui, Yan-Ying Chen, and Shih-Fu Chang. 2014. Object-based visual sentiment concept analysis and application. In Proceedings of the ACM International Conference on Multimedia. ACM, 367--376. Google ScholarDigital Library
Abhinav Dhall, Akshay Asthana, Roland Goecke, and Tom Gedeon. 2011. Emotion recognition using PHOG and LPQ features. In Proceedings of the IEEE International Conference on Automatic Face & Gesture Recognition and Workshops (FG'11). IEEE, 878--883.Google ScholarCross Ref
Thomas G. Dietterich. 1998. Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput. 10, 7, 1895--1923. Google ScholarDigital Library
Paul Ekman and Wallace V. Friesen. 1977. Facial Action Coding System: A Technique for the Measurement of Facial Movement. Consulting Psychologists Press, Palo Alto, CA.Google Scholar
Moataz El Ayadi, Mohamed S. Kamel, and Fakhri Karray. 2011. Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recog. 44, 3, 572--587. Google ScholarDigital Library
Vipul Garg, Harsh Kumar, and Rohit Sinha. 2013. Speech based Emotion Recognition based on hierarchical decision tree with SVM, BLG and SVR classifiers. In Proceedings of the National Conference on Communications. IEEE, 1--5.Google ScholarCross Ref
Davood Gharavian, Mansour Sheikhan, Alireza Nazerieh, and Sahar Garoucy. 2012. Speech emotion recognition using FCBF feature selection method and GA-optimized fuzzy ARTMAP neural network. Neural Comput. Appl. 21, 8, 2115--2126.Google ScholarCross Ref
Ben Gold, Nelson Morgan, and Dan Ellis. 2011. Speech and Audio Signal Processing: Processing and Perception of Speech and Music. John Wiley & Sons. Google ScholarDigital Library
Hatice Gunes, Björn Schuller, Maja Pantic, and Roddy Cowie. 2011. Emotion representation, analysis and synthesis in continuous space: A survey. In Proceedings of the IEEE International Conference on Automatic Face & Gesture Recognition and Workshops. IEEE, 827--834.Google ScholarCross Ref
Sanaul Haq and Philip J. B. Jackson. 2010. Machine Audition: Principles, Algorithms and Systems. IGI Global, Hershey PA, Chapter Multimodal emotion recognition, 398--423.Google Scholar
M. Sazzad Hussain, Sidney K. D'Mello, and Rafael A. Calvo. 2014. 25 Research and development tools in affective computing. In The Oxford Handbook of Affective Computing, 349.Google Scholar
Brendan Jou, Subhabrata Bhattacharya, and Shih-Fu Chang. 2014. Predicting viewer perceived emotions in animated GIFs. In Proceedings of the ACM International Conference on Multimedia. ACM, 213--216. Google ScholarDigital Library
Markus Kächele, Michael Glodek, Dimitrij Zharkov, Sascha Meudt, and Friedhelm Schwenker. 2014. Fusion of audio-visual features using hierarchical classifier systems for the recognition of affective states and the state of depression. Depression 1, 1.Google Scholar
Ozlem Kalinli. 2012. Automatic phoneme segmentation using auditory attention features. In Proceedings of INTERSPEECH.Google Scholar
Joseph Keshet, Shai Shalev-Shwartz, and Yoram Singer. 2005. Phoneme alignment based on discriminative learning. http://u.cs.biu.ac.il/&sim;jkeshet/papers/KeshetShSiCh05.pdf.Google Scholar
Yelin Kim and Emily Mower Provost. 2014. Say Cheese vs. smile: Reducing speech-related variability for facial emotion recognition. In Proceedings of the ACM International Conference on Multimedia (ACM MM'14). Google ScholarDigital Library
Michael Kipp and J.-C. Martin. 2009. Gesture and emotion: Can basic gestural form features discriminate emotions? In Proceedings of the 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops (ACII'09). IEEE, 1--8.Google ScholarCross Ref
Andrea Kleinsmith and Nadia Bianchi-Berthouze. 2013. Affective body expression perception and recognition: A survey. IEEE Trans. Affective Computing 4, 1, 15--33. Google ScholarDigital Library
Shashidhar G. Koolagudi, Nitin Kumar, and K. Sreenivasa Rao. 2011. Speech emotion recognition using segmental level prosodic analysis. In Proceedings of the International Conference on Devices and Communications. IEEE, 1--5.Google Scholar
Chi-Chun Lee, Emily Mower, Carlos Busso, Sungbok Lee, and Shrikanth Narayanan. 2009. Emotion recognition using a hierarchical binary decision tree approach. In Proceedings of INTERSPEECH. 320--323.Google Scholar
Chi-Chun Lee, Emily Mower, Carlos Busso, Sungbok Lee, and Shrikanth Narayanan. 2011. Emotion recognition using a hierarchical binary decision tree approach. Speech Commun. 53, 9, 1162--1171. Google ScholarDigital Library
Chul Min Lee and Shrikanth S. Narayanan. 2005. Toward detecting emotions in spoken dialogs. IEEE Trans. Speech Audio Process. 13, 2, 293--303.Google ScholarCross Ref
Chul Min Lee, Serdar Yildirim, Murtaza Bulut, Abe Kazemzadeh, Carlos Busso, Zhigang Deng, Sungbok Lee, and Shrikanth Narayanan. 2004. Emotion recognition based on phoneme classes. In Proceedings of INTERSPEECH. 205--211.Google Scholar
Jae-Gil Lee, Jiawei Han, and Kyu-Young Whang. 2007. Trajectory clustering: a partition-and-group framework. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 593--604. Google ScholarDigital Library
Patrick Lucey, Terrence Martin, and Sridha Sridharan. 2004. Confusability of phonemes grouped according to their viseme classes in noisy environments. In Proceedings of the Australian International Conference on Speech Science & Technology. 265--270.Google Scholar
Soroosh Mariooryad and Carlos Busso. 2013. Feature and model level compensation of lexical content for facial emotion recognition. In Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition (FG'13). DOI:http://dx.doi.org/10.1109/FG.2013.6553752Google ScholarCross Ref
Hongying Meng and Nadia Bianchi-Berthouze. 2011. Naturalistic affective expression classification by a multi-stage approach based on hidden Markov models. In Affective Computing and Intelligent Interaction, Springer, 378--387. Google ScholarDigital Library
Angeliki Metallinou, Carlos Busso, Sungbok Lee, and Shrikanth Narayanan. 2010. Visual emotion recognition using compact facial representations and viseme information. In Proceedings of the IEEE International Conference on Acoustics Speech and Signal Processing. IEEE, 2474--2477.Google ScholarCross Ref
Angeliki Metallinou, Athanasios Katsamanis, and Shrikanth Narayanan. 2013. Tracking continuous emotional trends of participants during affective dyadic interactions using body language and speech information. Image Vision Comput. 31, 2, 137--152. Google ScholarDigital Library
Angeliki Metallinou, Martin Wollmer, Athanasios Katsamanis, Florian Eyben, Björn Schuller, and Shrikanth Narayanan. 2012. Context-sensitive learning for enhanced audiovisual emotion classification. IEEE Trans. Affective Computing 3, 2, 184--198. Google ScholarDigital Library
Emily Mower, Maja J. Mataric, and Shrikanth Narayanan. 2009. Human perception of audio-visual synthetic character emotion expression in the presence of ambiguous and conflicting information. IEEE Trans. Multimedia 11, 5, 843--855. Google ScholarDigital Library
Emily Mower, Maja J. Mataric, and Shrikanth Narayanan. 2011. A framework for automatic human emotion classification using emotion profiles. IEEE Trans. Audio Speech Lang. Process. 19, 5 (2011), 1057--1070. Google ScholarDigital Library
Emily Mower and Shrikanth Narayanan. 2011. A hierarchical static-dynamic framework for emotion classification. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 2372--2375.Google ScholarCross Ref
Emily Mower Provost. 2013. Identifying salient sub-utterance emotion dynamics using flexible units and estimates of affective flow. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 3682--3686.Google ScholarCross Ref
Shrikanth Narayanan and Panayiotis G. Georgiou. 2013. Behavioral signal processing: Deriving human behavioral informatics from speech and language. Proc. IEEE 101, 5, 1203.Google ScholarCross Ref
Jérémie Nicolle, Vincent Rapp, Kévin Bailly, Lionel Prevost, and Mohamed Chetouani. 2012. Robust continuous prediction of human emotions using multiscale dynamic cues. In Proceedings of the ACM International Conference on Multimodal Interaction. ACM, 501--508. Google ScholarDigital Library
Maja Pantic and Marian Stewart Bartlett. 2007. Machine analysis of facial expressions. In Face Recognition, I-Tech Education and Publishing, Vienna, Austria, 377--416.Google Scholar
Yu Qiao, Naoya Shimomura, and Nobuaki Minematsu. 2008. Unsupervised optimal phoneme segmentation: objectives, algorithm and comparisons. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 3989--3992.Google Scholar
Yong Rui, Anoop Gupta, and Alex Acero. 2000. Automatically extracting highlights for TV baseball programs. In Proceedings of the 8th ACM International Conference on Multimedia. ACM, 105--115. Google ScholarDigital Library
Enrique Sánchez-Lozano, Paula Lopez-Otero, Laura Docio-Fernandez, Enrique Argones-Rúa, and José Luis Alba-Castro. 2013. Audiovisual three-level fusion for continuous estimation of Russell's emotion circumplex. In Proceedings of the 3rd ACM International Workshop on Audio/Visual Emotion Challenge. ACM, 31--40. Google ScholarDigital Library
Georgia Sandbach, Stefanos Zafeiriou, Maja Pantic, and Daniel Rueckert. 2011. A dynamic approach to the recognition of 3d facial expressions and their temporal models. In Proceedings of the IEEE International Conference on Automatic Face & Gesture Recognition and Workshops. 406--413.Google ScholarCross Ref
Arman Savran, Houwei Cao, Miraj Shah, Ani Nenkova, and Ragini Verma. 2012. Combining video, audio and lexical indicators of affect in spontaneous conversation via particle filtering. In Proceedings of the ACM International Conference on Multimodal Interaction. 485--492. Google ScholarDigital Library
Björn Schuller, Anton Batliner, Stefan Steidl, and Dino Seppi. 2011. Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge. Speech Commun. 53, 9, 1062--1087. Google ScholarDigital Library
Björn Schuller and Gerhard Rigoll. 2006. Timing levels in segment-based speech emotion recognition. In Proceedings of INTERSPEECH. 1818--1821.Google Scholar
Björn Schuller, Stefan Steidl, Anton Batliner, Felix Burkhardt, Laurence Devillers, Christian Müller, and Shrikanth Narayanan. 2013. Paralinguistics in speech and language: State-of-the-art and the challenge. Computer Speech Lang. 27, 1, 4--39. Google ScholarDigital Library
Miraj Shah, David G. Cooper, Houwei Cao, Ruben C. Gur, Ani Nenkova, and Ragini Verma. 2013. Action Unit Models of Facial Expression of Emotion in the Presence of Speech. In Proceedings of the Conference on Affective Computing and Intelligent Interaction. IEEE, 49--54. Google ScholarDigital Library
Caifeng Shan, Shaogang Gong, and Peter W. McOwan. 2009. Facial expression recognition based on local binary patterns: A comprehensive study. Image Vision Comput. 27, 6, 803--816. Google ScholarDigital Library
Doroteo Torre Toledano, Luis A. Hernández Gómez, and Luis Villarrubia Grande. 2003. Automatic phonetic segmentation. IEEE Trans. Speech Audio Process. 11, 6, 617--625.Google ScholarCross Ref
Bogdan Vlasenko, Dmytro Prylipko, Ronald Böck, and Andreas Wendemuth. 2014. Modeling phonetic pattern variability in favor of the creation of robust emotion classifiers for real-life applications. Computer Speech Lang. 28, 2, 483--500. Google ScholarDigital Library
Shaohua Wan and J. K. Aggarwal. 2014. Spontaneous facial expression recognition: A robust metric learning approach. Pattern Recog. 47, 5, 1859--1868. Google ScholarDigital Library
Siqing Wu, Tiago H. Falk, and Wai-Yip Chan. 2011. Automatic speech emotion recognition using modulation spectral features. Speech Commun. 53, 5, 768--785. Google ScholarDigital Library
Guoying Zhao and Matti Pietikainen. 2007. Dynamic texture recognition using local binary patterns with an application to facial expressions. IEEE Trans. Pattern Anal. Mach. Intell. 29, 6, 915--928. Google ScholarDigital Library

Index Terms

Emotion Recognition During Speech Using Dynamics of Multiple Regions of the Face
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Object recognition
      2. Computer vision tasks
        Scene understanding
2. Information systems
  1. Information systems applications
    1. Multimedia information systems

Recommendations

Analysis of emotion recognition using facial expressions, speech and multimodal information
ICMI '04: Proceedings of the 6th international conference on Multimodal interfaces

The interaction between human beings and computers will be more natural if computers are able to perceive and respond to human non-verbal communication such as emotions. Although several approaches have been proposed to recognize human emotions based on ...
Read More
Emotion recognition by face dynamics
CompSysTech '13: Proceedings of the 14th International Conference on Computer Systems and Technologies

The paper proposes an accessible method for emotion recognition from facial dynamics in video streams. The emotions considered are anger, disgust, fear, happiness, sadness, surprise, and the neutral expression as well. The method is based on the Facial ...
Read More
Human-Computer Interaction Using Emotion Recognition from Facial Expression
EMS '11: Proceedings of the 2011 UKSim 5th European Symposium on Computer Modeling and Simulation

This paper describes emotion recognition system based on facial expression. A fully automatic facial expression recognition system is based on three steps: face detection, facial characteristic extraction and facial expression classification. We have ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Multimedia Computing, Communications, and Applications Volume 12, Issue 1s
Special Issue on Smartphone-Based Interactive Technologies, Systems, and Applications and Special Issue on Extended Best Papers from ACM Multimedia 2014
October 2015
317 pages
ISSN:1551-6857
EISSN:1551-6865
DOI:10.1145/2837676
Editor:
Ralf Steinmetz
Technische Universität Darmstadt, Germany
Issue’s Table of Contents
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 21 October 2015
- Accepted: 1 July 2015
- Revised: 1 March 2015
- Received: 1 February 2015
Published in tomm Volume 12, Issue 1s

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Emotion
emotion recognition
facial movement
segmentation
Qualifiers
- announcement
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 22
  Total Citations
  View Citations
- 501
  Total Downloads
- Downloads (Last 12 months)21
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Emotion Recognition During Speech Using Dynamics of Multiple Regions of the Face

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

Analysis of emotion recognition using facial expressions, speech and multimodal information

Emotion recognition by face dynamics

Human-Computer Interaction Using Emotion Recognition from Facial Expression

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Emotion Recognition During Speech Using Dynamics of Multiple Regions of the Face

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

Analysis of emotion recognition using facial expressions, speech and multimodal information

Emotion recognition by face dynamics

Human-Computer Interaction Using Emotion Recognition from Facial Expression

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media