Abstract
The need for human-centered, affective multimedia interfaces has motivated research in automatic emotion recognition. In this article, we focus on facial emotion recognition. Specifically, we target a domain in which speakers produce emotional facial expressions while speaking. The main challenge of this domain is the presence of modulations due to both emotion and speech. For example, an individual's mouth movement may be similar when he smiles and when he pronounces the phoneme /IY/, as in “cheese”. The result of this confusion is a decrease in performance of facial emotion recognition systems. In our previous work, we investigated the joint effects of emotion and speech on facial movement. We found that it is critical to employ proper temporal segmentation and to leverage knowledge of spoken content to improve classification performance. In the current work, we investigate the temporal characteristics of specific regions of the face, such as the forehead, eyebrow, cheek, and mouth. We present methodology that uses the temporal patterns of specific regions of the face in the context of a facial emotion recognition system. We test our proposed approaches on two emotion datasets, the IEMOCAP and SAVEE datasets. Our results demonstrate that the combination of emotion recognition systems based on different facial regions improves overall accuracy compared to systems that do not leverage different characteristics of individual regions.
Supplemental Material
Available for Download
Supplemental movie, appendix, image and software files for, Emotion Recognition During Speech Using Dynamics of Multiple Regions of the Face
- Barry Arons. 1994. Pitch-based emphasis detection for segmenting speech recordings. In Proceedings of the International Conference on Spoken Language Processing. 1931--1934.Google Scholar
- Douglas Bates, Martin Maechler, and Ben Bolker. 2007. lme4: Linear mixed-effects models using S4 classes (R package version 0.9975-11).Google Scholar
- Elisabetta Bevacqua and Catherine Pelachaud. 2004. Expressive audio-visual speech. Comput. Anim. Virtual Worlds 15, 3--4, 297--304. Google ScholarDigital Library
- Subhabrata Bhattacharya, Behnaz Nojavanasghari, Tao Chen, Dong Liu, Shih-Fu Chang, and Mubarak Shah. 2013. Towards a comprehensive computational model for aesthetic assessment of videos. In Proceedings of the 21st ACM International Conference on Multimedia. ACM, 361--364. Google ScholarDigital Library
- Benjamin Bigot, Isabelle Ferrane, and Z. Ibrahim. 2008. Towards the detection and the characterization of conversational speech zones in audiovisual documents. In Proceedings of the International Workshop on Content-Based Multimedia Indexing (CBMI'08). IEEE, 162--169.Google Scholar
- Michael J. Black and Yaser Yacoob. 1997. Recognizing facial expressions in image sequences using local parameterized models of image motion. Int. J. Comput. Vision 25, 1, 23--48. Google ScholarDigital Library
- Marisa Boston, John Hale, Reinhold Kliegl, Umesh Patil, and Shravan Vasishth. 2008. Parsing costs as predictors of reading difficulty: An evaluation using the Potsdam Sentence Corpus. The Mind Research Repository (beta) 1.Google Scholar
- Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N. Chang, Sungbok Lee, and Shrikanth S. Narayanan. 2008. IEMOCAP: Interactive emotional dyadic motion capture database. Lang. Resources Eval. 42, 4, 335--359.Google ScholarCross Ref
- Carlos Busso and Shrikanth S. Narayanan. 2007. Interrelation between speech and facial gestures in emotional utterances: a single subject study. IEEE Trans. Audio Speech Lang. Process. 15, 8, 2331--2347. Google ScholarDigital Library
- Rafael A. Calvo and Sidney D'Mello. 2010. Affect detection: An interdisciplinary review of models, methods, and their applications. IEEE Trans. Affective Computing 1, 1, 18--37. Google ScholarDigital Library
- Erik Cambria, Bjorn Schuller, Yunqing Xia, and Catherine Havasi. 2013. New avenues in opinion mining and sentiment analysis. IEEE Intell. Syst. 1. Google ScholarDigital Library
- Chandramouli Chandrasekaran, Andrea Trubanova, Sébastien Stillittano, Alice Caplier, and Asif A. Ghazanfar. 2009. The natural statistics of audiovisual speech. PLoS Comput. Biol. 5, 7, 1000436.Google ScholarCross Ref
- Jingying Chen, Maylor K. Leung, and Yongsheng Gao. 2003. Noisy logo recognition using line segment Hausdorff distance. Pattern Recog. 36, 4, 943--955.Google ScholarCross Ref
- Tao Chen, Felix X. Yu, Jiawei Chen, Yin Cui, Yan-Ying Chen, and Shih-Fu Chang. 2014. Object-based visual sentiment concept analysis and application. In Proceedings of the ACM International Conference on Multimedia. ACM, 367--376. Google ScholarDigital Library
- Abhinav Dhall, Akshay Asthana, Roland Goecke, and Tom Gedeon. 2011. Emotion recognition using PHOG and LPQ features. In Proceedings of the IEEE International Conference on Automatic Face & Gesture Recognition and Workshops (FG'11). IEEE, 878--883.Google ScholarCross Ref
- Thomas G. Dietterich. 1998. Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput. 10, 7, 1895--1923. Google ScholarDigital Library
- Paul Ekman and Wallace V. Friesen. 1977. Facial Action Coding System: A Technique for the Measurement of Facial Movement. Consulting Psychologists Press, Palo Alto, CA.Google Scholar
- Moataz El Ayadi, Mohamed S. Kamel, and Fakhri Karray. 2011. Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recog. 44, 3, 572--587. Google ScholarDigital Library
- Vipul Garg, Harsh Kumar, and Rohit Sinha. 2013. Speech based Emotion Recognition based on hierarchical decision tree with SVM, BLG and SVR classifiers. In Proceedings of the National Conference on Communications. IEEE, 1--5.Google ScholarCross Ref
- Davood Gharavian, Mansour Sheikhan, Alireza Nazerieh, and Sahar Garoucy. 2012. Speech emotion recognition using FCBF feature selection method and GA-optimized fuzzy ARTMAP neural network. Neural Comput. Appl. 21, 8, 2115--2126.Google ScholarCross Ref
- Ben Gold, Nelson Morgan, and Dan Ellis. 2011. Speech and Audio Signal Processing: Processing and Perception of Speech and Music. John Wiley & Sons. Google ScholarDigital Library
- Hatice Gunes, Björn Schuller, Maja Pantic, and Roddy Cowie. 2011. Emotion representation, analysis and synthesis in continuous space: A survey. In Proceedings of the IEEE International Conference on Automatic Face & Gesture Recognition and Workshops. IEEE, 827--834.Google ScholarCross Ref
- Sanaul Haq and Philip J. B. Jackson. 2010. Machine Audition: Principles, Algorithms and Systems. IGI Global, Hershey PA, Chapter Multimodal emotion recognition, 398--423.Google Scholar
- M. Sazzad Hussain, Sidney K. D'Mello, and Rafael A. Calvo. 2014. 25 Research and development tools in affective computing. In The Oxford Handbook of Affective Computing, 349.Google Scholar
- Brendan Jou, Subhabrata Bhattacharya, and Shih-Fu Chang. 2014. Predicting viewer perceived emotions in animated GIFs. In Proceedings of the ACM International Conference on Multimedia. ACM, 213--216. Google ScholarDigital Library
- Markus Kächele, Michael Glodek, Dimitrij Zharkov, Sascha Meudt, and Friedhelm Schwenker. 2014. Fusion of audio-visual features using hierarchical classifier systems for the recognition of affective states and the state of depression. Depression 1, 1.Google Scholar
- Ozlem Kalinli. 2012. Automatic phoneme segmentation using auditory attention features. In Proceedings of INTERSPEECH.Google Scholar
- Joseph Keshet, Shai Shalev-Shwartz, and Yoram Singer. 2005. Phoneme alignment based on discriminative learning. http://u.cs.biu.ac.il/∼jkeshet/papers/KeshetShSiCh05.pdf.Google Scholar
- Yelin Kim and Emily Mower Provost. 2014. Say Cheese vs. smile: Reducing speech-related variability for facial emotion recognition. In Proceedings of the ACM International Conference on Multimedia (ACM MM'14). Google ScholarDigital Library
- Michael Kipp and J.-C. Martin. 2009. Gesture and emotion: Can basic gestural form features discriminate emotions? In Proceedings of the 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops (ACII'09). IEEE, 1--8.Google ScholarCross Ref
- Andrea Kleinsmith and Nadia Bianchi-Berthouze. 2013. Affective body expression perception and recognition: A survey. IEEE Trans. Affective Computing 4, 1, 15--33. Google ScholarDigital Library
- Shashidhar G. Koolagudi, Nitin Kumar, and K. Sreenivasa Rao. 2011. Speech emotion recognition using segmental level prosodic analysis. In Proceedings of the International Conference on Devices and Communications. IEEE, 1--5.Google Scholar
- Chi-Chun Lee, Emily Mower, Carlos Busso, Sungbok Lee, and Shrikanth Narayanan. 2009. Emotion recognition using a hierarchical binary decision tree approach. In Proceedings of INTERSPEECH. 320--323.Google Scholar
- Chi-Chun Lee, Emily Mower, Carlos Busso, Sungbok Lee, and Shrikanth Narayanan. 2011. Emotion recognition using a hierarchical binary decision tree approach. Speech Commun. 53, 9, 1162--1171. Google ScholarDigital Library
- Chul Min Lee and Shrikanth S. Narayanan. 2005. Toward detecting emotions in spoken dialogs. IEEE Trans. Speech Audio Process. 13, 2, 293--303.Google ScholarCross Ref
- Chul Min Lee, Serdar Yildirim, Murtaza Bulut, Abe Kazemzadeh, Carlos Busso, Zhigang Deng, Sungbok Lee, and Shrikanth Narayanan. 2004. Emotion recognition based on phoneme classes. In Proceedings of INTERSPEECH. 205--211.Google Scholar
- Jae-Gil Lee, Jiawei Han, and Kyu-Young Whang. 2007. Trajectory clustering: a partition-and-group framework. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 593--604. Google ScholarDigital Library
- Patrick Lucey, Terrence Martin, and Sridha Sridharan. 2004. Confusability of phonemes grouped according to their viseme classes in noisy environments. In Proceedings of the Australian International Conference on Speech Science & Technology. 265--270.Google Scholar
- Soroosh Mariooryad and Carlos Busso. 2013. Feature and model level compensation of lexical content for facial emotion recognition. In Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition (FG'13). DOI:http://dx.doi.org/10.1109/FG.2013.6553752Google ScholarCross Ref
- Hongying Meng and Nadia Bianchi-Berthouze. 2011. Naturalistic affective expression classification by a multi-stage approach based on hidden Markov models. In Affective Computing and Intelligent Interaction, Springer, 378--387. Google ScholarDigital Library
- Angeliki Metallinou, Carlos Busso, Sungbok Lee, and Shrikanth Narayanan. 2010. Visual emotion recognition using compact facial representations and viseme information. In Proceedings of the IEEE International Conference on Acoustics Speech and Signal Processing. IEEE, 2474--2477.Google ScholarCross Ref
- Angeliki Metallinou, Athanasios Katsamanis, and Shrikanth Narayanan. 2013. Tracking continuous emotional trends of participants during affective dyadic interactions using body language and speech information. Image Vision Comput. 31, 2, 137--152. Google ScholarDigital Library
- Angeliki Metallinou, Martin Wollmer, Athanasios Katsamanis, Florian Eyben, Björn Schuller, and Shrikanth Narayanan. 2012. Context-sensitive learning for enhanced audiovisual emotion classification. IEEE Trans. Affective Computing 3, 2, 184--198. Google ScholarDigital Library
- Emily Mower, Maja J. Mataric, and Shrikanth Narayanan. 2009. Human perception of audio-visual synthetic character emotion expression in the presence of ambiguous and conflicting information. IEEE Trans. Multimedia 11, 5, 843--855. Google ScholarDigital Library
- Emily Mower, Maja J. Mataric, and Shrikanth Narayanan. 2011. A framework for automatic human emotion classification using emotion profiles. IEEE Trans. Audio Speech Lang. Process. 19, 5 (2011), 1057--1070. Google ScholarDigital Library
- Emily Mower and Shrikanth Narayanan. 2011. A hierarchical static-dynamic framework for emotion classification. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 2372--2375.Google ScholarCross Ref
- Emily Mower Provost. 2013. Identifying salient sub-utterance emotion dynamics using flexible units and estimates of affective flow. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 3682--3686.Google ScholarCross Ref
- Shrikanth Narayanan and Panayiotis G. Georgiou. 2013. Behavioral signal processing: Deriving human behavioral informatics from speech and language. Proc. IEEE 101, 5, 1203.Google ScholarCross Ref
- Jérémie Nicolle, Vincent Rapp, Kévin Bailly, Lionel Prevost, and Mohamed Chetouani. 2012. Robust continuous prediction of human emotions using multiscale dynamic cues. In Proceedings of the ACM International Conference on Multimodal Interaction. ACM, 501--508. Google ScholarDigital Library
- Maja Pantic and Marian Stewart Bartlett. 2007. Machine analysis of facial expressions. In Face Recognition, I-Tech Education and Publishing, Vienna, Austria, 377--416.Google Scholar
- Yu Qiao, Naoya Shimomura, and Nobuaki Minematsu. 2008. Unsupervised optimal phoneme segmentation: objectives, algorithm and comparisons. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 3989--3992.Google Scholar
- Yong Rui, Anoop Gupta, and Alex Acero. 2000. Automatically extracting highlights for TV baseball programs. In Proceedings of the 8th ACM International Conference on Multimedia. ACM, 105--115. Google ScholarDigital Library
- Enrique Sánchez-Lozano, Paula Lopez-Otero, Laura Docio-Fernandez, Enrique Argones-Rúa, and José Luis Alba-Castro. 2013. Audiovisual three-level fusion for continuous estimation of Russell's emotion circumplex. In Proceedings of the 3rd ACM International Workshop on Audio/Visual Emotion Challenge. ACM, 31--40. Google ScholarDigital Library
- Georgia Sandbach, Stefanos Zafeiriou, Maja Pantic, and Daniel Rueckert. 2011. A dynamic approach to the recognition of 3d facial expressions and their temporal models. In Proceedings of the IEEE International Conference on Automatic Face & Gesture Recognition and Workshops. 406--413.Google ScholarCross Ref
- Arman Savran, Houwei Cao, Miraj Shah, Ani Nenkova, and Ragini Verma. 2012. Combining video, audio and lexical indicators of affect in spontaneous conversation via particle filtering. In Proceedings of the ACM International Conference on Multimodal Interaction. 485--492. Google ScholarDigital Library
- Björn Schuller, Anton Batliner, Stefan Steidl, and Dino Seppi. 2011. Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge. Speech Commun. 53, 9, 1062--1087. Google ScholarDigital Library
- Björn Schuller and Gerhard Rigoll. 2006. Timing levels in segment-based speech emotion recognition. In Proceedings of INTERSPEECH. 1818--1821.Google Scholar
- Björn Schuller, Stefan Steidl, Anton Batliner, Felix Burkhardt, Laurence Devillers, Christian Müller, and Shrikanth Narayanan. 2013. Paralinguistics in speech and language: State-of-the-art and the challenge. Computer Speech Lang. 27, 1, 4--39. Google ScholarDigital Library
- Miraj Shah, David G. Cooper, Houwei Cao, Ruben C. Gur, Ani Nenkova, and Ragini Verma. 2013. Action Unit Models of Facial Expression of Emotion in the Presence of Speech. In Proceedings of the Conference on Affective Computing and Intelligent Interaction. IEEE, 49--54. Google ScholarDigital Library
- Caifeng Shan, Shaogang Gong, and Peter W. McOwan. 2009. Facial expression recognition based on local binary patterns: A comprehensive study. Image Vision Comput. 27, 6, 803--816. Google ScholarDigital Library
- Doroteo Torre Toledano, Luis A. Hernández Gómez, and Luis Villarrubia Grande. 2003. Automatic phonetic segmentation. IEEE Trans. Speech Audio Process. 11, 6, 617--625.Google ScholarCross Ref
- Bogdan Vlasenko, Dmytro Prylipko, Ronald Böck, and Andreas Wendemuth. 2014. Modeling phonetic pattern variability in favor of the creation of robust emotion classifiers for real-life applications. Computer Speech Lang. 28, 2, 483--500. Google ScholarDigital Library
- Shaohua Wan and J. K. Aggarwal. 2014. Spontaneous facial expression recognition: A robust metric learning approach. Pattern Recog. 47, 5, 1859--1868. Google ScholarDigital Library
- Siqing Wu, Tiago H. Falk, and Wai-Yip Chan. 2011. Automatic speech emotion recognition using modulation spectral features. Speech Commun. 53, 5, 768--785. Google ScholarDigital Library
- Guoying Zhao and Matti Pietikainen. 2007. Dynamic texture recognition using local binary patterns with an application to facial expressions. IEEE Trans. Pattern Anal. Mach. Intell. 29, 6, 915--928. Google ScholarDigital Library
Index Terms
- Emotion Recognition During Speech Using Dynamics of Multiple Regions of the Face
Recommendations
Analysis of emotion recognition using facial expressions, speech and multimodal information
ICMI '04: Proceedings of the 6th international conference on Multimodal interfacesThe interaction between human beings and computers will be more natural if computers are able to perceive and respond to human non-verbal communication such as emotions. Although several approaches have been proposed to recognize human emotions based on ...
Emotion recognition by face dynamics
CompSysTech '13: Proceedings of the 14th International Conference on Computer Systems and TechnologiesThe paper proposes an accessible method for emotion recognition from facial dynamics in video streams. The emotions considered are anger, disgust, fear, happiness, sadness, surprise, and the neutral expression as well. The method is based on the Facial ...
Human-Computer Interaction Using Emotion Recognition from Facial Expression
EMS '11: Proceedings of the 2011 UKSim 5th European Symposium on Computer Modeling and SimulationThis paper describes emotion recognition system based on facial expression. A fully automatic facial expression recognition system is based on three steps: face detection, facial characteristic extraction and facial expression classification. We have ...
Comments