Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
announcement

Emotion Recognition During Speech Using Dynamics of Multiple Regions of the Face

Published:21 October 2015Publication History
Skip Abstract Section

Abstract

The need for human-centered, affective multimedia interfaces has motivated research in automatic emotion recognition. In this article, we focus on facial emotion recognition. Specifically, we target a domain in which speakers produce emotional facial expressions while speaking. The main challenge of this domain is the presence of modulations due to both emotion and speech. For example, an individual's mouth movement may be similar when he smiles and when he pronounces the phoneme /IY/, as in “cheese”. The result of this confusion is a decrease in performance of facial emotion recognition systems. In our previous work, we investigated the joint effects of emotion and speech on facial movement. We found that it is critical to employ proper temporal segmentation and to leverage knowledge of spoken content to improve classification performance. In the current work, we investigate the temporal characteristics of specific regions of the face, such as the forehead, eyebrow, cheek, and mouth. We present methodology that uses the temporal patterns of specific regions of the face in the context of a facial emotion recognition system. We test our proposed approaches on two emotion datasets, the IEMOCAP and SAVEE datasets. Our results demonstrate that the combination of emotion recognition systems based on different facial regions improves overall accuracy compared to systems that do not leverage different characteristics of individual regions.

Skip Supplemental Material Section

Supplemental Material

References

  1. Barry Arons. 1994. Pitch-based emphasis detection for segmenting speech recordings. In Proceedings of the International Conference on Spoken Language Processing. 1931--1934.Google ScholarGoogle Scholar
  2. Douglas Bates, Martin Maechler, and Ben Bolker. 2007. lme4: Linear mixed-effects models using S4 classes (R package version 0.9975-11).Google ScholarGoogle Scholar
  3. Elisabetta Bevacqua and Catherine Pelachaud. 2004. Expressive audio-visual speech. Comput. Anim. Virtual Worlds 15, 3--4, 297--304. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Subhabrata Bhattacharya, Behnaz Nojavanasghari, Tao Chen, Dong Liu, Shih-Fu Chang, and Mubarak Shah. 2013. Towards a comprehensive computational model for aesthetic assessment of videos. In Proceedings of the 21st ACM International Conference on Multimedia. ACM, 361--364. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Benjamin Bigot, Isabelle Ferrane, and Z. Ibrahim. 2008. Towards the detection and the characterization of conversational speech zones in audiovisual documents. In Proceedings of the International Workshop on Content-Based Multimedia Indexing (CBMI'08). IEEE, 162--169.Google ScholarGoogle Scholar
  6. Michael J. Black and Yaser Yacoob. 1997. Recognizing facial expressions in image sequences using local parameterized models of image motion. Int. J. Comput. Vision 25, 1, 23--48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Marisa Boston, John Hale, Reinhold Kliegl, Umesh Patil, and Shravan Vasishth. 2008. Parsing costs as predictors of reading difficulty: An evaluation using the Potsdam Sentence Corpus. The Mind Research Repository (beta) 1.Google ScholarGoogle Scholar
  8. Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N. Chang, Sungbok Lee, and Shrikanth S. Narayanan. 2008. IEMOCAP: Interactive emotional dyadic motion capture database. Lang. Resources Eval. 42, 4, 335--359.Google ScholarGoogle ScholarCross RefCross Ref
  9. Carlos Busso and Shrikanth S. Narayanan. 2007. Interrelation between speech and facial gestures in emotional utterances: a single subject study. IEEE Trans. Audio Speech Lang. Process. 15, 8, 2331--2347. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Rafael A. Calvo and Sidney D'Mello. 2010. Affect detection: An interdisciplinary review of models, methods, and their applications. IEEE Trans. Affective Computing 1, 1, 18--37. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Erik Cambria, Bjorn Schuller, Yunqing Xia, and Catherine Havasi. 2013. New avenues in opinion mining and sentiment analysis. IEEE Intell. Syst. 1. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Chandramouli Chandrasekaran, Andrea Trubanova, Sébastien Stillittano, Alice Caplier, and Asif A. Ghazanfar. 2009. The natural statistics of audiovisual speech. PLoS Comput. Biol. 5, 7, 1000436.Google ScholarGoogle ScholarCross RefCross Ref
  13. Jingying Chen, Maylor K. Leung, and Yongsheng Gao. 2003. Noisy logo recognition using line segment Hausdorff distance. Pattern Recog. 36, 4, 943--955.Google ScholarGoogle ScholarCross RefCross Ref
  14. Tao Chen, Felix X. Yu, Jiawei Chen, Yin Cui, Yan-Ying Chen, and Shih-Fu Chang. 2014. Object-based visual sentiment concept analysis and application. In Proceedings of the ACM International Conference on Multimedia. ACM, 367--376. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Abhinav Dhall, Akshay Asthana, Roland Goecke, and Tom Gedeon. 2011. Emotion recognition using PHOG and LPQ features. In Proceedings of the IEEE International Conference on Automatic Face & Gesture Recognition and Workshops (FG'11). IEEE, 878--883.Google ScholarGoogle ScholarCross RefCross Ref
  16. Thomas G. Dietterich. 1998. Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput. 10, 7, 1895--1923. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Paul Ekman and Wallace V. Friesen. 1977. Facial Action Coding System: A Technique for the Measurement of Facial Movement. Consulting Psychologists Press, Palo Alto, CA.Google ScholarGoogle Scholar
  18. Moataz El Ayadi, Mohamed S. Kamel, and Fakhri Karray. 2011. Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recog. 44, 3, 572--587. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Vipul Garg, Harsh Kumar, and Rohit Sinha. 2013. Speech based Emotion Recognition based on hierarchical decision tree with SVM, BLG and SVR classifiers. In Proceedings of the National Conference on Communications. IEEE, 1--5.Google ScholarGoogle ScholarCross RefCross Ref
  20. Davood Gharavian, Mansour Sheikhan, Alireza Nazerieh, and Sahar Garoucy. 2012. Speech emotion recognition using FCBF feature selection method and GA-optimized fuzzy ARTMAP neural network. Neural Comput. Appl. 21, 8, 2115--2126.Google ScholarGoogle ScholarCross RefCross Ref
  21. Ben Gold, Nelson Morgan, and Dan Ellis. 2011. Speech and Audio Signal Processing: Processing and Perception of Speech and Music. John Wiley & Sons. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Hatice Gunes, Björn Schuller, Maja Pantic, and Roddy Cowie. 2011. Emotion representation, analysis and synthesis in continuous space: A survey. In Proceedings of the IEEE International Conference on Automatic Face & Gesture Recognition and Workshops. IEEE, 827--834.Google ScholarGoogle ScholarCross RefCross Ref
  23. Sanaul Haq and Philip J. B. Jackson. 2010. Machine Audition: Principles, Algorithms and Systems. IGI Global, Hershey PA, Chapter Multimodal emotion recognition, 398--423.Google ScholarGoogle Scholar
  24. M. Sazzad Hussain, Sidney K. D'Mello, and Rafael A. Calvo. 2014. 25 Research and development tools in affective computing. In The Oxford Handbook of Affective Computing, 349.Google ScholarGoogle Scholar
  25. Brendan Jou, Subhabrata Bhattacharya, and Shih-Fu Chang. 2014. Predicting viewer perceived emotions in animated GIFs. In Proceedings of the ACM International Conference on Multimedia. ACM, 213--216. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Markus Kächele, Michael Glodek, Dimitrij Zharkov, Sascha Meudt, and Friedhelm Schwenker. 2014. Fusion of audio-visual features using hierarchical classifier systems for the recognition of affective states and the state of depression. Depression 1, 1.Google ScholarGoogle Scholar
  27. Ozlem Kalinli. 2012. Automatic phoneme segmentation using auditory attention features. In Proceedings of INTERSPEECH.Google ScholarGoogle Scholar
  28. Joseph Keshet, Shai Shalev-Shwartz, and Yoram Singer. 2005. Phoneme alignment based on discriminative learning. http://u.cs.biu.ac.il/∼jkeshet/papers/KeshetShSiCh05.pdf.Google ScholarGoogle Scholar
  29. Yelin Kim and Emily Mower Provost. 2014. Say Cheese vs. smile: Reducing speech-related variability for facial emotion recognition. In Proceedings of the ACM International Conference on Multimedia (ACM MM'14). Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Michael Kipp and J.-C. Martin. 2009. Gesture and emotion: Can basic gestural form features discriminate emotions? In Proceedings of the 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops (ACII'09). IEEE, 1--8.Google ScholarGoogle ScholarCross RefCross Ref
  31. Andrea Kleinsmith and Nadia Bianchi-Berthouze. 2013. Affective body expression perception and recognition: A survey. IEEE Trans. Affective Computing 4, 1, 15--33. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Shashidhar G. Koolagudi, Nitin Kumar, and K. Sreenivasa Rao. 2011. Speech emotion recognition using segmental level prosodic analysis. In Proceedings of the International Conference on Devices and Communications. IEEE, 1--5.Google ScholarGoogle Scholar
  33. Chi-Chun Lee, Emily Mower, Carlos Busso, Sungbok Lee, and Shrikanth Narayanan. 2009. Emotion recognition using a hierarchical binary decision tree approach. In Proceedings of INTERSPEECH. 320--323.Google ScholarGoogle Scholar
  34. Chi-Chun Lee, Emily Mower, Carlos Busso, Sungbok Lee, and Shrikanth Narayanan. 2011. Emotion recognition using a hierarchical binary decision tree approach. Speech Commun. 53, 9, 1162--1171. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Chul Min Lee and Shrikanth S. Narayanan. 2005. Toward detecting emotions in spoken dialogs. IEEE Trans. Speech Audio Process. 13, 2, 293--303.Google ScholarGoogle ScholarCross RefCross Ref
  36. Chul Min Lee, Serdar Yildirim, Murtaza Bulut, Abe Kazemzadeh, Carlos Busso, Zhigang Deng, Sungbok Lee, and Shrikanth Narayanan. 2004. Emotion recognition based on phoneme classes. In Proceedings of INTERSPEECH. 205--211.Google ScholarGoogle Scholar
  37. Jae-Gil Lee, Jiawei Han, and Kyu-Young Whang. 2007. Trajectory clustering: a partition-and-group framework. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 593--604. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Patrick Lucey, Terrence Martin, and Sridha Sridharan. 2004. Confusability of phonemes grouped according to their viseme classes in noisy environments. In Proceedings of the Australian International Conference on Speech Science & Technology. 265--270.Google ScholarGoogle Scholar
  39. Soroosh Mariooryad and Carlos Busso. 2013. Feature and model level compensation of lexical content for facial emotion recognition. In Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition (FG'13). DOI:http://dx.doi.org/10.1109/FG.2013.6553752Google ScholarGoogle ScholarCross RefCross Ref
  40. Hongying Meng and Nadia Bianchi-Berthouze. 2011. Naturalistic affective expression classification by a multi-stage approach based on hidden Markov models. In Affective Computing and Intelligent Interaction, Springer, 378--387. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Angeliki Metallinou, Carlos Busso, Sungbok Lee, and Shrikanth Narayanan. 2010. Visual emotion recognition using compact facial representations and viseme information. In Proceedings of the IEEE International Conference on Acoustics Speech and Signal Processing. IEEE, 2474--2477.Google ScholarGoogle ScholarCross RefCross Ref
  42. Angeliki Metallinou, Athanasios Katsamanis, and Shrikanth Narayanan. 2013. Tracking continuous emotional trends of participants during affective dyadic interactions using body language and speech information. Image Vision Comput. 31, 2, 137--152. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Angeliki Metallinou, Martin Wollmer, Athanasios Katsamanis, Florian Eyben, Björn Schuller, and Shrikanth Narayanan. 2012. Context-sensitive learning for enhanced audiovisual emotion classification. IEEE Trans. Affective Computing 3, 2, 184--198. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Emily Mower, Maja J. Mataric, and Shrikanth Narayanan. 2009. Human perception of audio-visual synthetic character emotion expression in the presence of ambiguous and conflicting information. IEEE Trans. Multimedia 11, 5, 843--855. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Emily Mower, Maja J. Mataric, and Shrikanth Narayanan. 2011. A framework for automatic human emotion classification using emotion profiles. IEEE Trans. Audio Speech Lang. Process. 19, 5 (2011), 1057--1070. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Emily Mower and Shrikanth Narayanan. 2011. A hierarchical static-dynamic framework for emotion classification. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 2372--2375.Google ScholarGoogle ScholarCross RefCross Ref
  47. Emily Mower Provost. 2013. Identifying salient sub-utterance emotion dynamics using flexible units and estimates of affective flow. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 3682--3686.Google ScholarGoogle ScholarCross RefCross Ref
  48. Shrikanth Narayanan and Panayiotis G. Georgiou. 2013. Behavioral signal processing: Deriving human behavioral informatics from speech and language. Proc. IEEE 101, 5, 1203.Google ScholarGoogle ScholarCross RefCross Ref
  49. Jérémie Nicolle, Vincent Rapp, Kévin Bailly, Lionel Prevost, and Mohamed Chetouani. 2012. Robust continuous prediction of human emotions using multiscale dynamic cues. In Proceedings of the ACM International Conference on Multimodal Interaction. ACM, 501--508. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Maja Pantic and Marian Stewart Bartlett. 2007. Machine analysis of facial expressions. In Face Recognition, I-Tech Education and Publishing, Vienna, Austria, 377--416.Google ScholarGoogle Scholar
  51. Yu Qiao, Naoya Shimomura, and Nobuaki Minematsu. 2008. Unsupervised optimal phoneme segmentation: objectives, algorithm and comparisons. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 3989--3992.Google ScholarGoogle Scholar
  52. Yong Rui, Anoop Gupta, and Alex Acero. 2000. Automatically extracting highlights for TV baseball programs. In Proceedings of the 8th ACM International Conference on Multimedia. ACM, 105--115. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Enrique Sánchez-Lozano, Paula Lopez-Otero, Laura Docio-Fernandez, Enrique Argones-Rúa, and José Luis Alba-Castro. 2013. Audiovisual three-level fusion for continuous estimation of Russell's emotion circumplex. In Proceedings of the 3rd ACM International Workshop on Audio/Visual Emotion Challenge. ACM, 31--40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Georgia Sandbach, Stefanos Zafeiriou, Maja Pantic, and Daniel Rueckert. 2011. A dynamic approach to the recognition of 3d facial expressions and their temporal models. In Proceedings of the IEEE International Conference on Automatic Face & Gesture Recognition and Workshops. 406--413.Google ScholarGoogle ScholarCross RefCross Ref
  55. Arman Savran, Houwei Cao, Miraj Shah, Ani Nenkova, and Ragini Verma. 2012. Combining video, audio and lexical indicators of affect in spontaneous conversation via particle filtering. In Proceedings of the ACM International Conference on Multimodal Interaction. 485--492. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Björn Schuller, Anton Batliner, Stefan Steidl, and Dino Seppi. 2011. Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge. Speech Commun. 53, 9, 1062--1087. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Björn Schuller and Gerhard Rigoll. 2006. Timing levels in segment-based speech emotion recognition. In Proceedings of INTERSPEECH. 1818--1821.Google ScholarGoogle Scholar
  58. Björn Schuller, Stefan Steidl, Anton Batliner, Felix Burkhardt, Laurence Devillers, Christian Müller, and Shrikanth Narayanan. 2013. Paralinguistics in speech and language: State-of-the-art and the challenge. Computer Speech Lang. 27, 1, 4--39. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Miraj Shah, David G. Cooper, Houwei Cao, Ruben C. Gur, Ani Nenkova, and Ragini Verma. 2013. Action Unit Models of Facial Expression of Emotion in the Presence of Speech. In Proceedings of the Conference on Affective Computing and Intelligent Interaction. IEEE, 49--54. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Caifeng Shan, Shaogang Gong, and Peter W. McOwan. 2009. Facial expression recognition based on local binary patterns: A comprehensive study. Image Vision Comput. 27, 6, 803--816. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Doroteo Torre Toledano, Luis A. Hernández Gómez, and Luis Villarrubia Grande. 2003. Automatic phonetic segmentation. IEEE Trans. Speech Audio Process. 11, 6, 617--625.Google ScholarGoogle ScholarCross RefCross Ref
  62. Bogdan Vlasenko, Dmytro Prylipko, Ronald Böck, and Andreas Wendemuth. 2014. Modeling phonetic pattern variability in favor of the creation of robust emotion classifiers for real-life applications. Computer Speech Lang. 28, 2, 483--500. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Shaohua Wan and J. K. Aggarwal. 2014. Spontaneous facial expression recognition: A robust metric learning approach. Pattern Recog. 47, 5, 1859--1868. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Siqing Wu, Tiago H. Falk, and Wai-Yip Chan. 2011. Automatic speech emotion recognition using modulation spectral features. Speech Commun. 53, 5, 768--785. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Guoying Zhao and Matti Pietikainen. 2007. Dynamic texture recognition using local binary patterns with an application to facial expressions. IEEE Trans. Pattern Anal. Mach. Intell. 29, 6, 915--928. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Emotion Recognition During Speech Using Dynamics of Multiple Regions of the Face

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Multimedia Computing, Communications, and Applications
          ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 12, Issue 1s
          Special Issue on Smartphone-Based Interactive Technologies, Systems, and Applications and Special Issue on Extended Best Papers from ACM Multimedia 2014
          October 2015
          317 pages
          ISSN:1551-6857
          EISSN:1551-6865
          DOI:10.1145/2837676
          Issue’s Table of Contents

          Copyright © 2015 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 21 October 2015
          • Accepted: 1 July 2015
          • Revised: 1 March 2015
          • Received: 1 February 2015
          Published in tomm Volume 12, Issue 1s

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • announcement
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader