The International Arab Journal of Information Technology (IAJIT)


Evaluation of Influence of Arousal-Valence Primitives on Speech Emotion Recognition

Speech Emotion recognition is a challenging research problem with a significant scientific interest. There has been a lot of research and development around this field in the recent times. In this article, we present a study which aims to improve the recognition accuracy of speech emotion recognition using a hierarchical method based on Gaussian Mixture Model and Support Vector Machines for dimensional and continuous prediction of emotions in valence (positive vs negative emotion) and arousal space (the degree of emotional intensity). According to these dimensions, emotions are categorized into N broad groups. These N groups are further classified into other groups using spectral representation. We verify and compare the functionality of the different proposed multi-level models in order to study differential effects of emotional valence and arousal on the recognition of a basic emotion. Experimental studies are performed over the Berlin Emotional database and the Surrey Audio-Visual Expressed Emotion corpus, expressing different emotions, in German and English languages.

[1] Baron-Cohen S., Mind Reading: The Interactive Guide to Emotion, Cdr, 2004.

[2] Burkhardt F., Paeschke A., Rolfes M., Sandlmeier W., and Weiss B., A Database of German Emotional Speech, in Proceedings of Interspeech Lisbon, Portugal, pp. 1517-1520, 2005.

[3] Cowie R., Douglas-Cowie E., Tsapatsoulis N., Votsis G., Kollias S., Fellenz W., and Taylor J., Emotion Recognition in Human-Computer Interaction, IEEE Signal Processing Magazine, vol. 18, no. 1, pp. 33-80, 2001.

[4] Dempster A., Laid N., and Durbin D., Maximum Likelihood from Incomplete Data via the EM Algorithm, Journal of the Royal Statistical Society, vol. 39, no. 1, pp. 1-38, 1977. Emotion Emo-Db Savee audio-Db Positive valence 76,19 50 Negative valence 73,91 71,25 Neutral valence 71,42 82,5 Evaluation of Influence of Arousal-Valence Primitives on Speech Emotion Recognition 761

[5] Ekman P., Friesen W., and Ellsworth P., Emotion in the human face: Guidelines for Research and an Integration of Findings, Elsevier, 2013.

[6] Ekman P. and Friesen W., Head and Body Cues in the Judgment of Emotion: A Reformulation, Perceptual and Motor Skills, vol. 24, no. 3, pp. 711-724, 1967.

[7] Grandjean D., Sander D., and Sherer K., Conscious Emotional Experience Emerges as a Function of Multilevel, Appraisal-Driven Response Synchronization, Consciousness and Cognition, vol. 17, no. 2, pp. 484-495, 2008.

[8] Haq S. and Jackson P., Multimodal Emotion Recognition. Machine Audition: Principles, Algorithms and Systems, IGI Global Press, 2010.

[9] Karadogan S. and Larsen J., Combining Semantic and Acoustic Features for Valence and Arousal Recognition in Speech, in Proceedings of IEEE 3rd International Workshop on: Cognitive Information Processing, Baiona, pp. 1-6, 2012.

[10] Kamaruddin N. and Wahab A., Human Behavior State Profile Mapping Based on Recalibrated Speech Affective Space Model, in Proceedings of the 34th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, San Diego, pp. 2021-2024, 2012.

[11] Mehrabian A. and Russell J., an Approach to Environmental Psychology, MIT Press, 1974.

[12] Marchi E., Schuller B., Batliner A., Fridenzon S., Tal S., and Golan O., Emotion in the Speech of Children with Autism Spectrum Conditions: Prosody and Everything else, in Proceedings of 3rd Workshop on Child, Computer and Interaction, pp. 17-24, 2012.

[13] Mittal T. and Sharma R., Multiclass SVM based Spoken Hindi Numerals Recognition, The International Arab Journal of Information Technology, vol. 12, no. 6A, pp. 666-671, 2015.

[14] Sherer K., Schorr A., and Johnstone T., Appraisal Processes in Emotion: Theory, Methods, Research, Oxford University Press, 2001.

[15] Trabelsi I. and Ben-Ayed D., A Multi Level Data Fusion Approach for Speaker Identification on Telephone Speech, International Journal of Signal Processing, Image Processing and Pattern Recognition, vol. 6, no. 2, 2013.

[16] Trabelsi I. and Ben Ayed D., On the Use of Different Feature Extraction Methods for Linear and Non Linear Kernels, in Proceedings of IEEE 6th International Conference on Sciences of Electronics, Technologies of Information and Telecommunications, Sousse, pp. 1-11, 2012.

[17] Trabelsi I., Ben-Ayed D., and Ellouze N., Improved Frame Level Features and SVM Supervectors Approach for the Recognition of Emotional States from Speech: Application to categorical and dimensional states, International Journal of Image, Graphics and Signal Processing, vol. 5, no. 9, pp. 8-13, 2013.

[18] Truong K. and Raaijmakers S., Automatic Recognition of Spontaneous Emotions in Speech using Acoustic and Lexical Features, Machine Learning for Multimodal Interaction, Utrecht, pp. 161-172, 2008. 762 The International Arab Journal of Information Technology, Vol. 15, No. 4, July 2018 Imen Trabelsi received her MS degree in signal processing in 2011 from the Institute of Computer Science of Tunis (ISI-Tunisia) and PhD degree in electrical engineering with specialization in signal processing in 2015 from the University of Tunis-El Manar (Tunisia). Her main areas of interests include: speech processing, pattern recognition, machine learning, artificial intelligence and emotion recognition. She has published research papers at international journals and conference proceedings. Dorra Ben Ayed received computer science engineering degree in 1995 from the National School Computer Science (ENSI-Tunisia), the MS degree in electrical engineering (signal processing) in 1997 from the National School of Engineer of Tunis (ENITTunisia), the Ph.D. degree in electrical engineering (signal processing) in 2003 fro m (ENIT- Tunisia ). She is currently an associate professor in the computer science department at the High Institute of Computer Science of Tunis (ISI-Tunisia). Her research interests include fuzzy logic, support vector machines, artificial intelligence, pattern recognition, speech recognition and speaker identification. Noureddine Ellouze received a PhD degree in 1977 from l Institut National Polytechnique at Paul Sabatier University (Toulouse, France), and Electronic Engineer Diploma from ENSEEIHT in 1968 at the same university. In 1978, Dr Ellouze joined the Department of Electrical Engineering at the National School of Engineer of Tunis (ENIT-Tunisia), as Assistant Professor in Statistic, Electronic, Signal Processing and Computer Architecture. In 1990, he became Professor in Signal Processing, Digital Signal Processing and Stochastic Process. He has also served as Director of Electrical Department at ENIT from 1978 to 1983, General Manager and President of the Research Institute on Informatics and Telecommunications (IRSIT) from 1987 to 1990, President of the same Institute from 1990 to 1994. He is now Director of Signal Processing Research Laboratory (LSTS) at ENIT and is in charge of Control and Signal Processing Master degree at ENIT. Pr Ellouze is IEEE fellow since 1987, he directed multiple Master thesis and PhD thesis and published over 200 scientific papers in journals and conference proceedings. He is chief editor of the scientific journal Annales Maghr bines de l Ing nieur. His research interests include Neural Networks and Fuzzy Classification, Pattern Recognition, Signal Processing and Image Processing applied in biomedical, Multimedia, and Man Machine Communication.