Abstract
Speech is the most natural way of expressing ourselves as humans. Identifyingemotion from speech is a nontrivial task due to the ambiguous definition ofemotion itself. Speaker Emotion Recognition (SER) is essential forunderstanding human emotional behavior. The SER task is challenging due to thevariety of speakers, background noise, complexity of emotions, and speakingstyles. It has many applications in education, healthcare, customer service,and Human-Computer Interaction (HCI). Previously, conventional machine learningmethods such as SVM, HMM, and KNN have been used for the SER task. In recentyears, deep learning methods have become popular, with convolutional neuralnetworks and recurrent neural networks being used for SER tasks. The input ofthese methods is mostly spectrograms and hand-crafted features. In this work,we study the use of self-supervised transformer-based models, Wav2Vec2 andHuBERT, to determine the emotion of speakers from their voice. The modelsautomatically extract features from raw audio signals, which are then used forthe classification task. The proposed solution is evaluated on reputabledatasets, including RAVDESS, SHEMO, SAVEE, AESDD, and Emo-DB. The results showthe effectiveness of the proposed method on different datasets. Moreover, themodel has been used for real-world applications like call center conversations,and the results demonstrate that the model accurately predicts emotions.