Multimodal Speech Emotion Recognition Using Audio and Text

Abstract

Speech emotion recognition is a challenging task, and extensive reliance hasbeen placed on models that use audio features in building well-performingclassifiers. In this paper, we propose a novel deep dual recurrent encodermodel that utilizes text data and audio signals simultaneously to obtain abetter understanding of speech data. As emotional dialogue is composed of soundand spoken content, our model encodes the information from audio and textsequences using dual recurrent neural networks (RNNs) and then combines theinformation from these sources to predict the emotion class. This architectureanalyzes speech data from the signal level to the language level, and it thusutilizes the information within the data more comprehensively than models thatfocus on audio features. Extensive experiments are conducted to investigate theefficacy and properties of the proposed model. Our proposed model outperformsprevious state-of-the-art methods in assigning data to one of four emotioncategories (i.e., angry, happy, sad and neutral) when the model is applied tothe IEMOCAP dataset, as reflected by accuracies ranging from 68.8% to 71.8%.

Quick Read (beta)

loading the full paper ...