Abstract
In this paper, we propose a novel deep inductive transfer learning framework,named feature distribution adaptation network, to tackle the challengingmulti-modal speech emotion recognition problem. Our method aims to use deeptransfer learning strategies to align visual and audio feature distributions toobtain consistent representation of emotion, thereby improving the performanceof speech emotion recognition. In our model, the pre-trained ResNet-34 isutilized for feature extraction for facial expression images and acoustic Melspectrograms, respectively. Then, the cross-attention mechanism is introducedto model the intrinsic similarity relationships of multi-modal features.Finally, the multi-modal feature distribution adaptation is performedefficiently with feed-forward network, which is extended using the localmaximum mean discrepancy loss. Experiments are carried out on two benchmarkdatasets, and the results demonstrate that our model can achieve excellentperformance compared with existing ones.