Recursive Joint Cross-Modal Attention for Multimodal Fusion in Dimensional Emotion Recognition

Abstract

Though multimodal emotion recognition has achieved significant progress overrecent years, the potential of rich synergic relationships across themodalities is not fully exploited. In this paper, we introduce Recursive JointCross-Modal Attention (RJCMA) to effectively capture both intra-and inter-modalrelationships across audio, visual and text modalities for dimensional emotionrecognition. In particular, we compute the attention weights based oncross-correlation between the joint audio-visual-text feature representationsand the feature representations of individual modalities to simultaneouslycapture intra- and inter-modal relationships across the modalities. Theattended features of the individual modalities are again fed as input to thefusion model in a recursive mechanism to obtain more refined featurerepresentations. We have also explored Temporal Convolutional Networks (TCNs)to improve the temporal modeling of the feature representations of individualmodalities. Extensive experiments are conducted to evaluate the performance ofthe proposed fusion model on the challenging Affwild2 dataset. By effectivelycapturing the synergic intra- and inter-modal relationships across audio,visual and text modalities, the proposed fusion model achieves a ConcordanceCorrelation Coefficient (CCC) of 0.585 (0.542) and 0.659 (0.619) for valenceand arousal respectively on the validation set (test set). This shows asignificant improvement over the baseline of 0.24 (0.211) and 0.20 (0.191) forvalence and arousal respectively on the validation set (test set) of thevalence-arousal challenge of 6th Affective Behavior Analysis in-the-Wild (ABAW)competition.

Quick Read (beta)

loading the full paper ...