Self-attention fusion for audiovisual emotion recognition with incomplete data

Abstract

In this paper, we consider the problem of multimodal data analysis with a usecase of audiovisual emotion recognition. We propose an architecture capable oflearning from raw data and describe three variants of it with distinct modalityfusion mechanisms. While most of the previous works consider the ideal scenarioof presence of both modalities at all times during inference, we evaluate therobustness of the model in the unconstrained settings where one modality isabsent or noisy, and propose a method to mitigate these limitations in a formof modality dropout. Most importantly, we find that following this approach notonly improves performance drastically under the absence/noisy representationsof one modality, but also improves the performance in a standard ideal setting,outperforming the competing methods.

Quick Read (beta)

loading the full paper ...