Multimodal Language Analysis with Recurrent Multistage Fusion

Abstract

Computational modeling of human multimodal language is an emerging researcharea in natural language processing spanning the language, visual and acousticmodalities. Comprehending multimodal language requires modeling not only theinteractions within each modality (intra-modal interactions) but moreimportantly the interactions between modalities (cross-modal interactions). Inthis paper, we propose the Recurrent Multistage Fusion Network (RMFN) whichdecomposes the fusion problem into multiple stages, each of them focused on asubset of multimodal signals for specialized, effective fusion. Cross-modalinteractions are modeled using this multistage fusion approach which buildsupon intermediate representations of previous stages. Temporal and intra-modalinteractions are modeled by integrating our proposed fusion approach with asystem of recurrent neural networks. The RMFN displays state-of-the-artperformance in modeling human multimodal language across three public datasetsrelating to multimodal sentiment analysis, emotion recognition, and speakertraits recognition. We provide visualizations to show that each stage of fusionfocuses on a different subset of multimodal signals, learning increasinglydiscriminative multimodal representations.

Quick Read (beta)

loading the full paper ...