DLF: Disentangled-Language-Focused Multimodal Sentiment Analysis

Abstract

Multimodal Sentiment Analysis (MSA) leverages heterogeneous modalities, suchas language, vision, and audio, to enhance the understanding of humansentiment. While existing models often focus on extracting shared informationacross modalities or directly fusing heterogeneous modalities, such approachescan introduce redundancy and conflicts due to equal treatment of all modalitiesand the mutual transfer of information between modality pairs. To address theseissues, we propose a Disentangled-Language-Focused (DLF) multimodalrepresentation learning framework, which incorporates a feature disentanglementmodule to separate modality-shared and modality-specific information. Tofurther reduce redundancy and enhance language-targeted features, fourgeometric measures are introduced to refine the disentanglement process. ALanguage-Focused Attractor (LFA) is further developed to strengthen languagerepresentation by leveraging complementary modality-specific informationthrough a language-guided cross-attention mechanism. The framework also employshierarchical predictions to improve overall accuracy. Extensive experiments ontwo popular MSA datasets, CMU-MOSI and CMU-MOSEI, demonstrate the significantperformance gains achieved by the proposed DLF framework. Comprehensiveablation studies further validate the effectiveness of the featuredisentanglement module, language-focused attractor, and hierarchicalpredictions. Our code is available at https://github.com/pwang322/DLF.

Quick Read (beta)

loading the full paper ...