Multimodal Representation Learning by Alternating Unimodal Adaptation

Abstract

Multimodal learning, which integrates data from diverse sensory modes, playsa pivotal role in artificial intelligence. However, existing multimodallearning methods often struggle with challenges where some modalities appearmore dominant than others during multimodal learning, resulting in suboptimalperformance. To address this challenge, we propose MLA (Multimodal Learningwith Alternating Unimodal Adaptation). MLA reframes the conventional jointmultimodal learning process by transforming it into an alternating unimodallearning process, thereby minimizing interference between modalities.Simultaneously, it captures cross-modal interactions through a shared head,which undergoes continuous optimization across different modalities. Thisoptimization process is controlled by a gradient modification mechanism toprevent the shared head from losing previously acquired information. During theinference phase, MLA utilizes a test-time uncertainty-based model fusionmechanism to integrate multimodal information. Extensive experiments areconducted on five diverse datasets, encompassing scenarios with completemodalities and scenarios with missing modalities. These experiments demonstratethe superiority of MLA over competing prior approaches.

Quick Read (beta)

loading the full paper ...