U2A: Unified Unimodal Adaptation for Robust and Efficient Multimodal Learning

Abstract

Multimodal learning often relies on designing new models and complex trainingstrategies to achieve optimal performance. We present Unified UnimodalAdaptation (U2A), which jointly fine-tunes pretrained unimodal encoders usinglow-rank adaptation (LoRA) for various multimodal tasks. Our methodsignificantly reduces the number of learnable parameters and eliminates theneed for complex training strategies, such as alternating training, gradientmodifications, or unimodal fine-tuning. To address missing modalities duringboth training and testing, we introduce Mask Tokens (MT), which generatemissing modality features from available modalities using a single token permodality. This simplifies the process, removing the need for specializedfeature estimation or prompt-tuning methods. Our evaluation demonstrates thatU2A matches or outperforms state-of-the-art methods in both complete andmissing modality settings, showcasing strong performance and robustness acrossvarious modalities, tasks, and datasets. We also analyze and report theeffectiveness of Mask Tokens in different missing modality scenarios. Overall,our method provides a robust, flexible, and efficient solution for multimodallearning, with minimal computational overhead.

Quick Read (beta)

loading the full paper ...