Improving Multimodal Learning Balance and Sufficiency through Data Remixing

Abstract

Different modalities hold considerable gaps in optimization trajectories,including speeds and paths, which lead to modality laziness and modality clashwhen jointly training multimodal models, resulting in insufficient andimbalanced multimodal learning. Existing methods focus on enforcing the weakmodality by adding modality-specific optimization objectives, aligning theiroptimization speeds, or decomposing multimodal learning to enhance unimodallearning. These methods fail to achieve both unimodal sufficiency andmultimodal balance. In this paper, we, for the first time, address bothconcerns by proposing multimodal Data Remixing, including decoupling multimodaldata and filtering hard samples for each modality to mitigate modalityimbalance; and then batch-level reassembling to align the gradient directionsand avoid cross-modal interference, thus enhancing unimodal learningsufficiency. Experimental results demonstrate that our method can be seamlesslyintegrated with existing approaches, improving accuracy by approximately6.50%$\uparrow$ on CREMAD and 3.41%$\uparrow$ on Kinetic-Sounds, withouttraining set expansion or additional computational overhead during inference.The source code is available at https://github.com/MatthewMaxy/Remix_ICML2025.

Quick Read (beta)

loading the full paper ...