LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

Abstract

The video-language (VL) pretraining has achieved remarkable improvement inmultiple downstream tasks. However, the current VL pretraining framework ishard to extend to multiple modalities (N modalities, N>=3) beyond vision andlanguage. We thus propose LanguageBind, taking the language as the bind acrossdifferent modalities because the language modality is well-explored andcontains rich semantics. Specifically, we freeze the language encoder acquiredby VL pretraining, then train encoders for other modalities with contrastivelearning. As a result, all modalities are mapped to a shared feature space,implementing multi-modal semantic alignment. While LanguageBind ensures that wecan extend VL modalities to N modalities, we also need a high-quality datasetwith alignment data pairs centered on language. We thus propose VIDAL-10M withVideo, Infrared, Depth, Audio and their corresponding Language, naming asVIDAL-10M. In our VIDAL-10M, all videos are from short video platforms withcomplete semantics rather than truncated segments from long videos, and all thevideo, depth, infrared, and audio modalities are aligned to their textualdescriptions. After pretraining on VIDAL-10M, we outperform ImageBind by 5.8%R@1 on the MSR-VTT dataset with only 15% of the parameters in the zero-shotvideo-text retrieval task. Beyond this, our LanguageBind has greatly improvedin the zero-shot video, audio, depth, and infrared understanding tasks. Forinstance, LanguageBind surpassing InterVideo by 1.9% on MSR-VTT, 8.8% on MSVD,6.3% on DiDeMo, and 4.4% on ActivityNet. On the LLVIP and NYU-D datasets,LanguageBind outperforms ImageBind with 23.8% and 11.1% top-1 accuracy. Codeaddress: https://github.com/PKU-YuanGroup/LanguageBind.

Quick Read (beta)

loading the full paper ...