LangBridge: Interpreting Image as a Combination of Language Embeddings

Abstract

Recent years have witnessed remarkable advances in Large Vision-LanguageModels (LVLMs), which have achieved human-level performance across variouscomplex vision-language tasks. Following LLaVA's paradigm, mainstream LVLMstypically employ a shallow MLP for visual-language alignment through atwo-stage training process: pretraining for cross-modal alignment followed byinstruction tuning. While this approach has proven effective, the underlyingmechanisms of how MLPs bridge the modality gap remain poorly understood.Although some research has explored how LLMs process transformed visual tokens,few studies have investigated the fundamental alignment mechanism. Furthermore,the MLP adapter requires retraining whenever switching LLM backbones. Toaddress these limitations, we first investigate the working principles of MLPadapters and discover that they learn to project visual embeddings intosubspaces spanned by corresponding text embeddings progressively. Based on thisinsight, we propose LangBridge, a novel adapter that explicitly maps visualtokens to linear combinations of LLM vocabulary embeddings. This innovativedesign enables pretraining-free adapter transfer across different LLMs whilemaintaining performance. Our experimental results demonstrate that a LangBridgeadapter pre-trained on Qwen2-0.5B can be directly applied to larger models suchas LLaMA3-8B or Qwen2.5-14B while maintaining competitive performance. Overall,LangBridge enables interpretable vision-language alignment by grounding visualrepresentations in LLM vocab embedding, while its plug-and-play design ensuresefficient reuse across multiple LLMs with nearly no performance degradation.See our project page at https://curryx-001.github.io/LangBridge.github.io/

Quick Read (beta)

loading the full paper ...