Abstract
Building general-purpose models that can perceive diverse real-worldmodalities and solve various tasks is an appealing target in artificialintelligence. In this paper, we present ChatBridge, a novel multimodal languagemodel that leverages the expressive capabilities of language as the catalyst tobridge the gap between various modalities. We show that only language-pairedtwo-modality data is sufficient to connect all modalities. ChatBridge leveragesrecent large language models (LLM) and extends their zero-shot capabilities toincorporate diverse multimodal inputs. ChatBridge undergoes a two-stagetraining. The first stage aligns each modality with language, which bringsemergent multimodal correlation and collaboration abilities. The second stageinstruction-finetunes ChatBridge to align it with user intent with our newlyproposed multimodal instruction tuning dataset, named MULTIS, which covers awide range of 16 multimodal tasks of text, image, video, and audio modalities.We show strong quantitative and qualitative results on zero-shot multimodaltasks covering text, image, video, and audio modalities. All codes, data, andmodels of ChatBridge will be open-sourced.