VITA: Towards Open-Source Interactive Omni Multimodal LLM

Abstract

The remarkable multimodal capabilities and interactive experience of GPT-4ounderscore their necessity in practical applications, yet open-source modelsrarely excel in both areas. In this paper, we introduce VITA, the first-everopen-source Multimodal Large Language Model (MLLM) adept at simultaneousprocessing and analysis of Video, Image, Text, and Audio modalities, andmeanwhile has an advanced multimodal interactive experience. Starting fromMixtral 8x7B as a language foundation, we expand its Chinese vocabularyfollowed by bilingual instruction tuning. We further endow the language modelwith visual and audio capabilities through two-stage multi-task learning ofmultimodal alignment and instruction tuning. VITA demonstrates robustfoundational capabilities of multilingual, vision, and audio understanding, asevidenced by its strong performance across a range of both unimodal andmultimodal benchmarks. Beyond foundational capabilities, we have madeconsiderable progress in enhancing the natural multimodal human-computerinteraction experience. To the best of our knowledge, we are the first toexploit non-awakening interaction and audio interrupt in MLLM. VITA is thefirst step for the open-source community to explore the seamless integration ofmultimodal understanding and interaction. While there is still lots of work tobe done on VITA to get close to close-source counterparts, we hope that itsrole as a pioneer can serve as a cornerstone for subsequent research. ProjectPage: https://vita-home.github.io.

Quick Read (beta)

loading the full paper ...