Abstract
GPT-4o, an omni-modal model that enables vocal conversations with diverseemotions and tones, marks a milestone for omni-modal foundation models.However, empowering Large Language Models to perceive and generate images,texts, and speeches end-to-end with publicly available data remains challengingfor the open-source community. Existing vision-language models rely on externaltools for speech processing, while speech-language models still suffer fromlimited or totally without vision-understanding capabilities. To address thisgap, we propose the EMOVA (EMotionally Omni-present Voice Assistant), to enableLarge Language Models with end-to-end speech abilities while maintaining theleading vision-language performance. With a semantic-acoustic disentangledspeech tokenizer, we surprisingly notice that omni-modal alignment can furtherenhance vision-language and speech abilities compared with the bi-modal alignedcounterparts. Moreover, a lightweight style module is introduced for theflexible speech style controls including emotions and pitches. For the firsttime, EMOVA achieves state-of-the-art performance on both the vision-languageand speech benchmarks, and meanwhile, supporting omni-modal spoken dialoguewith vivid emotions.