Abstract
GPT-4o, an omni-modal model that enables vocal conversations with diverseemotions and tones, marks a milestone for omni-modal foundation models.However, empowering Large Language Models to perceive and generate images,texts, and speeches end-to-end with publicly available data remains challengingin the open-source community. Existing vision-language models rely on externaltools for the speech processing, while speech-language models still suffer fromlimited or even without vision-understanding abilities. To address this gap, wepropose EMOVA (EMotionally Omni-present Voice Assistant), to enable LargeLanguage Models with end-to-end speech capabilities while maintaining theleading vision-language performance. With a semantic-acoustic disentangledspeech tokenizer, we notice surprisingly that omni-modal alignment can furtherenhance vision-language and speech abilities compared with the correspondingbi-modal aligned counterparts. Moreover, a lightweight style module is proposedfor flexible speech style controls (e.g., emotions and pitches). For the firsttime, EMOVA achieves state-of-the-art performance on both the vision-languageand speech benchmarks, and meanwhile, supporting omni-modal spoken dialoguewith vivid emotions.