EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions

  • 2024-09-26 17:44:02
  • Kai Chen, Yunhao Gou, Runhui Huang, Zhili Liu, Daxin Tan, Jing Xu, Chunwei Wang, Yi Zhu, Yihan Zeng, Kuo Yang, Dingdong Wang, Kun Xiang, Haoyuan Li, Haoli Bai, Jianhua Han, Xiaohui Li, Weike Jin, Nian Xie, Yu Zhang, James T. Kwok, Hengshuang Zhao, Xiaodan Liang, Dit-Yan Yeung, Xiao Chen, Zhenguo Li, Wei Zhang, Qun Liu, Lanqing Hong, Lu Hou, Hang Xu
  • 0

Abstract

GPT-4o, an omni-modal model that enables vocal conversations with diverseemotions and tones, marks a milestone for omni-modal foundation models.However, empowering Large Language Models to perceive and generate images,texts, and speeches end-to-end with publicly available data remains challengingin the open-source community. Existing vision-language models rely on externaltools for the speech processing, while speech-language models still suffer fromlimited or even without vision-understanding abilities. To address this gap, wepropose EMOVA (EMotionally Omni-present Voice Assistant), to enable LargeLanguage Models with end-to-end speech capabilities while maintaining theleading vision-language performance. With a semantic-acoustic disentangledspeech tokenizer, we notice surprisingly that omni-modal alignment can furtherenhance vision-language and speech abilities compared with the correspondingbi-modal aligned counterparts. Moreover, a lightweight style module is proposedfor flexible speech style controls (e.g., emotions and pitches). For the firsttime, EMOVA achieves state-of-the-art performance on both the vision-languageand speech benchmarks, and meanwhile, supporting omni-modal spoken dialoguewith vivid emotions.

 

Quick Read (beta)

loading the full paper ...