OpenOmni: Large Language Models Pivot Zero-shot Omnimodal Alignment across Language with Real-time Self-Aware Emotional Speech Synthesis

  • 2025-01-09 15:54:14
  • Run Luo, Ting-En Lin, Haonan Zhang, Yuchuan Wu, Xiong Liu, Min Yang, Yongbin Li, Longze Chen, Jiaming Li, Lei Zhang, Yangyi Chen, Hamid Alinejad-Rokny, Fei Huang
  • 0

Abstract

Recent advancements in omnimodal learning have been achieved in understandingand generation across images, text, and speech, though mainly withinproprietary models. Limited omnimodal datasets and the inherent challengesassociated with real-time emotional speech generation have hindered open-sourceprogress. To address these issues, we propose openomni, a two-stage trainingmethod combining omnimodal alignment and speech generation to develop astate-of-the-art omnimodal large language model. In the alignment phase, apre-trained speech model is further trained on text-image tasks to generalizefrom vision to speech in a (near) zero-shot manner, outperforming modelstrained on tri-modal datasets. In the speech generation phase, a lightweightdecoder facilitates real-time emotional speech through training on speech tasksand preference learning. Experiments demonstrate that openomni consistentlyimproves across omnimodal, vision-language, and speech-language evaluations,enabling natural, emotion-rich dialogues and real-time emotional speechgeneration.

 

Quick Read (beta)

loading the full paper ...