OpenOmni: Large Language Models Pivot Zero-shot Omnimodal Alignment across Language with Real-time Self-Aware Emotional Speech Synthesis

Abstract

Recent advancements in omnimodal learning have been achieved in understandingand generation across images, text, and speech, though mainly withinproprietary models. Limited omnimodal datasets and the inherent challengesassociated with real-time emotional speech generation have hindered open-sourceprogress. To address these issues, we propose openomni, a two-stage trainingmethod combining omnimodal alignment and speech generation to develop astate-of-the-art omnimodal large language model. In the alignment phase, apre-trained speech model is further trained on text-image tasks to generalizefrom vision to speech in a (near) zero-shot manner, outperforming modelstrained on tri-modal datasets. In the speech generation phase, a lightweightdecoder facilitates real-time emotional speech through training on speech tasksand preference learning. Experiments demonstrate that openomni consistentlyimproves across omnimodal, vision-language, and speech-language evaluations,enabling natural, emotion-rich dialogues and real-time emotional speechgeneration.

Quick Read (beta)

loading the full paper ...