Abstract
Recent advancements in omnimodal learning have been achieved in understandingand generation across images, text, and speech, though mainly withinproprietary models. Limited omnimodal datasets and the inherent challengesassociated with real-time emotional speech generation have hindered open-sourceprogress. To address these issues, we propose openomni, a two-stage trainingmethod combining omnimodal alignment and speech generation to develop astate-of-the-art omnimodal large language model. In the alignment phase, apre-trained speech model is further trained on text-image tasks to generalizefrom vision to speech in a (near) zero-shot manner, outperforming modelstrained on tri-modal datasets. In the speech generation phase, a lightweightdecoder facilitates real-time emotional speech through training on speech tasksand preference learning. Experiments demonstrate that openomni consistentlyimproves across omnimodal, vision-language, and speech-language evaluations,enabling natural, emotion-rich dialogues and real-time emotional speechgeneration.