Abstract
We present MGM-Omni, a unified Omni LLM for omni-modal understanding andexpressive, long-horizon speech generation. Unlike cascaded pipelines thatisolate speech synthesis, MGM-Omni adopts a "brain-mouth" design with adual-track, token-based architecture that cleanly decouples multimodalreasoning from real-time speech generation. This design enables efficientcross-modal interaction and low-latency, streaming speech generation. Forunderstanding, a unified training strategy coupled with a dual audio encoderdesign enables long-form audio perception across diverse acoustic conditions.For generation, a chunk-based parallel decoding scheme narrows the text speechtoken-rate gap, accelerating inference and supporting streaming zero-shot voicecloning with stable timbre over extended durations. Compared to concurrentwork, MGM-Omni achieves these capabilities with markedly data-efficienttraining. Extensive experiments demonstrate that MGM-Omni outperforms existingopen source models in preserving timbre identity across extended sequences,producing natural and context-aware speech, and achieving superior long-formaudio and omnimodal understanding. MGM-Omni establishes an efficient,end-to-end paradigm for omnimodal understanding and controllable, personalisedlong-horizon speech generation.