Abstract
Recent advances in text-to-speech synthesis have achieved notable success ingenerating high-quality short utterances for individual speakers. However,these systems still face challenges when extending their capabilities to long,multi-speaker, and spontaneous dialogues, typical of real-world scenarios suchas podcasts. These limitations arise from two primary challenges: 1) longspeech: podcasts typically span several minutes, exceeding the upper limit ofmost existing work; 2) spontaneity: podcasts are marked by their spontaneous,oral nature, which sharply contrasts with formal, written contexts; existingworks often fall short in capturing this spontaneity. In this paper, we proposeMoonCast, a solution for high-quality zero-shot podcast generation, aiming tosynthesize natural podcast-style speech from text-only sources (e.g., stories,technical reports, news in TXT, PDF, or Web URL formats) using the voices ofunseen speakers. To generate long audio, we adopt a long-context languagemodel-based audio modeling approach utilizing large-scale long-context speechdata. To enhance spontaneity, we utilize a podcast generation module togenerate scripts with spontaneous details, which have been empirically shown tobe as crucial as the text-to-speech modeling itself. Experiments demonstratethat MoonCast outperforms baselines, with particularly notable improvements inspontaneity and coherence.