Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis

Abstract

While recent zero-shot text-to-speech (TTS) models have significantlyimproved speech quality and expressiveness, mainstream systems still sufferfrom issues related to speech-text alignment modeling: 1) models withoutexplicit speech-text alignment modeling exhibit less robustness, especially forhard sentences in practical applications; 2) predefined alignment-based modelssuffer from naturalness constraints of forced alignments. This paper introduces\textit{MegaTTS 3}, a TTS system featuring an innovative sparse alignmentalgorithm that guides the latent diffusion transformer (DiT). Specifically, weprovide sparse alignment boundaries to MegaTTS 3 to reduce the difficulty ofalignment without limiting the search space, thereby achieving highnaturalness. Moreover, we employ a multi-condition classifier-free guidancestrategy for accent intensity adjustment and adopt the piecewise rectified flowtechnique to accelerate the generation process. Experiments demonstrate thatMegaTTS 3 achieves state-of-the-art zero-shot TTS speech quality and supportshighly flexible control over accent intensity. Notably, our system can generatehigh-quality one-minute speech with only 8 sampling steps. Audio samples areavailable at https://sditdemo.github.io/sditdemo/.

Quick Read (beta)

loading the full paper ...