Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis

Abstract

%auto-ignore In this paper we propose Flowtron: an autoregressive flow-basedgenerative network for text-to-speech synthesis with control over speechvariation and style transfer. Flowtron borrows insights from IAF and revampsTacotron in order to provide high-quality and expressive mel-spectrogramsynthesis. Flowtron is optimized by maximizing the likelihood of the trainingdata, which makes training simple and stable. Flowtron learns an invertiblemapping of data to a latent space that can be manipulated to control manyaspects of speech synthesis (pitch, tone, speech rate, cadence, accent). Ourmean opinion scores (MOS) show that Flowtron matches state-of-the-art TTSmodels in terms of speech quality. In addition, we provide results on controlof speech variation, interpolation between samples and style transfer betweenspeakers seen and unseen during training. Code and pre-trained models will bemade publicly available at https://github.com/NVIDIA/flowtron

Quick Read (beta)

loading the full paper ...