Hierarchical Generative Modeling for Controllable Speech Synthesis

  • 2018-10-16 18:20:02
  • Wei-Ning Hsu, Yu Zhang, Ron J. Weiss, Heiga Zen, Yonghui Wu, Yuxuan Wang, Yuan Cao, Ye Jia, Zhifeng Chen, Jonathan Shen, Patrick Nguyen, Ruoming Pang
  • 45

Abstract

This paper proposes a neural end-to-end text-to-speech (TTS) model which cancontrol latent attributes in the generated speech that are rarely annotated inthe training data, such as speaking style, accent, background noise, andrecording conditions. The model is formulated as a conditional generative modelwith two levels of hierarchical latent variables. The first level is acategorical variable, which represents attribute groups (e.g. clean/noisy) andprovides interpretability. The second level, conditioned on the first, is amultivariate Gaussian variable, which characterizes specific attributeconfigurations (e.g. noise level, speaking rate) and enables disentangledfine-grained control over these attributes. This amounts to using a Gaussianmixture model (GMM) for the latent distribution. Extensive evaluationdemonstrates its ability to control the aforementioned attributes. Inparticular, it is capable of consistently synthesizing high-quality cleanspeech regardless of the quality of the training data for the target speaker.

 

Quick Read (beta)

loading the full paper ...