NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality

Abstract

Text to speech (TTS) has made rapid progress in both academia and industry inrecent years. Some questions naturally arise that whether a TTS system canachieve human-level quality, how to define/judge human-level quality and how toachieve it. In this paper, we answer these questions by first defininghuman-level quality based on statistical significance of measurement anddescribing the guidelines to judge it, and then proposing a TTS system calledNaturalSpeech that achieves human-level quality on a benchmark dataset.Specifically, we leverage a variational autoencoder (VAE) for end-to-end textto waveform generation, with several key designs to enhance the capacity ofprior from text and reduce the complexity of posterior from speech, includingphoneme pre-training, differentiable duration modeling, bidirectionalprior/posterior modeling, and memory mechanism in VAE. Experiment evaluationson popular LJSpeech dataset show that our proposed NaturalSpeech achieves -0.01CMOS (comparative mean opinion score) to human recordings on sentence level,with Wilcoxon signed rank test at p-level p>>0.05, which demonstrates nostatistically significant difference from human recordings for the first timeon this dataset.

Quick Read (beta)

loading the full paper ...