Abstract
This paper proposes a new end-to-end text-to-speech (E2E-TTS) model based onneural machine translation (NMT). The proposed model consists of twocomponents; a non-autoregressive vector quantized variational autoencoder(VQ-VAE) model and an autoregressive Transformer-NMT model. The VQ-VAE modellearns a mapping function from a speech waveform into a sequence of discretesymbols, and then the Transformer-NMT model is trained to estimate thisdiscrete symbol sequence from a given input text. Since the VQ-VAE model canlearn such a mapping in a fully-data-driven manner, we do not need to considerhyperparameters of the feature extraction required in the conventional E2E-TTSmodels. Thanks to the use of discrete symbols, we can use various techniquesdeveloped in NMT and automatic speech recognition (ASR) such as beam search,subword units, and fusions with a language model. Furthermore, we can avoid anover smoothing problem of predicted features, which is one of the common issuesin TTS. The experimental evaluation with the JSUT corpus shows that theproposed method outperforms the conventional Transformer-TTS model with anon-autoregressive neural vocoder in naturalness, achieving the performancecomparable to the reconstruction of the VQ-VAE model.