Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis

Abstract

We describe a neural network-based system for text-to-speech (TTS) synthesisthat is able to generate speech audio in the voice of many different speakers,including those unseen during training. Our system consists of threeindependently trained components: (1) a speaker encoder network, trained on aspeaker verification task using an independent dataset of noisy speech fromthousands of speakers without transcripts, to generate a fixed-dimensionalembedding vector from seconds of reference speech from a target speaker; (2) asequence-to-sequence synthesis network based on Tacotron 2, which generates amel spectrogram from text, conditioned on the speaker embedding; (3) anauto-regressive WaveNet-based vocoder that converts the mel spectrogram into asequence of time domain waveform samples. We demonstrate that the proposedmodel is able to transfer the knowledge of speaker variability learned by thediscriminatively-trained speaker encoder to the new task, and is able tosynthesize natural speech from speakers that were not seen during training. Wequantify the importance of training the speaker encoder on a large and diversespeaker set in order to obtain the best generalization performance. Finally, weshow that randomly sampled speaker embeddings can be used to synthesize speechin the voice of novel speakers dissimilar from those used in training,indicating that the model has learned a high quality speaker representation.

Quick Read (beta)

loading the full paper ...