Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning

Abstract

We present a multispeaker, multilingual text-to-speech (TTS) synthesis modelbased on Tacotron that is able to produce high quality speech in multiplelanguages. Moreover, the model is able to transfer voices across languages,e.g. synthesize fluent Spanish speech using an English speaker's voice, withouttraining on any bilingual or parallel examples. Such transfer works acrossdistantly related languages, e.g. English and Mandarin. Critical to achieving this result are: 1. using a phonemic inputrepresentation to encourage sharing of model capacity across languages, and 2.incorporating an adversarial loss term to encourage the model to disentangleits representation of speaker identity (which is perfectly correlated withlanguage in the training data) from the speech content. Further scaling up themodel by training on multiple speakers of each language, and incorporating anautoencoding input to help stabilize attention during training, results in amodel which can be used to consistently synthesize intelligible speech fortraining speakers in all languages seen during training, and in native orforeign accents.

Quick Read (beta)

loading the full paper ...