Cross-lingual Multispeaker Text-to-Speech under Limited-Data Scenario

Abstract

Modeling voices for multiple speakers and multiple languages in onetext-to-speech system has been a challenge for a long time. This paper presentsan extension on Tacotron2 to achieve bilingual multispeaker speech synthesiswhen there are limited data for each language. We achieve cross-lingualsynthesis, including code-switching cases, between English and Mandarin formonolingual speakers. The two languages share the same phonemic representationsfor input, while the language attribute and the speaker identity areindependently controlled by language tokens and speaker embeddings,respectively. In addition, we investigate the model's performance on thecross-lingual synthesis, with and without a bilingual dataset during training.With the bilingual dataset, not only can the model generate high-fidelityspeech for all speakers concerning the language they speak, but also cangenerate accented, yet fluent and intelligible speech for monolingual speakersregarding non-native language. For example, the Mandarin speaker can speakEnglish fluently. Furthermore, the model trained with bilingual dataset isrobust for code-switching text-to-speech, as shown in our results and providedsamples.{https://caizexin.github.io/mlms-syn-samples/index.html}.

Quick Read (beta)

loading the full paper ...