Cross-lingual Multispeaker Text-to-Speech under Limited-Data Scenario

  • 2020-05-21 03:03:34
  • Zexin Cai, Yaogen Yang, Ming Li
Modeling voices for multiple speakers and multiple languages in onetext-to-speech system has been a challenge for a long time. This paper presentsan extension on Tacotron2 to achieve bilingual multispeaker speech synthesiswhen there are limited data for each language. We achieve cross-lingualsynthesis, including code-switching cases, between English and Mandarin formonolingual speakers. The two languages share the same phonemic representationsfor input, while the language attribute and the speaker identity areindependently controlled by language tokens and speaker embeddings,respectively. In addition, we investigate the model's performance on thecross-lingual synthesis, with and without a bilingual dataset during training.With the bilingual dataset, not only can the model generate high-fidelityspeech for all speakers concerning the language they speak, but also cangenerate accented, yet fluent and intelligible speech for monolingual speakersregarding non-native language. For example, the Mandarin speaker can speakEnglish fluently. Furthermore, the model trained with bilingual dataset isrobust for code-switching text-to-speech, as shown in our results and providedsamples.{}.


