End-to-end Text-to-speech for Low-resource Languages by Cross-Lingual Transfer Learning

Abstract

End-to-end text-to-speech (TTS) has shown great success on large quantitiesof paired text plus speech data. However, laborious data collection remainsdifficult for at least 95% of the languages over the world, which hinders thedevelopment of TTS in different languages. In this paper, we aim to build TTSsystems for such low-resource (target) languages where only very limited paireddata are available. We show such TTS can be effectively constructed bytransferring knowledge from a high-resource (source) language. Since the modeltrained on source language cannot be directly applied to target language due toinput space mismatch, we propose a method to learn a mapping between source andtarget linguistic symbols. Benefiting from this learned mapping, pronunciationinformation can be preserved throughout the transferring procedure. Preliminaryexperiments show that we only need around 15 minutes of paired data to obtain arelatively good TTS system. Furthermore, analytic studies demonstrated that theautomatically discovered mapping correlate well with the phonetic expertise.

Quick Read (beta)

loading the full paper ...