Multilingual Byte2Speech Models for Scalable Low-resource Speech Synthesis

  • 2021-07-09 11:29:43
  • Mutian He, Jingzhou Yang, Lei He, Frank K. Soong
To scale neural speech synthesis to various real-world languages, we presenta multilingual end-to-end framework that maps byte inputs to spectrograms, thusallowing arbitrary input scripts. Besides strong results on 40+ languages, theframework demonstrates capabilities to adapt to new languages under extremelow-resource and even few-shot scenarios of merely 40s transcribed recording,without the need of per-language resources like lexicon, extra corpus,auxiliary models, or linguistic expertise, thus ensuring scalability. While itretains satisfactory intelligibility and naturalness matching rich-resourcemodels. Exhaustive comparative and ablation studies are performed to reveal thepotential of the framework for low-resource languages. Furthermore, we proposea novel method to extract language-specific sub-networks in a multilingualmodel for a better understanding of its mechanism.


