Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling

  • 2023-03-07 14:31:55
  • Ziqiang Zhang, Long Zhou, Chengyi Wang, Sanyuan Chen, Yu Wu, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, Lei He, Sheng Zhao, Furu Wei
We propose a cross-lingual neural codec language model, VALL-E X, forcross-lingual speech synthesis. Specifically, we extend VALL-E and train amulti-lingual conditional codec language model to predict the acoustic tokensequences of the target language speech by using both the source languagespeech and the target language text as prompts. VALL-E X inherits strongin-context learning capabilities and can be applied for zero-shot cross-lingualtext-to-speech synthesis and zero-shot speech-to-speech translation tasks.Experimental results show that it can generate high-quality speech in thetarget language via just one speech utterance in the source language as aprompt while preserving the unseen speaker's voice, emotion, and acousticenvironment. Moreover, VALL-E X effectively alleviates the foreign accentproblems, which can be controlled by a language ID. Audio samples are availableat \url{https://aka.ms/vallex}.


