Abstract
We propose a cross-lingual neural codec language model, VALL-E X, forcross-lingual speech synthesis. Specifically, we extend VALL-E and train amulti-lingual conditional codec language model to predict the acoustic tokensequences of the target language speech by using both the source languagespeech and the target language text as prompts. VALL-E X inherits strongin-context learning capabilities and can be applied for zero-shot cross-lingualtext-to-speech synthesis and zero-shot speech-to-speech translation tasks.Experimental results show that it can generate high-quality speech in thetarget language via just one speech utterance in the source language as aprompt while preserving the unseen speaker's voice, emotion, and acousticenvironment. Moreover, VALL-E X effectively alleviates the foreign accentproblems, which can be controlled by a language ID. Audio samples are availableat \url{https://aka.ms/vallex}.