Abstract
This paper presents XLSR which learns cross-lingual speech representations bypretraining a single model from the raw waveform of speech in multiplelanguages. We build on a concurrently introduced self-supervised model which istrained by solving a contrastive task over masked latent speech representationsand jointly learns a quantization of the latents shared across languages. Theresulting model is fine-tuned on labeled data and experiments show thatcross-lingual pretraining significantly outperforms monolingual pretraining. Onthe CommonVoice benchmark, XLSR shows a relative phoneme error rate reductionof 72% compared to the best known results. On BABEL, our approach improves worderror rate by 16% relative compared to the strongest comparable system. Ourapproach enables a single multilingual speech recognition model which iscompetitive to strong individual models. Analysis shows that the latentdiscrete speech representations are shared across languages with increasedsharing for related languages.