Unsupervised Cross-lingual Representation Learning for Speech Recognition

Abstract

This paper presents XLSR which learns cross-lingual speech representations bypretraining a single model from the raw waveform of speech in multiplelanguages. We build on a concurrently introduced self-supervised model which istrained by solving a contrastive task over masked latent speech representationsand jointly learns a quantization of the latents shared across languages. Theresulting model is fine-tuned on labeled data and experiments show thatcross-lingual pretraining significantly outperforms monolingual pretraining. Onthe CommonVoice benchmark, XLSR shows a relative phoneme error rate reductionof 72% compared to the best known results. On BABEL, our approach improves worderror rate by 16% relative compared to the strongest comparable system. Ourapproach enables a single multilingual speech recognition model which iscompetitive to strong individual models. Analysis shows that the latentdiscrete speech representations are shared across languages with increasedsharing for related languages.

Quick Read (beta)

loading the full paper ...