Unsupervised Cross-lingual Representation Learning for Speech Recognition

  • 2020-06-24 18:25:05
  • Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, Michael Auli
  • 58


This paper presents XLSR which learns cross-lingual speech representations bypretraining a single model from the raw waveform of speech in multiplelanguages. We build on a concurrently introduced self-supervised model which istrained by solving a contrastive task over masked latent speech representationsand jointly learns a quantization of the latents shared across languages. Theresulting model is fine-tuned on labeled data and experiments show thatcross-lingual pretraining significantly outperforms monolingual pretraining. Onthe CommonVoice benchmark, XLSR shows a relative phoneme error rate reductionof 72% compared to the best known results. On BABEL, our approach improves worderror rate by 16% relative compared to the strongest comparable system. Ourapproach enables a single multilingual speech recognition model which iscompetitive to strong individual models. Analysis shows that the latentdiscrete speech representations are shared across languages with increasedsharing for related languages.


