CLSRIL-23: Cross Lingual Speech Representations for Indic Languages

Abstract

We present a CLSRIL-23, a self supervised learning based audio pre-trainedmodel which learns cross lingual speech representations from raw audio across23 Indic languages. It is built on top of wav2vec 2.0 which is solved bytraining a contrastive task over masked latent speech representations andjointly learns the quantization of latents shared across all languages. Wecompare the language wise loss during pretraining to compare effects ofmonolingual and multilingual pretraining. Performance on some downstreamfine-tuning tasks for speech recognition is also compared and our experimentsshow that multilingual pretraining outperforms monolingual training, in termsof learning speech representations which encodes phonetic similarity oflanguages and also in terms of performance on down stream tasks. A decrease of5% is observed in WER and 9.5% in CER when a multilingual pretrained model isused for finetuning in Hindi. All the code models are also open sourced.CLSRIL-23 is a model trained on $23$ languages and almost 10,000 hours of audiodata to facilitate research in speech recognition for Indic languages. We hopethat new state of the art systems will be created using the self supervisedapproach, especially for low resources Indic languages.

Quick Read (beta)

loading the full paper ...