MuRIL: Multilingual Representations for Indian Languages

Abstract

India is a multilingual society with 1369 rationalized languages and dialectsbeing spoken across the country (INDIA, 2011). Of these, the 22 scheduledlanguages have a staggering total of 1.17 billion speakers and 121 languageshave more than 10,000 speakers (INDIA, 2011). India also has the second largest(and an ever growing) digital footprint (Statista, 2020). Despite this, today'sstate-of-the-art multilingual systems perform suboptimally on Indian (IN)languages. This can be explained by the fact that multilingual language models(LMs) are often trained on 100+ languages together, leading to a smallrepresentation of IN languages in their vocabulary and training data.Multilingual LMs are substantially less effective in resource-lean scenarios(Wu and Dredze, 2020; Lauscher et al., 2020), as limited data doesn't helpcapture the various nuances of a language. One also commonly observes INlanguage text transliterated to Latin or code-mixed with English, especially ininformal settings (for example, on social media platforms) (Rijhwani et al.,2017). This phenomenon is not adequately handled by current state-of-the-artmultilingual LMs. To address the aforementioned gaps, we propose MuRIL, amultilingual LM specifically built for IN languages. MuRIL is trained onsignificantly large amounts of IN text corpora only. We explicitly augmentmonolingual text corpora with both translated and transliterated documentpairs, that serve as supervised cross-lingual signals in training. MuRILsignificantly outperforms multilingual BERT (mBERT) on all tasks in thechallenging cross-lingual XTREME benchmark (Hu et al., 2020). We also presentresults on transliterated (native to Latin script) test sets of the chosendatasets and demonstrate the efficacy of MuRIL in handling transliterated data.

Quick Read (beta)

loading the full paper ...