Abstract
Recent research in multilingual language models (LM) has demonstrated theirability to effectively handle multiple languages in a single model. This holdspromise for low web-resource languages (LRL) as multilingual models can enabletransfer of supervision from high resource languages to LRLs. However,incorporating a new language in an LM still remains a challenge, particularlyfor languages with limited corpora and in unseen scripts. In this paper weargue that relatedness among languages in a language family may be exploited toovercome some of the corpora limitations of LRLs, and propose RelateLM. Wefocus on Indian languages, and exploit relatedness along two dimensions: (1)script (since many Indic scripts originated from the Brahmic script), and (2)sentence structure. RelateLM uses transliteration to convert the unseen scriptof limited LRL text into the script of a Related Prominent Language (RPL)(Hindi in our case). While exploiting similar sentence structures, RelateLMutilizes readily available bilingual dictionaries to pseudo translate RPL textinto LRL corpora. Experiments on multiple real-world benchmark datasets providevalidation to our hypothesis that using a related language as pivot, along withtransliteration and pseudo translation based data augmentation, can be aneffective way to adapt LMs for LRLs, rather than direct training or pivotingthrough English.