Exploiting Language Relatedness for Low Web-Resource Language Model Adaptation: An Indic Languages Study

Abstract

Recent research in multilingual language models (LM) has demonstrated theirability to effectively handle multiple languages in a single model. This holdspromise for low web-resource languages (LRL) as multilingual models can enabletransfer of supervision from high resource languages to LRLs. However,incorporating a new language in an LM still remains a challenge, particularlyfor languages with limited corpora and in unseen scripts. In this paper weargue that relatedness among languages in a language family may be exploited toovercome some of the corpora limitations of LRLs, and propose RelateLM. Wefocus on Indian languages, and exploit relatedness along two dimensions: (1)script (since many Indic scripts originated from the Brahmic script), and (2)sentence structure. RelateLM uses transliteration to convert the unseen scriptof limited LRL text into the script of a Related Prominent Language (RPL)(Hindi in our case). While exploiting similar sentence structures, RelateLMutilizes readily available bilingual dictionaries to pseudo translate RPL textinto LRL corpora. Experiments on multiple real-world benchmark datasets providevalidation to our hypothesis that using a related language as pivot, along withtransliteration and pseudo translation based data augmentation, can be aneffective way to adapt LMs for LRLs, rather than direct training or pivotingthrough English.

Quick Read (beta)

loading the full paper ...