Bilingual Language Modeling, A transfer learning technique for Roman Urdu

Abstract

Pretrained language models are now of widespread use in Natural LanguageProcessing. Despite their success, applying them to Low Resource languages isstill a huge challenge. Although Multilingual models hold great promise,applying them to specific low-resource languages e.g. Roman Urdu can beexcessive. In this paper, we show how the code-switching property of languagesmay be used to perform cross-lingual transfer learning from a correspondinghigh resource language. We also show how this transfer learning techniquetermed Bilingual Language Modeling can be used to produce better performingmodels for Roman Urdu. To enable training and experimentation, we also presenta collection of novel corpora for Roman Urdu extracted from various sources andsocial networking sites, e.g. Twitter. We train Monolingual, Multilingual, andBilingual models of Roman Urdu - the proposed bilingual model achieves 23%accuracy compared to the 2% and 11% of the monolingual and multilingual modelsrespectively in the Masked Language Modeling (MLM) task.

Quick Read (beta)

loading the full paper ...