Do Not Worry if You Do Not Have Data: Building Pretrained Language Models Using Translationese

Abstract

In this paper, we explore the utility of \textit{Translationese} as syntheticdata created using machine translation for pre-training language models (LMs).Pre-training requires vast amounts of monolingual data, which is mostlyunavailable for languages other than English. Recently, there has been agrowing interest in using synthetic data to address this data scarcity. We takethe case of English and Indic languages and translate web-crawled monolingualdocuments (clean) into the target language. Then, we train language modelscontaining 28M and 85M parameters on this translationese data (synthetic). Weshow that their performance on downstream natural language understanding andgenerative tasks is only 3.56\% poorer on NLU tasks and 1.51\% on NLG tasksthan LMs pre-trained on clean data. Further, we propose the use of lightweight\textit{TinyLMs} pre-trained on clean data to filter synthetic data efficientlywhich significantly improves the performance of our models. We also find thatLMs trained on synthetic data strongly benefit from extended pretraining on atiny fraction (10\%) of clean data. We release the data we collected andcreated as a part of this work, \textit{IndicMonoDoc}, the largest collectionof monolingual document-level corpora, which we hope will help bridge the gapbetween English and non-English performance for large language models.

Quick Read (beta)

loading the full paper ...