Do Not Worry if You Do Not Have Data: Building Pretrained Language Models Using Translationese

Abstract

In this paper, we explore the utility of Translationese as synthetic datacreated using machine translation for pre-training language models (LMs).Pre-training requires vast amounts of monolingual data, which is mostlyunavailable for languages other than English. Recently, there has been agrowing interest in using synthetic data to address this data scarcity. We takethe case of English and Indic languages and translate web-crawled monolingualdocuments (clean) into the target language. Then, we train language modelscontaining 28M and 85M parameters on this translationese data (synthetic). Weshow that their performance on downstream natural language understanding andgenerative tasks is only 3.56% poorer on NLU tasks and 1.51% on NLG tasks thanLMs pre-trained on clean data. Further, we propose the use of lightweightTinyLMs pre-trained on clean data to filter synthetic data efficiently whichsignificantly improves the performance of our models. We also find that LMstrained on synthetic data strongly benefit from extended pretraining on a tinyfraction (10%) of clean data. We release the data we collected and created as apart of this work, IndicMonoDoc, the largest collection of monolingualdocument-level corpora, which we hope will help bridge the gap between Englishand non-English performance for large language models.

Quick Read (beta)

loading the full paper ...