Should we Stop Training More Monolingual Models, and Simply Use Machine Translation Instead?

  • 2021-04-21 10:21:24
  • Tim Isbister, Fredrik Carlsson, Magnus Sahlgren
Most work in NLP makes the assumption that it is desirable to developsolutions in the native language in question. There is consequently a strongtrend towards building native language models even for low-resource languages.This paper questions this development, and explores the idea of simplytranslating the data into English, thereby enabling the use of pretrained, andlarge-scale, English language models. We demonstrate empirically that a largeEnglish language model coupled with modern machine translation outperformsnative language models in most Scandinavian languages. The exception to this isFinnish, which we assume is due to inferior translation quality. Our resultssuggest that machine translation is a mature technology, which raises a seriouscounter-argument for training native language models for low-resourcelanguages. This paper therefore strives to make a provocative but importantpoint. As English language models are improving at an unprecedented pace, whichin turn improves machine translation, it is from an empirical and environmentalstand-point more effective to translate data from low-resource languages intoEnglish, than to build language models for such languages.


