WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models

Abstract

Large pretrained language models (LMs) have become the central building blockof many NLP applications. Training these models requires ever morecomputational resources and most of the existing models are trained on Englishtext only. It is exceedingly expensive to train these models in otherlanguages. To alleviate this problem, we introduce a novel method -- calledWECHSEL -- to efficiently and effectively transfer pretrained LMs to newlanguages. WECHSEL can be applied to any model which uses subword-basedtokenization and learns an embedding for each subword. The tokenizer of thesource model (in English) is replaced with a tokenizer in the target languageand token embeddings are initialized such that they are semantically similar tothe English tokens by utilizing multilingual static word embeddings coveringEnglish and the target language. We use WECHSEL to transfer the English RoBERTaand GPT-2 models to four languages (French, German, Chinese and Swahili). Wealso study the benefits of our method on very low-resource languages. WECHSELimproves over proposed methods for cross-lingual parameter transfer andoutperforms models of comparable size trained from scratch with up to 64x lesstraining effort. Our method makes training large language models for newlanguages more accessible and less damaging to the environment. We make ourcode and models publicly available.

Quick Read (beta)

loading the full paper ...