Multilingual Encoder Knows more than You Realize: Shared Weights Pretraining for Extremely Low-Resource Languages

Abstract

While multilingual language models like XLM-R have advanced multilingualismin NLP, they still perform poorly in extremely low-resource languages. Thissituation is exacerbated by the fact that modern LLMs such as LLaMA and Qwensupport far fewer languages than XLM-R, making text generation modelsnon-existent for many languages in the world. To tackle this challenge, wepropose a novel framework for adapting multilingual encoders to text generationin extremely low-resource languages. By reusing the weights between the encoderand the decoder, our framework allows the model to leverage the learnedsemantic space of the encoder, enabling efficient learning and effectivegeneralization in low-resource languages. Applying this framework to fourChinese minority languages, we present XLM-SWCM, and demonstrate its superiorperformance on various downstream tasks even when compared with much largermodels.

Quick Read (beta)

loading the full paper ...