LangSAMP: Language-Script Aware Multilingual Pretraining

Abstract

Recent multilingual pretrained language models (mPLMs) often avoid usinglanguage embeddings -- learnable vectors assigned to different languages. Theseembeddings are discarded for two main reasons: (1) mPLMs are expected to have asingle, unified parameter set across all languages, and (2) they need tofunction seamlessly as universal text encoders without requiring language IDsas input. However, this removal increases the burden on token embeddings toencode all language-specific information, which may hinder the model's abilityto produce more language-neutral representations. To address this challenge, wepropose Language-Script Aware Multilingual Pretraining (LangSAMP), a methodthat incorporates both language and script embeddings to enhance representationlearning while maintaining a simple architecture. Specifically, we integratethese embeddings into the output of the transformer blocks before passing thefinal representations to the language modeling head for prediction. We applyLangSAMP to the continual pretraining of XLM-R on a highly multilingual corpuscovering more than 500 languages. The resulting model consistently outperformsthe baseline. Extensive analysis further shows that language/script embeddingsencode language/script-specific information, which improves the selection ofsource languages for crosslingual transfer. We make our code and modelspublicly available at \url{https://github.com/cisnlp/LangSAMP}.

Quick Read (beta)

loading the full paper ...