Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment

Abstract

Multilingual pre-trained models (mPLMs) have shown impressive performance oncross-lingual transfer tasks. However, the transfer performance is oftenhindered when a low-resource target language is written in a different scriptthan the high-resource source language, even though the two languages may berelated or share parts of their vocabularies. Inspired by recent work that usestransliteration to address this problem, our paper proposes atransliteration-based post-pretraining alignment (PPA) method aiming to improvethe cross-lingual alignment between languages using diverse scripts. We selecttwo areal language groups, $\textbf{Mediterranean-Amharic-Farsi}$ and$\textbf{South+East Asian Languages}$, wherein the languages are mutuallyinfluenced but use different scripts. We apply our method to these languagegroups and conduct extensive experiments on a spectrum of downstream tasks. Theresults show that after PPA, models consistently outperform the original model(up to 50% for some tasks) in English-centric transfer. In addition, when weuse languages other than English as sources in transfer, our method obtainseven larger improvements. We will make our code and models publicly availableat \url{https://github.com/cisnlp/Transliteration-PPA}.

Quick Read (beta)

loading the full paper ...