mPLM-Sim: Better Cross-Lingual Similarity and Transfer in Multilingual Pretrained Language Models

Abstract

Recent multilingual pretrained language models (mPLMs) have been shown toencode strong language-specific signals, which are not explicitly providedduring pretraining. It remains an open question whether it is feasible toemploy mPLMs to measure language similarity, and subsequently use thesimilarity results to select source languages for boosting cross-lingualtransfer. To investigate this, we propose mPLMSim, a language similaritymeasure that induces the similarities across languages from mPLMs usingmulti-parallel corpora. Our study shows that mPLM-Sim exhibits moderately highcorrelations with linguistic similarity measures, such as lexicostatistics,genealogical language family, and geographical sprachbund. We also conduct acase study on languages with low correlation and observe that mPLM-Sim yieldsmore accurate similarity results. Additionally, we find that similarity resultsvary across different mPLMs and different layers within an mPLM. We furtherinvestigate whether mPLMSim is effective for zero-shot cross-lingual transferby conducting experiments on both low-level syntactic tasks and high-levelsemantic tasks. The experimental results demonstrate that mPLM-Sim is capableof selecting better source languages than linguistic measures, resulting in a1%-2% improvement in zero-shot cross-lingual transfer performance.

Quick Read (beta)

loading the full paper ...