A Survey on Multilingual Large Language Models: Corpora, Alignment, and Bias

Abstract

Based on the foundation of Large Language Models (LLMs), Multilingual LLMs(MLLMs) have been developed to address the challenges faced in multilingualnatural language processing, hoping to achieve knowledge transfer fromhigh-resource languages to low-resource languages. However, significantlimitations and challenges still exist, such as language imbalance,multilingual alignment, and inherent bias. In this paper, we aim to provide acomprehensive analysis of MLLMs, delving deeply into discussions surroundingthese critical issues. First of all, we start by presenting an overview ofMLLMs, covering their evolutions, key techniques, and multilingual capacities.Secondly, we explore the multilingual training corpora of MLLMs and themultilingual datasets oriented for downstream tasks that are crucial to enhancethe cross-lingual capability of MLLMs. Thirdly, we survey the state-of-the-artstudies of multilingual representations and investigate whether the currentMLLMs can learn a universal language representation. Fourthly, we discuss biason MLLMs, including its categories, evaluation metrics, and debiasingtechniques. Finally, we discuss existing challenges and point out promisingresearch directions of MLLMs.

Quick Read (beta)

loading the full paper ...