MoE-LPR: Multilingual Extension of Large Language Models through Mixture-of-Experts with Language Priors Routing

Abstract

Large Language Models (LLMs) are often English-centric due to thedisproportionate distribution of languages in their pre-training data.Enhancing non-English language capabilities through post-pretraining oftenresults in catastrophic forgetting of the ability of original languages.Previous methods either achieve good expansion with severe forgetting or slightforgetting with poor expansion, indicating the challenge of balancing languageexpansion while preventing forgetting. In this paper, we propose a methodcalled MoE-LPR (Mixture-of-Experts with Language Priors Routing) to alleviatethis problem. MoE-LPR employs a two-stage training approach to enhance themultilingual capability. First, the model is post-pretrained into aMixture-of-Experts (MoE) architecture by upcycling, where all the originalparameters are frozen and new experts are added. In this stage, we focusimproving the ability on expanded languages, without using any originallanguage data. Then, the model reviews the knowledge of the original languageswith replay data amounting to less than 1% of post-pretraining, where weincorporate language priors routing to better recover the abilities of theoriginal languages. Evaluations on multiple benchmarks show that MoE-LPRoutperforms other post-pretraining methods. Freezing original parameterspreserves original language knowledge while adding new experts preserves thelearning ability. Reviewing with LPR enables effective utilization ofmultilingual knowledge within the parameters. Additionally, the MoEarchitecture maintains the same inference overhead while increasing total modelparameters. Extensive experiments demonstrate MoE-LPR's effectiveness inimproving expanded languages and preserving original language proficiency withsuperior scalability. Code and scripts are freely available athttps://github.com/zjwang21/MoE-LPR.git.

Quick Read (beta)

loading the full paper ...