Less, but Better: Efficient Multilingual Expansion for LLMs via Layer-wise Mixture-of-Experts

Abstract

Continually expanding new languages for existing large language models (LLMs)is a promising yet challenging approach to building powerful multilingual LLMs.The biggest challenge is to make the model continuously learn new languageswhile preserving the proficient ability of old languages. To achieve this,recent work utilizes the Mixture-of-Experts (MoE) architecture to expand newlanguages by adding new experts and avoid catastrophic forgetting of oldlanguages by routing corresponding tokens to the original model backbone (oldexperts). Although intuitive, this kind of method is parameter-costly whenexpanding new languages and still inevitably impacts the performance of oldlanguages. To address these limitations, we analyze the languagecharacteristics of different layers in LLMs and propose a layer-wise expertallocation algorithm (LayerMoE) to determine the appropriate number of newexperts for each layer. Specifically, we find different layers in LLMs exhibitdifferent representation similarities between languages and then utilize thesimilarity as the indicator to allocate experts for each layer, i.e., thehigher similarity, the fewer experts. Additionally, to further mitigate theforgetting of old languages, we add a classifier in front of the router networkon the layers with higher similarity to guide the routing of old languagetokens. Experimental results show that our method outperforms the previousstate-of-the-art baseline with 60% fewer experts in the single-expansionsetting and with 33.3% fewer experts in the lifelong-expansion setting,demonstrating the effectiveness of our method.

Quick Read (beta)

loading the full paper ...