Abstract
Large language models (LLMs), with their powerful generative capabilities andvast knowledge, empower various tasks in everyday life. However, theseabilities are primarily concentrated in high-resource languages, leavinglow-resource languages with weaker generative capabilities and relativelylimited knowledge. Enhancing the multilingual capabilities of LLMs is thereforecrucial for serving over 100 linguistic communities worldwide. An intuitiveapproach to enhance the multilingual capabilities would be to constructinstruction data for various languages, but constructing instruction data forover 100 languages is prohibitively costly. In this paper, we introduce BayLing2, which efficiently transfers generative capabilities and knowledge fromhigh-resource languages to low-resource languages through language alignment.To achieve this, we constructed a dataset of 3.2 million instructions,comprising high-resource language instructions (Chinese and English) andcross-lingual instructions for 100+ languages and performed instruction tuningbased on the dataset to facilitate the capability transfer between languages.Using Llama as the foundation model, we developed BayLing-2-7B, BayLing-2-13B,and BayLing-3-8B, and conducted a comprehensive evaluation of BayLing. Formultilingual translation across 100+ languages, BayLing shows superiorperformance compared to open-source models of similar scale. For multilingualknowledge and understanding benchmarks, BayLing achieves significantimprovements across over 20 low-resource languages, demonstrating itscapability of effective knowledge transfer from high-resource to low-resourcelanguages. Furthermore, results on English benchmarks indicate that BayLingmaintains high performance in highresource languages while enhancing theperformance in low-resource languages. Demo, homepage, code and models ofBayLing are available.