Abstract
Large language models (LLMs) exhibit remarkable multilingual capabilitiesdespite the extreme language imbalance in the pre-training data. In this paper,we closely examine the reasons behind this phenomenon, focusing on thepre-training corpus. We find that the existence of code-switching, alternatingbetween different languages within a context, is key to multilingualcapabilities. We conduct an analysis to investigate code-switching in thepre-training corpus, examining its presence and categorizing it into four typeswithin two quadrants. We then assess its impact on multilingual performance.These types of code-switching data are unbalanced in proportions anddemonstrate different effects on facilitating language transfer. To betterexplore the power of code-switching for language alignment during pre-training,we investigate the strategy of synthetic code-switching. We continuously scaleup the synthetic code-switching data and observe remarkable improvements inboth benchmarks and representation space. Extensive experiments indicate thatincorporating synthetic code-switching data enables better language alignmentand generalizes well to high, medium, and low-resource languages withpre-training corpora of varying qualities.