Cross-lingual Collapse: How Language-Centric Foundation Models Shape Reasoning in Large Language Models

Abstract

We identify \textbf{Cross-lingual Collapse}, a systematic drift in which thechain-of-thought (CoT) of a multilingual language model reverts to its dominantpre-training language even when the prompt is expressed in a differentlanguage. Recent large language models (LLMs) with reinforcement learning withverifiable reward (RLVR) have achieved strong logical reasoning performances byexposing their intermediate reasoning traces, giving rise to large reasoningmodels (LRMs). However, the mechanism behind multilingual reasoning in LRMs isnot yet fully explored. To investigate the issue, we fine-tune multilingualLRMs with Group-Relative Policy Optimization (GRPO) on translated versions ofthe GSM$8$K and SimpleRL-Zoo datasets in three different languages: Chinese,Korean, and Ukrainian. During training, we monitor both task accuracy andlanguage consistency of the reasoning chains. Our experiments reveal three keyfindings: (i) GRPO rapidly amplifies pre-training language imbalances, leadingto the erosion of low-resource languages within just a few hundred updates;(ii) language consistency reward mitigates this drift but does so at theexpense of an almost 5 - 10 pp drop in accuracy. and (iii) the resultinglanguage collapse is severely damaging and largely irreversible, as subsequentfine-tuning struggles to steer the model back toward its originaltarget-language reasoning capabilities. Together, these findings point to aremarkable conclusion: \textit{not all languages are trained equally forreasoning}. Furthermore, our paper sheds light on the roles of reward shaping,data difficulty, and pre-training priors in eliciting multilingual reasoning.

Quick Read (beta)

loading the full paper ...