Crosslingual Reasoning through Test-Time Scaling

  • 2025-05-08 17:50:06
  • Zheng-Xin Yong, M. Farid Adilazuarda, Jonibek Mansurov, Ruochen Zhang, Niklas Muennighoff, Carsten Eickhoff, Genta Indra Winata, Julia Kreutzer, Stephen H. Bach, Alham Fikri Aji
  • 0

Abstract

Reasoning capabilities of large language models are primarily studied forEnglish, even when pretrained models are multilingual. In this work, weinvestigate to what extent English reasoning finetuning with longchain-of-thoughts (CoTs) can generalize across languages. First, we find thatscaling up inference compute for English-centric reasoning language models(RLMs) improves multilingual mathematical reasoning across many languagesincluding low-resource languages, to an extent where they outperform modelstwice their size. Second, we reveal that while English-centric RLM's CoTs arenaturally predominantly English, they consistently follow a quote-and-thinkpattern to reason about quoted non-English inputs. Third, we discover aneffective strategy to control the language of long CoT reasoning, and weobserve that models reason better and more efficiently in high-resourcelanguages. Finally, we observe poor out-of-domain reasoning generalization, inparticular from STEM to cultural commonsense knowledge, even for English.Overall, we demonstrate the potentials, study the mechanisms and outline thelimitations of crosslingual generalization of English reasoning test-timescaling. We conclude that practitioners should let English-centric RLMs reasonin high-resource languages, while further work is needed to improve reasoningin low-resource languages and out-of-domain contexts.

 

Quick Read (beta)

loading the full paper ...