Abstract
Reasoning capabilities of large language models are primarily studied forEnglish, even when pretrained models are multilingual. In this work, weinvestigate to what extent English reasoning finetuning with longchain-of-thoughts (CoTs) can generalize across languages. First, we find thatscaling up inference compute for English-centric reasoning language models(RLMs) improves multilingual mathematical reasoning across many languagesincluding low-resource languages, to an extent where they outperform modelstwice their size. Second, we reveal that while English-centric RLM's CoTs arenaturally predominantly English, they consistently follow a quote-and-thinkpattern to reason about quoted non-English inputs. Third, we discover aneffective strategy to control the language of long CoT reasoning, and weobserve that models reason better and more efficiently in high-resourcelanguages. Finally, we observe poor out-of-domain reasoning generalization, inparticular from STEM to cultural commonsense knowledge, even for English.Overall, we demonstrate the potentials, study the mechanisms and outline thelimitations of crosslingual generalization of English reasoning test-timescaling. We conclude that practitioners should let English-centric RLMs reasonin high-resource languages, while further work is needed to improve reasoningin low-resource languages and out-of-domain contexts.