Abstract
Recent Large Reasoning Models (LRMs) with thinking traces have shown strongperformance on English reasoning tasks. However, their ability to think inother languages is less studied. This capability is as important as answeraccuracy for real world applications because users may find the reasoning traceuseful for oversight only when it is expressed in their own language. Wecomprehensively evaluate two leading families of LRMs on our XReasoningbenchmark and find that even the most advanced models often revert to Englishor produce fragmented reasoning in other languages, revealing a substantial gapin multilingual reasoning. Prompt based interventions that force models toreason in the users language improve readability and oversight but reduceanswer accuracy, exposing an important trade off. We further show that targetedpost training on just 100 examples mitigates this mismatch, though someaccuracy loss remains. Our results highlight the limited multilingual reasoningcapabilities of current LRMs and outline directions for future work. Code anddata are available at https://github.com/Betswish/mCoT-XReasoning.