Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective

Abstract

Reinforcement learning (RL) has emerged as a promising approach to improvelarge language model (LLM) reasoning, yet most open efforts focus narrowly onmath and code, limiting our understanding of its broader applicability togeneral reasoning. A key challenge lies in the lack of reliable, scalable RLreward signals across diverse reasoning domains. We introduce Guru, a curatedRL reasoning corpus of 92K verifiable examples spanning six reasoningdomains--Math, Code, Science, Logic, Simulation, and Tabular--each builtthrough domain-specific reward design, deduplication, and filtering to ensurereliability and effectiveness for RL training. Based on Guru, we systematicallyrevisit established findings in RL for LLM reasoning and observe significantvariation across domains. For example, while prior work suggests that RLprimarily elicits existing knowledge from pretrained models, our results reveala more nuanced pattern: domains frequently seen during pretraining (Math, Code,Science) easily benefit from cross-domain RL training, while domains withlimited pretraining exposure (Logic, Simulation, and Tabular) require in-domaintraining to achieve meaningful performance gains, suggesting that RL is likelyto facilitate genuine skill acquisition. Finally, we present Guru-7B andGuru-32B, two models that achieve state-of-the-art performance among openmodels RL-trained with publicly available data, outperforming best baselines by7.9% and 6.7% on our 17-task evaluation suite across six reasoning domains. Wealso show that our models effectively improve the Pass@k performance of theirbase models, particularly on complex tasks less likely to appear in pretrainingdata. We release data, models, training and evaluation code to facilitategeneral-purpose reasoning at: https://github.com/LLM360/Reasoning360

Quick Read (beta)

loading the full paper ...