Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective

  • 2025-06-17 21:24:00
  • Zhoujun Cheng, Shibo Hao, Tianyang Liu, Fan Zhou, Yutao Xie, Feng Yao, Yuexin Bian, Yonghao Zhuang, Nilabjo Dey, Yuheng Zha, Yi Gu, Kun Zhou, Yuqi Wang, Yuan Li, Richard Fan, Jianshu She, Chengqian Gao, Abulhair Saparov, Haonan Li, Taylor W. Killian, Mikhail Yurochkin, Zhengzhong Liu, Eric P. Xing, Zhiting Hu
  • 0

Abstract

Reinforcement learning (RL) has emerged as a promising approach to improvelarge language model (LLM) reasoning, yet most open efforts focus narrowly onmath and code, limiting our understanding of its broader applicability togeneral reasoning. A key challenge lies in the lack of reliable, scalable RLreward signals across diverse reasoning domains. We introduce Guru, a curatedRL reasoning corpus of 92K verifiable examples spanning six reasoningdomains--Math, Code, Science, Logic, Simulation, and Tabular--each builtthrough domain-specific reward design, deduplication, and filtering to ensurereliability and effectiveness for RL training. Based on Guru, we systematicallyrevisit established findings in RL for LLM reasoning and observe significantvariation across domains. For example, while prior work suggests that RLprimarily elicits existing knowledge from pretrained models, our results reveala more nuanced pattern: domains frequently seen during pretraining (Math, Code,Science) easily benefit from cross-domain RL training, while domains withlimited pretraining exposure (Logic, Simulation, and Tabular) require in-domaintraining to achieve meaningful performance gains, suggesting that RL is likelyto facilitate genuine skill acquisition. Finally, we present Guru-7B andGuru-32B, two models that achieve state-of-the-art performance among openmodels RL-trained with publicly available data, outperforming best baselines by7.9% and 6.7% on our 17-task evaluation suite across six reasoning domains. Wealso show that our models effectively improve the Pass@k performance of theirbase models, particularly on complex tasks less likely to appear in pretrainingdata. We release data, models, training and evaluation code to facilitategeneral-purpose reasoning at: https://github.com/LLM360/Reasoning360

 

Quick Read (beta)

loading the full paper ...