Breaking the Exploration Bottleneck: Rubric-Scaffolded Reinforcement Learning for General LLM Reasoning

Abstract

Recent advances in Large Language Models (LLMs) have underscored thepotential of Reinforcement Learning (RL) to facilitate the emergence ofreasoning capabilities. Despite the encouraging results, a fundamental dilemmapersists as RL improvement relies on learning from high-quality samples, yetthe exploration for such samples remains bounded by the inherent limitations ofLLMs. This, in effect, creates an undesirable cycle in which what cannot beexplored cannot be learned. In this work, we propose Rubric-ScaffoldedReinforcement Learning (RuscaRL), a novel instructional scaffolding frameworkdesigned to break the exploration bottleneck for general LLM reasoning.Specifically, RuscaRL introduces checklist-style rubrics as (1) explicitscaffolding for exploration during rollout generation, where different rubricsare provided as external guidance within task instructions to steer diversehigh-quality responses. This guidance is gradually decayed over time,encouraging the model to internalize the underlying reasoning patterns; (2)verifiable rewards for exploitation during model training, where we can obtainrobust LLM-as-a-Judge scores using rubrics as references, enabling effective RLon general reasoning tasks. Extensive experiments demonstrate the superiorityof the proposed RuscaRL across various benchmarks, effectively expandingreasoning boundaries under the Best-of-N evaluation. Notably, RuscaRLsignificantly boosts Qwen2.5-7B-Instruct from 23.6 to 50.3 on HealthBench-500,surpassing GPT-4.1. Furthermore, our fine-tuned variant onQwen3-30B-A3B-Instruct achieves 61.1 on HealthBench-500, outperforming leadingLLMs including OpenAI-o3. Our code is available athttps://github.com/IANNXANG/RuscaRL.

Quick Read (beta)

loading the full paper ...