SHARP: Synthesizing High-quality Aligned Reasoning Problems for Large Reasoning Models Reinforcement Learning

Abstract

Training large reasoning models (LRMs) with reinforcement learning in STEMdomains is hindered by the scarcity of high-quality, diverse, and verifiableproblem sets. Existing synthesis methods, such as Chain-of-Thought prompting,often generate oversimplified or uncheckable data, limiting model advancementon complex tasks. To address these challenges, we introduce SHARP, a unifiedapproach to Synthesizing High-quality Aligned Reasoning Problems for LRMsreinforcement learning with verifiable rewards (RLVR). SHARP encompasses astrategic set of self-alignment principles -- targeting graduate andOlympiad-level difficulty, rigorous logical consistency, and unambiguous,verifiable answers -- and a structured three-phase framework (Alignment,Instantiation, Inference) that ensures thematic diversity and fine-grainedcontrol over problem generation. We implement SHARP by leveraging astate-of-the-art LRM to infer and verify challenging STEM questions, thenemploy a reinforcement learning loop to refine the model's reasoning throughverifiable reward signals. Experiments on benchmarks such as GPQA demonstratethat SHARP-augmented training substantially outperforms existing methods,markedly improving complex reasoning accuracy and pushing LRM performancecloser to expert-level proficiency. Our contributions include the SHARPstrategy, framework design, end-to-end implementation, and experimentalevaluation of its effectiveness in elevating LRM reasoning capabilities.

Quick Read (beta)

loading the full paper ...