GHPO: Adaptive Guidance for Stable and Efficient LLM Reinforcement Learning

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged asa powerful paradigm for facilitating the self-improvement of large languagemodels (LLMs), particularly in the domain of complex reasoning tasks. However,prevailing on-policy RL methods often contend with significant traininginstability and inefficiency. This is primarily due to a capacity-difficultymismatch, where the complexity of training data frequently outpaces the model'scurrent capabilities, leading to critically sparse reward signals and stalledlearning progress. This challenge is particularly acute for smaller, moreresource-efficient LLMs. To overcome this, we introduce the Guided HybridPolicy Optimization (GHPO), a novel difficulty-aware reinforcement learningframework. GHPO dynamically calibrates task difficulty by employing adaptiveprompt refinement to provide targeted guidance. This unique approach adaptivelybalances direct imitation learning for problems currently beyond the model'sreach with exploration-based reinforcement learning for more manageable tasks,effectively creating a smooth and optimized learning curriculum. Extensiveexperiments demonstrate that GHPO achieves an average performance gain ofapproximately 5% across six challenging mathematics benchmarks, consistentlyoutperforming strong on-policy reinforcement learning and curriculum learningbaselines. Further analysis confirms that our framework significantly enhancesboth training stability and final reasoning performance, thus offering ascalable and efficient solution for developing powerful and robust reasoningmodels.

Quick Read (beta)

loading the full paper ...