Scaf-GRPO: Scaffolded Group Relative Policy Optimization for Enhancing LLM Reasoning

  • 2025-10-22 17:41:30
  • Xichen Zhang, Sitong Wu, Yinghao Zhu, Haoru Tan, Shaozuo Yu, Ziyi He, Jiaya Jia
  • 0

Abstract

Reinforcement learning from verifiable rewards has emerged as a powerfultechnique for enhancing the complex reasoning abilities of Large LanguageModels (LLMs). However, these methods are fundamentally constrained by the''learning cliff'' phenomenon: when faced with problems far beyond theircurrent capabilities, models consistently fail, yielding a persistentzero-reward signal. In policy optimization algorithms like GRPO, this collapsesthe advantage calculation to zero, rendering these difficult problems invisibleto the learning gradient and stalling progress. To overcome this, we introduceScaf-GRPO (Scaffolded Group Relative Policy Optimization), a progressivetraining framework that strategically provides minimal guidance only when amodel's independent learning has plateaued. The framework first diagnoseslearning stagnation and then intervenes by injecting tiered in-prompt hints,ranging from abstract concepts to concrete steps, enabling the model toconstruct a valid solution by itself. Extensive experiments on challengingmathematics benchmarks demonstrate Scaf-GRPO's effectiveness, boosting thepass@1 score of the Qwen2.5-Math-7B model on the AIME24 benchmark by a relative44.3% over a vanilla GRPO baseline. This result demonstrates our frameworkprovides a robust and effective methodology for unlocking a model's ability tosolve problems previously beyond its reach, a critical step towards extendingthe frontier of autonomous reasoning in LLM.

 

Quick Read (beta)

loading the full paper ...