EDGE-GRPO: Entropy-Driven GRPO with Guided Error Correction for Advantage Diversity

Abstract

Large Language Models (LLMs) have made remarkable progress in enhancingstep-by-step reasoning through reinforcement learning. However, the GroupRelative Policy Optimization (GRPO) algorithm, which relies on sparse rewardrules, often encounters the issue of identical rewards within groups, leadingto the advantage collapse problem. Existing works typically address thischallenge from two perspectives: enforcing model reflection to enhance responsediversity, and introducing internal feedback to augment the training signal(advantage). In this work, we begin by analyzing the limitations of modelreflection and investigating the policy entropy of responses at thefine-grained sample level. Based on our experimental findings, we propose theEDGE-GRPO algorithm, which adopts \textbf{E}ntropy-\textbf{D}riven Advantageand \textbf{G}uided \textbf{E}rror Correction to effectively mitigate theproblem of advantage collapse. Extensive experiments on several main reasoningbenchmarks demonstrate the effectiveness and superiority of our approach. It isavailable at https://github.com/ZhangXJ199/EDGE-GRPO.

Quick Read (beta)

loading the full paper ...