Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback

  • 2025-08-20 09:10:05
  • Xiaoying Zhang, Hao Sun, Yipeng Zhang, Kaituo Feng, Chaochao Lu, Chao Yang, Helen Meng
  • 0

Abstract

Recent advances in reinforcement learning (RL) with numerical feedback, suchas scalar rewards, have significantly enhanced the complex reasoningcapabilities of large language models (LLMs). Despite this success, we identifythree key challenges encountered by RL with solely numerical feedback:performance plateaus, limited effectiveness of spontaneous self-reflection, andpersistent failures. We then demonstrate that RL-finetuned models, even afterexhibiting performance plateaus, can generate correct refinements onpersistently failed problems by leveraging natural language feedback in theform of critiques. Building on this insight, we propose Critique-GRPO, anonline RL framework that integrates both natural language and numericalfeedback for effective policy optimization. Critique-GRPO enables LLMs to learnfrom initial responses and critique-guided self-refinements simultaneouslywhile maintaining exploration. Additionally, we employ a shaping function toamplify learning from correct, especially unfamiliar, refinements and penalizeincorrect ones. Extensive experiments with Qwen2.5-7B-Base,Qwen2.5-Math-7B-Base, and Qwen3-8B demonstrate that Critique-GRPO consistentlyoutperforms supervised learning and RL-based fine-tuning methods across eightchallenging mathematical, STEM, and general reasoning tasks. Specifically,Critique-GRPO improves average pass@1 scores across all compared methods byapproximately +4.4% on Qwen2.5-7B-Base and +3.8% on Qwen3-8B. Notably,Critique-GRPO enables effective self-improvement through self-critiquing,achieving significant gains over GRPO, e.g., +16.7% pass@1 improvement on AIME2024.

 

Quick Read (beta)

loading the full paper ...