Abstract
Recent advances in reinforcement learning (RL) with numerical feedback, suchas scalar rewards, have significantly enhanced the complex reasoningcapabilities of large language models (LLMs). Despite this success, we identifythree key challenges encountered by RL with solely numerical feedback:performance plateaus, limited effectiveness of spontaneous self-reflection, andpersistent failures. We then demonstrate that RL-finetuned models, even afterexhibiting performance plateaus, can generate correct refinements onpersistently failed problems by leveraging natural language feedback in theform of critiques. Building on this insight, we propose Critique-GRPO, anonline RL framework that integrates both natural language and numericalfeedback for effective policy optimization. Critique-GRPO enables LLMs to learnfrom initial responses and critique-guided self-refinements simultaneouslywhile maintaining exploration. Additionally, we employ a shaping function toamplify learning from correct, especially unfamiliar, refinements and penalizeincorrect ones. Extensive experiments with Qwen2.5-7B-Base,Qwen2.5-Math-7B-Base, and Qwen3-8B demonstrate that Critique-GRPO consistentlyoutperforms supervised learning and RL-based fine-tuning methods across eightchallenging mathematical, STEM, and general reasoning tasks, improving averagepass@1 scores by approximately 4.4% and 3.8% on Qwen2.5-7B-Base and Qwen3-8B,respectively. Notably, Critique-GRPO enables effective self-improvement throughself-critiquing and weak-to-strong generalization, achieving consistent gainsover GRPO, such as 16.7% and 10.0% pass@1 improvements on AIME 2024,respectively.