Critique-RL: Training Language Models for Critiquing through Two-Stage Reinforcement Learning

  • 2025-10-28 11:37:01
  • Zhiheng Xi, Jixuan Huang, Xin Guo, Boyang Hong, Dingwen Yang, Xiaoran Fan, Shuo Li, Zehui Chen, Junjie Ye, Siyu Yuan, Zhengyin Du, Xuesong Yao, Yufei Xu, Jiecao Chen, Rui Zheng, Tao Gui, Qi Zhang, Xuanjing Huang
  • 0

Abstract

Training critiquing language models to assess and provide feedback on modeloutputs is a promising way to improve LLMs for complex reasoning tasks.However, existing approaches typically rely on stronger supervisors forannotating critique data. To address this, we propose Critique-RL, an online RLapproach for developing critiquing language models without strongersupervision. Our approach operates on a two-player paradigm: the actorgenerates a response, the critic provides feedback, and the actor refines theresponse accordingly. We first reveal that relying solely on indirect rewardsignals from the actor's outputs for RL optimization often leads tounsatisfactory critics: while their helpfulness (i.e., providing constructivefeedback) improves, the discriminability (i.e., determining whether a responseis high-quality or not) remains poor, resulting in marginal performance gains.To overcome this, Critique-RL adopts a two-stage optimization strategy. Instage I, it reinforces the discriminability of the critic with directrule-based reward signals; in stage II, it introduces indirect rewards based onactor refinement to improve the critic's helpfulness, while maintaining itsdiscriminability via appropriate regularization. Extensive experiments acrossvarious tasks and models show that Critique-RL delivers substantial performanceimprovements. For example, it achieves a 9.02% gain on in-domain tasks and a5.70% gain on out-of-domain tasks for Qwen2.5-7B, highlighting its potential.

 

Quick Read (beta)

loading the full paper ...