Abstract
Training critiquing language models to assess and provide feedback on modeloutputs is a promising way to improve LLMs for complex reasoning tasks.However, existing approaches typically rely on stronger supervisors forannotating critique data. To address this, we propose Critique-RL, an online RLapproach for developing critiquing language models without strongersupervision. Our approach operates on a two-player paradigm: the actorgenerates a response, the critic provides feedback, and the actor refines theresponse accordingly. We first reveal that relying solely on indirect rewardsignals from the actor's outputs for RL optimization often leads tounsatisfactory critics: while their helpfulness (i.e., providing constructivefeedback) improves, the discriminability (i.e., determining whether a responseis high-quality or not) remains poor, resulting in marginal performance gains.To overcome this, Critique-RL adopts a two-stage optimization strategy. Instage I, it reinforces the discriminability of the critic with directrule-based reward signals; in stage II, it introduces indirect rewards based onactor refinement to improve the critic's helpfulness, while maintaining itsdiscriminability via appropriate regularization. Extensive experiments acrossvarious tasks and models show that Critique-RL delivers substantial performanceimprovements. For example, it achieves a 9.02% gain on in-domain tasks and a5.70% gain on out-of-domain tasks for Qwen2.5-7B, highlighting its potential.