COPO: Consistency-Aware Policy Optimization

Abstract

Reinforcement learning has significantly enhanced the reasoning capabilitiesof Large Language Models (LLMs) in complex problem-solving tasks. Recently, theintroduction of DeepSeek R1 has inspired a surge of interest in leveragingrule-based rewards as a low-cost alternative for computing advantage functionsand guiding policy optimization. However, a common challenge observed acrossmany replication and extension efforts is that when multiple sampled responsesunder a single prompt converge to identical outcomes, whether correct orincorrect, the group-based advantage degenerates to zero. This leads tovanishing gradients and renders the corresponding samples ineffective forlearning, ultimately limiting training efficiency and downstream performance.To address this issue, we propose a consistency-aware policy optimizationframework that introduces a structured global reward based on outcomeconsistency, the global loss based on it ensures that, even when model outputsshow high intra-group consistency, the training process still receivesmeaningful learning signals, which encourages the generation of correct andself-consistent reasoning paths from a global perspective. Furthermore, weincorporate an entropy-based soft blending mechanism that adaptively balanceslocal advantage estimation with global optimization, enabling dynamictransitions between exploration and convergence throughout training. Our methodintroduces several key innovations in both reward design and optimizationstrategy. We validate its effectiveness through substantial performance gainson multiple mathematical reasoning benchmarks, highlighting the proposedframework's robustness and general applicability. Code of this work has beenreleased at https://github.com/hijih/copo-code.git.

Quick Read (beta)

loading the full paper ...