DRPO: Efficient Reasoning via Decoupled Reward Policy Optimization

Abstract

Recent large reasoning models (LRMs) driven by reinforcement learningalgorithms (e.g., GRPO) have achieved remarkable performance on challengingreasoning tasks. However, these models suffer from overthinking, generatingunnecessarily long and redundant reasoning even for simple questions, whichsubstantially increases computational cost and response latency. While existingmethods incorporate length rewards to GRPO to promote concise reasoning, theyincur significant performance degradation. We identify the root cause: whenrewards for correct but long rollouts are penalized, GRPO's group-relativeadvantage function can assign them negative advantages, actively discouragingvalid reasoning. To overcome this, we propose Decoupled Reward PolicyOptimization (DRPO), a novel framework that decouples the length-based learningsignal of correct rollouts from incorrect ones. DRPO ensures that rewardsignals for correct rollouts are normalized solely within the positive group,shielding them from interference by negative samples. The DRPO's objective isgrounded in integrating an optimized positive data distribution, whichmaximizes length-based rewards under a KL regularization, into a discriminativeobjective. We derive a closed-form solution for this distribution, enablingefficient computation of the objective and its gradients using only on-policydata and importance weighting. Of independent interest, this formulation isgeneral and can incorporate other preference rewards of positive data beyondlength. Experiments on mathematical reasoning tasks demonstrate DRPO'ssignificant superiority over six efficient reasoning baselines. Notably, with a1.5B model, our method achieves 77\% length reduction with only 1.1\%performance loss on simple questions like GSM8k dataset, while the follow-upbaseline sacrifices 4.3\% for 68\% length reduction.

Quick Read (beta)

loading the full paper ...