Value-Free Policy Optimization via Reward Partitioning

Abstract

Single-trajectory reinforcement learning (RL) methods aim to optimizepolicies from datasets consisting of (prompt, response, reward) triplets, wherescalar rewards are directly available. This supervision format is highlypractical, as it mirrors real-world human feedback, such as thumbs-up/downsignals, and avoids the need for structured preference annotations. Incontrast, pairwise preference-based methods like Direct Preference Optimization(DPO) rely on datasets with both preferred and dispreferred responses, whichare harder to construct and less natural to collect. Among single-trajectoryapproaches, Direct Reward Optimization (DRO) has shown strong empiricalperformance due to its simplicity and stability. However, DRO requiresapproximating a value function, which introduces several limitations: highoff-policy variance, coupling between policy and value learning, and a lack ofabsolute supervision on the policy itself. We introduce Reward PartitioningOptimization (RPO), a new method that resolves these limitations by removingthe need to model the value function. Instead, RPO normalizes observed rewardsusing a partitioning approach estimated directly from data. This leads to astraightforward supervised learning objective on the policy, with no auxiliarymodels and no joint optimization. RPO provides direct and stable supervision onthe policy, making it robust and easy to implement in practice. We validate RPOon scalar-feedback language modeling tasks using Flan-T5 encoder-decodermodels. Our results demonstrate that RPO outperforms existing single-trajectorybaselines such as DRO and Kahneman-Tversky Optimization (KTO). These findingsconfirm that RPO is a simple, effective, and theoretically grounded method forsingle-trajectory policy optimization.

Quick Read (beta)

loading the full paper ...