Intra-Trajectory Consistency for Reward Modeling

Abstract

Reward models are critical for improving large language models (LLMs),particularly in reinforcement learning from human feedback (RLHF) orinference-time verification. Current reward modeling typically relies on scoresof overall responses to learn the outcome rewards for the responses. However,since the response-level scores are coarse-grained supervision signals, thereward model struggles to identify the specific components within a responsetrajectory that truly correlate with the scores, leading to poor generalizationon unseen responses. In this paper, we propose to leverage generationprobabilities to establish reward consistency between processes in the responsetrajectory, which allows the response-level supervisory signal to propagateacross processes, thereby providing additional fine-grained signals for rewardlearning. Building on analysis under the Bayesian framework, we develop anintra-trajectory consistency regularization to enforce that adjacent processeswith higher next-token generation probability maintain more consistent rewards.We apply the proposed regularization to the advanced outcome reward model,improving its performance on RewardBench. Besides, we show that the rewardmodel trained with the proposed regularization induces better DPO-alignedpolicies and achieves better best-of-N (BON) inference-time verificationresults. Our code is provided in https://github.com/chaoyang101/ICRM.

Quick Read (beta)

loading the full paper ...