Abstract
Offline preference-based reinforcement learning (PbRL) typically operates intwo phases: first, use human preferences to learn a reward model and annotaterewards for a reward-free offline dataset; second, learn a policy by optimizingthe learned reward via offline RL. However, accurately modeling step-wiserewards from trajectory-level preference feedback presents inherent challenges.The reward bias introduced, particularly the overestimation of predictedrewards, leads to optimistic trajectory stitching, which undermines thepessimism mechanism critical to the offline RL phase. To address thischallenge, we propose In-Dataset Trajectory Return Regularization (DTR) foroffline PbRL, which leverages conditional sequence modeling to mitigate therisk of learning inaccurate trajectory stitching under reward bias.Specifically, DTR employs Decision Transformer and TD-Learning to strike abalance between maintaining fidelity to the behavior policy with highin-dataset trajectory returns and selecting optimal actions based on highreward labels. Additionally, we introduce an ensemble normalization techniquethat effectively integrates multiple reward models, balancing the tradeoffbetween reward differentiation and accuracy. Empirical evaluations on variousbenchmarks demonstrate the superiority of DTR over other state-of-the-artbaselines