Hindsight Preference Learning for Offline Preference-based Reinforcement Learning

Abstract

Offline preference-based reinforcement learning (RL), which focuses onoptimizing policies using human preferences between pairs of trajectorysegments selected from an offline dataset, has emerged as a practical avenuefor RL applications. Existing works rely on extracting step-wise reward signalsfrom trajectory-wise preference annotations, assuming that preferencescorrelate with the cumulative Markovian rewards. However, such methods fail tocapture the holistic perspective of data annotation: Humans often assess thedesirability of a sequence of actions by considering the overall outcome ratherthan the immediate rewards. To address this challenge, we propose to modelhuman preferences using rewards conditioned on future outcomes of thetrajectory segments, i.e. the hindsight information. For downstream RLoptimization, the reward of each step is calculated by marginalizing overpossible future outcomes, the distribution of which is approximated by avariational auto-encoder trained using the offline dataset. Our proposedmethod, Hindsight Preference Learning (HPL), can facilitate credit assignmentby taking full advantage of vast trajectory data available in massive unlabeleddatasets. Comprehensive empirical studies demonstrate the benefits of HPL indelivering robust and advantageous rewards across various domains. Our code ispublicly released at https://github.com/typoverflow/WiseRL.

Quick Read (beta)

loading the full paper ...