Preference Elicitation for Offline Reinforcement Learning

Abstract

Applying reinforcement learning (RL) to real-world problems is often madechallenging by the inability to interact with the environment and thedifficulty of designing reward functions. Offline RL addresses the firstchallenge by considering access to an offline dataset of environmentinteractions labeled by the reward function. In contrast, Preference-based RLdoes not assume access to the reward function and learns it from preferences,but typically requires an online interaction with the environment. We bridgethe gap between these frameworks by exploring efficient methods for acquiringpreference feedback in a fully offline setup. We propose Sim-OPRL, an offlinepreference-based reinforcement learning algorithm, which leverages a learnedenvironment model to elicit preference feedback on simulated rollouts. Drawingon insights from both the offline RL and the preference-based RL literature,our algorithm employs a pessimistic approach for out-of-distribution data, andan optimistic approach for acquiring informative preferences about the optimalpolicy. We provide theoretical guarantees regarding the sample complexity ofour approach, dependent on how well the offline data covers the optimal policy.Finally, we demonstrate the empirical performance of Sim-OPRL in differentenvironments.

Quick Read (beta)

loading the full paper ...