Preference-based Reinforcement Learning with Finite-Time Guarantees

Abstract

Preference-based Reinforcement Learning (PbRL) replaces reward values intraditional reinforcement learning by preferences to better elicit humanopinion on the target objective, especially when numerical reward values arehard to design or interpret. Despite promising results in applications, thetheoretical understanding of PbRL is still in its infancy. In this paper, wepresent the first finite-time analysis for general PbRL problems. We first showthat a unique optimal policy may not exist if preferences over trajectories aredeterministic for PbRL. If preferences are stochastic, and the preferenceprobability relates to the hidden reward values, we present algorithms forPbRL, both with and without a simulator, that are able to identify the bestpolicy up to accuracy $\varepsilon$ with high probability. Our method exploresthe state space by navigating to under-explored states, and solves PbRL using acombination of dueling bandits and policy search. Experiments show the efficacyof our method when it is applied to real-world problems.

Quick Read (beta)

loading the full paper ...