Abstract
In preference-based reinforcement learning (RL), an agent interacts with theenvironment while receiving preferences instead of absolute feedback. Whilethere is increasing research activity in preference-based RL, the design offormal frameworks that admit tractable theoretical analysis remains an openchallenge. Building upon ideas from preference-based bandit learning andposterior sampling in RL, we present Dueling Posterior Sampling (DPS), whichemploys preference-based posterior sampling to learn both the system dynamicsand the underlying utility function that governs the user's preferences.Because preference feedback is provided on trajectories rather than individualstate/action pairs, we develop a Bayesian approach to solving the creditassignment problem, translating user preferences to a posterior distributionover state/action reward models. We prove an asymptotic no-regret rate for DPSwith a Bayesian logistic regression credit assignment model; to our knowledge,this is the first regret guarantee for preference-based RL. We also discusspossible avenues for extending this proof methodology to analyze other creditassignment models. Finally, we evaluate the approach empirically, showingcompetitive performance against existing baselines.