Provable Reward-Agnostic Preference-Based Reinforcement Learning

Abstract

Preference-based Reinforcement Learning (PbRL) is a paradigm in which an RLagent learns to optimize a task using pair-wise preference-based feedback overtrajectories, rather than explicit reward signals. While PbRL has demonstratedpractical success in fine-tuning language models, existing theoretical workfocuses on regret minimization and fails to capture most of the practicalframeworks. In this study, we fill in such a gap between theoretical PbRL andpractical algorithms by proposing a theoretical reward-agnostic PbRL frameworkwhere exploratory trajectories that enable accurate learning of hidden rewardfunctions are acquired before collecting any human feedback. Theoreticalanalysis demonstrates that our algorithm requires less human feedback forlearning the optimal policy under preference-based models with linearparameterization and unknown transitions, compared to the existing theoreticalliterature. Specifically, our framework can incorporate linear and low-rankMDPs with efficient sample complexity. Additionally, we investigatereward-agnostic RL with action-based comparison feedback and introduce anefficient querying algorithm tailored to this scenario.

Quick Read (beta)

loading the full paper ...