Preference-Guided Reinforcement Learning for Efficient Exploration

  • 2024-07-09 03:11:12
  • Guojian Wang, Faguo Wu, Xiao Zhang, Tianyuan Chen, Xuyang Chen, Lin Zhao
In this paper, we investigate preference-based reinforcement learning (PbRL)that allows reinforcement learning (RL) agents to learn from human feedback.This is particularly valuable when defining a fine-grain reward function is notfeasible. However, this approach is inefficient and impractical for promotingdeep exploration in hard-exploration tasks with long horizons and sparserewards. To tackle this issue, we introduce LOPE: Learning Online withtrajectory Preference guidancE, an end-to-end preference-guided RL frameworkthat enhances exploration efficiency in hard-exploration tasks. Our intuitionis that LOPE directly adjusts the focus of online exploration by consideringhuman feedback as guidance, avoiding learning a separate reward model frompreferences. Specifically, LOPE includes a two-step sequential policyoptimization process consisting of trust-region-based policy improvement andpreference guidance steps. We reformulate preference guidance as a noveltrajectory-wise state marginal matching problem that minimizes the maximum meandiscrepancy distance between the preferred trajectories and the learned policy.Furthermore, we provide a theoretical analysis to characterize the performanceimprovement bound and evaluate the LOPE's effectiveness. When assessed invarious challenging hard-exploration environments, LOPE outperforms severalstate-of-the-art methods regarding convergence rate and overall performance.The code used in this study is available at\url{}.


