Query-Policy Misalignment in Preference-Based Reinforcement Learning

Abstract

Preference-based reinforcement learning (PbRL) provides a natural way toalign RL agents' behavior with human desired outcomes, but is often restrainedby costly human feedback. To improve feedback efficiency, most existing PbRLmethods focus on selecting queries to maximally improve the overall quality ofthe reward model, but counter-intuitively, we find that this may notnecessarily lead to improved performance. To unravel this mystery, we identifya long-neglected issue in the query selection schemes of existing PbRL studies:Query-Policy Misalignment. We show that the seemingly informative queriesselected to improve the overall quality of reward model actually may not alignwith RL agents' interests, thus offering little help on policy learning andeventually resulting in poor feedback efficiency. We show that this issue canbe effectively addressed via near on-policy query and a specially designedhybrid experience replay, which together enforce the bidirectional query-policyalignment. Simple yet elegant, our method can be easily incorporated intoexisting approaches by changing only a few lines of code. We showcase incomprehensive experiments that our method achieves substantial gains in bothhuman feedback and RL sample efficiency, demonstrating the importance ofaddressing query-policy misalignment in PbRL tasks.

Quick Read (beta)

loading the full paper ...