Abstract
Preference-based reinforcement learning (PBRL) in the offline setting hassucceeded greatly in industrial applications such as chatbots. A two-steplearning framework where one applies a reinforcement learning step after areward modeling step has been widely adopted for the problem. However, such amethod faces challenges from the risk of reward hacking and the complexity ofreinforcement learning. To overcome the challenge, our insight is that bothchallenges come from the state-actions not supported in the dataset. Suchstate-actions are unreliable and increase the complexity of the reinforcementlearning problem at the second step. Based on the insight, we develop a noveltwo-step learning method called PRC: preference-based reinforcement learningwith constrained actions. The high-level idea is to limit the reinforcementlearning agent to optimize over a constrained action space that excludes theout-of-distribution state-actions. We empirically verify that our method hashigh learning efficiency on various datasets in robotic control environments.