Constraints Penalized Q-Learning for Safe Offline Reinforcement Learning

Abstract

We study the problem of safe offline reinforcement learning (RL), the goal isto learn a policy that maximizes long-term reward while satisfying safetyconstraints given only offline data, without further interaction with theenvironment. This problem is more appealing for real world RL applications, inwhich data collection is costly or dangerous. Enforcing constraint satisfactionis non-trivial, especially in offline settings, as there is a potential largediscrepancy between the policy distribution and the data distribution, causingerrors in estimating the value of safety constraints. We show that na\"iveapproaches that combine techniques from safe RL and offline RL can only learnsub-optimal solutions. We thus develop a simple yet effective algorithm,Constraints Penalized Q-Learning (CPQ), to solve the problem. Our method admitsthe use of data generated by mixed behavior policies. We present a theoreticalanalysis and demonstrate empirically that our approach can learn robustlyacross a variety of benchmark control tasks, outperforming several baselines.

Quick Read (beta)

loading the full paper ...