Safe Reinforcement Learning with Learned Non-Markovian Safety Constraints

Abstract

In safe Reinforcement Learning (RL), safety cost is typically defined as afunction dependent on the immediate state and actions. In practice, safetyconstraints can often be non-Markovian due to the insufficient fidelity ofstate representation, and safety cost may not be known. We therefore address ageneral setting where safety labels (e.g., safe or unsafe) are associated withstate-action trajectories. Our key contributions are: first, we design a safetymodel that specifically performs credit assignment to assess contributions ofpartial state-action trajectories on safety. This safety model is trained usinga labeled safety dataset. Second, using RL-as-inference strategy we derive aneffective algorithm for optimizing a safe policy using the learned safetymodel. Finally, we devise a method to dynamically adapt the tradeoffcoefficient between reward maximization and safety compliance. We rewrite theconstrained optimization problem into its dual problem and derive agradient-based method to dynamically adjust the tradeoff coefficient duringtraining. Our empirical results demonstrate that this approach is highlyscalable and able to satisfy sophisticated non-Markovian safety constraints.

Quick Read (beta)

loading the full paper ...