Safe Reinforcement Learning with Learned Non-Markovian Safety Constraints

  • 2024-05-05 18:27:22
  • Siow Meng Low, Akshat Kumar
In safe Reinforcement Learning (RL), safety cost is typically defined as afunction dependent on the immediate state and actions. In practice, safetyconstraints can often be non-Markovian due to the insufficient fidelity ofstate representation, and safety cost may not be known. We therefore address ageneral setting where safety labels (e.g., safe or unsafe) are associated withstate-action trajectories. Our key contributions are: first, we design a safetymodel that specifically performs credit assignment to assess contributions ofpartial state-action trajectories on safety. This safety model is trained usinga labeled safety dataset. Second, using RL-as-inference strategy we derive aneffective algorithm for optimizing a safe policy using the learned safetymodel. Finally, we devise a method to dynamically adapt the tradeoffcoefficient between reward maximization and safety compliance. We rewrite theconstrained optimization problem into its dual problem and derive agradient-based method to dynamically adjust the tradeoff coefficient duringtraining. Our empirical results demonstrate that this approach is highlyscalable and able to satisfy sophisticated non-Markovian safety constraints.


