Learn Zero-Constraint-Violation Policy in Model-Free Constrained Reinforcement Learning

Abstract

In the trial-and-error mechanism of reinforcement learning (RL), a notoriouscontradiction arises when we expect to learn a safe policy: how to learn a safepolicy without enough data and prior model about the dangerous region? Existingmethods mostly use the posterior penalty for dangerous actions, which meansthat the agent is not penalized until experiencing danger. This fact causesthat the agent cannot learn a zero-violation policy even after convergence.Otherwise, it would not receive any penalty and lose the knowledge aboutdanger. In this paper, we propose the safe set actor-critic (SSAC) algorithm,which confines the policy update using safety-oriented energy functions, or thesafety indexes. The safety index is designed to increase rapidly forpotentially dangerous actions, which allows us to locate the safe set on theaction space, or the control safe set. Therefore, we can identify the dangerousactions prior to taking them, and further obtain a zero constraint-violationpolicy after convergence.We claim that we can learn the energy function in amodel-free manner similar to learning a value function. By using the energyfunction transition as the constraint objective, we formulate a constrained RLproblem. We prove that our Lagrangian-based solutions make sure that thelearned policy will converge to the constrained optimum under some assumptions.The proposed algorithm is evaluated on both the complex simulation environmentsand a hardware-in-loop (HIL) experiment with a real controller from theautonomous vehicle. Experimental results suggest that the converged policy inall environments achieves zero constraint violation and comparable performancewith model-based baselines.

Quick Read (beta)

loading the full paper ...