Towards Safe Reinforcement Learning via Constraining Conditional Value-at-Risk

Abstract

Though deep reinforcement learning (DRL) has obtained substantial success, itmay encounter catastrophic failures due to the intrinsic uncertainty of bothtransition and observation. Most of the existing methods for safe reinforcementlearning can only handle transition disturbance or observation disturbancesince these two kinds of disturbance affect different parts of the agent;besides, the popular worst-case return may lead to overly pessimistic policies.To address these issues, we first theoretically prove that the performancedegradation under transition disturbance and observation disturbance depends ona novel metric of Value Function Range (VFR), which corresponds to the gap inthe value function between the best state and the worst state. Based on theanalysis, we adopt conditional value-at-risk (CVaR) as an assessment of riskand propose a novel reinforcement learning algorithm ofCVaR-Proximal-Policy-Optimization (CPPO) which formalizes the risk-sensitiveconstrained optimization problem by keeping its CVaR under a given threshold.Experimental results show that CPPO achieves a higher cumulative reward and ismore robust against both observation and transition disturbances on a series ofcontinuous control tasks in MuJoCo.

Quick Read (beta)

loading the full paper ...