Policy Bifurcation in Safe Reinforcement Learning

Abstract

Safe reinforcement learning (RL) offers advanced solutions to constrainedoptimal control problems. Existing studies in safe RL implicitly assumecontinuity in policy functions, where policies map states to actions in asmooth, uninterrupted manner; however, our research finds that in somescenarios, the feasible policy should be discontinuous or multi-valued,interpolating between discontinuous local optima can inevitably lead toconstraint violations. We are the first to identify the generating mechanism ofsuch a phenomenon, and employ topological analysis to rigorously prove theexistence of policy bifurcation in safe RL, which corresponds to thecontractibility of the reachable tuple. Our theorem reveals that in scenarioswhere the obstacle-free state space is non-simply connected, a feasible policyis required to be bifurcated, meaning its output action needs to changeabruptly in response to the varying state. To train such a bifurcated policy,we propose a safe RL algorithm called multimodal policy optimization (MUPO),which utilizes a Gaussian mixture distribution as the policy output. Thebifurcated behavior can be achieved by selecting the Gaussian component withthe highest mixing coefficient. Besides, MUPO also integrates spectralnormalization and forward KL divergence to enhance the policy's capability ofexploring different modes. Experiments with vehicle control tasks show that ouralgorithm successfully learns the bifurcated policy and ensures satisfyingsafety, while a continuous policy suffers from inevitable constraintviolations.

Quick Read (beta)

loading the full paper ...