Efficient Risk-Averse Reinforcement Learning

Abstract

In risk-averse reinforcement learning (RL), the goal is to optimize some riskmeasure of the returns. A risk measure often focuses on the worst returns outof the agent's experience. As a result, standard methods for risk-averse RLoften ignore high-return strategies. We prove that under certain conditionsthis inevitably leads to a local-optimum barrier, and propose a soft riskmechanism to bypass it. We also devise a novel Cross Entropy module for risksampling, which (1) preserves risk aversion despite the soft risk; (2)independently improves sample efficiency. By separating the risk aversion ofthe sampler and the optimizer, we can sample episodes with poor conditions, yetoptimize with respect to successful strategies. We combine these two conceptsin CeSoR - Cross-entropy Soft-Risk optimization algorithm - which can beapplied on top of any risk-averse policy gradient (PG) method. We demonstrateimproved risk aversion in maze navigation, autonomous driving, and resourceallocation benchmarks, including in scenarios where standard risk-averse PGcompletely fails.

Quick Read (beta)

loading the full paper ...