Risk-Averse Offline Reinforcement Learning

Abstract

Training Reinforcement Learning (RL) agents in high-stakes applications mightbe too prohibitive due to the risk associated to exploration. Thus, the agentcan only use data previously collected by safe policies. While previous workconsiders optimizing the average performance using offline data, we focus onoptimizing a risk-averse criteria, namely the CVaR. In particular, we presentthe Offline Risk-Averse Actor-Critic (O-RAAC), a model-free RL algorithm thatis able to learn risk-averse policies in a fully offline setting. We show thatO-RAAC learns policies with higher CVaR than risk-neutral approaches indifferent robot control tasks. Furthermore, considering risk-averse criteriaguarantees distributional robustness of the average performance with respect toparticular distribution shifts. We demonstrate empirically that in the presenceof natural distribution-shifts, O-RAAC learns policies with good averageperformance.

Quick Read (beta)

loading the full paper ...