Conservative Offline Distributional Reinforcement Learning

Abstract

Many reinforcement learning (RL) problems in practice are offline, learningpurely from observational data. A key challenge is how to ensure the learnedpolicy is safe, which requires quantifying the risk associated with differentactions. In the online setting, distributional RL algorithms do so by learningthe distribution over returns (i.e., cumulative rewards) instead of theexpected return; beyond quantifying risk, they have also been shown to learnbetter representations for planning. We propose Conservative OfflineDistributional Actor Critic (CODAC), an offline RL algorithm suitable for bothrisk-neutral and risk-averse domains. CODAC adapts distributional RL to theoffline setting by penalizing the predicted quantiles of the return forout-of-distribution actions. We prove that CODAC learns a conservative returndistribution -- in particular, for finite MDPs, CODAC converges to an uniformlower bound on the quantiles of the return distribution; our proof relies on anovel analysis of the distributional Bellman operator. In our experiments, ontwo challenging robot navigation tasks, CODAC successfully learns risk-aversepolicies using offline data collected purely from risk-neutral agents.Furthermore, CODAC is state-of-the-art on the D4RL MuJoCo benchmark in terms ofboth expected and risk-sensitive performance.

Quick Read (beta)

loading the full paper ...