Abstract
We present a mean-variance policy iteration (MVPI) framework for risk-aversecontrol in a discounted infinite horizon MDP. MVPI enjoys great flexibility inthat any policy evaluation method and risk-neutral control method can bedropped in for risk-averse control off the shelf, in both on- and off-policysettings. We propose risk-averse TD3 as an example instantiating MVPI, whichoutperforms vanilla TD3 and many previous risk-averse control methods inchallenging Mujoco robot simulation tasks under a risk-aware performancemetric. This risk-averse TD3 is the first to introduce deterministic policiesand off-policy learning into risk-averse reinforcement learning, both of whichare key to the performance boost we show in Mujoco domains. MVPI adopts aper-step reward perspective (Bisi et al., 2019) for risk-averse control,instead of the commonly used total reward perspective.