Bridging Distributional and Risk-sensitive Reinforcement Learning with Provable Regret Bounds

Abstract

We study the regret guarantee for risk-sensitive reinforcement learning(RSRL) via distributional reinforcement learning (DRL) methods. In particular,we consider finite episodic Markov decision processes whose objective is theentropic risk measure (EntRM) of return. We identify a key property of theEntRM, the monotonicity-preserving property, which enables the risk-sensitivedistributional dynamic programming framework. We then propose two novel DRLalgorithms that implement optimism through two different schemes, including amodel-free one and a model-based one. We prove that both of them attain $\tilde{\mathcal{O}}(\frac{\exp(|\beta|H)-1}{|\beta|H}H\sqrt{HS^2AT})$ regret upper bound, where $S$ is the number ofstates, $A$ the number of states, $H$ the time horizon and $T$ the number oftotal time steps. It matches RSVI2 proposed in \cite{fei2021exponential} with amuch simpler regret analysis. To the best of our knowledge, this is the firstregret analysis of DRL, which bridges DRL and RSRL in terms of samplecomplexity. Finally, we improve the existing lower bound by proving a tighterbound of $\Omega(\frac{\exp(\beta H/6)-1}{\beta H}H\sqrt{SAT})$ for $\beta>0$case, which recovers the tight lower bound $\Omega(H\sqrt{SAT})$ in therisk-neutral setting.

Quick Read (beta)

loading the full paper ...