Risk-Sensitive Reinforcement Learning: Near-Optimal Risk-Sample Tradeoff in Regret

Abstract

We study risk-sensitive reinforcement learning in episodic Markov decisionprocesses with unknown transition kernels, where the goal is to optimize thetotal reward under the risk measure of exponential utility. We propose twoprovably efficient model-free algorithms, Risk-Sensitive Value Iteration (RSVI)and Risk-Sensitive Q-learning (RSQ). These algorithms implement a form ofrisk-sensitive optimism in the face of uncertainty, which adapts to bothrisk-seeking and risk-averse modes of exploration. We prove that RSVI attainsan $\tilde{O}\big(\lambda(|\beta| H^2) \cdot \sqrt{H^{3} S^{2}AT} \big)$regret, while RSQ attains an $\tilde{O}\big(\lambda(|\beta| H^2) \cdot\sqrt{H^{4} SAT} \big)$ regret, where $\lambda(u) = (e^{3u}-1)/u$ for $u>0$. Inthe above, $\beta$ is the risk parameter of the exponential utility function,$S$ the number of states, $A$ the number of actions, $T$ the total number oftimesteps, and $H$ the episode length. On the flip side, we establish a regretlower bound showing that the exponential dependence on $|\beta|$ and $H$ isunavoidable for any algorithm with an $\tilde{O}(\sqrt{T})$ regret (even whenthe risk objective is on the same scale as the original reward), thuscertifying the near-optimality of the proposed algorithms. Our resultsdemonstrate that incorporating risk awareness into reinforcement learningnecessitates an exponential cost in $|\beta|$ and $H$, which quantifies thefundamental tradeoff between risk sensitivity (related to aleatoricuncertainty) and sample efficiency (related to epistemic uncertainty). To thebest of our knowledge, this is the first regret analysis of risk-sensitivereinforcement learning with the exponential utility.

Quick Read (beta)

loading the full paper ...