Abstract
Risk-averse reinforcement learning finds application in various high-stakesfields. Unlike classical reinforcement learning, which aims to maximizeexpected returns, risk-averse agents choose policies that minimize risk,occasionally sacrificing expected value. These preferences can be framedthrough utility theory. We focus on the specific case of the exponentialutility function, where we can derive the Bellman equations and employ variousreinforcement learning algorithms with few modifications. However, thesemethods suffer from numerical instability due to the need for exponentcomputation throughout the process. To address this, we introduce a numericallystable and mathematically sound loss function based on the Itakura-Saitodivergence for learning state-value and action-value functions. We evaluate ourproposed loss function against established alternatives, both theoretically andempirically. In the experimental section, we explore multiple financialscenarios, some with known analytical solutions, and show that our lossfunction outperforms the alternatives.