Abstract
Average-reward Markov decision processes (MDPs) provide a foundationalframework for sequential decision-making under uncertainty. However,average-reward MDPs have remained largely unexplored in reinforcement learning(RL) settings, with the majority of RL-based efforts having been allocated toepisodic and discounted MDPs. In this work, we study a unique structuralproperty of average-reward MDPs and utilize it to introduce Reward-ExtendedDifferential (or RED) reinforcement learning: a novel RL framework that can beused to effectively and efficiently solve various subtasks simultaneously inthe average-reward setting. We introduce a family of RED learning algorithmsfor prediction and control, including proven-convergent algorithms for thetabular case. We then showcase the power of these algorithms by demonstratinghow they can be used to learn a policy that optimizes, for the first time, thewell-known conditional value-at-risk (CVaR) risk measure in a fully-onlinemanner, without the use of an explicit bi-level optimization scheme or anaugmented state-space.