Sample Complexity of Distributionally Robust Average-Reward Reinforcement Learning

Abstract

Motivated by practical applications where stable long-term performance iscritical-such as robotics, operations research, and healthcare-we study theproblem of distributionally robust (DR) average-reward reinforcement learning.We propose two algorithms that achieve near-optimal sample complexity. Thefirst reduces the problem to a DR discounted Markov decision process (MDP),while the second, Anchored DR Average-Reward MDP, introduces an anchoring stateto stabilize the controlled transition kernels within the uncertainty set.Assuming the nominal MDP is uniformly ergodic, we prove that both algorithmsattain a sample complexity of $\widetilde{O}\left(|\mathbf{S}||\mathbf{A}|t_{\mathrm{mix}}^2\varepsilon^{-2}\right)$ for estimating the optimal policy aswell as the robust average reward under KL and $f_k$-divergence-baseduncertainty sets, provided the uncertainty radius is sufficiently small. Here,$\varepsilon$ is the target accuracy, $|\mathbf{S}|$ and $|\mathbf{A}|$ denotethe sizes of the state and action spaces, and $t_{\mathrm{mix}}$ is the mixingtime of the nominal MDP. This represents the first finite-sample convergenceguarantee for DR average-reward reinforcement learning. We further validate theconvergence rates of our algorithms through numerical experiments.

Quick Read (beta)

loading the full paper ...