Upper and Lower Bounds for Distributionally Robust Off-Dynamics Reinforcement Learning

Abstract

We study off-dynamics Reinforcement Learning (RL), where the policy trainingand deployment environments are different. To deal with this environmentalperturbation, we focus on learning policies robust to uncertainties intransition dynamics under the framework of distributionally robust Markovdecision processes (DRMDPs), where the nominal and perturbed dynamics arelinear Markov Decision Processes. We propose a novel algorithm We-DRIVE-U thatenjoys an average suboptimality $\widetilde{\mathcal{O}}\big({d H \cdot \min\{1/{\rho}, H\}/\sqrt{K} }\big)$, where $K$ is the number of episodes, $H$ isthe horizon length, $d$ is the feature dimension and $\rho$ is the uncertaintylevel. This result improves the state-of-the-art by$\mathcal{O}(dH/\min\{1/\rho,H\})$. We also construct a novel hard instance andderive the first information-theoretic lower bound in this setting, whichindicates our algorithm is near-optimal up to $\mathcal{O}(\sqrt{H})$ for anyuncertainty level $\rho\in(0,1]$. Our algorithm also enjoys a 'rare-switching'design, and thus only requires $\mathcal{O}(dH\log(1+H^2K))$ policy switchesand $\mathcal{O}(d^2H\log(1+H^2K))$ calls for oracle to solve dual optimizationproblems, which significantly improves the computational efficiency of existingalgorithms for DRMDPs, whose policy switch and oracle complexities are both$\mathcal{O}(K)$.

Quick Read (beta)

loading the full paper ...