Sample Complexity of Robust Reinforcement Learning with a Generative Model

Abstract

The Robust Markov Decision Process (RMDP) framework focuses on designingcontrol policies that are robust against the parameter uncertainties due to themismatches between the simulator model and real-world settings. An RMDP problemis typically formulated as a max-min problem, where the objective is to findthe policy that maximizes the value function for the worst possible model thatlies in an uncertainty set around a nominal model. The standard robust dynamicprogramming approach requires the knowledge of the nominal model for computingthe optimal robust policy. In this work, we propose a model-based reinforcementlearning (RL) algorithm for learning an $\epsilon$-optimal robust policy whenthe nominal model is unknown. We consider three different forms of uncertaintysets, characterized by the total variation distance, chi-square divergence, andKL divergence. For each of these uncertainty sets, we give a precisecharacterization of the sample complexity of our proposed algorithm. Inaddition to the sample complexity results, we also present a formal analyticalargument on the benefit of using robust policies. Finally, we demonstrate theperformance of our algorithm on two benchmark problems.

Quick Read (beta)

loading the full paper ...