Policy Gradient Method For Robust Reinforcement Learning

Abstract

This paper develops the first policy gradient method with global optimalityguarantee and complexity analysis for robust reinforcement learning under modelmismatch. Robust reinforcement learning is to learn a policy robust to modelmismatch between simulator and real environment. We first develop the robustpolicy (sub-)gradient, which is applicable for any differentiable parametricpolicy class. We show that the proposed robust policy gradient method convergesto the global optimum asymptotically under direct policy parameterization. Wefurther develop a smoothed robust policy gradient method and show that toachieve an $\epsilon$-global optimum, the complexity is $\mathcalO(\epsilon^{-3})$. We then extend our methodology to the general model-freesetting and design the robust actor-critic method with differentiableparametric policy class and value function. We further characterize itsasymptotic convergence and sample complexity under the tabular setting.Finally, we provide simulation results to demonstrate the robustness of ourmethods.

Quick Read (beta)

loading the full paper ...