Corruption Robust Exploration in Episodic Reinforcement Learning

Abstract

We initiate the study of multi-stage episodic reinforcement learning underadversarial manipulations in both the rewards and the transition probabilitiesof the underlying system. Existing efficient algorithms heavily rely on the"optimism under uncertainty" principle which dictates their behavior and doesnot allow flexibility to perform corruption-robust exploration. We address thisby (i) departing from the optimistic behavior, and (ii) creating a generalframework that incorporates the principle of action-elimination. (Thisprinciple has been essential for corruption-robust exploration in multi-armedbandits, a degenerate special case of episodic reinforcement learning.) Despiteconstructing a lower bound for a straightforward implementation ofaction-elimination, we provide a clean and modular way to transfer it toepisodic reinforcement learning. Our algorithm enjoys near-optimal guaranteesin the absence of adversarial manipulations, has performance that degradesgracefully as the amount of corruption increases, and does not need to knowthis amount. Our results shed new light on the broader question of robustexploration, and suggest a way to address a rather daunting mismatch betweenoptimistic algorithms and algorithms with higher flexibility. To demonstratethe applicability of our framework, we provide a second instantiation thereof,showing how it can provide efficient guarantees for the stochastic setting,despite doing almost uniform exploration across plausibly optimal actions.

Quick Read (beta)

loading the full paper ...