On Corruption-Robustness in Performative Reinforcement Learning

Abstract

In performative Reinforcement Learning (RL), an agent faces apolicy-dependent environment: the reward and transition functions depend on theagent's policy. Prior work on performative RL has studied the convergence ofrepeated retraining approaches to a performatively stable policy. In the finitesample regime, these approaches repeatedly solve for a saddle point of aconvex-concave objective, which estimates the Lagrangian of a regularizedversion of the reinforcement learning problem. In this paper, we aim to extendsuch repeated retraining approaches, enabling them to operate under corrupteddata. More specifically, we consider Huber's $\epsilon$-contamination model,where an $\epsilon$ fraction of data points is corrupted by arbitraryadversarial noise. We propose a repeated retraining approach based onconvex-concave optimization under corrupted gradients and a novelproblem-specific robust mean estimator for the gradients. We prove that ourapproach exhibits last-iterate convergence to an approximately stable policy,with the approximation error linear in $\sqrt{\epsilon}$. We experimentallydemonstrate the importance of accounting for corruption in performative RL.

Quick Read (beta)

loading the full paper ...