Policy Teaching via Environment Poisoning: Training-time Adversarial Attacks against Reinforcement Learning

Abstract

We study a security threat to reinforcement learning where an attackerpoisons the learning environment to force the agent into executing a targetpolicy chosen by the attacker. As a victim, we consider RL agents whoseobjective is to find a policy that maximizes average reward in undiscountedinfinite-horizon problem settings. The attacker can manipulate the rewards orthe transition dynamics in the learning environment at training-time and isinterested in doing so in a stealthy manner. We propose an optimizationframework for finding an \emph{optimal stealthy attack} for different measuresof attack cost. We provide sufficient technical conditions under which theattack is feasible and provide lower/upper bounds on the attack cost. Weinstantiate our attacks in two settings: (i) an \emph{offline} setting wherethe agent is doing planning in the poisoned environment, and (ii) an\emph{online} setting where the agent is learning a policy using aregret-minimization framework with poisoned feedback. Our results show that theattacker can easily succeed in teaching any target policy to the victim undermild conditions and highlight a significant security threat to reinforcementlearning agents in practice.

Quick Read (beta)

loading the full paper ...