Abstract
Reinforcement learning (RL) is an actively growing field that is seeingincreased usage in real-world, safety-critical applications -- making itparamount to ensure the robustness of RL algorithms against adversarialattacks. In this work we explore a particularly stealthy form of training-timeattacks against RL -- backdoor poisoning. Here the adversary intercepts thetraining of an RL agent with the goal of reliably inducing a particular actionwhen the agent observes a pre-determined trigger at inference time. We uncovertheoretical limitations of prior work by proving their inability to generalizeacross domains and MDPs. Motivated by this, we formulate a novel poisoningattack framework which interlinks the adversary's objectives with those offinding an optimal policy -- guaranteeing attack success in the limit. Usinginsights from our theoretical analysis we develop ``SleeperNets'' as auniversal backdoor attack which exploits a newly proposed threat model andleverages dynamic reward poisoning techniques. We evaluate our attack in 6environments spanning multiple domains and demonstrate significant improvementsin attack success over existing methods, while preserving benign episodicreturn.