Reward Delay Attacks on Deep Reinforcement Learning

Abstract

Most reinforcement learning algorithms implicitly assume strong synchrony. Wepresent novel attacks targeting Q-learning that exploit a vulnerabilityentailed by this assumption by delaying the reward signal for a limited timeperiod. We consider two types of attack goals: targeted attacks, which aim tocause a target policy to be learned, and untargeted attacks, which simply aimto induce a policy with a low reward. We evaluate the efficacy of the proposedattacks through a series of experiments. Our first observation is thatreward-delay attacks are extremely effective when the goal is simply tominimize reward. Indeed, we find that even naive baseline reward-delay attacksare also highly successful in minimizing the reward. Targeted attacks, on theother hand, are more challenging, although we nevertheless demonstrate that theproposed approaches remain highly effective at achieving the attacker'stargets. In addition, we introduce a second threat model that captures aminimal mitigation that ensures that rewards cannot be used out of sequence. Wefind that this mitigation remains insufficient to ensure robustness to attacksthat delay, but preserve the order, of rewards.

Quick Read (beta)

loading the full paper ...