Understanding Adversarial Attacks on Observations in Deep Reinforcement Learning

Abstract

Deep reinforcement learning models are vulnerable to adversarial attacks thatcan decrease a victim's cumulative expected reward by manipulating the victim'sobservations. Despite the efficiency of previous optimization-based methods forgenerating adversarial noise in supervised learning, such methods might not beable to achieve the lowest cumulative reward since they do not explore theenvironmental dynamics in general. In this paper, we provide a framework tobetter understand the existing methods by reformulating the problem ofadversarial attacks on reinforcement learning in the function space. Ourreformulation generates an optimal adversary in the function space of thetargeted attacks, repelling them via a generic two-stage framework. In thefirst stage, we train a deceptive policy by hacking the environment, anddiscover a set of trajectories routing to the lowest reward or the worst-caseperformance. Next, the adversary misleads the victim to imitate the deceptivepolicy by perturbing the observations. Compared to existing approaches, wetheoretically show that our adversary is stronger under an appropriate noiselevel. Extensive experiments demonstrate our method's superiority in terms ofefficiency and effectiveness, achieving the state-of-the-art performance inboth Atari and MuJoCo environments.

Quick Read (beta)

loading the full paper ...