Abstract
Control policies, trained using the Deep Reinforcement Learning, have beenrecently shown to be vulnerable to adversarial attacks introducing even verysmall perturbations to the policy input. The attacks proposed so far have beendesigned using heuristics, and build on existing adversarial example craftingtechniques used to dupe classifiers in supervised learning. In contrast, thispaper investigates the problem of devising optimal attacks, depending on awelldefined attacker's objective, e.g., to minimize the main agent averagereward. When the policy and the system dynamics, as well as rewards, are knownto the attacker, a scenario referred to as a whitebox attack, designingoptimal attacks amounts to solving a Markov Decision Process. For what we callblackbox attacks, where neither the policy nor the system is known, optimalattacks can be trained using Reinforcement Learning techniques. Throughnumerical experiments, we demonstrate the efficiency of our attacks compared toexisting attacks (usually based on Gradient methods). We further quantify thepotential impact of attacks and establish its connection to the smoothness ofthe policy under attack. Smooth policies are naturally less prone to attacks(this explains why Lipschitz policies, with respect to the state, are moreresilient). Finally, we show that from the main agent perspective, the systemuncertainties and the attacker can be modeled as a Partially Observable MarkovDecision Process. We actually demonstrate that using Reinforcement Learningtechniques tailored to POMDP (e.g. using Recurrent Neural Networks) leads tomore resilient policies.
Quick Read (beta)
Optimal Attacks on Reinforcement Learning Policies
Abstract
Control policies, trained using the Deep Reinforcement Learning, have been recently shown to be vulnerable to adversarial attacks introducing even very small perturbations to the policy input. The attacks proposed so far have been designed using heuristics, and build on existing adversarial example crafting techniques used to dupe classifiers in supervised learning. In contrast, this paper investigates the problem of devising optimal attacks, depending on a welldefined attacker’s objective, e.g., to minimize the main agent average reward. When the policy and the system dynamics, as well as rewards, are known to the attacker, a scenario referred to as a whitebox attack, designing optimal attacks amounts to solving a Markov Decision Process. For what we call blackbox attacks, where neither the policy nor the system is known, optimal attacks can be trained using Reinforcement Learning techniques. Through numerical experiments, we demonstrate the efficiency of our attacks compared to existing attacks (usually based on Gradient methods). We further quantify the potential impact of attacks and establish its connection to the smoothness of the policy under attack. Smooth policies are naturally less prone to attacks (this explains why Lipschitz policies, with respect to the state, are more resilient). Finally, we show that from the main agent perspective, the system uncertainties and the attacker can be modelled as a Partially Observable Markov Decision Process. We actually demonstrate that using Reinforcement Learning techniques tailored to POMDP (e.g. using Recurrent Neural Networks) leads to more resilient policies.
Optimal Attacks on Reinforcement Learning Policies
Alessio Russo Division of Decision and Control Systems KTH Royal Institute of Technology Stockholm, Sweden [email protected] Alexandre Proutiere Division of Decision and Control Systems KTH Royal Institute of Technology Stockholm, Sweden [email protected]
noticebox[b]Preprint. Under review.\[email protected]
1 Introduction
Advances in Deep Reinforcement Learning (RL) have made it possible to train endtoend policies achieving superhuman performance on a large variety of tasks, such as playing Atari games [1, 2], playing Go [3, 4], as well as controlling systems with continuous state and action spaces [5, 6, 7]. Recently, some of these policies have been shown to be vulnerable to adversarial attacks at test time see e.g. [8, 9]. Even if these attacks only introduce small perturbations to the successive inputs of the policies, they can significantly impact the average collected reward by the agent.
So far, attacks on RL policies have been designed using heuristic techniques, based on gradient methods, essentially the Fast Gradient Sign Method (FGSM). As such, they do not explicitly aim at minimizing the reward collected by the agent. In contrast, in this paper, we investigate the problem of casting optimal attacks with welldefined objectives, for example minimizing the average reward collected by the agent. We believe that casting optimal attacks is crucial when assessing the robustness of RL policies, since ideally, the agent should learn and apply policies that resist any possible attack (of course with limited and reasonable amplitude).
For a given policy learnt by the agent, we show that the problem of devising an optimal attack can be formulated as a Markov Decision Process (MDP), defined through the system dynamics, the agent’s reward function and policy. We are mainly interested in blackbox attacks: to devise them, the adversary only observes the variables from which she builds her reward. For example, if the objective is to lead the system to a certain set of (bad) states, the adversary may just observe the states as the system evolves. If the goal is to minimize the cumulative reward collected by the agent, the adversary may just observe the system states and the instantaneous rewards collected by the agent. In blackbox attacks, the aforementioned MDP is unknown, and optimal attacks can be trained using RL algorithms. The action selected by the adversary in this MDP corresponds to a perturbed state to be fed to the agent’s policy. As a consequence, the action space is typically very large. To deal with this issue, our optimal attack is trained using DDPG [5], a Deep RL algorithm initially tailored to continuous action space.
We train and evaluate optimal attacks for some OpenAI Gym [10] environments with discrete action spaces (discrete MountainCar, Cartpole, and Pong), and continuous stateaction spaces (continuous MountainCar and continuous LunarLander). As expected, optimal attacks outperform existing attacks.
We discuss how the methods used to train RL policies impact their resilience to attacks. We show that the damages caused by attacks are upper bounded by a quantity directly related to the smoothness of the policy under attack^{1}^{1} 1 A policy is smooth when the action it selects smoothly varies with the state. Refer to proposition 5.1 for details.. Hence, training methods such as DDPG leading to smooth policies should be more robust. Finally, we remark that when under attack, the agent faces an environment with uncertain state feedback, and her sequential decision problem can be modelled as a Partially Observable MDP (POMDP). This suggests that training methods specifically tailored to POMDP (e.g. DRQN [11]) result in more resistant policies. These observations are confirmed by our experiments.
2 Related Work
Adversarial attacks on RL policies have received some attention over the last couple of years. As for now, attacks concern mainly the system at test time, where the agent has trained a policy using some Deep RL algorithm (except for [12], which considers an attack during training). The design of these attacks [8, 9, 13] are inspired by the Fast Gradient Sign Method (FGSM) [14], which was originally used to fool Deep Learning classifiers. FGSM is a gradient method that changes the input with the aim of optimizing some objective function. This function, and its gradient, are sampled using observations: in case of an agent trained using Deep RL, this function can be hence related to a distribution of actions given a state (the actornetwork outputs a distribution over actions), or the $Q$value function (the critic outputs approximate $Q$values). In [8], FGSM is used to minimize the probability of the main agent selecting the optimal action according to the agent. On the other hand, in [9], the authors develop a gradient method that maximizes the probability of taking the action with minimal $Q$value. The aforementioned attacks are based on the unperturbed policy of the agent, or its $Q$value function, and do not consider optimizing the attack over all possible trajectories. Therefore, they can be considered a firstorder approximation of an optimal attack. If the adversary wishes to minimize the average cumulative reward of the agent, she needs to work on the perturbed agent policy (to assess the impact of an attack, we need to compute the rewards after the attack).
Proactive and reactive defence methods have been proposed to make RL policies robust to gradient attacks. Proactive methods are essentially similar to those in supervised learning, i.e., they are based on injecting adversarial inputs during training [15, 9]. A few reactive methods have also been proposed [13]. There, the idea is to train a separate network predicting the next state. This network is then used to replace the state input by the predicted state if this input is believed to be too corrupted. But as we show in the paper, adversarial examples introduce partial observability, and the problem is not Markov anymore. It is known [16] that deterministic memoryless policies are inadequate to deal with partial observability, and some kind of memory is needed to find a robust policy. In this work, we make use of recurrent neural networks to show that we can use techniques from the literature of Partially Observable MDPs to robustify RL policies.
It is finally worth mentioning that there has been some work studying the problem of robust RL with a gametheoretical perspective [17, 18]. However, these papers consider a very different setting where the adversary has a direct impact on the system (not indirectly through the agent as in our work). In a sense, they are more related to RL for multiagent systems.
3 Preliminaries
This section provides a brief technical background on Markov Decision Process and Deep Reinforcement Learning. The notions and methods introduced here are used extensively in the remainder of the paper.
Markov Decision Process. A Markov Decision Process (MDP) defines a discretetime controlled Markov chain. It is defined by a tuple $\mathcal{M}=\u27e8\mathcal{S},({\mathcal{A}}_{s},s\in \mathcal{S}),P,{p}_{0},q\u27e9$. $\mathcal{S}$ is the finite state space, ${\mathcal{A}}_{s}$ is the finite set of actions available in state $s$, $P$ represents the system dynamics where $P(\cdot s,a)$ is the distribution of the next state given that the current state is $s$ and the selected action is $a$, ${p}_{0}$ is the distribution of the initial state, and $q$ characterizes rewards where $q(\cdot s,a)$ is the distribution of the collected reward in state is $s$ when action $a$ is selected. We denote by $r(s,a)$ the expectation of this reward. We assume that the average reward is bounded: for all $(s,a)$, $r(s,a)\le R$. A policy $\pi $ for $\mathcal{M}$ maps the state and to a distribution of the selected action. For $a\in {\mathcal{A}}_{s}$, $\pi (as)$ denotes the probability of choosing action $a$ in state $s$. The value function ${V}^{\pi}$ of a policy $\pi $ defines for every $s$ the average cumulative discounted reward under $\pi $ when the system starts in state $s$: ${V}^{\pi}(s)=\mathbb{E}[{\sum}_{t\ge 0}{\gamma}^{t}r({s}_{t}^{\pi},{a}_{t}^{\pi})]$ where ${s}_{t}^{\pi}$ and ${a}_{t}^{\pi}$ are the state and the selected action at time $t$ under policy $\pi $, and $\gamma \in (0,1)$ is the discount factor. The $Q$value function ${Q}^{\pi}$ of policy $\pi $ maps a (state, action) pair $(s,a)$ to the average cumulative discounted reward obtained starting in state $s$ when the first selected action is $a$ and subsequent actions are chosen according to $\pi $: ${Q}^{\pi}(s,a)=r(s,a)+\gamma {\sum}_{{s}^{\prime}}{\sum}_{a}\pi (as)P({s}^{\prime}s,a){V}^{\pi}({s}^{\prime})$. For a given MDP $\mathcal{M}$, the objective is to devise a policy with maximal value in every state.
Deep Reinforcement Learning To train policies with maximal rewards for MPDs with large state or action spaces, Deep Reinforcement Learning consists in parametrizing the set of policies or some particular functions of interest such as the value function or the $Q$value function of a given policy by a deep neural network. In this paper, we consider various Deep RL techniques to train either the policy of the agent or the attack. These include:
DQN [1]: the network aims at approximating the $Q$value function of the optimal policy.
DDPG [5]: an actorcritic method developed to deal with continuous stateaction spaces. To this aim, the actor network returns a single action rather than a distribution over actions. DDPG is particularly wellsuited to train attacks (since it corresponds to solving an MDP with large action space).
DRQN [11]: introduces memory in DQN by replacing the (postconvolutional) layer by a recurrent LSTM. The use of this recurrent structure allows us to deal with partial observability, and DRQN works better with POMDPs.
4 Optimal Adversarial Attacks
To attack a given policy $\pi $ trained by the main agent, an adversary can slightly modify the sequence of inputs of this policy. The changes are imposed to be small, according to some distance, so that the attack remains difficult to detect. The adversary aims at optimizing inputs with a precise objective in mind, for example, to cause as much damage as possible to the agent.
We denote by $\mathcal{M}=\u27e8\mathcal{S},({\mathcal{A}}_{s},s\in \mathcal{S}),P,{p}_{0},q\u27e9$ the MDP solved by the agent, i.e., the agent policy $\pi $ has been trained for $\mathcal{M}$.
4.1 The attack MDP
To attack the policy $\pi $ the adversary proceeds as follows. At time $t$, the adversary observes the system state ${s}_{t}$. She then selects a perturbed state ${\overline{s}}_{t}$, which becomes the input of the agent policy $\pi $. The agent hence chooses an action according to $\pi (\cdot {\overline{s}}_{t})$. The adversary successively collects a random reward defined depending on her objective, which is assumed to be a function of the true state and the action selected by the agent. An attack is defined by a mapping $\chi :\mathcal{S}\to \mathcal{S}$ where $\overline{s}=\chi (s)$ is the perturbed state given that the true state is $s$.
Constrained perturbations. For the attack to be imperceptible, the input should be only slightly modified. Formally, we assume that we can define a notion of distance $d(s,{s}^{\prime})$ between two states $s,{s}^{\prime}\in \mathcal{S}$. The state space of most RL problems can be seen as a subset of a Euclidean space, in which case $d$ is just the Euclidean distance. We impose that given a state $s$ of the system, the adversary can only select a perturbed state $\overline{s}$ in ${\overline{\mathcal{A}}}_{s}^{\u03f5}=\{\overline{s}\in \mathcal{S}:d(s,\overline{s})\le \u03f5\}$. $\u03f5$ is the maximal amplitude of the state perturbation.
System dynamics and agent’s reward under attack. Given that the state is ${s}_{t}=s$ at time $t$ and that the adversary selects a modified state $\overline{s}\in {\overline{\mathcal{A}}}_{s}^{\u03f5}$, the agent selects an action according to the distribution $\pi (\cdot \overline{s})$. Thus, the system state evolves to a random state with distribution $\overline{P}(\cdot s,\overline{s}):={\sum}_{a}\pi (a\overline{s})P(\cdot s,a)$. The agent instantaneous reward is a random variable with distribution ${\sum}_{a}\pi (a\overline{s})q(\cdot s,a)$.
Adversary’s reward. The attack is shaped depending on the adversary’s objective. The adversary typically defines her reward as a function of the true state and the action selected by the agent, or more concisely as a direct function of the reward collected by the agent. The adversary might be interested in guiding the agent to some specific states. In control systems, the adversary may wish to induce oscillations in the system output or to reduce its controllability (this can be realized by choosing a reward equal to the energy spent by the agent, i.e., proportional to ${a}^{2}$ if the agent selects $a$). The most natural objective for the adversary is to minimize the average cumulative reward collected by the agent. In this case, the adversary would set her instantaneous reward equal to the opposite of the agent’s reward.
We denote by $\overline{q}(\cdot s,\overline{s})$ the distribution of the reward collected by the adversary in state $s$ when she modifies the input to $\overline{s}$, and by $\overline{r}(s,\overline{s})$ its expectation. For example, when the adversary wishes to minimize the agent’s average cumulative reward, we have $\overline{r}(s,\overline{s})={\sum}_{a}\pi (a\overline{s})r(s,a)$.
We have shown that designing an optimal attack corresponds to identifying a policy $\chi $ that solves the following MDP: $\overline{\mathcal{M}}=\u27e8\mathcal{S},({\overline{\mathcal{A}}}_{s}^{\u03f5},s\in \mathcal{S}),\overline{P},{p}_{0},\overline{q}\u27e9$. When the parameters of this MDP are known to the adversary, a scenario referred to as white box attack, finding the optimal attack accounts to solving the MDP, e.g., by using classical methods (value or policy iteration). More realistically, the adversary may ignore the parameters of $\overline{\mathcal{M}}$, and only observe the state evolution and her successive instantaneous rewards. This scenario is called blackbox attack, and in this case, the adversary can identify the optimal attack using Reinforcement Learning algorithms.
Observe that when the adversarial attack policy is $\chi $, then the system dynamics and rewards correspond to a scenario where the agent applies the perturbed policy $\pi \circ \chi $, which is defined by the distributions $\pi \circ \chi (\cdot s):=\pi (\cdot \chi (s)),s\in \mathcal{S}$.
4.2 Minimizing agent’s average reward
Next, we give interesting properties of the MDP $\overline{\mathcal{M}}$. When the adversary aims at minimizing the average cumulative reward of the agent, then the reward collected by the agent can be considered as a cost by the adversary. In this case, $\overline{r}(s,\overline{s})={\sum}_{a}\pi (a\overline{s})r(s,a)$, and one can easily relate the value function (for the MDP $\overline{\mathcal{M}}$) of an attack policy $\chi $ to that of the agent policy $\pi \circ \chi $ (for the MDP $\mathcal{M}$):
$${V}^{\chi}(s)={V}^{\pi \circ \chi}(s),\forall s\in \mathcal{S}.$$  (1) 
The $Q$—value function of an attack policy $\chi $ can also be related to the $Q$—value function of the agent policy under attack $\pi \circ \chi $. Indeed: for all $s,\overline{s}\in \mathcal{S}$.
${Q}^{\chi}(s,\overline{s})$  $=\overline{r}(s,\overline{s})+\gamma {\displaystyle \sum _{{s}^{\prime}}}{\displaystyle \sum _{a}}\pi (a\overline{s})P({s}^{\prime}s,a){V}^{\chi}({s}^{\prime}),$  
$={\displaystyle \sum _{a}}\pi (a\overline{s})\left\{r(s,a)+\gamma {\displaystyle \sum _{{s}^{\prime}}}P({s}^{\prime}s,a){V}^{\pi \circ \chi}({s}^{\prime})\right\}={\displaystyle \sum _{a}}\pi (a\overline{s}){Q}^{\pi \circ \chi}(s,a).$ 
In particular, if the agent policy is deterministic ($\pi (s)\in {\mathcal{A}}_{s}$ is the action selected under $\pi $ in state $s$), we simply have:
$${Q}^{\chi}(s,\overline{s})={Q}^{\pi \circ \chi}(s,\pi (\overline{s})),\forall s,\overline{s}\in \mathcal{S}.$$  (2) 
Equation (2) has an important consequence when we train an attack policy using Deep Learning algorithms. It implies that evaluating the $Q$value of an attack policy $\chi $ can be done by evaluating the $Q$value function of the perturbed agent policy $\pi \circ \chi $. In the case of offpolicy learning, the former evaluation would require to store in the replay buffer experiences of the type $(s,\overline{s},r)$ ($r$ is the observed reward), whereas the latter just stores $(s,\pi (\overline{s}),r)$ (but it requires to observe the actions of the agent). This simplifies considerably when the state space is much larger than the action space.
4.3 Training optimal attacks
Training an optimal attack can be very complex, since it corresponds to solving an MDP where the action space size is similar to that of the state space. In our experiments, we use two (potentially combined) techniques to deal with this issue: (i) feature extraction when this is at all possible, and (ii) specific Deep RL techniques tailored to very large (or even continuous) state and action spaces, such as DDPG.
Feature extraction. One may extract important features of the system admitting a lowdimensional representation. Formally, this means that by using some expert knowledge about the system, one can identify a bijective mapping between the state $s$ in $\mathcal{S}$, and its features $z$ in $\mathcal{Z}$ of lower dimension. In general, this mapping can also be chosen such that its image of the ball ${\mathcal{A}}_{s}^{\u03f5}$ is easy to represent in $\mathcal{Z}$ (typically this image would also be a ball around $z$, the feature representation of $s$). It is then enough to work in $\mathcal{Z}$. When $\mathcal{Z}$ is of reasonable size, one may then rely on existing valuebased techniques (e.g. DQN) to train the attack. An example where features can be easily extracted is the Atari game of pong: the system state is just represented by the positions of the ball and the two bars. Finally, we assume to be very likely that an adversary trying to disrupt the agent policy would conduct an advanced feature engineering work before casting her attack.
Deep RL: DDPG. In many systems, it may not be easy to extract interesting features. In that case, one may rely on Deep RL algorithms to train the attack. The main issue here is the size of the action space. This size prevents us to use DQN (that would require to build a network with as many outputs as the number of possible actions), but also policygradient and actorcritic methods that parametrize randomized policies such as TRPO (indeed we can simply not maintain distributions over actions). This leads us to use DDPG, an actorcritic method that parametrizes deterministic policies. In DDPG, we consider policies parametrized by $\varphi $ (${\chi}_{\varphi}$ is the policy parametrized by $\varphi $), and its approximated $Q$—value function ${Q}_{\theta}$ parametrized by $\theta $. We also maintain two corresponding target networks. The objective function to maximize is $J(\varphi )={\mathbb{E}}_{s\sim {p}_{0}}[{V}^{{\chi}_{\varphi}}(s)]$. $\varphi $ is updated using the following gradient [19]:
$${\nabla}_{\varphi}J(\varphi )={\mathbb{E}}_{s\sim {\rho}^{{\chi}^{e}}}[{\nabla}_{\overline{s}}{Q}_{\theta}(s,{\chi}_{\varphi}(s)){\nabla}_{\varphi}{\chi}_{\varphi}(s)],$$  (3) 
where ${\chi}^{e}$ corresponds to ${\chi}_{\varphi}$ with additional exploration noise, and ${\rho}^{{\chi}^{e}}$ denotes the discounted state visitation distribution under the policy ${\chi}^{e}$. In the case where the adversary’s objective is to minimize the average reward of the agent, in view of (2), the above gradient becomes:
$${\nabla}_{\varphi}J(\varphi )={\mathbb{E}}_{s\sim {\rho}^{{\chi}^{e}}}[{\nabla}_{a}{Q}_{\theta}(s,\pi ({\chi}_{\varphi}(s))){\nabla}_{\overline{s}}\pi ({\chi}_{\varphi}(s)){\nabla}_{\varphi}{\chi}_{\varphi}(s)].$$  (4) 
Finally, to ensure that the selected perturbation computed by ${\chi}_{\varphi}$ belongs to ${\mathcal{A}}_{s}^{\u03f5}$, we add a last fixed layer to the network that projects onto the ball centred in $0$, with radius $\u03f5$. This is done by applying the function $x\mapsto \mathrm{min}(1,\frac{\u03f5}{\parallel x\parallel})x$ to $x+e$, where $x$ is the output before the projection layer, and $e$ is the exploration noise. After this projection, we add the state $s$ to get ${\chi}^{e}(s)$.
The pseudocode of the algorithm is provided in Algorithm 1 in the Appendix. Algorithm 2 presents a version of the algorithm using the gradient (4). A diagram of the actor and critic networks are also provided in Figure 8 of the Appendix.
Gradientbased exploration. In case the adversary knows the policy $\pi $ of the main agent, we can leverage this knowledge to tune the exploration noise $e$. Similarly, to gradient methods [8, 9], we suggest using the quantity ${\nabla}_{s}J(\pi (\cdot s),y)$ to improve the exploration process , where $J$ is the crossentropy loss between $\pi (\cdot s)$ and a onehot vector $y$ that encodes the action with minimum probability in state $s$. This allows exploring directions that might minimize the value of the unperturbed policy, which boosts the rate at which the optimal attack is learnt. At time $t$, the exploration noise could be set to ${e}_{t}^{\prime}$, a convex combination of a white exploration noise ${e}_{t}$ and $f({s}_{t})=g({\nabla}_{s}J(\pi (\cdot {s}_{t}),{y}_{t}))$, where $g$ is a normalizing function. Following this gradient introduces bias, hence we randomly choose when to use ${e}_{t}$ or ${e}_{t}^{\prime}$ (more details can be found in the Appendix).
5 Robustness of Policies
In this section, we briefly discuss which training methods should lead to agent policies more resilient to attacks. We first quantify the maximal impact of an attack on the cumulative reward of the agent policy, and show that it is connected to the smoothness of the agent policy. This connection indicates that the smoother the policy is, the more resilient it is to attacks. Next, we show that from the agent perspective, an attack induces a POMDP. This suggests that if training the agent policy using Deep RL algorithms tailored to POMDP yields more resilient policies.
Impact of an attack and policy smoothness. When the agent policy $\pi $ is attacked using $\chi $, one may compute the cumulative reward gathered by the agent. Indeed, the system evolves as if the agent policy was $\pi \circ \chi $. The corresponding value function satisfies the Bellman equation: ${V}^{\pi \circ \chi}(s)={\mathbb{E}}_{a\sim \pi \circ \chi (\cdot s)}\left[r(s,a)+\gamma {\mathbb{E}}_{{s}^{\prime}\sim P(\cdot s,a)}[{V}^{\pi \circ \chi}({s}^{\prime})]\right]$. Starting from this observation, we derive an upper bound on the impact of an attack (see the Appendix for a proof):
Proposition 5.1.
For any $s\mathrm{\in}\mathrm{S}$, let^{2}^{2} 2 $\mathrm{\parallel}\mathrm{\cdot}{\mathrm{\parallel}}_{T\mathit{}V}$ is the Total Variation distance, and for any $V\mathrm{\in}{\mathrm{R}}^{\mathrm{}\mathrm{S}\mathrm{}}$, ${\mathrm{\parallel}V\mathrm{\parallel}}_{\mathrm{\infty}}\mathrm{=}{\mathrm{max}}_{s\mathrm{\in}\mathrm{S}}\mathit{}\mathrm{}V\mathit{}\mathrm{(}s\mathrm{)}\mathrm{}$. ${\alpha}_{\pi \mathrm{,}\epsilon}\mathrm{(}s\mathrm{)}\mathrm{=}{\mathrm{max}}_{\overline{s}\mathrm{\in}{\mathrm{A}}_{s}^{\epsilon}}\mathrm{\parallel}\pi \mathrm{(}\mathrm{\cdot}\mathrm{}s\mathrm{)}\mathrm{}\pi \mathrm{(}\mathrm{\cdot}\mathrm{}\overline{s}\mathrm{)}{\mathrm{\parallel}}_{T\mathit{}V}$. We have:
$${\parallel {V}^{\pi}{V}^{\pi \circ \chi}\parallel}_{\mathrm{\infty}}\le 2\frac{{\parallel {\alpha}_{\pi ,\epsilon}\parallel}_{\mathrm{\infty}}}{1\gamma}\left(R+\gamma {\parallel {V}^{\pi}\parallel}_{\mathrm{\infty}}\right).$$  (5) 
Assume now that the agent policy $\pi $ is smooth in the sense that, for all $s\mathrm{,}{s}^{\mathrm{\prime}}\mathrm{\in}\mathrm{S}$, $\mathrm{\parallel}\pi \mathrm{(}\mathrm{\cdot}\mathrm{}s\mathrm{)}\mathrm{}\pi \mathrm{(}\mathrm{\cdot}\mathrm{}{s}^{\mathrm{\prime}}\mathrm{)}{\mathrm{\parallel}}_{T\mathit{}V}\mathrm{\le}Ld\mathrm{(}s\mathrm{,}{s}^{\mathrm{\prime}}\mathrm{)}$, then
$${\parallel {V}^{\pi}{V}^{\pi \circ \chi}\parallel}_{\mathrm{\infty}}\le 2\frac{L\epsilon}{1\gamma}\left(R+\gamma {\parallel {V}^{\pi}\parallel}_{\mathrm{\infty}}\right).$$  (6) 
In the above, ${\alpha}_{\pi ,\epsilon}(s)$ quantifies the potential impact of applying a feasible perturbation to the state $s$ on the distribution of actions selected by the agent in this state (note that it decreases to 0 as $\epsilon $ goes to 0). The proposition establishes the intuitive fact that when the agent policy is smooth (i.e., varies smoothly with the input), it should be more resilient to attacks. Indeed, in our experiments, policies trained using Deep RL methods known to lead to smooth policies (e.g. DDPG for RL problems with continuous state and action spaces) resist better to attacks.
Induced POMDP and DRQN. When the agent policy is under the attack $\chi $, the agent perceives the true state $s$ only through its perturbation $\chi (s)$. More precisely, the agent observes a perturbed state $\overline{s}$ only, with probability $\mathcal{O}(\overline{s}s)=\mathrm{\U0001d7d9}[\chi (s)=\overline{s}]$ (in general $\mathcal{O}$ could also be a distribution over states if the attack is randomized). It is known [20, 21, 16] that solving POMDPs requires "memory", i.e., policies whose decisions depend on the past observations (nonMarkovian). Hence, we expect that training the agent policy using Deep RL algorithms tailored to POMDP produces more resilient policies. In [11] they empirically demonstrate that recurrent controllers have a certain degree of robustness against missing information, even when trained with full state information. We use this to proactively extract robust features, not biased by a specific adversary policy $\chi $, as in adversarial training. This is confirmed in our experiments, comparing the impact of attacks on policies trained by DRQN to those trained using Deep RL algorithms without recurrent structure (without memory).
6 Experimental Evaluation
We evaluate our attacks on four OpenAI Gym [10] environments (A: discrete MountainCar, B: Cartpole, C: continuous MountainCar, and D: continuous LunarLander), and on E: Pong, an Atari 2600 game in the Arcade Learning Environment [22]. In Appendix, we also present the results of our attacks for the toy example of the Grid World problem, to illustrate why gradientbased attack are suboptimal.
Agent’s policies. The agent policies are trained using DQN or DRQN in case of discrete action spaces (environments A, B, and E), and using DDPG when the action space is continuous (environments C and D). In each environment and for each training algorithm, we have obtained from 3 policies for DQN, 1 for DRQN (achieving at least 95% of the performance of the best policy reported in OpenAI Gym).
Adversarial attack. For each environment A, B, C, and D, we train one adversarial attack using DDPG against one of the agent’s policies, and for a perturbation constraint $\epsilon =0.05$ (we use the same normalization method as in [8]). For results presented below, we use uniformly distributed noise; results with gradientbased noise are discussed in the Appendix. To get attacks for other values of $\epsilon $, we do not retrain our attack but only change the projection layer in our network. For environment E, we represent the state as a 4dimensional feature vector, and train the attack using DQN in the feature space. The full experimental setup is presented in the Appendix.
6.1 Optimal vs. gradientbased attacks
We compare our attacks to the gradientbased attacks proposed in [9] (we found that these are the most efficient among gradientbased attacks). To test a particular attack, we run it against the 3 agent’s policies, using 10 random seeds (e.g. involved in the state initialization), and 30 episodes. We hence generate up to 1800 episodes, and average the cumulative reward of the agent.
In Figure 1, we plot the performance loss for different the perturbation amplitudes $\epsilon $ in the 4 environments. Observe that the optimal attack consistently outperforms the gradientbased attack, and significantly impact the performance of the agent. Also note that since we have trained our attack for $\epsilon =0.05$, it seems that this attack generalizes well for various values of $\epsilon $. In Appendix, we present similar plots but not averaged over the 3 agent’s policies, and we do not see any real difference – suggesting that attacks generalize well to different agent’s policies.
From Figure 1, for environments with discrete action spaces (the top two sets of curves), the resilience of policies trained by DRQN is confirmed: these policies are harder to attack.
Policies trained using DDPG for environments with continuous action spaces (the bottom two sets of curves in Figure 1) are more difficult to perturb than those trained in environments with discrete action spaces. The gradientbased attack does not seem to be efficient at all, at least in the range of attack amplitude $\epsilon $ considered. In the case of LunarLander some state components were not normalized, which may explain why a bigger value of $\epsilon $ is needed in order to see a major performance loss (at least $\epsilon =0.1$). Our optimal attack performs better, but the impact of the attack is not as significant as in environments with discrete action spaces: For instance, for discrete and continuous MountainCar, our attack with amplitude $\epsilon =0.07$ decreases the performance of the agent by 30% and 13%, respectively.
6.2 Attack on Pong, based on feature extraction
We consider the game of Pong where the state is an 84x84 pixel image. The agent’s policies were trained using DQN using as inputs the successive images. As for the attack, we extracted a 4dimensional feature vector fully representing the state: this vector encodes the position of the centre of mass of the three relevant objects in the image (the ball and the two bars). Extracting the features is done using a high pass filter to detect the contours. We need 4dimensional vectors since the two bars have a single degree of freedom, and the ball has two.
We limit the amplitude of the attack in the feature space directly. More precisely, we impose that the adversary just changes one of the 4 components of the feature vector by one pixel only. For example, the adversary can move up one bar by one pixel. We found that the amplitude of the resulting attack in the true state space (full images) corresponds to $\epsilon =0.013$.
The attack is trained using DQN in the feature space. In Figure 2 (left part), we present a few true and perturbed images, and below the corresponding distributions of the actions selected by the agent without and with the attack (the probabilities are calculated from the output of the network using a softmax function with temperature $T=1$). Moving objects by one pixel can perturb considerably the agent’s policy. In Figure 2 (right part), we show the evolution of the performance of the agent’s policy under our attack during the training of the latter. Initially, the agent’s policy achieves the maximum score, but after 500 episodes of training, the agent’s score has become close to the lowest possible score in this game (20). This score can be reached in an even fewer number of episodes when tuning the training parameters. The gradientbased attacks performed in [8] cannot be directly compared to ours, since these attacks change $4$ frames at every time step, while ours modifies one only. They also modify the values of every pixel in frames (using 32bit precision over the reals [0,1]), whereas we change only the values of just a few (using only 8bit precision over the integers [0,255]).
7 Conclusion
In this paper we have formulated the problem of devising an optimal blackbox attack that does not require access to the underlying policy of the main agent. Previous attacks, such as FGSM [8], make use of whitebox assumptions, i.e., knowing the actionvalue function Q, or the policy, of the main agent. In our formulation, we do not assume this knowledge. Deriving an optimal attack is important in order to understand how to build RL policies robust to adversarial perturbations. The problem has been formulated as a Markov Decision Problem, where the goal of the adversary is encapsulated in the adversarial reward function. The problem becomes intractable when we step out of toy problems: we propose a variation of DDPG to compute the optimal attack. In the whitebox case, instead of using FGSM, we propose to use a gradientbased method to improve exploration. Adapting to such attacks requires solving a Partially Observable MDP, hence we have to resort to nonmarkovian policies [16]. In deep learning this can be done by adopting recurrent layers, as in DRQN [11]. We also show that Lipschitz policies may have better robustness properties. We validated our algorithm on different environments. In all cases we have found out that our attack outperforms gradient methods. This is more evident in discrete action spaces, whilst in continuous spaces it is more difficult to perturb for small values of $\epsilon $, which may be explained by the bound we provide for Lipschitz policies. Finally, policies that use memory seem to be more robust in general.
References
 [1] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. NIPS Deep Learning Workshop, 2013.
 [2] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937, 2016.
 [3] Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
 [4] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, and Demis Hassabis. Mastering the game of go without human knowledge. Nature, 550:354 EP –, Oct 2017.
 [5] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. International Conference on Learning Representations, San Juan, Puerto Rico, 2016.
 [6] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, pages 1889–1897, 2015.
 [7] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. Endtoend training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334–1373, 2016.
 [8] Sandy Huang, Nicolas Papernot, Ian Goodfellow, Yan Duan, and Pieter Abbeel. Adversarial attacks on neural network policies. Workshop Track of the 5th International Conference on Learning Representations, Toulon, France, 2017.
 [9] Anay Pattanaik, Zhenyi Tang, Shuijing Liu, Gautham Bommannan, and Girish Chowdhary. Robust deep reinforcement learning with adversarial attacks. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pages 2040–2042. International Foundation for Autonomous Agents and Multiagent Systems, 2018.
 [10] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
 [11] Matthew Hausknecht and Peter Stone. Deep recurrent qlearning for partially observable mdps. In 2015 AAAI Fall Symposium Series, 2015.
 [12] Vahid Behzadan and Arslan Munir. Vulnerability of deep reinforcement learning to policy induction attacks. In International Conference on Machine Learning and Data Mining in Pattern Recognition, pages 262–275. Springer, 2017.
 [13] YenChen Lin, ZhangWei Hong, YuanHong Liao, MengLi Shih, MingYu Liu, and Min Sun. Tactics of adversarial attack on deep reinforcement learning agents. Proceedings of the TwentySixth International Joint Conference on Artificial Intelligence, 2017.
 [14] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. International Conference on Learning Representations, 2015.
 [15] Jernej Kos and Dawn Song. Delving into adversarial attacks on deep policies. Workshop Track of the 5th International Conference on Learning Representations, Toulon, France, 2017.
 [16] Satinder P Singh, Tommi Jaakkola, and Michael I Jordan. Learning without stateestimation in partially observable markovian decision processes. In Machine Learning Proceedings 1994, pages 284–292. Elsevier, 1994.
 [17] Jun Morimoto and Kenji Doya. Robust reinforcement learning. Neural computation, 17(2):335–359, 2005.
 [18] Lerrel Pinto, James Davidson, Rahul Sukthankar, and Abhinav Gupta. Robust adversarial reinforcement learning. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 2817–2826. JMLR. org, 2017.
 [19] David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Deterministic policy gradient algorithms. In ICML, 2014.
 [20] Matthijs TJ Spaan. Partially observable markov decision processes. In Reinforcement Learning, pages 387–414. Springer, 2012.
 [21] Karl J Astrom. Optimal control of markov processes with incomplete state information. Journal of mathematical analysis and applications, 10(1):174–205, 1965.
 [22] Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
 [23] Matthias Plappert. kerasrl. https://github.com/kerasrl/kerasrl, 2016.
8 Other experimental results
Gridworld. We use as a toyexample a $6\times 6$ gridworld, with ${\mathrm{\ell}}_{1}$ distance, and perturbation magnitude $\epsilon =1.0$. The goal of the game is to reach the topleft or bottomright corner in the fewest number of steps, with reward $1$ for each step spent in the game. We implemented both the gradient methods inspired by Huang et al. [8] and Pattanaik et al. [9], and compared them with our proposed method to find the optimal perturbation policy $\chi $. The implementation of those algorithms is shown in Algorithm 10.2, 10.2. In figure 3 is shown the undiscounted value of the various attacks: we can see the value of the worst attack defined in [9] does not achieve the same values that are achieved from the optimal perturbation $\chi $.
OpenAI Gym Environments. Figure 4 presents the results for the discrete MountainCar environment A. There, we plot the distribution of the performance loss (in %) due to the attack for $\epsilon =0.05$, when the agent policies have been trained using DQN (top) or DRQN (bottom).
In Figure 58 we show the performance loss distribution for $\epsilon =0.05$, and in Figure 9 the average performance loss for different values of $\epsilon $, all computed when the adversary attacks the agent’s policy it was trained for. Additionally, for environment B, C, D, we show the loss distribution averaged across all the main agent’s policies.
Gradient based exploration Gradientbased exploration tries to exploit knowledge of the unperturbed policy $\pi $ of the main agent. Caution should be taken, in order to reduce bias, and to minimize the risk of exploring like an onpolicy method. If we follow the information contained in $\pi $ without restriction we would end up in a suboptimal solution, that minimizes the value of the unperturbed policy $\pi $. The exploration noise is given by the following equation:
$${\widehat{e}}_{t}=\{\begin{array}{cc}{e}_{t},\hfill & {X}_{t}=0,\hfill \\ (1{\omega}_{t}){e}_{t}+{\omega}_{t}g({\nabla}_{s}J(\pi (a{s}_{t}),y)),\hfill & {X}_{t}=1\hfill \end{array}$$  (7) 
where $g$ is a normalizing function such as $g(x)=\frac{x}{\parallel x\parallel}$, or $g(x)=\mathrm{sign}(x)$, ${X}_{t}$ is a Bernoulli random variable with $P({X}_{t}=1)=p$, and ${\omega}_{t}={\mathrm{max}}_{a}\pi (a{s}_{t}){\mathrm{min}}_{a}\pi (a{s}_{t})$. $J$ is the crossentropy loss between $\pi (a{s}_{t})$ and a onehot vector $y$ that encodes the action with minimum probability in state ${s}_{t}$. If $\pi $ is not known, but the actionvalue function ${Q}^{\pi}$ is known, we can compute $\pi (\cdot s)$ by applying the softmax function on the actionvalues of a particular state $s$.
The gradient of the crossentropy loss computes the direction that increases the probability of taking the action with minimum value when the policy $\pi $ is unperturbed. The random variable ${X}_{t}$, instead, is necessary in order to reduce bias induced by following this gradient, while ${\omega}_{t}$ quantifies how strongly an agent prefers the optimal action in state ${s}_{t}$. Intuitively, in case $\pi $ is an optimal policy, the lower ${\omega}_{t}$ is, the less impact the optimal action has in that state. This guarantees that we follow the gradient of $J$ mostly when we are in critical states. In Figure 10 we show the lowpass filtered loss during the training of the attack. We clearly see an improvement in performance, although it highly depends on the selection of the parameters.
We have tested this type of exploration on Mountaincar, with $p=0.35$, and temperature of the softmax function set to $1$. The exploration noise ${e}_{t}$ is drawn from a Uniform distribution with parameters shown in Table 3 (Mountaincar). In Figure 10 we compare gradientbased exploration vs. uniform noise exploration. On the left, we show the lowpass filtered loss during the training of the attack, with $\epsilon =0.05$ and ${\mathrm{\ell}}_{2}$ norm. On the right of Figure 10 we compare episodes reward statistics after training. Results in the table were averaged across $5$ different seeds, with $100$ episodes for each seed, for a total of $500$ episodes for each method. We see an improvement in performance, although it highly depends on the selection of the parameters.
9 Proof of Proposition 5.1
Proof.
Let
$${g}^{\pi}(s,a)={\mathbb{E}}_{{s}^{\prime}\sim P(\cdot s,a)}[{V}^{\pi}({s}^{\prime})],\text{and}\mathit{\hspace{1em}}{g}^{\pi \circ \chi}(s,a)={\mathbb{E}}_{{s}^{\prime}\sim P(\cdot s,a)}[{V}^{\pi \circ \chi}({s}^{\prime})].$$  (8) 
We will now proceed by bounding the quantity $({V}^{\pi}{V}^{\pi \circ \chi})(s)$:
$({V}^{\pi}{V}^{\pi \circ \chi})(s)$  $=\left{\mathbb{E}}_{a\sim \pi (\cdot s)}[r(s,a)+\gamma {g}^{\pi}(s,a)]{\mathbb{E}}_{a\sim \pi \circ \chi (\cdot s)}[r(s,a)+\gamma {g}^{\pi \circ \chi}(s,a)]\right,$  (9)  
$=\left{\mathbb{E}}_{\overline{s}\sim \chi (\cdot s)}\left[{\mathbb{E}}_{a\sim \pi (\cdot s)}[r(s,a)+\gamma {g}^{\pi}(s,a)]{\mathbb{E}}_{a\sim \pi (\cdot \overline{s})}[r(s,a)+\gamma {g}^{\pi \circ \chi}(s,a)]\right]\right,$  (10) 
where in the last line we made use of the following equality ${\mathbb{E}}_{a\sim \pi \circ \chi (\cdot s)}[\cdot ]={\mathbb{E}}_{\overline{s}\sim \chi (\cdot s)}[{\mathbb{E}}_{a\sim \pi (\cdot \overline{s})}[\cdot ]]$. Now, by making use of $\mathbb{E}[x]\le \mathbb{E}[x]$ we get:
$({V}^{\pi}{V}^{\pi \circ \chi})(s)$  $=\left{\mathbb{E}}_{\overline{s}\sim \chi (\cdot s)}\left[{\mathbb{E}}_{a\sim \pi (\cdot s)}[r(s,a)+\gamma {g}^{\pi}(s,a)]{\mathbb{E}}_{a\sim \pi (\cdot \overline{s})}[r(s,a)+\gamma {g}^{\pi \circ \chi}(s,a)]\right]\right,$  (11)  
$\le {\mathbb{E}}_{\overline{s}\sim \chi (\cdot s)}[{\mathbb{E}}_{a\sim \pi (\cdot s)}[r(s,a)+\gamma {g}^{\pi}(s,a)]{\mathbb{E}}_{a\sim \pi (\cdot \overline{s})}[r(s,a)+\gamma {g}^{\pi \circ \chi}(s,a)].$  (12) 
At this point consider a random variable $X$ that can assume values only in a bounded set $K$. Let ${X}_{\mathrm{\infty}}$ be a bound on the absolute value of $X$. Then, for any two distributions ${f}_{1},{f}_{2}$ whose support is $K$ we have ${\mathbb{E}}_{{f}_{1}}[X]{\mathbb{E}}_{{f}_{2}}[X]\le 2{X}_{\mathrm{\infty}}{\parallel {f}_{1}{f}_{2}\parallel}_{TV}$:
$${\mathbb{E}}_{{f}_{1}}[X]{\mathbb{E}}_{{f}_{2}}[X]=\sum _{x\in K}x{f}_{1}(x)\sum _{x\in K}x{f}_{2}(x))\le X{}_{\mathrm{\infty}}\sum {}_{x\in K}f{}_{1}(x)f{}_{2}(x)=2X{}_{\mathrm{\infty}}\parallel f{}_{1}f{}_{2}\parallel {}_{TV}.$$  (13) 
By using the previous inequality we can bound the reward difference:
${\mathbb{E}}_{\overline{s}\sim \chi (\cdot s)}{\mathbb{E}}_{a\sim \pi (\cdot s)}[r(s,a)]{\mathbb{E}}_{a\sim \pi (\cdot \overline{s})}[r(s,a)]$  $\le 2R{\mathbb{E}}_{\overline{s}\sim \chi (\cdot s)}{\parallel \pi (s)\pi (\overline{s})\parallel}_{TV},$  (14)  
$\le 2R{\alpha}_{\pi ,\epsilon}(s),$  (15) 
where ${\alpha}_{\pi ,\epsilon}(s)={\mathrm{max}}_{\overline{s}\in {X}_{s}^{\epsilon}}{\parallel \pi (s)\pi (\overline{s})\parallel}_{TV}$. The inequality becomes
$$({V}^{\pi}{V}^{\pi \circ \chi})(s)\le 2{\alpha}_{\epsilon}(s)R+\gamma {\mathbb{E}}_{\overline{s}\sim \chi (\cdot s)}\left{\mathbb{E}}_{a\sim \pi (\cdot s)}[{g}^{\pi}(s,a)]{\mathbb{E}}_{a\sim \pi (\cdot \overline{s})}[{g}^{\pi \circ \chi}(s,a)]\right.$$  (16) 
Now, let ${g}^{\pi}(s,a)={g}^{\pi \circ \chi}(s,a)+\mathrm{\Delta}(s,a)$, and observe that ${g}^{\pi}(s,a)$ is uniformly bounded by ${V}_{\mathrm{\infty}}^{\pi}={\parallel {V}^{\pi}\parallel}_{\mathrm{\infty}}$, and $\mathrm{\Delta}(s,a)$ by ${\parallel {V}^{\pi}{V}^{\pi \circ \chi}\parallel}_{\mathrm{\infty}}$, thus
${\mathbb{E}}_{\overline{s}\sim \chi (\cdot s)}\left{\mathbb{E}}_{a\sim \pi (\cdot s)}[{g}^{\pi}(s,a)]{\mathbb{E}}_{a\sim \pi (\cdot \overline{s})}[{g}^{\pi \circ \chi}(s,a)]\right,$  (17)  
$=$  ${\mathbb{E}}_{\overline{s}\sim \chi (\cdot s)}\left{\mathbb{E}}_{a\sim \pi (\cdot s)}[{g}^{\pi}(s,a)]{\mathbb{E}}_{a\sim \pi (\cdot \overline{s})}[{g}^{\pi}(s,a)\mathrm{\Delta}(s,a)]\right,$  (18)  
$\le $  ${\mathbb{E}}_{\overline{s}\sim \chi (\cdot s)}\left[2{\alpha}_{\pi ,\epsilon}(s){V}_{\mathrm{\infty}}^{\pi}+{\parallel {V}^{\pi}{V}^{\pi \circ \chi}\parallel}_{\mathrm{\infty}}\right]=2{\alpha}_{\pi ,\epsilon}(s){V}_{\mathrm{\infty}}^{\pi}+{\parallel {V}^{\pi}{V}^{\pi \circ \chi}\parallel}_{\mathrm{\infty}}.$  (19) 
Hence, we can deduce the following inequality:
$({V}^{\pi}{V}^{\pi \circ \chi})(s)$  $\le 2R{\alpha}_{\pi ,\epsilon}(s)+2\gamma {\alpha}_{\pi ,\epsilon}(s){V}_{\mathrm{\infty}}^{\pi}+\gamma {\parallel {V}^{\pi}{V}^{\pi \circ \chi}\parallel}_{\mathrm{\infty}},$  (20)  
$=2{\alpha}_{\pi ,\epsilon}(s)(R+\gamma {V}_{\mathrm{\infty}}^{\pi})+\gamma {\parallel {V}^{\pi}{V}^{\pi \circ \chi}\parallel}_{\mathrm{\infty}}.$  (21) 
By taking the term $\gamma {\parallel {V}^{\pi}{V}^{\pi \circ \chi}\parallel}_{\mathrm{\infty}}$ on the left hand side, dividing by $1\gamma $, and finally taking the supremum over $s\in \mathcal{S}$ we obtain the result:
$${\parallel {V}^{\pi}{V}^{\pi \circ \chi}\parallel}_{\mathrm{\infty}}\le 2\frac{{\parallel {\alpha}_{\pi ,\epsilon}\parallel}_{\mathrm{\infty}}}{1\gamma}(R+\gamma {V}_{\mathrm{\infty}}^{\pi}).$$  (22) 
Assume now the main agent’s policy is smooth, such that ${\parallel \pi (s)\pi ({s}^{\prime})\parallel}_{TV}\le Ld(s,{s}^{\prime}),L\in {\mathbb{R}}_{\ge 0}$. We can now bound the term ${\mathbb{E}}_{\overline{s}\sim \chi (\cdot s)}{\mathbb{E}}_{a\sim \pi (\cdot s)}[r(s,a)]{\mathbb{E}}_{a\sim \pi (\cdot \overline{s})}[r(s,a)]$ as follows:
${\mathbb{E}}_{\overline{s}\sim \chi (\cdot s)}{\mathbb{E}}_{a\sim \pi (\cdot s)}[r(s,a)]{\mathbb{E}}_{a\sim \pi (\cdot \overline{s})}[r(s,a)]$  $\le {\mathbb{E}}_{\overline{s}\sim \chi (\cdot s)}[2R{\parallel \pi (s)\pi (\overline{s})\parallel}_{TV}],$  (23)  
$\le 2LR{\mathbb{E}}_{\overline{s}\sim \chi (\cdot s)}[d(s,\overline{s})],$  (24)  
$=2LR\overline{d}(s),$  (25) 
where $\overline{d}(s)={\mathbb{E}}_{\overline{s}\sim \chi (\cdot s)}[d(s,\overline{s})]$ denotes the average weighted distance between elements of ${X}_{s}^{\epsilon}$ and $s$, according to the distribution $\chi (\cdot s)$. Equivalently we also have
${\mathbb{E}}_{\overline{s}\sim \chi (\cdot s)}\left{\mathbb{E}}_{a\sim \pi (\cdot s)}[{g}^{\pi}(s,a)]{\mathbb{E}}_{a\sim \pi (\cdot \overline{s})}[{g}^{\pi \circ \chi}(s,a)]\right$  $\le {\mathbb{E}}_{\overline{s}\sim \chi (\cdot s)}[2{V}_{\mathrm{\infty}}^{\pi}\parallel \pi (s)\pi (\overline{s}){\parallel}_{TV}$  (26)  
$\mathrm{\hspace{1em}}+\parallel {V}^{\pi}{V}^{\pi \circ \chi}{\parallel}_{\mathrm{\infty}}],$  (27)  
$\le 2L{V}_{\mathrm{\infty}}^{\pi}[\overline{d}(s)+{\parallel {V}^{\pi}{V}^{\pi \circ \chi}\parallel}_{\mathrm{\infty}}],$  (28) 
from which we obtain
$({V}^{\pi}{V}^{\pi \circ \chi})(s)$  $\le 2LR\overline{d}(s)+\gamma 2L\overline{d}(s){V}_{\mathrm{\infty}}^{\pi}+\gamma {\parallel {V}^{\pi}{V}^{\pi \circ \chi}\parallel}_{\mathrm{\infty}},$  (29)  
$=2L\overline{d}(s)(R+\gamma {V}_{\mathrm{\infty}}^{\pi})+\gamma {\parallel {V}^{\pi}{V}^{\pi \circ \chi}\parallel}_{\mathrm{\infty}},$  (30) 
and
$${\parallel {V}^{\pi}{V}^{\pi \circ \chi}\parallel}_{\mathrm{\infty}}\le 2\underset{s\in \mathcal{S}}{sup}\frac{L\overline{d}(s)}{1\gamma}(R+\gamma {V}_{\mathrm{\infty}}^{\pi})\le 2\frac{L\epsilon}{1\gamma}(R+\gamma {V}_{\mathrm{\infty}}^{\pi}).$$  (31) 
∎
10 Algorithms
10.1 Attack training algorithm
In this section we describe the DDPG algorithms discussed in Section 4.3. We will denote the projection layer and the output before the projection layer respectively by ${l}_{\mathrm{\Pi}}$ and ${x}_{\varphi ,t}$. Algorithm 10.1 refers to the blackbox attack, whilst Algorithm 10.1 makes use of $\pi $, if it is known. Apart from the projection layer, and the change in the gradient step in the whitebox case, both algorithms follow the same logic as the usual DDPG algorithm, explained in [5]. Moreover, projections have a regularization term $\lambda $ in order to prevent division by $0$. {algorithm}[!h] \[email protected]@algorithmic[1] \ProcedureAttackTraining$\pi ,{T}_{\text{training}},{N}_{B}$ \StateInitialize critic and actor network ${Q}_{\theta},{\chi}_{\varphi}$ with weights $\theta $ and $\varphi $. \StateInitialize target critic and actor networks ${Q}_{{\theta}^{}},{\chi}_{{\varphi}^{}}$ with weights ${\theta}^{}\leftarrow \theta ,{\varphi}^{}\leftarrow \varphi $. \StateInitialize replay buffer $\mathcal{D}$. \Forepisode 1:M \StateInitialize environment and observe state ${s}_{0}$, and set initial time $t\leftarrow 0$. \Repeat\StateSelect a perturbation and add exploration noise ${\overline{s}}_{t}\leftarrow {s}_{t}+{l}_{\mathrm{\Pi}}({x}_{\varphi ,t}+{e}_{t})$. \StateFeed perturbation ${\overline{s}}_{t}$ to main agent and observe reward and state ${\overline{r}}_{t},{s}_{t+1}$. \StateAppend transition $({s}_{t},{\overline{s}}_{t},{\overline{r}}_{t},{s}_{t+1})$ to $\mathcal{D}$. \If$t\equiv 0mod{T}_{\text{training}}$ \StateSample random minibatch $B\sim \mathcal{D}$ of ${N}_{B}$ elements. \StateSet target for the ith element of $B$ to ${y}_{i}={\overline{r}}_{i}+\gamma {Q}_{{\theta}^{}}({s}_{i+1},{\chi}_{{\varphi}^{}}({s}_{i+1}))$. \StateUpdate critic by minimizing loss $\frac{1}{{N}_{B}}{\sum}_{i=1}^{{N}_{B}}{({y}_{i}{Q}_{\theta}({s}_{i},{\overline{s}}_{i}))}^{2}$. \StateUpdate actor using sampled policy gradient
$${\nabla}_{\varphi}J\approx {\frac{1}{{N}_{B}}\sum _{i=1}^{{N}_{B}}{\nabla}_{\overline{s}}{Q}_{\theta}({s}_{i},\overline{s}){\nabla}_{\varphi}{\chi}_{\varphi}({s}_{i})}_{\overline{s}=\chi ({s}_{i})}.$$ 
Slowly update the target networks:
$${\theta}^{}\leftarrow (1\tau ){\theta}^{}+\tau \theta ,{\varphi}^{}\leftarrow (1\tau ){\varphi}^{}+\tau \varphi .$$ 
$t\leftarrow t+1$. \Untilepisode is terminal. \EndFor\EndProcedure {algorithm}[!h] \[email protected]@algorithmic[1] \ProcedureAttackTraining2$\pi ,{T}_{\text{training}},{N}_{B}$ \StateInitialize critic network ${Q}_{\theta}$ and actor ${\chi}_{\varphi}$ with weights $\theta $ and $\varphi $. \StateInitialize target critic and actor networks ${Q}_{{\theta}^{}},{\chi}_{{\varphi}_{T}}$ with weights ${\theta}^{}\leftarrow \theta ,{\varphi}^{}\leftarrow \varphi $. \StateInitialize replay buffer $\mathcal{D}$. \Forepisode 1:M \StateInitialize environment and observe state ${s}_{0}$. \StateSet initial time $t\leftarrow 0$. \Repeat\StateSelect a perturbation and add exploration noise ${\overline{s}}_{t}\leftarrow s+{l}_{\mathrm{\Pi}}({x}_{\varphi ,t}+{e}_{t})$. \StateFeed perturbation ${\overline{s}}_{t}$ to main agent and observe reward and state ${\overline{r}}_{t},{s}_{t+1}$. \StateAppend transition $({s}_{t},{a}_{t},{\overline{r}}_{t},{s}_{t+1})$ to $\mathcal{D}$, where ${a}_{t}=\pi ({\overline{s}}_{t})$. \If$t\equiv 0mod{T}_{\text{training}}$ \StateSample random minibatch $B\sim \mathcal{D}$ of ${N}_{B}$ elements. \StateSet target for the ith element of $B$ to ${y}_{i}={\overline{r}}_{i}\gamma {Q}_{{\theta}^{}}({s}_{i+1},\pi ({\chi}_{{\varphi}^{}}({s}_{i+1})))$. \StateUpdate critic by minimizing loss $\frac{1}{{N}_{B}}{\sum}_{i=1}^{{N}_{B}}{({y}_{i}+{Q}_{\theta}({s}_{i},{a}_{i}))}^{2}$. \StateUpdate actor using sampled policy gradient
$${\nabla}_{\varphi}J\approx {\frac{1}{{N}_{B}}\sum _{i=1}^{{N}_{B}}{\nabla}_{a}{Q}_{\theta}({s}_{i},a){\nabla}_{\overline{s}}\pi (\overline{s}){\nabla}_{\varphi}{\chi}_{\varphi}({s}_{i})}_{a=\pi (\chi ({s}_{i})),\overline{s}=\chi ({s}_{i})}.$$ 
Slowly update the target networks:
${\theta}^{}\leftarrow (1\tau ){\theta}^{}+\tau \theta ,$  
${\varphi}^{}\leftarrow (1\tau ){\varphi}^{}+\tau \varphi .$ 
$t\leftarrow t+1$. \Untilepisode is terminal. \EndFor\EndProcedure
10.2 Gradientbased attacks
Gradientbased attacks can be adapted to MDPs by solving the optimization problem these methods are trying to solve. There are two main techniques that were developed, one by Huang et al. [8], and the other one from Pattanaik et al. [9]: the former tries to minimize the probability of the optimal action $a*$, whilst the latter maximizes the probability of taking the worst action, ${a}_{*}$, defined as the action that attains the minimum value for the unperturbed policy $\pi .$ Pseudocode for the former is given in Algorithm 10.2, while for the latter is in Algorithm 10.2. {algorithm}[!h] \[email protected]@algorithmic[1] \ProcedureAttackProcedure1$s,\pi ,{X}_{s}^{\epsilon}$ \State${a}^{*}\leftarrow {\mathrm{arg}\mathrm{max}}_{a}\pi (as)$\CommentBest action in state $s$ \State$\overline{s}\leftarrow s$ \ForAll${s}^{\prime}\in {X}_{s}^{\epsilon}$\CommentLoop trough the $\epsilon $neighbor states \If$$ \CommentEvaluate if ${a}^{*}$ is not optimal in $j$ \State$\overline{s}\leftarrow {s}^{\prime}$ \EndIf\EndFor\Statereturn $\overline{s}$ \EndProcedure
[!h] \[email protected]@algorithmic[1] \ProcedureAttackProcedure$s,\pi ,{Q}^{\pi},{X}_{s}^{\epsilon}$ \State${q}^{*}\leftarrow {Q}^{\pi}(s,{\mathrm{arg}\mathrm{max}}_{a}\pi (as))$ \State$\overline{s}\leftarrow s$ \ForAll${s}^{\prime}\in {X}_{s}^{\epsilon}$\CommentLoop trough the $\epsilon $neighbor states \State$q\leftarrow {Q}^{\pi}(s,{\mathrm{arg}\mathrm{max}}_{a}\pi (a{s}^{\prime})))$ \If$$ \CommentEvaluate if optimal action in ${s}^{\prime}$ is worse in $s$ \State${q}^{*}\leftarrow q$ \State$\overline{s}\leftarrow {s}^{\prime}$ \EndIf\EndFor\Statereturn $\overline{s}$ \EndProcedure
11 Experimental setup
Hardware and software setup. All experiments were executed on a stationary desktop computer, featuring an Intel Xeon Silver 4110 CPU, 48GB of RAM and a GeForce GTX 1080 graphical card. Ubuntu 18.04 was installed on the computer, together with Tensorflow 1.13.1 and CUDA 10.0.
Deeplearning framework. We set up our experiments within the KerasRL framework [23], together with OpenAI Gym [10] as interface to the test environments used in this paper.
All the adversarial methods used in this paper, including Gradient attacks, and DRQN, have been implemented from scratch using the KerasRL framework.
Neural network architectures.
Depending on the environment we use a different neural network architecture, as shown in table 1. The network architecture was not optimised, but was chosen large enough also to have a better comparison with Gradient methods since they depend on the dimensionality of the network, and may fail for smallsized networks. For all environments we set the frame skip to 4.

MountainCar  MountainCar  LunarLander  Cartpole  

DQN  Actor  QCritic  Actor  QCritic  DQN  
Layer 1  FC(512, ReLU)  FC(400, ReLU)  FC(400, ReLU)  FC(400, ReLU)  FC(400, ReLU)  FC(256, ReLU)  
Layer 2  FC(256, ReLU)  FC(300, ReLU)  FC(300, ReLU)  FC(300, ReLU)  FC(300, ReLU)  FC(256, ReLU)  
Layer 3  FC(64, ReLU)  FC(1, Tanh)  FC(1)  FC(2, Tanh)  FC(1)  FC(2)  
Layer 4  FC(3)           
For the adversary, instead, we used a different actorcritic architecture, although we have not tried to optimise it. The model can be seen in figure 11. In the figure $\text{dim}(S)$ denotes the dimensionality of the state space. The neural network structure for the game of Pong is the same as the one used in [8], for both the adversary and the main agent.
Data collection and preprocessing. Every data point is an average across $10$ different seeds, where for each seed we have tested an attack against all of the main agent’s policies, and each test considers $30$ episodes ($10$ in the case of FGSM for Continuous Mountaincar and LunarLander). In Table 2 is shown the number of policies for the main agent and the number of episodes collected. Density plots were preprocessed using the function seaborn.kdeplot with Gaussian kernel and bandwidth $0.05$ ($0.15$ for LunarLander). Gradientbased exploration data was preprocessed using a first order acausal Butterworth filter with critical frequency ${\omega}_{n}=0.1$.
Algorithm/Environment  MountainCar  MountainCarContinuous  Cartpole  LunarLander 

DQN  3 (1800)    3 (1800)   
DRQN  1 (600)    1 (600)   
DDPG    3 (1200)    3 (1200) 
Settings for gradientattacks. For gradientattacks we have always used the same norm as the one used by the optimal attack. For the Softmax layer we have always used a temperature of $T=1$. We compute the gradient using the method from Pattanaik et al. [9], which has proven to be better than the one proposed by Huang et al.[8]. In order to speedup the computation we do not sample the attack magnitude from a random distribution, but fix it to the maximum value $\epsilon $.
Training procedure and parameters. All the training parameters for the main agent are shown in table 3. Parameters were chosen based on previous experience, so that the policy trains in a fixed number of training steps. Policies were trained until they were within $5\%$ of the optimal value referenced in the OpenAI Gym documentation [10], across an average of the last $100$ episodes. For the game of Pong we use same as [8]
Parameters/Environment  MountainCar  MountainCar  LunarLander  Cartpole  Pong 

Action space  Discrete  Continuous  Continuous  Discrete  Discrete 
Algorithm used  DQN/DRQN  DDPG  DDPG  DQN/DRQN  DQN 
Reference value  $50.0$  $90.0$  $200.0$  $195.0$  $20$ 
Number of steps  $4\cdot {10}^{4}$  $2\cdot {10}^{4}$  $3\cdot {10}^{4}$  $22\cdot {10}^{3}$  $4\cdot {10}^{6}$ 
Number of warmup steps  $100$  ${10}^{3}$  $3\cdot {10}^{3}$  ${10}^{3}$  $5\cdot {10}^{4}$ 
Replay memory size  $4\cdot {10}^{4}$  $2\cdot {10}^{4}$  $15\cdot {10}^{3}$  $22\cdot {10}^{3}$  $5\cdot {10}^{5}$ 
Discount factor  $0.99$  $0.99$  $0.99$  $0.99$  $0.99$ 
Learning rate  ${10}^{3}$  $({10}^{4},{10}^{3})$  $({10}^{4},{10}^{3})$  ${10}^{4}$  $25\cdot {10}^{4}$ 
Optimizer  Adam  Adam  Adam  Adam  Adam 
Gradient clipping  Not set  $1.0$  $1.0$  $1.0$  Not set 
Exploration style  $\u03f5$greedy  Gaussian noise  Gaussian noise  $\u03f5$greedy  $\u03f5$greedy 
Initial exploration rate/ Std (DDPG)  $0.95$  $2$  $1$  $0.95$  $0.95$ 
Final exploration rate/ Std (DDPG)  $0.1$  ${10}^{3}$  ${10}^{3}$  $0.1$  $0.05$ 
Exploration steps  $28\cdot {10}^{3}$  $14\cdot {10}^{3}$  $21\cdot {10}^{3}$  $11\cdot {10}^{3}$  $35\cdot {10}^{5}$ 
Batch size  $32$  $64$  $192$  $256$  $32$ 
Target model update  ${10}^{2}$  ${10}^{3}$  ${10}^{3}$  $200$  ${10}^{4}$ 
Initial random steps  Training  $0$  $0$  $0$  $0$  $0$ 
Initial random steps  Testing  $10$  $10$  $10$  $10$  $50$ 
Frameskip  $4$  $4$  $4$  $4$  $4$ 
For the adversary, we used similar parameters, shown in Table 4.
Parameters/Environment  MountainCar 

LunarLander  Cartpole  Pong  

Algorithm used  DDPG  DDPG  DDPG  DDPG  DQN+Features  
Number of steps  $4\cdot {10}^{4}$  $6\cdot {10}^{4}$  $5\cdot {10}^{4}$  $4\cdot {10}^{4}$  $2\cdot {10}^{6}$  
Number of warmup steps  $4\cdot {10}^{3}$  $3\cdot {10}^{3}$  $5\cdot {10}^{3}$  $4\cdot {10}^{3}$  ${10}^{4}$  
Replay memory size  $15\cdot {10}^{3}$  $3\cdot {10}^{4}$  $15\cdot {10}^{3}$  $15\cdot {10}^{3}$  $50\cdot {10}^{4}$  
Discount factor  $0.99$  $0.99$  $0.99$  $0.99$  $0.99$  
Learning rate  $({10}^{4},{10}^{3})$  $({10}^{4},{10}^{3})$  $({10}^{4},{10}^{3})$  $({10}^{3},{10}^{2})$  $25\cdot {10}^{4}$  
Optimizer  Adam  Adam  Adam  Adam  Adam  
Gradient clipping  $1.0$  $1.0$  $1.0$  $1.0$  Not set  
Exploration style  Uniform noise  Uniform noise  Uniform noise  Uniform noise  $\u03f5$greedy  
Initial exploration settings  $[0.5,0.5]$  $2$  $[0.5,0.5]$  $[0.3,0.3]$  $0.95$  
Final exploration settings  $[{10}^{3},{10}^{3}]$  $[{10}^{3},{10}^{3}]$  $[{10}^{3},{10}^{3}]$  $[{10}^{3},{10}^{3}]$  $0.05$  
Exploration steps  $28\cdot {10}^{3}$  $48\cdot {10}^{3}$  $35\cdot {10}^{3}$  $28\cdot {10}^{3}$  $15\cdot {10}^{5}$  
Batch size  $164$  $32$  $256$  $92$  $32$  
Target model update  ${10}^{3}$  ${10}^{3}$  ${10}^{3}$  ${10}^{2}$  ${10}^{3}$  
Initial random steps  Training  $0$  $0$  $0$  $0$  $0$  
Initial random steps  Testing  $10$  $10$  $10$  $10$  $10$  
Frameskip  $4$  $4$  $4$  $4$  $4$  
Projection  Regularization term  ${10}^{6}$  ${10}^{6}$  ${10}^{6}$  ${10}^{6}$   