Spatiotemporally Constrained Action Space Attacks on Deep Reinforcement Learning Agents

  • 2019-11-19 03:36:57
  • Xian Yeow Lee, Sambit Ghadai, Kai Liang Tan, Chinmay Hegde, Soumik Sarkar
  • 0

Abstract

Robustness of Deep Reinforcement Learning (DRL) algorithms towardsadversarial attacks in real world applications such as those deployed incyber-physical systems (CPS) are of increasing concern. Numerous studies haveinvestigated the mechanisms of attacks on the RL agent's state space.Nonetheless, attacks on the RL agent's action space (AS) (corresponding toactuators in engineering systems) are equally perverse; such attacks arerelatively less studied in the ML literature. In this work, we first frame theproblem as an optimization problem of minimizing the cumulative reward of an RLagent with decoupled constraints as the budget of attack. We propose awhite-box Myopic Action Space (MAS) attack algorithm that distributes theattacks across the action space dimensions. Next, we reformulate theoptimization problem above with the same objective function, but with atemporally coupled constraint on the attack budget to take into account theapproximated dynamics of the agent. This leads to the white-box Look-aheadAction Space (LAS) attack algorithm that distributes the attacks across theaction and temporal dimensions. Our results shows that using the same amount ofresources, the LAS attack deteriorates the agent's performance significantlymore than the MAS attack. This reveals the possibility that with limitedresource, an adversary can utilize the agent's dynamics to malevolently craftattacks that causes the agent to fail. Additionally, we leverage these attackstrategies as a possible tool to gain insights on the potential vulnerabilitiesof DRL agents.

 

Quick Read (beta)

Spatiotemporally Constrained Action Space Attacks On Deep Reinforcement Learning Agents

Xian Yeow Lee,1 Sambit Ghadai,1 Kai Liang Tan,1 Chinmay Hegde,2 Soumik Sarkar1
1Department of Mechanical Engineering, Iowa State University, Ames, IA 50011
2 Tandon School of Engineering, New York University, Brooklyn, NY 11201
{xylee, sambitg, kailiang, soumiks}@iastate.edu, [email protected]
Corresponding Author
Abstract

Robustness of Deep Reinforcement Learning (DRL) algorithms towards adversarial attacks in real world applications such as those deployed in cyber-physical systems (CPS) are of increasing concern. Numerous studies have investigated the mechanisms of attacks on the RL agent’s state space. Nonetheless, attacks on the RL agent’s action space (corresponding to actuators in engineering systems) are equally perverse, but such attacks are relatively less studied in the ML literature. In this work, we first frame the problem as an optimization problem of minimizing the cumulative reward of an RL agent with decoupled constraints as the budget of attack. We propose the white-box Myopic Action Space (MAS) attack algorithm that distributes the attacks across the action space dimensions. Next, we reformulate the optimization problem above with the same objective function, but with a temporally coupled constraint on the attack budget to take into account the approximated dynamics of the agent. This leads to the white-box Look-ahead Action Space (LAS) attack algorithm that distributes the attacks across the action and temporal dimensions. Our results showed that using the same amount of resources, the LAS attack deteriorates the agent’s performance significantly more than the MAS attack. This reveals the possibility that with limited resource, an adversary can utilize the agent’s dynamics to malevolently craft attacks that causes the agent to fail. Additionally, we leverage these attack strategies as a possible tool to gain insights on the potential vulnerabilities of DRL agents.

Spatiotemporally Constrained Action Space Attacks On Deep Reinforcement Learning Agents


Xian Yeow Lee,1 Sambit Ghadai,1 Kai Liang Tan,1 Chinmay Hegde,2 Soumik Sarkar1thanks: Corresponding Author 1Department of Mechanical Engineering, Iowa State University, Ames, IA 50011 2 Tandon School of Engineering, New York University, Brooklyn, NY 11201 {xylee, sambitg, kailiang, soumiks}@iastate.edu, [email protected]

Preprint.

Introduction

The spectrum of Reinforcement Learning (RL) applications ranges from engineering design and control (??) to business (?) and creative design (?). As RL-based frameworks are increasingly deployed in real-world, it is imperative that the safety and reliability of these frameworks are well understood. While any adversarial infiltration of these systems can be costly, the safety of DRL systems deployed in cyber-physical systems (CPS) such as industrial robotic applications and self-driving vehicles are especially safety and life-critical.

A root cause of these safety concerns is that in certain applications, the inputs to an RL system can be accessed and modified adversarially to cause the RL agent to take sub-optimal (or even harmful) actions. This is especially true when deep neural networks (DNNs) are used as key components (e.g., to represent policies) of RL agents. Recently, a wealth of results in the ML literature demonstrated that DNNs can be fooled to misclassify images by perturbing the input by an imperceptible amount (?) or by introducing specific natural looking attributes (?). Such adversarial perturbations have also demonstrated the impacts of attacks on an RL agent’s state space as shown by (?).

Besides perturbing the RL agent’s state space, it is also important to consider adversarial attacks on the agent’s action space, which in engineering systems, represents physically manipulable actuators. We note that (model-based) actuator attacks have been studied in the cyber-physical security community, including vulnerability of continuous systems to discrete time attacks (?); theoretical characteristics of undetectable actuator attacks (?); and “defense” schemes that re-stabilizes a system when under actuation attacks (?). However, the issue of adversarial attacks on a RL agent’s action space has relatively been ignored in the DRL literature. In this work, we present a suite of novel attack strategies on a RL agent’s action space.

Our contributions:

  1. 1.

    We formulate a white-box Myopic Action Space (MAS) attack strategy as an optimization problem with decoupled constraints.

  2. 2.

    We extend the formulation above by coupling constraints to compute a non-myopic attack that is derived from the agent’s state-action dynamics and develop a white-box Look-ahead Action Space (LAS) attack strategy. Empirically, we show that LAS crafts a stronger attack than MAS using the same budget.

  3. 3.

    We illustrate how these attack strategies can be used to understand a RL agent’s vulnerabilities.

  4. 4.

    We present analysis to show that our proposed attack algorithms leveraging projected gradient descent on the surrogate reward function (represented by the trained RL agent model) converges to the same effect of applying projected gradient descent on the true reward function.

Table 1: Landscape of adversarial attack strategies on RL agents. First column denotes if the attack takes into account the dynamics of the agent. Second column shows the method of computing the attacks; O denotes an optimization-based method and M denotes a model-based method where the parameters of a model needs to be learned. Last column represents if the attacks are mounted on agent’s state space (S) or action space (A).
Method Includes Dynamics Method Space of Attack
FGSM on Policies (?) X O S
ATN (?) X M S
Gradient based Adversarial Attack (?) X O S
Policy Induction Attacks (?) X O S
Strategically-Timed and Enchanting Attack (?) O, M S
NR-MDP (?) X M A
Myopic Action Space (MAS) X O A
Look-ahead Action Space (LAS) O A

Related Works

Due to the large amount of recent progress in the area of adversarial machine learning, we only focus on reviewing the most recent attack and defense mechanisms proposed for DRL models. Table 1 presents the primary landscape of this area of research to contextualize our work.

Adversarial Attacks on RL Agent

Several studies of adversarial attacks on DRL systems have been conducted recently. (?) extended the idea of FGSM attacks in deep learning to RL agent’s policies to degrade the performance of a trained RL agent. Furthermore, (?) showed that these attacks on the agent’s state space are transferable to other agents. Additionally, (?) proposed attaching an Adversarial Transformer Network (ATN) to the RL agent to learn perturbations that will deceive the RL agent to pursue an adversarial reward. While the attack strategies mentioned above are effective, they do not consider the dynamics of the agent. One exception is the work by (?) that proposed two attack strategies. One strategy was to attack the agent when the difference in probability/value of the best and worst action crosses a certain threshold. The other strategy was to combine a video prediction model that predicts future states and a sampling-based action planning scheme to craft adversarial inputs to lead the agent to an adversarial goal, which might not be scalable.

Other studies of adversarial attacks on the specific application of DRL for path-finding have also been conducted by (?) and (?), which results in the RL agent failing to find a path to the goal or planning a path that is more costly.

Robustification of RL Agents

As successful attack strategies are being developed for RL models, various works on training RL agents to be robust against attacks have also been conducted. (?) proposed that a more severe attack can be engineered by increasing the probability of the worst action rather than decreasing the probability of the best action. They showed that the robustness of an RL agent can be improved by training the agent using these adversarial examples. More recently, (?) presented a method to robustify RL agent’s policy towards action space perturbations by formulating the problem as a zero-sum Markov game. In their formulation, a separate nominal and adversary policy are trained simultaneously with a critic network being updated over the mixture of both policies to improve both adversarial and nominal policies. Meanwhile, (?) proposed a method to detect and mitigate attacks by employing a hierarchical learning framework with multiple sub-policies. They showed that the framework reduces agent’s bias to maintain high nominal rewards in the absence of adversaries. We note that other methods to defend against adversarial attacks exist, such as studies done by (?) and (?). These works are done mainly in the context of a DNN but may be extendable to DRL agents that employs DNN as policies, however discussing these works in detail goes beyond the scope of this work.

Mathematical Formulation

Preliminaries

We focus exclusively on model-free RL approaches. Below, let st and at denote the (continuous, possibly high-dimensional) vector variables denoting state and action, respectively, at time t. We assume a state evolution function, st+1=E(st,at) and let R(st,at) denote the reward signal the agent receives for taking the action at, given st. The goal of the RL agent is to choose actions that maximizes the cumulative reward, tR(st,at), given access to the trajectory, τ, comprising all past states and actions. In value-based methods, the RL agent determines action at each time step by finding an intermediate quantity called the value function that satisfies the recursive Bellman Equations. One example of such method is Q-learning (?) where the agent discovers the Q-function, defined recursively as:

Qt(st,at)=R(st,at)+maxaQt+1(E(st,at),a).

The optimal action (or “policy”) at each time step is to deterministically select the action that maximizes this Q-function conditioned on the observed state, i.e.,

at*=argmaxaQ(st,a).

In DRL, the Q-function in the above formulation is approximated via a parametric neural network Θ; methods to train these networks include Deep Q-networks (?).

In policy-based methods such as policy gradients (?), the RL agent directly maps trajectories to policies. In contrast with Q-learning, the selected action is sampled from the policy parameterized by a probability distribution, π=(a|s,Θ), such that the expected rewards (with expectations taken over π) are maximized:

π*=argmaxπE[R(τ)],at*π*.

In DRL, the optimal policy π is the output of a parametric neural network Θ, and actions at each time step are sampled; methods to train this neural network include proximal policy optimization (PPO) (?).

Threat Model

Our goal is to identify adversarial vulnerabilities in RL agents in a principled manner. To this end, We define a formal threat model, where we assume the adversary possesses the following capabilities:

  1. 1.

    Access to RL agent’s action stream. The attacker can directly perturb the agent’s nominal action adversarially (under reasonable bounds, elaborated below). The nominal agent is also assumed to be a closed-loop system and have no active defense mechanisms.

  2. 2.

    Access to RL agent’s training environment. This is required to perform forward simulations to design an optimal sequence of perturbations (elaborated below).

  3. 3.

    Knowledge of trained RL agent’s DNN. This is needed to understand how the RL agent acts under nominal conditions, and to compute gradients. In the adversarial ML literature, this assumption is commonly made under the umbrella of white-box attacks.

In the context of the above assumptions, the goal of the attacker is to choose a (bounded) action space perturbation that minimizes long-term discounted rewards. Based on how the attacker chooses to perturb actions, we define and construct two types of optimization-based attacks. We note that alternative approaches, such as training another RL agent to learn a sequence of attacks, is also plausible. However, an optimization-based approach is computationally more tractable to generate on-the-fly attacks for a target agent compared to training another RL agent (especially for high-dimensional continuous action spaces considered here) to generate attacks. Therefore, we restrict our focus on optimization-based approaches in this paper.

Myopic Action-Space (MAS) Attack Model

We first consider the case where the attacker is myopic, i.e., at each time step, they design perturbations in a greedy manner without regards to future considerations. Formally, let δt be the action space perturbation (to be determined) and b be a budget constraint on the magnitude of each δt 11 1 Physically, the budget may reflect a real physical constraint, such as the energy requirements to influence an actuation, or it may be a reflection on the degree of imperceptibility of the attack.. At each time step t, the attacker designs δt such that the anticipated future reward is minimized

minδtRadv(δt)=R(st,at+δt)+j=t+1TR(sj,aj) (1)
subject to :δtpb,
sj+1=E(sj,aj),
aj=Θ(sj)(forj=t,,T),

where p denotes the p-norm for some p1. Observe that while the attacker ostensibly solves separate (decoupled) problems at each time, the states themselves are not independent since given any trajectory, sj+1=E(sj,aj), where E(sj,aj) is the transition of the environment based on sj and aj. Therefore, 𝐑 is implicitly coupled through time since it depends heavily on the evolution of state trajectories rather than individual state visitations. Hence, the adversary perturbations solved above are strictly myopic and we consider this a static attack on the agent’s action space.

Look-ahead Action Space (LAS) Attack Model

Next, we consider the case where the attacker is able to look ahead and chooses a designed sequence of future perturbations. Using the same notation as above, let j=tt+HR(sj,aj+δj) denote the sum of rewards until a horizon parameter H, and let j=t+H+1TR(sj,aj) be the future sum of rewards from time j=t+H+1. Additionally, we consider the (concatenated) matrix Δ=[δt,δt+1δt+H] and B denote a budget parameter. The attacker solves the optimization problem:

minΔRadv(Δ)=j=tt+HR(sj,aj+δj)+j=t+H+1TR(sj,aj) (2)
subject to : Δp,qB,Δ=[δt,δt+1,,δH],
sj+1=E(sj,aj),
aj=Θ(sj)

where p,q denotes the p,q-norm (?). By coupling the objective function and constraints through the temporal dimension, the solution to the optimization problem above is then forced to take the dynamics of the agent into account in an explicit manner.

Proposed Algorithms

In this section, we present two attack algorithms based on the optimization formulations presented in previous section.

Algorithm for Mounting MAS Attacks

We observe that (1) is a nonlinear constrained optimization problem; therefore, an immediate approach to solve it is via projected gradient descent (PGD). Specifically, let 𝒮 denote the p ball of radius b in the action space. We compute the gradient of the adversarial reward, Radv w.r.t. the action space variables and obtain the unconstrained adversarial action a^t+12 using gradient descent with step size η. Next, we calculate the unconstrained perturbation δt and project in onto 𝒮 to get δt :

a^t+12=at-ηRadv(st,a^t), (3)
δt=a^t+12-at,
δt=P𝒮(δt).

Here, at represents the nominal action. We note that this approach resembles the fast gradient-sign method (FGSM) (?), although we compute standard gradients here. As a variation, we can compute multiple steps of gradient descent w.r.t the action variable prior to projection; this is analogous to the basic iterative method (or iterative FGSM) (?). The MAS attack algorithm is shown in the supplementary material.

We note that in DRL approaches, only a noisy proxy of the true reward function is available: In value-based methods, we utilize the learned Q-function (for example, a DQN) as an approximate of the true reward function, while in policy-iteration methods, we use the probability density function returned by the optimal policy as a proxy of the reward, under the assumption that actions with high probability induces a high expected reward. Since DQN selects the action based on the argmax of Q-values and policy iteration samples the action with highest probability, the Q-values/action-probability remains a useful proxy for the reward in our attack formulation. Therefore, our proposed MAS attack is technically a version of noisy projected gradient descent on the policy evaluation of the nominal agent. We elaborate on this further in the theoretical analysis section.

Algorithm for Mounting LAS Attacks

\SetAlgoLined\SetNoFillComment\DontPrintSemicolon Initialize nominal and adversary environments Enom, Eadv with same random seed
Initialize nominal agent πnom weights, θ
Initialize budget B, adversary action buffer Aadv, horizon H
\WhiletT Reset Aadv
\uIfH = 0 Reset H and B \WhilekH Compute gradient of surrogate reward Radv
Compute adversarial action a^t+12,k using Radv
Compute δt,k=a^t+12,k-at,k
Append δt,k to Aadv
Step through Eadv with at,k to get next state Compute ||δt,k||p for each element in Aadv
Project sequence of ||δt,k||p in Aadv on to ball of size B to obtain look-ahead sequence of budgets [bt,k, bt,k+1 bt,k+H]
Project each δt,k in Aadv on to look-ahead sequence of budgets computed in the previous step to get sequence [δt,k δt,k+1 δt,k+H]
Compute projected adversarial action a^t = at+δt,k
Step through Enom with a^t
Bmax(0,B-δt,k)
HH-1
\algorithmcfname 1 Look-ahead Action Space (LAS) Attack

The previous algorithm is myopic and can be interpreted as a purely spatial attack. In this section, we propose a spatiotemporal attack algorithm by solving Eq. (2) over a given time window H. Due to the coupling of constraints in time, this approach is more involved. We initialize a copy of both the nominal agent and environment, called the adversary and adversarial environment respectively. At time t, we sample a virtual roll-out trajectory up until a certain horizon t+H using the pair of adversarial agent and environment. At each time step k of the virtual roll-out, we compute action space perturbations δt,k by taking (possibly multiple) gradient updates. Next, we compute the norms of each δt,k and project the sequence of norms back onto an q-ball of radius B. The resulting projected norms at each time point now represents the individual budgets, bk, of the spatial dimension at each time step. Finally, we project the original δt,k obtained in the previous step onto the p-balls of radii bk, respectively to get the final perturbations δt,k.22 2 Intuitively, these steps represent the allocation of overall budget B across different time steps. We note that to perform virtual roll-outs at every time step t, the state of the Eadv has to be the same as the state of Enom at t. To accomplish this, we saved the history of all previous actions to re-compute the state of the Eadv at time t from t=0. While this current implementation may be time-consuming, we believe that this problem can be avoided by giving the adversary direct access to the current state of the nominal agent through platform API-level modifications; or explicit observations (in real-life problems).

In subsequent time steps, the procedure above is repeated with a reduced budget of B-t=0tδt and reduced horizon H-t until H reaches zero. The horizon H is then reset again for planning a new spatiotemporal attack. An alternative formulation could also be shifting the window without reducing its length until the adversary decides to stop the attack. However, we consider the first formulation such that we can compare the performance of LAS with MAS for an equal overall budget. This technique of re-planning the δt at every step while shifting the window of H is similar to the concept of receding horizons regularly used in optimal control (?). It is evident that using this form of dynamic re-planning mitigates the planning error that occurs when the actual and simulated state trajectories diverge due to error accumulation (?). Hence, we perform this re-planning at every t to account for this deviation. The pseudocode is provided in Alg. 1.

Figure 1: Visual comparison of MAS and LAS. In MAS, each δt is computed via multi-step gradient descent w.r.t. expected rewards for the current step. In LAS, each δt,k is computed w.r.t. the dynamics of the agent with receding horizon. An adversarial agent & environment is used to compute LAS for each step. Projection is applied to each δt in the temporal domain. The final perturbed action is obtained by adding the first δt,k to the nominal action. This is done until the end of the attack window, i.e., H-t=0.

Theoretical Analysis

We can show that projected gradient descent on the surrogate reward function (modeled by the RL agent network) to generate both MAS and LAS attacks provably converges; this can be accomplished since gradient descent on a surrogate function is akin to a noisy gradient descent on the true adversarial reward.

As described in previous sections, our MAS/LAS algorithms are motivated in terms of the adversarial reward Radv. However, if we use either DQN or policy gradient networks, we do not have direct access to the reward function, but only its noisy proxy, defined via a neural network. Therefore, we need to argue that performing (projected) gradient descent using this proxy loss function is a sound procedure. To do this, we appeal to a recent result by (?), who prove convergence of noisy gradient descent approximately converges to a local minimum. More precisely, consider a general constrained nonlinear optimization problem:

minf(x)
s.t.c(x)=0,

where c is an arbitrary (differentiable, possibly vector-valued) function encoding the constraints. Define S={x|c(x)=0} define the constraint set. We attempt to minimize the objective function via noisy (projected) gradient updates:

xt+1/2 =xt-ηf(xt)+ξt,
xt+1 =PS(xt+1/2).

Theorem 1. (Convergence of noisy projected gradients.) Assume that the noise terms {ξt} are i.i.d., satisfying E[ξ]=0,E[ξξT]=σ2Id, ξO(1) almost surely. Assume that both the constraint function c()˙ and the objective function f() is β-smooth, L-Lipschitz, and possesses ρi-Lipschitz Hessian. Assume further that the objective function f is B-bounded. Then, there exists a learning rate η=O(1) such that with high probability, in 𝑝𝑜𝑙𝑦𝑙𝑜𝑔(1/η2) iterations, noisy projected gradient descent converges to a point x^ that is 𝑝𝑜𝑙𝑦𝑙𝑜𝑔(η)-close to some local minimum of f.

In our case, f and ξ depends on the structure of the RL agent’s neural network. (Smoothness assumptions of f can perhaps be justified by limiting the architecture of the network, but the iid-ness assumption on ξ is hard to verify). As such, it is difficult to ascertain whether the assumptions of the above theorem are satisfied in specific cases. Nonetheless, an interesting future theoretical direction is to understand Lipschitz-ness properties of specific families of DQN/policy gradient agents.

We defer further analysis of the double projection step onto mixed-norm balls used in our proposed LAS algorithms to the supplementary material.

Experimental Results & Discussion

To demonstrate the effectiveness and versatility of our methods, we implemented them on RL agents with continuous action environments from OpenAI’s gym (?) as they reflect the type of action space in most practical applications 33 3 Codes and links to supplementary are available at https://github.com/xylee95/Spatiotemporal-Attack-On-Deep-RL-Agents. For policy-based methods, we trained a nominal agent using the PPO algorithm and a DoubleDQN (DDQN) agent (?) for value-based methods44 4 The only difference in implementation of policy vs value-based methods is that in policy methods, we take analytical gradients of a distribution to compute the attacks (e.g., in line 10 of Algorithm 1) while for value-based methods, we randomly sample adversarial actions to compute numerical gradients.. Additionally, we utilize Normalized Advantage Functions (?) to convert the discrete nature of DDQN’s output to continuous action space. For succinctness, we present the results of the attack strategies only on PPO agent for the Lunar-Lander environment. Additional results of DDQN agent in Lunar Lander and Bipedal-Walker environments and PPO agent in Bipedal-Walker, Mujoco Hopper, Half-Cheetah and Walker environments are provided in the supplementary materials. As a baseline, we implemented a random action space attack, where a random perturbation bounded by the same budget b is applied to the agent’s action space at every step. For MAS attacks, we implemented two different spatial projection schemes, 1 projection based on (?) that represents a sparser distribution and 2 projection that represents a denser distribution of attacks. For LAS attacks, all combinations of spatial and temporal projection for 1 and 2 were implemented.

Figure 2: Box plots of PPO Lunar Lander showing average cumulative reward across 10 episodes for each attack methods. The top Figs. (a-c) have H=5 with B= 3, 4, and 5 respectively. For a direct comparison, corresponding MAS budgets are taken as b=B/H. Similarly, Figs.(d-f) have the same B values but with H=10. An obvious trend is that as B increases, the effectiveness of LAS over MAS becomes more evident as seen in the decreasing trend of the reward.

Comparison of MAS and LAS Attacks

Fig. 2 shows distributions of cumulative rewards obtained by the PPO agent across ten episodes in a Lunar Lander environment, with each subplot representing different combinations of budget, B and horizon, H. Top three subplots show experiments with a H value of 5 time steps and b value of 3, 4, and 5 from left to right respectively. Bottom row of figures show a similar set of experiments but with a H value of 10. For a direct comparison between MAS and LAS attacks with equivalent budgets across time, we have assigned the corresponding MAS budget values as b=B/H. This assumes that the total budget B is allocated uniformly across every time step for a given H, while LAS has the flexibility to allocate the attack budget non-uniformly in the same interval, conditioned on the dynamics of the agent.

We note that keeping H constant while increasing B provides both MAS and LAS with a higher budget to inject δt to the nominal actions. We observe that with a low budget of 3 (Fig. 2a), only LAS is successful in attacking the RL agent, as seen by the corresponding decrease in rewards. With a higher budget of 5 (Fig. 2c), MAS has a more apparent effect on the performance of the RL agent while LAS reduces the performance of the agent severely.

With B constant, increasing H allows the allocated B to be distributed along the increased time horizon. In other words, LAS virtually looks-ahead further into the future. In the most naive case, a longer horizon dilutes the severity of each δt in compared to shorter horizons. By comparing similar budget values of different horizons (i.e horizons 5 and 10 for budget 3, Fig. 2a and Fig. 2d respectively), attacks for H=10 are generally less severe than their H=5 counterparts. For all B and H combinations, we observe that MAS attacks are generally less effective compared to LAS. We note that this is a critical result of the study as most literature on static attacks have shown that the attacks can be ineffective below a certain budget. Here, we demonstrate that while MAS attacks can seemingly look ineffective for a given budget, a stronger and more effective attack can essentially be crafted using LAS with the same budget.

In the following sections, we further study the difference between MAS and LAS as well as demonstrate how the attacks can be utilized to understand the vulnerabilities of the agent in different environments.

Figure 3: Time vs Attack magnitude along action dimension for LAS attacks with B=4,H=5 in Lunar Lander environment with PPO RL agent. (a) Variation of attack magnitude along Up-Down and Left-Right action dimensions through different episodes. In all episodes except episode 2, Up-Down action is more heavily attacked than Left-Right. (b) Variation of attack magnitude through time for episode 1 of (a). After 270 steps, the agent is not attacked in the Left-Right dimension, but heavily attacked in Up-Down directions. (c) Actual rendering of Lunar Lander environment for episode 1 of (a) corresponding to (b). Frame 1-5 are strictly increasing time steps showing trajectory of the RL agent controlling the lunar lander.

Action Dimension Decomposition of LAS Attacks

Fig. 3 shows action dimension decomposition of LAS attacks. The example shown in Fig. 3 is the result of 2 projection in action space with 2 projection in time. From Fig. 3a, we observe that through all the episodes of LAS attacks, one of the action dimension (i.e., Up - Down direction of lunar lander) is consistently perturbed more, i.e., accumulates more attack, than Left-Right direction.

Fig. 3b shows a detailed view of action dimension attacks for an episode (Episode 1). It is evident from the figure that the Up-Down actions of the lunar lander are more prone to attacks throughout the episode than Left-Right actions. Additionally, Left-Right action attacks are restricted after certain time steps and only the Up-Down actions are attacked further. Fig. 3c further corroborates the observation in the Lunar Lander environment. As the episode progresses in Fig. 3c, the lunar lander initially lands on the ground in frame 3, but lifts up and hovers until the episode ends in frame 5. This observation supports the fact that the proposed attacks are effective in perturbing the action dimensions in an optimal manner; as in this case, perturbing the lunar lander in the horizontal direction will not further decrease rewards. On the other hand, hovering the lunar lander will cause the agent to use more fuel, which consequently decreases the total reward. From these studies, it can be concluded that LAS attacks (correlated with projections of actions in time) can clearly isolate vulnerable action dimension(s) of the RL agent to mount a successful attack.

Ablation Study of Horizon and Budget

Lastly, we performed multiple ablation studies to compare the effectiveness of LAS and MAS attacks. While we have observed that LAS attacks are generally stronger than MAS, we hypothesize that there will be an upper limit to LAS’s advantage as the allowable budget increases. We take the difference of each attack’s reduction in rewards (i.e. attack - nominal) and visualize how much rewards LAS reduces as compared to MAS under different conditions of B and H. In the case of PPO in Lunar Lander, we observe that the reduction in rewards of LAS vs MAS becomes less drastic as budget increases, hence showing that LAS has diminishing returns as both MAS and LAS saturates at higher budgets. We defer detailed discussions and additional figures of the ablation study to the supplementary materials.

Conclusion & Future Work

In this study, we present two novel attack strategies on an RL agent’s action space; a myopic attack (MAS) and a non-myopic attack (LAS). The results show that LAS attacks, that were crafted with explicit use of the agent’s dynamics information, are more powerful than MAS attacks. Additionally, we observed that applying LAS attacks on RL agents reveals the possible vulnerable actuators of an agent, as seen by the non-uniform distribution of attacks on certain action dimensions. This can be leveraged as a tool to identify the vulnerabilities and plan a mitigation strategy under similar attacks. Possible future works include extending the concept of LAS attacks to state space attacks where the agent’s observations are perturbed instead of the agent’s actions while taking into account the dynamics of the agent. Additionally, while we did not focus on the imperceptibility and deployment aspects of the proposed attacks in this study, defining a proper metric in terms of detectability in action space and optimizing the budget to remain undetected for different environments will be a future research direction.

Acknowledgement

This work was supported in part by NSF grants CNS-1845969 and CCF-2005804, and AFOSR YIP Grant FA9550-17-1-0220.

References

Appendix A Pseudocode of MAS Attack

\SetAlgoLined\SetNoFillComment\DontPrintSemicolon Initialize nominal environment, Enom, nominal agent πnom with weights, θ
Initialize budget b
\Whilet T Compute gradient of surrogate reward Radv
Compute adversarial action a^t+12 using Radv
Compute δt=a^t+12-at, project δt onto ball of size b to get δt
Compute projected adversarial action a^t = at+δt
Step through Enom with a^t to get next state
\algorithmcfname 2 Myopic Action Space (MAS) Attack

Appendix B Analysis

Projections onto Mixed-norm Balls

The Look-ahead Action Space (LAS) Attack Model described above requires projecting onto the mixed-norm p,q-ball of radius B in a vector space. Below, we show how to provably compute such projections in a computationally efficient manner. For a more complete treatment, we refer to (?). Recall the definition of the (p,q)-norm. Let Xm×n be partitioned into sub-vectors xi,i[n] of length-m. Then,

Xp,q:=(i=1nxiqp)1/p.

Due to scale invariance of norms, we can assume B=1. We consider the following special cases:

  1. 1.

    p=1,q=1: this reduces to the case of the ordinary 1-norm in mn. Projection onto the unit 1-ball can be achieved via soft-thresholding every entry in X:

    PS(Xi,j)=sign(Xi,j)(|Xi,j|-λ)+,

    where λ>0 is a KKT parameter that can be discovered by a simple sorting the (absolute) values of X. See (?).

  2. 2.

    p=2,q=2: this reduces to the case of the isotropic 2-norm in mn. Projection onto the unit 2-ball can be achieved by simple normalization:

    PS(Xi,j)=Xi,j/X2,2.
  3. 3.

    p=1,q=2: this is a “hybrid” combination of the above two cases, and corresponds to the procedure that we use in mounting our LAS attack. Projection onto this ball can be achieved by a three-step method. First, we compute the n-dimensional vector, v, of column-wise 2-norms. Then, we project v onto the unit 1-ball; essentially, this enables us to “distribute” the (unit) budget across columns. Since 1-projection is achieved via soft-thresholding, a number of coordinates of this vector are zeroed out, and others undergo a shrinkage. Call this (sparsified) projected vector vp. Finally, we perform an 2 projection, i.e., we scale each column of X by dividing by its norm and multiplying by the entries of vp:

    PS(Xi,j)=Xi,jxj2vp(i).

Appendix C Additional Experiments

Comparison of Attacks Mounted on PPO Agent in Bipedal-Walker Environment

Figure 4: Boxplots showing cumulative rewards of PPO agent in Bipedal-Walker environment under different attack strategies across 10 different episodes. Plots (a), (b) and (c) are attacked with a horizon of 5 time steps with budget value of 7, 8, and 9 respectively. (d), (e), and (f) are attacked with a horizon value of 10 time steps with budget value B of 7, 8, and 9.

The results in Fig. 4 depicts the comparison between the MAS and LAS attacks applied on a PPO agent in the Bipedal-Walker environment. A similar trend is observed where LAS attacks are generally more severe than MAS attacks. We acknowledge that in this environment, MAS attacks are sometimes effective in reducing the rewards as well. However, this can be attributed to the Bipedal Walker having more dimensions (4 dimensions) in terms of it’s action space in compared to the Lunar-Lander (2 dimensions) environment. In addition, the actions of the Bipedal Walker is also highly coupled, in compared to the actions of the Lunar Lander. Hence, the agent for Bipedal-Walker is more sensitive towards perturbations, which explains the increase efficacy of MAS attacks.

Comparison of Attacks Mounted on DDQN Agent in Lunar Lander and Bipedal-Walker Environments

In Figures 5 and 6, we present additional results on the efficacy of different attack strategies for a DoubleDQN agent trained in the Lunar Lander and Bipedal Walker environment. An interesting observation is that in both environment, the effects of the attacks are more severe for the DDQN agent in compared to the PPO agent as seen in the box-plots. We conjecture that this is due to the deterministic nature of DDQN method where the optimal action is unique. In contrast, the PPO agent was trained with a certain stochastic in the process, which might explain the additional robustness of PPO agent to noise or perturbations. Nevertheless, in the DDQN agent, we still observe a similar trend where given the same about of budget, more severe attacks can be crafted using the LAS strategy in compared to MAS or random attacks.

Figure 5: DDQN Lunar Lander box plots showing average cumulative reward across 10 episodes for each attack method. Plots (a), (b) and (c) are attacked with a horizon of 5 time steps with budget value of 3, 4, and 5 respectively. (d), (e), and (f) are attacked with a horizon value of 10 time steps with budget value of 3, 4, and 5 respectively. Given the same horizon and budget, it is evident LAS attacks are more severe than MAS attacks, which in turn are generally more effective than random attacks.
Figure 6: DDQN Bipedal Walker box plots showing average cumulative reward across 10 episodes for each attack method. Plots (a), (b) and (c) are attacked with a horizon of 5 time steps with budget value of 3, 4, and 5 respectively. (d), (e), and (f) are attacked with a horizon value of 10 time steps with budget value of 3, 4, and 5 respectively. Given the same horizon and budget, it is evident LAS attacks are more severe than MAS attacks, which in turn are generally more effective than random attacks.

Comparison of Attacks Mounted on PPO Agent Mujoco Control Environments

Figure 7: Box plots depicting distribution of rewards obtained by PPO agent in Mujoco Hopper environment under different attacks. In this set of experiments, we used a values of B = 2, 3, and 4 for values of H=5 and 10. We observe similar reward trends where LAS attacks are generally stronger than MAS attacks across all values of budget and horizon.
Figure 8: Box plots depicting distribution of rewards obtained by PPO agent in Mujoco Half-Cheetah environment under different attacks. In this set of experiments, we used a values of B = 2, 3, and 4 for values of H=5 and 10. In this environment, the agent is more sensitive to the attacks for the budget values tested, where even a B=2 significantly decreases the distribution of rewards for MAS attacks. Nonetheless, we still observe similar reward trends where LAS attacks are generally stronger than MAS attacks across all values of budget and horizon. This suggests that the agent in this environment can easily be adversarially perturbed even with limited budget.
Figure 9: Box plots depicting distribution of rewards obtained by PPO agent in Mujoco Walker environment under different attacks. In this set of experiments, we used same values of B = 2, 3, and 4 for values of H=5 and 10 as the other Mujoco environments tested above. Similar reward trends where LAS attacks decreases rewards more than MAS attacks are also observed across all values of budget and horizon. In this environment, the shifts in reward distribution are less drastic, especially for B=2, when comparing different attacks. This suggests that the agent might be more robust in this case and information from the environment dynamics is actually required to significantly affect the agent as seen in plot (d).

In addition to the attacks mounted on the agents in the 2 environments above, we also compare the effect of the attacks on a PPO agent trained in 3 different Mujoco control environments. Figures 789 illustrates the distribution of rewards obtained by the agent in the Hopper, Half-Cheetah and Walker environments respectively. In all three environments, we observe that LAS attacks are generally more effective in reducing the rewards of the agent across different values of budget and horizon, which reinforces the fact that LAS attacks are stronger than MAS attacks. However, it is also interesting to note that agents in different environments have different sensitivity to the budget of attacks. In Hopper(Figure 7) and Walker(Figure 9), we see that MAS attacks have the effect of shifting the median and increasing the variance of the reward distortion with respect to the nominal, which highlights the fact that there are some episodes which MAS attack fails to affect the agent. In contrast, in the Half-Cheetah environment(Figure 8), we see that MAS attacks shifts the whole distribution of rewards downwards, showing that the agent is more sensitive in to MAS in this environment as compared to the other two environments. This suggests that the budget and horizon values are also hyper-parameters which should be tuned according to the environment.

Comparison of Temporal Projections in LAS

In this section, we present additional visualizations to further understand why 2 temporal projections results in more severe attacks in compared to 1 temporal projections. Figure 10,  11,  12 and 13 presents the δt usage plot across 100 time steps for both PPO and DDQN in the Lunar Lander and Bipedal-Walker environment. The left subplot represents 1 projections in the spatial dimension while the right subplot represents 2 projections in the spatial dimensions. These plots directly compare the difference in amount of δt used between 1 and 2 temporal projections for both 1 and 2 spatial attacks.

In most cases with the exception of Figure 12, we see a clear trend that 1 temporal projections results in a sparser but more concentrated peaks of δ utilization (corresponding to a few instance of strong attacks). In contrast, 2 temporal projections results in a more distributed but frequent form of of δ utilization (corresponding to more frequent but weak instances of attacks). We note that while 1 projections produces stronger attacks, there is a diminishing return on allocating more attacks to a certain time point as after a certain limit. Hence, this explains the weaker effect of 1 temporal projections since it concentrates the attacks to a few points but ultimately gives time for the agent to recover. In contrast, 2 temporal projections distributes the attacks more frequently that causes the agent to follow a diverging trajectory that is hard to recover from.

As an anecdotal example in the Lunar Lander environment, we observe that attacks with 1 temporal projection tend to turn off the vertical thrusters of the lunar lander. However, due to the sparsity of the attacks, the RL agent could possibly be fire the upward thrusters in time to prevent a free-fall landing. With 2 temporal projections, the agent is attacked continuously. Consequently, the agent has no chance to return to a nominal state and quickly diverges towards a terminal state.

Ablation Study

For this section, we present an ablation study to investigate the effect of different budget and horizon parameters on the effectiveness of LAS vs MAS. As mentioned in the main manuscript, we take the difference of each attack’s reduction in rewards (i.e. attack - nominal) and visualize how much rewards LAS reduces as compared to MAS under different conditions of B and H. Fig 14 illustrates the ablation study of a PPO agent in Lunar Lander. The figure is categorized by different spatial projections, where 1 spatial projections are shown on the left figure while 2 spatial projections are shown on the right. Both subplots are shown for 2 time projection attacks. Each individual subplot shows three different lines with different H, with each line visualizing the change in mean cumulative reward as budget increases along the x-axis. As budget increases, attacks in both 1 and 2 spatial projection shows a monotonic decrease in cumulative rewards. Attacks in each spatial projection with a H value of 5 shows different trends, where 2 decreases linearly with increasing budget while 1 became stagnant after B value of 3. This can be attributed to the fact that the attacks are more sparsely distributed in 1 attacks, causing most of the perturbations to be distributed into one action dimension. Thus, as budget increases, we see a diminishing return of LAS attacks since attacking a single action dimension beyond a certain limit doesn’t decrease reward any further. The study was also conducted for PPO Bipedal-Walker and both DDQN Lunar Lander and Bipedal-Walker as shown in Figure 16,  15 and  17 respectively. We only consider attacks in 2 temporal projection attacks for both 1 and 2 spatial projections. At a glance, we see different trends across each figures due to the different environment dynamics. However, in all cases, the decrease in reduction of rewards is always lesser than or equals to zero, which infers that LAS attacks are at least as effective than MAS attacks. We observed that attacks for horizon value of 5 becomes ineffective after a certain budget value. This shows that after some budget value, MAS attacks are as effective as LAS attacks because LAS might be operating at maximum attack capacity. Interesting to note that Bipedal-Walker for PPO needed a higher budget compared to the DDQN counterpart due to the PPO being more robust to attacks.

Effect of Horizon Parameter in LAS

In this section, we further describe the effect of horizon parameter H on the effectiveness of LAS attacks that we empirically observed. H defines a fixed time horizon (e.g., steps in DRL environments) to expend a given budget B. For a fixed B and short H, LAS favors injecting stronger perturbations in each step. Hence, we would intuitively hypothesize that given a shorter H, the severity of LAS attacks will increase as H decrease, as shown by the reduction in rewards between MAS and LAS in Figure 3 of the main paper. In most cases, the reduction is negative, hence showing that LAS attacks are indeed more severe. However in some cases as shown in Figure 8, 9 & 10, a shorter H does result in LAS being not as effective as a longer H (though still stronger than MAS as evident from negative values of y-axis). This can be attributed to the nonlinear reward function of the environments and consequent failure modes of the agent. For example, attacks on Lunar Lander PPO agent causes failure by constantly firing thruster engines to prevent Lunar Lander from landing, hence accumulating negative rewards over a long time. In contrast, attacks on the DQN agent causes Lunar Lander to crash immediately, hence terminating the episode before too much negative reward is accumulated. Thus, while the effect of H on LAS attacks sometimes do not show a consistent trend, we it is a key parameter that can be tuned to control the failure modes of the RL agent.

Action Space Dimension Decomposition

We provide additional results on using the LAS attack scheme as a tool to understand the RL agent’s action space vulnerabilities for a DoubleDQN agent in both Lunar Lander and Bipedal Walker environment. It is interesting to note in Figure 18, even with agents trained with a different philosophy (value-based vs policy based, shown in main manuscript), the attack scheme still distributes the attack to a similar dimension (Up-Down action for Lunar Lander), which highlights the importance of the that particular dimension. In Figure 19, we show the outcome of LAS attack scheme on Bipedal Walker environment having four action space dimensions. The four joints of the bipedal walker, namely Left Hip, Left Knee, Right Hip and Right Knee are attacked in this case, and from Figure 19, we see that the left hip is attacked more than any other action dimension in most of the episodes. This supports our inference that LAS attacks can bring out the vulnerabilities in the action space dimensions (actuators in case of CPS RL agents).

Figure 10: Comparison of δt used across time for a single episode in PPO Lunar Lander for different spatial projections with 1 and 2 temporal projection. Left plot illustrates 1 spatial projection and right plot shows 2 spatial projection. In both plots, the magnitude of attacks with 1 temporal projection attacks dropped to zero from time step 30. However, the magnitude of attacks in 2 temporal projection remains high through the episode. Hence, we observe that 1 temporal projections essentially allows the agent sufficient time to recover from earlier attacks. In the case of Lunar Lander, the agent might prevent a severe crash while landing or recover from a horizontal thruster boost induced by the attack.
Figure 11: δt usage plot of DDQN agent in Lunar Lander across time for a single episode. Left subplot illustrates 1 spatial attacks while right subplot shows 2 spatial attacks. In each subplot, attacks with 1 time projection attacks exhibit periodic spiked patterns while 2 time projection attacks are constantly activated with δt never reaching zero. Since 1 time projection attacks periodically inject δt into nominal actions of the DDQN agent, the agent has the opportunity to recover from the attacks.
Figure 12: δt usage plot for PPO agent in Bipedal-Walker across time for a single episode. Left subplot illustrates 1 spatial attacks while right subplot shows 2 spatial attacks. In this figure, both 1 and ell2 are seemingly well distributed, although the magnitude of δt used for both projection schemes are evidently lesser than the other agents in other environments. We speculate that this is due to the nature of the policy learnt by the PPO agent. In this environment, the PPO agent has learnt a policy to operate the Bipedal Walker by bending a knee joint and using the other knee joint to drag itself forward. Hence, in this situation, the agent has learnt a strong stable walking gait and there is not much room for δt to be applied.
Figure 13: δt usage plot for DDQN agent in Bipedal-Walker across time for a single episode. Left subplot illustrates 1 spatial attacks while the right subplot shows 2 spatial attacks. A trend similar to Fig 11 is observed here where the δt are utilized in sparse and concentrated instances for 1 temporal projections in compared to 2 temporal projections.
Figure 14: Ablation study for PPO Lunar Lander showing effectiveness of attacks comparing LAS with MAS for 2 time projection attacks. Left and right subplot shows 1 and 2 spatial projection respectively. Each subplot contains different lines representing different horizon, where budget is incrementally increased along each horizon. Both 1 and 2 spatial projection scales monotonically with increasing budget.
Figure 15: Ablation study for DDQN Lunar Lander showing effectiveness of attacks comparing LAS with MAS. Left subplot shows 1 spatial attacks while right subplot shows 2 spatial attacks. Both plots are in 2 temporal projection attacks. Each subplot contains different lines representing different horizon, where budget is incrementally increased along each horizon.
Figure 16: Ablation study for PPO Bipedal-Walker showing effectiveness of attacks comparing LAS with MAS. Left subplot shows 1 spatial attacks while right subplot shows 2 spatial attacks. Both plots are in 2 temporal projection attacks. Each subplot contains different lines representing different horizon, where budget is incrementally increased along each horizon.
Figure 17: Ablation study for DDQN Bipedal-Walker showing effectiveness of attacks comparing LAS with MAS. Left subplot shows 1 spatial attacks while right subplot shows 2 spatial attacks. Both plots are in 2 temporal projection attacks. Each subplot contains different lines representing different horizon, where budget is incrementally increased along each horizon.
Figure 18: Magnitude of attack with respect to different episodes for Lunar Lander environment with DoubleDQN RL agent. The two colors (Blue and Green) in the bar plots represent the attacks allocated to the two action space dimensions, Up-Down action and Left-Right action, respectively. The attack schema is LAS attack with a budget of 4 and Horizon of 5.
Figure 19: Magnitude of attack with respect to different episodes for Bipedal Walker environment with DoubleDQN RL agent. The four colors (Blue, Green, Red and Cyan) in the bar plots represent the attacks allocated to the four action space dimensions, Left Hip, Left Knee, Right Hip and Right Knee actions, respectively. The attack schema is LAS attack with a budget of 4 and Horizon of 5.