Abstract
Robustness of Deep Reinforcement Learning (DRL) algorithms towardsadversarial attacks in real world applications such as those deployed incyberphysical systems (CPS) are of increasing concern. Numerous studies haveinvestigated the mechanisms of attacks on the RL agent's state space.Nonetheless, attacks on the RL agent's action space (AS) (corresponding toactuators in engineering systems) are equally perverse; such attacks arerelatively less studied in the ML literature. In this work, we first frame theproblem as an optimization problem of minimizing the cumulative reward of an RLagent with decoupled constraints as the budget of attack. We propose awhitebox Myopic Action Space (MAS) attack algorithm that distributes theattacks across the action space dimensions. Next, we reformulate theoptimization problem above with the same objective function, but with atemporally coupled constraint on the attack budget to take into account theapproximated dynamics of the agent. This leads to the whitebox LookaheadAction Space (LAS) attack algorithm that distributes the attacks across theaction and temporal dimensions. Our results shows that using the same amount ofresources, the LAS attack deteriorates the agent's performance significantlymore than the MAS attack. This reveals the possibility that with limitedresource, an adversary can utilize the agent's dynamics to malevolently craftattacks that causes the agent to fail. Additionally, we leverage these attackstrategies as a possible tool to gain insights on the potential vulnerabilitiesof DRL agents.
Quick Read (beta)
Spatiotemporally Constrained Action Space Attacks On Deep Reinforcement Learning Agents
Abstract
Robustness of Deep Reinforcement Learning (DRL) algorithms towards adversarial attacks in real world applications such as those deployed in cyberphysical systems (CPS) are of increasing concern. Numerous studies have investigated the mechanisms of attacks on the RL agent’s state space. Nonetheless, attacks on the RL agent’s action space (corresponding to actuators in engineering systems) are equally perverse, but such attacks are relatively less studied in the ML literature. In this work, we first frame the problem as an optimization problem of minimizing the cumulative reward of an RL agent with decoupled constraints as the budget of attack. We propose the whitebox Myopic Action Space (MAS) attack algorithm that distributes the attacks across the action space dimensions. Next, we reformulate the optimization problem above with the same objective function, but with a temporally coupled constraint on the attack budget to take into account the approximated dynamics of the agent. This leads to the whitebox Lookahead Action Space (LAS) attack algorithm that distributes the attacks across the action and temporal dimensions. Our results showed that using the same amount of resources, the LAS attack deteriorates the agent’s performance significantly more than the MAS attack. This reveals the possibility that with limited resource, an adversary can utilize the agent’s dynamics to malevolently craft attacks that causes the agent to fail. Additionally, we leverage these attack strategies as a possible tool to gain insights on the potential vulnerabilities of DRL agents.
Spatiotemporally Constrained Action Space Attacks On Deep Reinforcement Learning Agents
Xian Yeow Lee,^{1} Sambit Ghadai,^{1} Kai Liang Tan,^{1} Chinmay Hegde,^{2} Soumik Sarkar^{1}^{†}^{†}thanks: Corresponding Author ^{1}Department of Mechanical Engineering, Iowa State University, Ames, IA 50011 ^{2} Tandon School of Engineering, New York University, Brooklyn, NY 11201 {xylee, sambitg, kailiang, soumiks}@iastate.edu, [email protected]
Preprint.
Introduction
The spectrum of Reinforcement Learning (RL) applications ranges from engineering design and control (?; ?) to business (?) and creative design (?). As RLbased frameworks are increasingly deployed in realworld, it is imperative that the safety and reliability of these frameworks are well understood. While any adversarial infiltration of these systems can be costly, the safety of DRL systems deployed in cyberphysical systems (CPS) such as industrial robotic applications and selfdriving vehicles are especially safety and lifecritical.
A root cause of these safety concerns is that in certain applications, the inputs to an RL system can be accessed and modified adversarially to cause the RL agent to take suboptimal (or even harmful) actions. This is especially true when deep neural networks (DNNs) are used as key components (e.g., to represent policies) of RL agents. Recently, a wealth of results in the ML literature demonstrated that DNNs can be fooled to misclassify images by perturbing the input by an imperceptible amount (?) or by introducing specific natural looking attributes (?). Such adversarial perturbations have also demonstrated the impacts of attacks on an RL agent’s state space as shown by (?).
Besides perturbing the RL agent’s state space, it is also important to consider adversarial attacks on the agent’s action space, which in engineering systems, represents physically manipulable actuators. We note that (modelbased) actuator attacks have been studied in the cyberphysical security community, including vulnerability of continuous systems to discrete time attacks (?); theoretical characteristics of undetectable actuator attacks (?); and “defense” schemes that restabilizes a system when under actuation attacks (?). However, the issue of adversarial attacks on a RL agent’s action space has relatively been ignored in the DRL literature. In this work, we present a suite of novel attack strategies on a RL agent’s action space.
Our contributions:

1.
We formulate a whitebox Myopic Action Space (MAS) attack strategy as an optimization problem with decoupled constraints.

2.
We extend the formulation above by coupling constraints to compute a nonmyopic attack that is derived from the agent’s stateaction dynamics and develop a whitebox Lookahead Action Space (LAS) attack strategy. Empirically, we show that LAS crafts a stronger attack than MAS using the same budget.

3.
We illustrate how these attack strategies can be used to understand a RL agent’s vulnerabilities.

4.
We present analysis to show that our proposed attack algorithms leveraging projected gradient descent on the surrogate reward function (represented by the trained RL agent model) converges to the same effect of applying projected gradient descent on the true reward function.
Method  Includes Dynamics  Method  Space of Attack 

FGSM on Policies (?)  X  O  S 
ATN (?)  X  M  S 
Gradient based Adversarial Attack (?)  X  O  S 
Policy Induction Attacks (?)  X  O  S 
StrategicallyTimed and Enchanting Attack (?)  ✓  O, M  S 
NRMDP (?)  X  M  A 
Myopic Action Space (MAS)  X  O  A 
Lookahead Action Space (LAS)  ✓  O  A 
Related Works
Due to the large amount of recent progress in the area of adversarial machine learning, we only focus on reviewing the most recent attack and defense mechanisms proposed for DRL models. Table 1 presents the primary landscape of this area of research to contextualize our work.
Adversarial Attacks on RL Agent
Several studies of adversarial attacks on DRL systems have been conducted recently. (?) extended the idea of FGSM attacks in deep learning to RL agent’s policies to degrade the performance of a trained RL agent. Furthermore, (?) showed that these attacks on the agent’s state space are transferable to other agents. Additionally, (?) proposed attaching an Adversarial Transformer Network (ATN) to the RL agent to learn perturbations that will deceive the RL agent to pursue an adversarial reward. While the attack strategies mentioned above are effective, they do not consider the dynamics of the agent. One exception is the work by (?) that proposed two attack strategies. One strategy was to attack the agent when the difference in probability/value of the best and worst action crosses a certain threshold. The other strategy was to combine a video prediction model that predicts future states and a samplingbased action planning scheme to craft adversarial inputs to lead the agent to an adversarial goal, which might not be scalable.
Other studies of adversarial attacks on the specific application of DRL for pathfinding have also been conducted by (?) and (?), which results in the RL agent failing to find a path to the goal or planning a path that is more costly.
Robustification of RL Agents
As successful attack strategies are being developed for RL models, various works on training RL agents to be robust against attacks have also been conducted. (?) proposed that a more severe attack can be engineered by increasing the probability of the worst action rather than decreasing the probability of the best action. They showed that the robustness of an RL agent can be improved by training the agent using these adversarial examples. More recently, (?) presented a method to robustify RL agent’s policy towards action space perturbations by formulating the problem as a zerosum Markov game. In their formulation, a separate nominal and adversary policy are trained simultaneously with a critic network being updated over the mixture of both policies to improve both adversarial and nominal policies. Meanwhile, (?) proposed a method to detect and mitigate attacks by employing a hierarchical learning framework with multiple subpolicies. They showed that the framework reduces agent’s bias to maintain high nominal rewards in the absence of adversaries. We note that other methods to defend against adversarial attacks exist, such as studies done by (?) and (?). These works are done mainly in the context of a DNN but may be extendable to DRL agents that employs DNN as policies, however discussing these works in detail goes beyond the scope of this work.
Mathematical Formulation
Preliminaries
We focus exclusively on modelfree RL approaches. Below, let ${s}_{t}$ and ${a}_{t}$ denote the (continuous, possibly highdimensional) vector variables denoting state and action, respectively, at time $t$. We assume a state evolution function, ${s}_{t+1}=E({s}_{t},{a}_{t})$ and let $R({s}_{t},{a}_{t})$ denote the reward signal the agent receives for taking the action ${a}_{t}$, given ${s}_{t}$. The goal of the RL agent is to choose actions that maximizes the cumulative reward, ${\sum}_{t}R({s}_{t},{a}_{t})$, given access to the trajectory, $\tau $, comprising all past states and actions. In valuebased methods, the RL agent determines action at each time step by finding an intermediate quantity called the value function that satisfies the recursive Bellman Equations. One example of such method is Qlearning (?) where the agent discovers the Qfunction, defined recursively as:
$${Q}_{t}({s}_{t},{a}_{t})=R({s}_{t},{a}_{t})+\underset{{a}^{\prime}}{\mathrm{max}}{Q}_{t+1}(E({s}_{t},{a}_{t}),{a}^{\prime}).$$ 
The optimal action (or “policy”) at each time step is to deterministically select the action that maximizes this Qfunction conditioned on the observed state, i.e.,
$${a}_{t}^{*}=\mathrm{arg}\underset{a}{\mathrm{max}}Q({s}_{t},a).$$ 
In DRL, the Qfunction in the above formulation is approximated via a parametric neural network $\mathrm{\Theta}$; methods to train these networks include Deep Qnetworks (?).
In policybased methods such as policy gradients (?), the RL agent directly maps trajectories to policies. In contrast with Qlearning, the selected action is sampled from the policy parameterized by a probability distribution, $\pi =\mathbb{P}(as,\mathrm{\Theta})$, such that the expected rewards (with expectations taken over $\pi $) are maximized:
$${\pi}^{*}=\mathrm{arg}\underset{\pi}{\mathrm{max}}E[R(\tau )],{a}_{t}^{*}\sim {\pi}^{*}.$$ 
In DRL, the optimal policy $\pi $ is the output of a parametric neural network $\mathrm{\Theta}$, and actions at each time step are sampled; methods to train this neural network include proximal policy optimization (PPO) (?).
Threat Model
Our goal is to identify adversarial vulnerabilities in RL agents in a principled manner. To this end, We define a formal threat model, where we assume the adversary possesses the following capabilities:

1.
Access to RL agent’s action stream. The attacker can directly perturb the agent’s nominal action adversarially (under reasonable bounds, elaborated below). The nominal agent is also assumed to be a closedloop system and have no active defense mechanisms.

2.
Access to RL agent’s training environment. This is required to perform forward simulations to design an optimal sequence of perturbations (elaborated below).

3.
Knowledge of trained RL agent’s DNN. This is needed to understand how the RL agent acts under nominal conditions, and to compute gradients. In the adversarial ML literature, this assumption is commonly made under the umbrella of whitebox attacks.
In the context of the above assumptions, the goal of the attacker is to choose a (bounded) action space perturbation that minimizes longterm discounted rewards. Based on how the attacker chooses to perturb actions, we define and construct two types of optimizationbased attacks. We note that alternative approaches, such as training another RL agent to learn a sequence of attacks, is also plausible. However, an optimizationbased approach is computationally more tractable to generate onthefly attacks for a target agent compared to training another RL agent (especially for highdimensional continuous action spaces considered here) to generate attacks. Therefore, we restrict our focus on optimizationbased approaches in this paper.
Myopic ActionSpace (MAS) Attack Model
We first consider the case where the attacker is myopic, i.e., at each time step, they design perturbations in a greedy manner without regards to future considerations. Formally, let ${\delta}_{t}$ be the action space perturbation (to be determined) and $b$ be a budget constraint on the magnitude of each ${\delta}_{t}$ ^{1}^{1} 1 Physically, the budget may reflect a real physical constraint, such as the energy requirements to influence an actuation, or it may be a reflection on the degree of imperceptibility of the attack.. At each time step $t$, the attacker designs ${\delta}_{t}$ such that the anticipated future reward is minimized
$\underset{{\delta}_{t}}{\text{min}}{R}_{\text{adv}}({\delta}_{t})=R({s}_{t},{a}_{t}+{\delta}_{t})+{\displaystyle \sum _{j=t+1}^{T}}R({s}_{j},{a}_{j})$  (1)  
$\text{subject to :}{\parallel {\delta}_{t}\parallel}_{p}\le b,$  
${s}_{j+1}=E({s}_{j},{a}_{j}),$  
${a}_{j}=\mathrm{\Theta}({s}_{j})(\text{for}j=t,\mathrm{\dots},T),$ 
where $\parallel \cdot {\parallel}_{p}$ denotes the ${\mathrm{\ell}}_{p}$norm for some $p\ge 1$. Observe that while the attacker ostensibly solves separate (decoupled) problems at each time, the states themselves are not independent since given any trajectory, ${s}_{j+1}=E({s}_{j},{a}_{j})$, where $E({s}_{j},{a}_{j})$ is the transition of the environment based on ${s}_{j}$ and ${a}_{j}$. Therefore, $\mathbf{R}$ is implicitly coupled through time since it depends heavily on the evolution of state trajectories rather than individual state visitations. Hence, the adversary perturbations solved above are strictly myopic and we consider this a static attack on the agent’s action space.
Lookahead Action Space (LAS) Attack Model
Next, we consider the case where the attacker is able to look ahead and chooses a designed sequence of future perturbations. Using the same notation as above, let ${\sum}_{j=t}^{t+H}R({s}_{j},{a}_{j}+{\delta}_{j})$ denote the sum of rewards until a horizon parameter $H$, and let ${\sum}_{j=t+H+1}^{T}R({s}_{j},{a}_{j})$ be the future sum of rewards from time $j=t+H+1$. Additionally, we consider the (concatenated) matrix $\mathrm{\Delta}=[{\delta}_{t},{\delta}_{t+1}\mathrm{\dots}{\delta}_{t+H}]$ and $B$ denote a budget parameter. The attacker solves the optimization problem:
$\underset{\Delta}{\text{min}}{R}_{adv}(\mathrm{\Delta})={\displaystyle \sum _{j=t}^{t+H}}R({s}_{j},{a}_{j}+{\delta}_{j})+{\displaystyle \sum _{j=t+H+1}^{T}}R({s}_{j},{a}_{j})$  (2)  
$\text{subject to :}{\parallel \mathrm{\Delta}\parallel}_{p,q}\le B,\mathrm{\Delta}=[{\delta}_{t},{\delta}_{t+1},\mathrm{\dots},{\delta}_{H}],$  
${s}_{j+1}=E({s}_{j},{a}_{j}),$  
${a}_{j}=\mathrm{\Theta}({s}_{j})$ 
where $\parallel \cdot {\parallel}_{p,q}$ denotes the ${\mathrm{\ell}}_{p,q}$norm (?). By coupling the objective function and constraints through the temporal dimension, the solution to the optimization problem above is then forced to take the dynamics of the agent into account in an explicit manner.
Proposed Algorithms
In this section, we present two attack algorithms based on the optimization formulations presented in previous section.
Algorithm for Mounting MAS Attacks
We observe that (1) is a nonlinear constrained optimization problem; therefore, an immediate approach to solve it is via projected gradient descent (PGD). Specifically, let $\mathcal{S}$ denote the ${\mathrm{\ell}}_{p}$ ball of radius $b$ in the action space. We compute the gradient of the adversarial reward, $\nabla {R}_{\text{adv}}$ w.r.t. the action space variables and obtain the unconstrained adversarial action ${\widehat{a}}_{t+\frac{1}{2}}$ using gradient descent with step size $\eta $. Next, we calculate the unconstrained perturbation ${\delta}_{t}$ and project in onto $\mathcal{S}$ to get ${\delta}_{t}^{\prime}$ :
${\widehat{a}}_{t+\frac{1}{2}}={a}_{t}\eta \nabla {R}_{adv}({s}_{t},{\widehat{a}}_{t}),$  (3)  
${\delta}_{t}={\widehat{a}}_{t+\frac{1}{2}}{a}_{t},$  
${\delta}_{t}^{\prime}={P}_{\mathcal{S}}({\delta}_{t}).$ 
Here, ${a}_{t}$ represents the nominal action. We note that this approach resembles the fast gradientsign method (FGSM) (?), although we compute standard gradients here. As a variation, we can compute multiple steps of gradient descent w.r.t the action variable prior to projection; this is analogous to the basic iterative method (or iterative FGSM) (?). The MAS attack algorithm is shown in the supplementary material.
We note that in DRL approaches, only a noisy proxy of the true reward function is available: In valuebased methods, we utilize the learned Qfunction (for example, a DQN) as an approximate of the true reward function, while in policyiteration methods, we use the probability density function returned by the optimal policy as a proxy of the reward, under the assumption that actions with high probability induces a high expected reward. Since DQN selects the action based on the argmax of Qvalues and policy iteration samples the action with highest probability, the Qvalues/actionprobability remains a useful proxy for the reward in our attack formulation. Therefore, our proposed MAS attack is technically a version of noisy projected gradient descent on the policy evaluation of the nominal agent. We elaborate on this further in the theoretical analysis section.
Algorithm for Mounting LAS Attacks
The previous algorithm is myopic and can be interpreted as a purely spatial attack. In this section, we propose a spatiotemporal attack algorithm by solving Eq. (2) over a given time window $H$. Due to the coupling of constraints in time, this approach is more involved. We initialize a copy of both the nominal agent and environment, called the adversary and adversarial environment respectively. At time $t$, we sample a virtual rollout trajectory up until a certain horizon $t+H$ using the pair of adversarial agent and environment. At each time step $k$ of the virtual rollout, we compute action space perturbations ${\delta}_{t,k}$ by taking (possibly multiple) gradient updates. Next, we compute the norms of each ${\delta}_{t,k}$ and project the sequence of norms back onto an ${\mathrm{\ell}}_{q}$ball of radius $B$. The resulting projected norms at each time point now represents the individual budgets, ${b}_{k}$, of the spatial dimension at each time step. Finally, we project the original ${\delta}_{t,k}$ obtained in the previous step onto the ${\mathrm{\ell}}_{p}$balls of radii ${b}_{k}$, respectively to get the final perturbations ${\delta}_{t,k}^{\prime}$.^{2}^{2} 2 Intuitively, these steps represent the allocation of overall budget $B$ across different time steps. We note that to perform virtual rollouts at every time step $t$, the state of the ${E}_{adv}$ has to be the same as the state of ${E}_{nom}$ at $t$. To accomplish this, we saved the history of all previous actions to recompute the state of the ${E}_{adv}$ at time $t$ from $t=0$. While this current implementation may be timeconsuming, we believe that this problem can be avoided by giving the adversary direct access to the current state of the nominal agent through platform APIlevel modifications; or explicit observations (in reallife problems).
In subsequent time steps, the procedure above is repeated with a reduced budget of $B{\sum}_{t=0}^{t}{\delta}_{t}^{\prime}$ and reduced horizon $Ht$ until $H$ reaches zero. The horizon $H$ is then reset again for planning a new spatiotemporal attack. An alternative formulation could also be shifting the window without reducing its length until the adversary decides to stop the attack. However, we consider the first formulation such that we can compare the performance of LAS with MAS for an equal overall budget. This technique of replanning the ${\delta}_{t}^{\prime}$ at every step while shifting the window of $H$ is similar to the concept of receding horizons regularly used in optimal control (?). It is evident that using this form of dynamic replanning mitigates the planning error that occurs when the actual and simulated state trajectories diverge due to error accumulation (?). Hence, we perform this replanning at every $t$ to account for this deviation. The pseudocode is provided in Alg. 1.
Theoretical Analysis
We can show that projected gradient descent on the surrogate reward function (modeled by the RL agent network) to generate both MAS and LAS attacks provably converges; this can be accomplished since gradient descent on a surrogate function is akin to a noisy gradient descent on the true adversarial reward.
As described in previous sections, our MAS/LAS algorithms are motivated in terms of the adversarial reward ${R}_{adv}$. However, if we use either DQN or policy gradient networks, we do not have direct access to the reward function, but only its noisy proxy, defined via a neural network. Therefore, we need to argue that performing (projected) gradient descent using this proxy loss function is a sound procedure. To do this, we appeal to a recent result by (?), who prove convergence of noisy gradient descent approximately converges to a local minimum. More precisely, consider a general constrained nonlinear optimization problem:
$\mathrm{min}f(x)$  
$\text{s.t.}c(x)=0,$ 
where $c$ is an arbitrary (differentiable, possibly vectorvalued) function encoding the constraints. Define $S=\{xc(x)=0\}$ define the constraint set. We attempt to minimize the objective function via noisy (projected) gradient updates:
${x}_{t+1/2}$  $={x}_{t}\eta \nabla f({x}_{t})+{\xi}_{t},$  
${x}_{t+1}$  $={P}_{S}({x}_{t+1/2}).$ 
Theorem 1. (Convergence of noisy projected gradients.) Assume that the noise terms $\mathrm{\{}{\xi}_{t}\mathrm{\}}$ are i.i.d., satisfying $E\mathit{}\mathrm{[}\xi \mathrm{]}\mathrm{=}\mathrm{0}\mathrm{,}E\mathit{}\mathrm{[}\xi \mathit{}{\xi}^{T}\mathrm{]}\mathrm{=}{\sigma}^{\mathrm{2}}\mathit{}I\mathit{}d$, $\mathrm{\parallel}\xi \mathrm{\parallel}\mathrm{\le}O\mathit{}\mathrm{(}\mathrm{1}\mathrm{)}$ almost surely. Assume that both the constraint function $c\mathrm{(}\dot{\mathrm{)}}$ and the objective function $f\mathit{}\mathrm{(}\mathrm{\cdot}\mathrm{)}$ is $\beta $smooth, $L$Lipschitz, and possesses ${\rho}_{i}$Lipschitz Hessian. Assume further that the objective function $f$ is $B$bounded. Then, there exists a learning rate $\eta \mathrm{=}O\mathit{}\mathrm{(}\mathrm{1}\mathrm{)}$ such that with high probability, in $\text{\mathit{p}\mathit{o}\mathit{l}\mathit{y}\mathit{l}\mathit{o}\mathit{g}}\mathit{}\mathrm{(}\mathrm{1}\mathrm{/}{\eta}^{\mathrm{2}}\mathrm{)}$ iterations, noisy projected gradient descent converges to a point $\widehat{x}$ that is $\text{\mathit{p}\mathit{o}\mathit{l}\mathit{y}\mathit{l}\mathit{o}\mathit{g}}\mathit{}\mathrm{(}\sqrt{\eta}\mathrm{)}$close to some local minimum of $f$.
In our case, $f$ and $\xi $ depends on the structure of the RL agent’s neural network. (Smoothness assumptions of $f$ can perhaps be justified by limiting the architecture of the network, but the iidness assumption on $\xi $ is hard to verify). As such, it is difficult to ascertain whether the assumptions of the above theorem are satisfied in specific cases. Nonetheless, an interesting future theoretical direction is to understand Lipschitzness properties of specific families of DQN/policy gradient agents.
We defer further analysis of the double projection step onto mixednorm balls used in our proposed LAS algorithms to the supplementary material.
Experimental Results & Discussion
To demonstrate the effectiveness and versatility of our methods, we implemented them on RL agents with continuous action environments from OpenAI’s gym (?) as they reflect the type of action space in most practical applications ^{3}^{3} 3 Codes and links to supplementary are available at https://github.com/xylee95/SpatiotemporalAttackOnDeepRLAgents. For policybased methods, we trained a nominal agent using the PPO algorithm and a DoubleDQN (DDQN) agent (?) for valuebased methods^{4}^{4} 4 The only difference in implementation of policy vs valuebased methods is that in policy methods, we take analytical gradients of a distribution to compute the attacks (e.g., in line 10 of Algorithm 1) while for valuebased methods, we randomly sample adversarial actions to compute numerical gradients.. Additionally, we utilize Normalized Advantage Functions (?) to convert the discrete nature of DDQN’s output to continuous action space. For succinctness, we present the results of the attack strategies only on PPO agent for the LunarLander environment. Additional results of DDQN agent in Lunar Lander and BipedalWalker environments and PPO agent in BipedalWalker, Mujoco Hopper, HalfCheetah and Walker environments are provided in the supplementary materials. As a baseline, we implemented a random action space attack, where a random perturbation bounded by the same budget $b$ is applied to the agent’s action space at every step. For MAS attacks, we implemented two different spatial projection schemes, ${\mathrm{\ell}}_{1}$ projection based on (?) that represents a sparser distribution and ${\mathrm{\ell}}_{2}$ projection that represents a denser distribution of attacks. For LAS attacks, all combinations of spatial and temporal projection for ${\mathrm{\ell}}_{1}$ and ${\mathrm{\ell}}_{2}$ were implemented.
Comparison of MAS and LAS Attacks
Fig. 2 shows distributions of cumulative rewards obtained by the PPO agent across ten episodes in a Lunar Lander environment, with each subplot representing different combinations of budget, $B$ and horizon, $H$. Top three subplots show experiments with a $H$ value of 5 time steps and $b$ value of 3, 4, and 5 from left to right respectively. Bottom row of figures show a similar set of experiments but with a $H$ value of 10. For a direct comparison between MAS and LAS attacks with equivalent budgets across time, we have assigned the corresponding MAS budget values as $b=B/H$. This assumes that the total budget $B$ is allocated uniformly across every time step for a given $H$, while LAS has the flexibility to allocate the attack budget nonuniformly in the same interval, conditioned on the dynamics of the agent.
We note that keeping $H$ constant while increasing $B$ provides both MAS and LAS with a higher budget to inject ${\delta}_{t}$ to the nominal actions. We observe that with a low budget of $3$ (Fig. 2a), only LAS is successful in attacking the RL agent, as seen by the corresponding decrease in rewards. With a higher budget of 5 (Fig. 2c), MAS has a more apparent effect on the performance of the RL agent while LAS reduces the performance of the agent severely.
With $B$ constant, increasing $H$ allows the allocated $B$ to be distributed along the increased time horizon. In other words, LAS virtually looksahead further into the future. In the most naive case, a longer horizon dilutes the severity of each ${\delta}_{t}$ in compared to shorter horizons. By comparing similar budget values of different horizons (i.e horizons 5 and 10 for budget 3, Fig. 2a and Fig. 2d respectively), attacks for $H=10$ are generally less severe than their $H=5$ counterparts. For all $B$ and $H$ combinations, we observe that MAS attacks are generally less effective compared to LAS. We note that this is a critical result of the study as most literature on static attacks have shown that the attacks can be ineffective below a certain budget. Here, we demonstrate that while MAS attacks can seemingly look ineffective for a given budget, a stronger and more effective attack can essentially be crafted using LAS with the same budget.
In the following sections, we further study the difference between MAS and LAS as well as demonstrate how the attacks can be utilized to understand the vulnerabilities of the agent in different environments.
Action Dimension Decomposition of LAS Attacks
Fig. 3 shows action dimension decomposition of LAS attacks. The example shown in Fig. 3 is the result of ${\mathrm{\ell}}_{2}$ projection in action space with ${\mathrm{\ell}}_{2}$ projection in time. From Fig. 3a, we observe that through all the episodes of LAS attacks, one of the action dimension (i.e., Up  Down direction of lunar lander) is consistently perturbed more, i.e., accumulates more attack, than LeftRight direction.
Fig. 3b shows a detailed view of action dimension attacks for an episode (Episode 1). It is evident from the figure that the UpDown actions of the lunar lander are more prone to attacks throughout the episode than LeftRight actions. Additionally, LeftRight action attacks are restricted after certain time steps and only the UpDown actions are attacked further. Fig. 3c further corroborates the observation in the Lunar Lander environment. As the episode progresses in Fig. 3c, the lunar lander initially lands on the ground in frame 3, but lifts up and hovers until the episode ends in frame 5. This observation supports the fact that the proposed attacks are effective in perturbing the action dimensions in an optimal manner; as in this case, perturbing the lunar lander in the horizontal direction will not further decrease rewards. On the other hand, hovering the lunar lander will cause the agent to use more fuel, which consequently decreases the total reward. From these studies, it can be concluded that LAS attacks (correlated with projections of actions in time) can clearly isolate vulnerable action dimension(s) of the RL agent to mount a successful attack.
Ablation Study of Horizon and Budget
Lastly, we performed multiple ablation studies to compare the effectiveness of LAS and MAS attacks. While we have observed that LAS attacks are generally stronger than MAS, we hypothesize that there will be an upper limit to LAS’s advantage as the allowable budget increases. We take the difference of each attack’s reduction in rewards (i.e. attack  nominal) and visualize how much rewards LAS reduces as compared to MAS under different conditions of $B$ and $H$. In the case of PPO in Lunar Lander, we observe that the reduction in rewards of LAS vs MAS becomes less drastic as budget increases, hence showing that LAS has diminishing returns as both MAS and LAS saturates at higher budgets. We defer detailed discussions and additional figures of the ablation study to the supplementary materials.
Conclusion & Future Work
In this study, we present two novel attack strategies on an RL agent’s action space; a myopic attack (MAS) and a nonmyopic attack (LAS). The results show that LAS attacks, that were crafted with explicit use of the agent’s dynamics information, are more powerful than MAS attacks. Additionally, we observed that applying LAS attacks on RL agents reveals the possible vulnerable actuators of an agent, as seen by the nonuniform distribution of attacks on certain action dimensions. This can be leveraged as a tool to identify the vulnerabilities and plan a mitigation strategy under similar attacks. Possible future works include extending the concept of LAS attacks to state space attacks where the agent’s observations are perturbed instead of the agent’s actions while taking into account the dynamics of the agent. Additionally, while we did not focus on the imperceptibility and deployment aspects of the proposed attacks in this study, defining a proper metric in terms of detectability in action space and optimizing the budget to remain undetected for different environments will be a future research direction.
Acknowledgement
This work was supported in part by NSF grants CNS1845969 and CCF2005804, and AFOSR YIP Grant FA95501710220.
References
Appendix A Pseudocode of MAS Attack
Appendix B Analysis
Projections onto Mixednorm Balls
The Lookahead Action Space (LAS) Attack Model described above requires projecting onto the mixednorm ${\mathrm{\ell}}_{p,q}$ball of radius $B$ in a vector space. Below, we show how to provably compute such projections in a computationally efficient manner. For a more complete treatment, we refer to (?). Recall the definition of the $(p,q)$norm. Let $X\in {\mathbb{R}}^{m\times n}$ be partitioned into subvectors ${x}_{i},i\in [n]$ of length$m$. Then,
$${\parallel X\parallel}_{p,q}:={\left(\sum _{i=1}^{n}{\parallel {x}_{i}\parallel}_{q}^{p}\right)}^{1/p}.$$ 
Due to scale invariance of norms, we can assume $B=1$. We consider the following special cases:

1.
$p=1,q=1$: this reduces to the case of the ordinary ${\mathrm{\ell}}_{1}$norm in ${\mathbb{R}}^{mn}$. Projection onto the unit ${\mathrm{\ell}}_{1}$ball can be achieved via softthresholding every entry in $X$:
$${P}_{S}({X}_{i,j})=\text{sign}({X}_{i,j})\cdot {({X}_{i,j}\lambda )}_{+},$$ where $\lambda >0$ is a KKT parameter that can be discovered by a simple sorting the (absolute) values of $X$. See (?).

2.
$p=2,q=2$: this reduces to the case of the isotropic ${\mathrm{\ell}}_{2}$norm in ${\mathbb{R}}^{mn}$. Projection onto the unit ${\mathrm{\ell}}_{2}$ball can be achieved by simple normalization:
$${P}_{S}({X}_{i,j})={X}_{i,j}/{\parallel X\parallel}_{2,2}.$$ 
3.
$p=1,q=2$: this is a “hybrid” combination of the above two cases, and corresponds to the procedure that we use in mounting our LAS attack. Projection onto this ball can be achieved by a threestep method. First, we compute the $n$dimensional vector, $v$, of columnwise ${\mathrm{\ell}}_{2}$norms. Then, we project $v$ onto the unit ${\mathrm{\ell}}_{1}$ball; essentially, this enables us to “distribute” the (unit) budget across columns. Since ${\mathrm{\ell}}_{1}$projection is achieved via softthresholding, a number of coordinates of this vector are zeroed out, and others undergo a shrinkage. Call this (sparsified) projected vector ${v}_{p}$. Finally, we perform an ${\mathrm{\ell}}_{2}$ projection, i.e., we scale each column of $X$ by dividing by its norm and multiplying by the entries of ${v}_{p}$:
$${P}_{S}({X}_{i,j})=\frac{{X}_{i,j}}{{\parallel {x}_{j}\parallel}_{2}}\cdot {v}_{p}(i).$$
Appendix C Additional Experiments
Comparison of Attacks Mounted on PPO Agent in BipedalWalker Environment
The results in Fig. 4 depicts the comparison between the MAS and LAS attacks applied on a PPO agent in the BipedalWalker environment. A similar trend is observed where LAS attacks are generally more severe than MAS attacks. We acknowledge that in this environment, MAS attacks are sometimes effective in reducing the rewards as well. However, this can be attributed to the Bipedal Walker having more dimensions (4 dimensions) in terms of it’s action space in compared to the LunarLander (2 dimensions) environment. In addition, the actions of the Bipedal Walker is also highly coupled, in compared to the actions of the Lunar Lander. Hence, the agent for BipedalWalker is more sensitive towards perturbations, which explains the increase efficacy of MAS attacks.
Comparison of Attacks Mounted on DDQN Agent in Lunar Lander and BipedalWalker Environments
In Figures 5 and 6, we present additional results on the efficacy of different attack strategies for a DoubleDQN agent trained in the Lunar Lander and Bipedal Walker environment. An interesting observation is that in both environment, the effects of the attacks are more severe for the DDQN agent in compared to the PPO agent as seen in the boxplots. We conjecture that this is due to the deterministic nature of DDQN method where the optimal action is unique. In contrast, the PPO agent was trained with a certain stochastic in the process, which might explain the additional robustness of PPO agent to noise or perturbations. Nevertheless, in the DDQN agent, we still observe a similar trend where given the same about of budget, more severe attacks can be crafted using the LAS strategy in compared to MAS or random attacks.
Comparison of Attacks Mounted on PPO Agent Mujoco Control Environments
In addition to the attacks mounted on the agents in the 2 environments above, we also compare the effect of the attacks on a PPO agent trained in 3 different Mujoco control environments. Figures 7, 8, 9 illustrates the distribution of rewards obtained by the agent in the Hopper, HalfCheetah and Walker environments respectively. In all three environments, we observe that LAS attacks are generally more effective in reducing the rewards of the agent across different values of budget and horizon, which reinforces the fact that LAS attacks are stronger than MAS attacks. However, it is also interesting to note that agents in different environments have different sensitivity to the budget of attacks. In Hopper(Figure 7) and Walker(Figure 9), we see that MAS attacks have the effect of shifting the median and increasing the variance of the reward distortion with respect to the nominal, which highlights the fact that there are some episodes which MAS attack fails to affect the agent. In contrast, in the HalfCheetah environment(Figure 8), we see that MAS attacks shifts the whole distribution of rewards downwards, showing that the agent is more sensitive in to MAS in this environment as compared to the other two environments. This suggests that the budget and horizon values are also hyperparameters which should be tuned according to the environment.
Comparison of Temporal Projections in LAS
In this section, we present additional visualizations to further understand why ${\mathrm{\ell}}_{2}$ temporal projections results in more severe attacks in compared to ${\mathrm{\ell}}_{1}$ temporal projections. Figure 10, 11, 12 and 13 presents the $\parallel {\delta}_{t}\parallel $ usage plot across 100 time steps for both PPO and DDQN in the Lunar Lander and BipedalWalker environment. The left subplot represents ${\mathrm{\ell}}_{1}$ projections in the spatial dimension while the right subplot represents ${\mathrm{\ell}}_{2}$ projections in the spatial dimensions. These plots directly compare the difference in amount of $\parallel {\delta}_{t}\parallel $ used between ${\mathrm{\ell}}_{1}$ and ${\mathrm{\ell}}_{2}$ temporal projections for both ${\mathrm{\ell}}_{1}$ and ${\mathrm{\ell}}_{2}$ spatial attacks.
In most cases with the exception of Figure 12, we see a clear trend that ${\mathrm{\ell}}_{1}$ temporal projections results in a sparser but more concentrated peaks of $\parallel \delta \parallel $ utilization (corresponding to a few instance of strong attacks). In contrast, ${\mathrm{\ell}}_{2}$ temporal projections results in a more distributed but frequent form of of $\parallel \delta \parallel $ utilization (corresponding to more frequent but weak instances of attacks). We note that while ${\mathrm{\ell}}_{1}$ projections produces stronger attacks, there is a diminishing return on allocating more attacks to a certain time point as after a certain limit. Hence, this explains the weaker effect of ${\mathrm{\ell}}_{1}$ temporal projections since it concentrates the attacks to a few points but ultimately gives time for the agent to recover. In contrast, ${\mathrm{\ell}}_{2}$ temporal projections distributes the attacks more frequently that causes the agent to follow a diverging trajectory that is hard to recover from.
As an anecdotal example in the Lunar Lander environment, we observe that attacks with ${\mathrm{\ell}}_{1}$ temporal projection tend to turn off the vertical thrusters of the lunar lander. However, due to the sparsity of the attacks, the RL agent could possibly be fire the upward thrusters in time to prevent a freefall landing. With ${\mathrm{\ell}}_{2}$ temporal projections, the agent is attacked continuously. Consequently, the agent has no chance to return to a nominal state and quickly diverges towards a terminal state.
Ablation Study
For this section, we present an ablation study to investigate the effect of different budget and horizon parameters on the effectiveness of LAS vs MAS. As mentioned in the main manuscript, we take the difference of each attack’s reduction in rewards (i.e. attack  nominal) and visualize how much rewards LAS reduces as compared to MAS under different conditions of $B$ and $H$. Fig 14 illustrates the ablation study of a PPO agent in Lunar Lander. The figure is categorized by different spatial projections, where ${\mathrm{\ell}}_{1}$ spatial projections are shown on the left figure while ${\mathrm{\ell}}_{2}$ spatial projections are shown on the right. Both subplots are shown for ${\mathrm{\ell}}_{2}$ time projection attacks. Each individual subplot shows three different lines with different $H$, with each line visualizing the change in mean cumulative reward as budget increases along the xaxis. As budget increases, attacks in both ${\mathrm{\ell}}_{1}$ and ${\mathrm{\ell}}_{2}$ spatial projection shows a monotonic decrease in cumulative rewards. Attacks in each spatial projection with a $H$ value of 5 shows different trends, where ${\mathrm{\ell}}_{2}$ decreases linearly with increasing budget while ${\mathrm{\ell}}_{1}$ became stagnant after $B$ value of 3. This can be attributed to the fact that the attacks are more sparsely distributed in ${\mathrm{\ell}}_{1}$ attacks, causing most of the perturbations to be distributed into one action dimension. Thus, as budget increases, we see a diminishing return of LAS attacks since attacking a single action dimension beyond a certain limit doesn’t decrease reward any further. The study was also conducted for PPO BipedalWalker and both DDQN Lunar Lander and BipedalWalker as shown in Figure 16, 15 and 17 respectively. We only consider attacks in ${\mathrm{\ell}}_{2}$ temporal projection attacks for both ${\mathrm{\ell}}_{1}$ and ${\mathrm{\ell}}_{2}$ spatial projections. At a glance, we see different trends across each figures due to the different environment dynamics. However, in all cases, the decrease in reduction of rewards is always lesser than or equals to zero, which infers that LAS attacks are at least as effective than MAS attacks. We observed that attacks for horizon value of 5 becomes ineffective after a certain budget value. This shows that after some budget value, MAS attacks are as effective as LAS attacks because LAS might be operating at maximum attack capacity. Interesting to note that BipedalWalker for PPO needed a higher budget compared to the DDQN counterpart due to the PPO being more robust to attacks.
Effect of Horizon Parameter in LAS
In this section, we further describe the effect of horizon parameter $H$ on the effectiveness of LAS attacks that we empirically observed. $H$ defines a fixed time horizon (e.g., steps in DRL environments) to expend a given budget $B$. For a fixed $B$ and short $H$, LAS favors injecting stronger perturbations in each step. Hence, we would intuitively hypothesize that given a shorter $H$, the severity of LAS attacks will increase as $H$ decrease, as shown by the reduction in rewards between MAS and LAS in Figure 3 of the main paper. In most cases, the reduction is negative, hence showing that LAS attacks are indeed more severe. However in some cases as shown in Figure 8, 9 & 10, a shorter $H$ does result in LAS being not as effective as a longer $H$ (though still stronger than MAS as evident from negative values of yaxis). This can be attributed to the nonlinear reward function of the environments and consequent failure modes of the agent. For example, attacks on Lunar Lander PPO agent causes failure by constantly firing thruster engines to prevent Lunar Lander from landing, hence accumulating negative rewards over a long time. In contrast, attacks on the DQN agent causes Lunar Lander to crash immediately, hence terminating the episode before too much negative reward is accumulated. Thus, while the effect of $H$ on LAS attacks sometimes do not show a consistent trend, we it is a key parameter that can be tuned to control the failure modes of the RL agent.
Action Space Dimension Decomposition
We provide additional results on using the LAS attack scheme as a tool to understand the RL agent’s action space vulnerabilities for a DoubleDQN agent in both Lunar Lander and Bipedal Walker environment. It is interesting to note in Figure 18, even with agents trained with a different philosophy (valuebased vs policy based, shown in main manuscript), the attack scheme still distributes the attack to a similar dimension (UpDown action for Lunar Lander), which highlights the importance of the that particular dimension. In Figure 19, we show the outcome of LAS attack scheme on Bipedal Walker environment having four action space dimensions. The four joints of the bipedal walker, namely Left Hip, Left Knee, Right Hip and Right Knee are attacked in this case, and from Figure 19, we see that the left hip is attacked more than any other action dimension in most of the episodes. This supports our inference that LAS attacks can bring out the vulnerabilities in the action space dimensions (actuators in case of CPS RL agents).