### Abstract

Deep Neural Network-based systems are now the state-of-the-art in manyrobotics tasks, but their application in safety-critical domains remainsdangerous without formal guarantees on network robustness. Small perturbationsto sensor inputs (from noise or adversarial examples) are often enough tochange network-based decisions, which was already shown to cause an autonomousvehicle to swerve into oncoming traffic. In light of these dangers, numerousalgorithms have been developed as defensive mechanisms from these adversarialinputs, some of which provide formal robustness guarantees or certificates.This work leverages research on certified adversarial robustness to develop anonline certified defense for deep reinforcement learning algorithms. Theproposed defense computes guaranteed lower bounds on state-action values duringexecution to identify and choose the optimal action under a worst-casedeviation in input space due to possible adversaries or noise. The approach isdemonstrated on a Deep Q-Network policy and is shown to increase robustness tonoise and adversaries in pedestrian collision avoidance scenarios and a classiccontrol task.

### Quick Read (beta)

# Certified Adversarial Robustness for

Deep Reinforcement Learning

###### Abstract

Deep Neural Network-based systems are now the state-of-the-art in many robotics tasks, but their application in safety-critical domains remains dangerous without formal guarantees on network robustness. Small perturbations to sensor inputs (from noise or adversarial examples) are often enough to change network-based decisions, which was already shown to cause an autonomous vehicle to swerve into oncoming traffic. In light of these dangers, numerous algorithms have been developed as defensive mechanisms from these adversarial inputs, some of which provide formal robustness guarantees or certificates. This work leverages research on certified adversarial robustness to develop an online certified defense for deep reinforcement learning algorithms. The proposed defense computes guaranteed lower bounds on state-action values during execution to identify and choose the optimal action under a worst-case deviation in input space due to possible adversaries or noise. The approach is demonstrated on a Deep Q-Network policy and is shown to increase robustness to noise and adversaries in pedestrian collision avoidance scenarios and a classic control task.

[name=J. H., color=blue]jh \definechangesauthor[name=M. E., color=red]me \definechangesauthor[name=B. L., color=green]bl \definechangesauthor[name=G. H., color=orange]gh \usetikzlibraryshapes,arrows,fit,backgrounds \tikzstyleblock = [draw, rectangle, text width=2cm, text centered, minimum height=1.2cm, node distance=3cm] \tikzstylecontainer = [draw, rectangle, inner sep=0.3cm, fill=NextBlue!20, minimum height=3cm] \usetikzlibrarypositioning,calc \tikzset mybackground/.style=execute at end picture= {scope}[on background layer] \node[font=] at (current bounding box.north) # 1; ,

Certified Adversarial Robustness for

Deep Reinforcement Learning

Björn Lütjens, Michael Everett, Jonathan P. How |

Department of Aeronautics and Astronautics |

Massachusetts Institute of Technology |

lutjens, mfe, jhow@mit.edu |

noticebox[b]3rd Conference on Robot Learning (CoRL 2019), Osaka, Japan.\[email protected]

Keywords: Adversarial Attacks, Reinforcement Learning, Collision Avoidance, Robustness Verification

## 1 Introduction

Deep reinforcement learning (RL) algorithms have achieved impressive success on robotic manipulation [Gu_2017] and robot navigation in pedestrian crowds [fan2019getting, Everett_2018]. Many of these systems utilize black-box predictions from deep neural networks (DNNs) to achieve state-of-the-art performance in prediction and planning tasks. However, the lack of formal robustness guarantees for DNNs currently limits their application in safety-critical domains, such as collision avoidance. In particular, even subtle perturbations to the input, known as adversarial examples, can lead to incorrect (but highly-confident) predictions from DNNs [Szegedy_2014, Akhtar_2018, Yuan_2019]. Furthermore, several recent works have demonstrated the danger of adversarial examples in real-world situations [Kurakin_2017b, Sharif_2016], including causing an autonomous vehicle to swerve into oncoming traffic [Tencent_2019]. The work in this paper addresses the lack of robustness against adversarial examples and sensor noise by proposing an online certified defense to add onto existing deep RL algorithms during execution.

Existing methods to defend against adversaries, such as adversarial training [Kurakin_2017, Madry_2018, Kos_2017], defensive distillation [Papernot_2016], or model ensembles [Tramer_2018] do not come with theoretical guarantees for reliably improving the robustness and are often ineffective on the advent of more advanced adversarial attacks [Carlini_2017, He_2017, Athalye_2018, Uesato_2018]. Verification methods do provide formal guarantees on the robustness of a given network, but finding the guarantees is an NP-complete problem and computationally intractable to solve in real-time for applications like robot manipulation or navigation [Katz_2017, Lomuscio_2017, Tjeng_2019, Ehlers_2017, Huang_2017b]. Robustness certification methods relax the problem to make it tractable. Given an adversarial distortion of a nominal input, instead of finding exact bounds on the worst-case output deviation, these methods efficiently find certified lower bounds [Raghunathan_2018, Wong_2018, Weng_2018, Singh_2018, Wang_2018]. In particular, the work by [Weng_2018] runs in real-time for small networks ($33$ to $14,000$ times faster than verification methods), its bound has been shown to be within $10\%$ error of the true bound, and it is compatible with many activation functions and neural network architectures [Weng_2018b, Boopathy_2019]. These methods were applied on computer vision tasks.

This work extends the tools for robustness certification against adversaries to deep RL tasks. As a motivating example, consider the collision avoidance setting in creftype 1, in which an adversary perturbs an agent’s (orange) observation of an obstacle (blue). An agent following a nominal/standard deep RL policy would observe ${s}_{adv}$ and select an action, ${a}_{nom}^{*}$, that collides with the obstacle’s true position, ${s}_{0}$, thinking that space is unoccupied. Our proposed approach assumes a worst-case deviation of the observed input, ${s}_{adv}$, bounded by $\u03f5$, and takes the optimal action, ${a}_{adv}^{*}$, under that perturbation, to safely avoid the true obstacle. Nominal robustness certification algorithms assume $\u03f5$ is a scalar, which makes sense for image inputs (all pixels have same scale, e.g., $0-255$ intensity). A key challenge in direct application to RL tasks is that the observation vector (network input) could have elements with substantially different scales (e.g., position, angle, joint torques) and associated measurement uncertainties, motivating our extension with $\u03f5$ as a vector.

This work contributes (i) the first formulation of robustness certification deep RL problems, (ii) an extension of existing robustness certification algorithms to variable scale inputs, (iii) an optimal action selection rule under worst-case state perturbations, and (iv) demonstrations of increased robustness to adversaries and sensor noise on cartpole and a pedestrian collision avoidance simulation.

## 2 Related work

### 2.1 Varieties of adversaries in RL

RL literature proposes many approaches to achieve adversarial robustness. Domain Randomization, also called perturbed simulation, adversarially chooses parameters that guide the physics of a simulation, such as mass, center of gravity, or friction during training [Rajeswaran_2017, Muratore_2018]. Other work investigates the addition of adversarially acting agents [Uther_1997, Pinto_2017] during training. The resulting policies are more robust to a distribution shift in the underlying physics from simulation to real-world, e.g., dynamics/kinematics. This work, in comparison, addresses adversarial threats in the observation space, not the underlying physics [Goodfellow_2015]. For example, adversarial threats could be created by small perturbations in a camera image, lidar pointcloud, or estimated positions/velocities of pedestrians.

### 2.2 Defenses to adversarial examples

Much of the existing work on robustness against adversarial attacks detects or defends against adversarial examples. Adversarial training or retraining augments the training dataset with adversaries [Kurakin_2017, Madry_2018, Kos_2017] to increase robustness during testing (empirically). Other works increase robustness through distilling networks [Papernot_2016], comparing the output of model ensembles [Tramer_2018], or detect adversarial examples through comparing the input with a binary filtered transformation of the input [Xu_2018]. Although these approaches show impressive empirical success, they do not come with theoretical guarantees for reliably improving the robustness against a variety of adversarial attacks and are often ineffective against more sophisticated adversarial attacks [Carlini_2017, He_2017, Athalye_2018, Uesato_2018].

### 2.3 Formal robustness verification and certification

Verification methods provide these desired theoretical guarantees. The methods find theoretically proven bounds on the maximum output deviation, given a bounded input perturbation [Katz_2017, Lomuscio_2017, Tjeng_2019]. These methods rely on satisfiability modulo theory (SMT) [Ehlers_2017, Katz_2017, Huang_2017b], LP, or mixed-integer linear programming (MILP) solvers [Lomuscio_2017, Tjeng_2019], or zonotopes [Gehr_2018], to propagate constraints on the input space to the output space. The difficulty in this propagation arises through ReLU or other activation functions, which can be nonlinear in between the upper and lower bound of the propagated input perturbation. In fact, the problem of finding the exact verification bounds is NP-complete [Katz_2017, Weng_2018] and thus currently infeasible to be run online on a robot. A relaxed version of the verification problem provides certified bounds on the output deviation, given an input perturbation [Raghunathan_2018, Wong_2018, Weng_2018, Singh_2018, Wang_2018]. Fast-Lin [Weng_2018], offers the certification of CIFAR networks in tens of seconds, provable guarantees, and the extendability to all possible activation functions [Weng_2018b]. This work extends Fast-Lin from computer vision tasks to be applicable in Deep RL domains.

### 2.4 Safe and risk-sensitive reinforcement learning

Like this work, several Safe RL algorithms surveyed in [Garcia_2015] also optimize a worst-case criterion. Those algorithms, also called risk-sensitive RL, optimize for the reward under worst-case assumptions of environment stochasticity, rather than optimizing for the expected reward [Heger_1994, Tamar_2015, Geibel_2006]. The resulting policies are more risk-sensitive, i.e., robust to stochastic deviations in the input space, e.g., sensor noise, but could still fail on algorithmically-crafted adversarial examples. To be fully robust against adversaries, this work assumes a worst-case deviation of the input space inside some bounds and takes the action with maximum expected reward. A concurrent line of work trains an agent to not change actions in the presence of adversaries, but does not guarantee that the agent takes the best action under a worst case perturbation [Fischer_2019].Other work in Safe RL focuses on parameter/model uncertainty, e.g., uncertainty in the model for novel observations (far from training data) in [Kahn_2017, Lutjens_2019]. Several robotics works avoid the sim-to-real transfer by learning policies online in the real world. However, learning in the real world is slow (requires many samples) and does not come with full safety guarantees [Sun_2017, Berkenkamp_2017].

## 3 Background

### 3.1 Robustness certification

In RL problems, the state-action value $Q=\mathbb{E}[{\sum}_{t=0}^{T}{\gamma}^{t}{r}_{t}]$ expresses the expected future reward, ${r}_{t}$, discounted by $\gamma $, from taking an action in a given state/observation. This work aims to find the action that maximizes state-action value under a worst-case perturbation of the observation by sensor noise or an adversary. This section explains how to obtain the certified lower bound on the DNN-predicted $Q$, given a bounded perturbation in the input space from the true state. The derivation is based on [Weng_2018], re-formulated for RL. We define the certified lower bound of the state-action value, ${Q}_{L}$, for each discrete action, ${a}_{j}$, as

${Q}_{L}({s}_{adv},{a}_{j}):=\underset{s\in {B}_{p}({s}_{adv},\u03f5)}{\mathrm{min}}{Q}_{l}(s,{a}_{j}),$ | (1) |

for all possible states, $s$, inside the $\u03f5$-Ball around the observed input, ${s}_{adv}\in {\mathbb{R}}^{n}$, where ${B}_{p}({s}_{adv},\u03f5):=\{s:{||s-{s}_{adv}||}_{p}\le \u03f5\}$. ${Q}_{l}$ is the certified lower bound for a given state: ${Q}_{l}(s,{a}_{j})\le Q(s,{a}_{j})\forall s\in {B}_{p}({s}_{adv},\u03f5),\forall {a}_{j}\in \mathbb{A}$, and calculated in creftype 3. The ${L}_{p}$-norm bounds the input deviation that the adversary was allowed to apply, and is defined as ${||x||}_{p}=(|{x}_{1}{|}^{p}+\mathrm{\dots}+|{x}_{n}{{|}^{p})}^{1/p}$ for $x\in {\mathbb{R}}^{n}$, $p\ge 1$.

The certification essentially passes the interval bounds $[{l}^{(0)},{u}^{(0)}]=[{s}_{adv}-\u03f5,{s}_{adv}+\u03f5]$ from the DNN’s input layer to the output layer, where ${l}^{(k)}$ and ${u}^{(k)}$ denotes the lower and upper bound of the preReLU-activation, ${z}^{(k)}$, in the $k$-th layer of an $m$-layer DNN. The difficulty while passing these bounds from layer to layer arises through the nonlinear activation functions, such as ReLU, PReLU, tanh, sigmoid. Note that although this work considers ReLU activations, it can easily be extended to general activation functions via the certification process as seen in [Weng_2018b]. When passing interval bounds through a ReLU activation, the upper and lower preReLU bound can either both positive $({l}^{(k)},{u}^{(k)}>0)$, negative $$, or positive and negative $$, in which the ReLU status is called active, inactive or undecided, respectively. In the active and inactive case, bounds are passed to the next layer as normal. In the undecided case, the output of the ReLU is bounded through linear upper and lower bounds:

$$ | (2) |

The identity matrix $D$ is introduced as the ReLU status matrix, $H$ as the lower/upper bounding factor, $W$ as the weight matrix, $b$ as the bias in layer $(k)$ with $r,j$ as indices, and the preReLU-activation, ${z}^{(k)}$, is replaced with ${W}_{r,:}^{(k)}s+{b}_{r}^{(k)}$. The ReLU bounding is then rewritten as

${D}_{r,r}^{(k)}({W}_{r,j}^{(k)}{s}_{j}+{b}_{r}^{(k)})\le \sigma ({W}_{r,j}^{(k)}{s}_{j}+{b}_{r}^{(k)})\le {D}_{r,r}^{(k)}({W}_{r,j}^{(k)}{s}_{j}+{b}_{r}^{(k)}-{H}_{r,j}^{(k)}),$ | |||

$$ |

Similar to the closed-form forward pass in a DNN, one can formulate the closed form solution for the guaranteed lower bound of the state-action value for a single state $s$:

${Q}_{l}(s,{a}_{j})={A}_{j,:}^{(0)}s+{b}_{j}^{(m)}+{\displaystyle \sum _{k=1}^{m-1}}{A}_{j,:}^{(k)}({b}^{(k)}-{H}_{:,j}^{(k)}),$ | (3) |

where the matrix $A$ contains the network weights and ReLU activation, recursively for all layers: ${A}^{(k-1)}={A}^{(k)}{W}^{(k)}{D}^{(k-1)}$, with identity in the final layer: ${A}^{(m)}=\mathrm{\U0001d7d9}$.

### 3.2 Pedestrian simulation

Among the many RL tasks, a particularly challenging safety-critical task is collision avoidance for a robotic vehicle among pedestrians. Because learning a policy in the real world is dangerous and time consuming, this work uses a kinematic simulation environment for learning pedestrian avoidance policies. The decision process of an RL agent in the environment can be described as Partially Observable Markov Decision Process (POMDP) with the tuple $$. The environment state $S$ is fully described by the behavior policy, position, velocity, radius, and goal position of each agent. In this example, the RL policy controls one of two agents with 11 discrete action heading actions $A=[{a}_{\mathrm{min}},{a}_{\mathrm{max}}]=[-\pi /6,+\pi /6]$ and constant velocity $v=1m/s$. The environment executes the selected action under unicycle kinematics, and controls the other agent from a diverse set of fixed policies (static, non-cooperative, ORCA [Berg_2009], GA3C-CADRL [Everett_2018]). The sparse reward is $1$ for reaching the goal, $-0.25$ for colliding and the partial observation is the x-y position, x-y velocity, and radius of each agent, and the RL agent’s goal, as in [Everett_2018].

## 4 Approach

This work develops an add-on certified defense for existing Deep RL algorithms to ensure robustness against sensor noise or adversarial examples during test time.

### 4.1 System architecture

creftypecap 2 depicts the system architecture of a standard model-free RL framework with the added-on certification. In an offline training phase, an agent uses a deep RL algorithm, here DQN [Mnih_2015], to train a DNN that maps non-corrupted state observations, $s$, to state-action values, $Q(s,a)$. Action selection during training uses the nominal cost function, ${a}_{nom}^{*}={\mathrm{arg}\mathrm{max}}_{a}Q(s,a)$. We assume the training process causes the network to converge to the optimal value function, ${Q}^{*}(s,a)$ and focus on the challenge of handling perturbed observations during execution.

During online execution, the agent only receives corrupted state observations from the environment, and passes those through the DNN. The certification node uses the DNN architecture and weights, $W$, to compute lower bounds on $Q$ under a bounded perturbation of the input $s\pm \u03f5$, which are used for robust action selection during execution (described below).

### 4.2 Optimal cost function under worst-case perturbation

We consider robustness to an adversary who picks the worst-possible state observation, ${s}_{adv}$, within a small perturbation, $\u03f5$, of the true state, ${s}_{0}$. The adversary assumes the RL agent follows a nominal policy (as in e.g., DQN) of selecting the action with highest Q-value at the current observation. A worst possible state observation, ${s}_{adv}$, is therefore any one which causes the RL agent to take the action with lowest Q-value in the true state ${s}_{0}$,

$${s}_{adv}\in \{s:s\in {B}_{p}({s}_{0},\u03f5)\text{and}\underset{a}{\mathrm{arg}\mathrm{max}}Q(s,a)=\underset{a}{\mathrm{arg}\mathrm{min}}Q({s}_{0},a)\}$$ | (4) |

After the adversary picks the state observation, the agent selects an action. Instead of trusting the observation (and thus choosing the worst action for the true state), the agent leverages the fact that the true state ${s}_{0}$ could be anywhere inside an $\u03f5$-Ball around ${s}_{adv}$: ${s}_{0}\in {B}_{p}({s}_{adv},\u03f5)$. The agent evaluates each action by calculating the worst-case Q-value under all the possible true states. The optimal action, ${a}^{*}$, is defined here as one with the highest Q-value under the worst-case perturbation,

$${a}^{*}=\underset{{a}_{j}}{\mathrm{arg}\mathrm{max}}\underset{s\in {B}_{p}({s}_{adv},\u03f5)}{\mathrm{min}}{Q}_{l}(s,{a}_{j})=\underset{{a}_{j}}{\mathrm{arg}\mathrm{max}}{Q}_{L}({s}_{adv},{a}_{j}),$$ | (5) |

using the outcome of the certification process, ${Q}_{L}$, as defined in creftype 1.

### 4.3 Adapting robustness certification to deep RL

To solve creftype 5 when $Q(s,a)$ is represented by a DNN, we adapt the formulation from [Weng_2018]. Most works in adversarial examples, including [Weng_2018], focus on defending adversaries on image inputs, in which all channels have the same scale, e.g. black/white images with intensities in $[0,255]$. More generally, however, input channels could be on different scales, e.g. joint torques, velocities, positions. Although not often mentioned, these sensor readings can also be prone to adversarial attacks, if transmitted over an insecure messaging framework, e.g. ROS, or accidentally producing adversaries through sensor failure or noise. Hence, this work extends [Weng_2018] to certify robustness of variable scale inputs and enables the usage to the broader robotics and machine learning community.

To do so, we compute the lower bound ${Q}_{L}({s}_{adv},{a}_{j})$ for all states inside the $\u03f5$-Ball ${B}_{p}({s}_{adv},\u03f5)$ around ${s}_{adv}$ similar to [Weng_2018], but with vector $\u03f5$ (instead of scalar $\u03f5$):

${Q}_{L}({s}_{adv},{a}_{j})$ | $=\underset{s\in {B}_{p}({s}_{adv},\u03f5)}{\mathrm{min}}\left({A}_{j,:}^{(0)}s+{b}_{j}^{(m)}+{\displaystyle \sum _{k=1}^{m-1}}{A}_{j,:}^{(k)}({b}^{(k)}-{H}_{:,j}^{(k)})\right)$ | (6) | ||

$=\left(\underset{s\in {B}_{p}({s}_{adv},\u03f5)}{\mathrm{min}}{A}_{j,:}^{(0)}s\right)+{b}_{j}^{(m)}+{\displaystyle \sum _{k=1}^{m-1}}{A}_{j,:}^{(k)}({b}^{(k)}-{H}_{:,j}^{(k)})$ | (7) | |||

$=\left(\underset{y\in {B}_{p}(0,1)}{\mathrm{min}}{A}_{j,:}^{(0)}(y\circ \u03f5)\right)+{A}_{j,:}^{(0)}{s}_{adv}+{b}_{j}^{(m)}+{\displaystyle \sum _{k=1}^{m-1}}{A}_{j,:}^{(k)}({b}^{(k)}-{H}_{:,j}^{(k)})$ | (8) | |||

$=\left(\underset{y\in {B}_{p}(0,1)}{\mathrm{min}}(\u03f5\circ {A}_{j,:}^{(0)})y\right)+{A}_{j,:}^{(0)}{s}_{adv}+{b}_{j}^{(m)}+{\displaystyle \sum _{k=1}^{m-1}}{A}_{j,:}^{(k)}({b}^{(k)}-{H}_{:,j}^{(k)})$ | (9) | |||

$=-{||\u03f5\circ {A}_{j,:}^{(0)}||}_{q}+{A}_{j,:}^{(0)}{s}_{adv}+{b}_{j}^{(m)}+{\displaystyle \sum _{k=1}^{m-1}}{A}_{j,:}^{(k)}({b}^{(k)}-{H}_{:,j}^{(k)}),$ | (10) |

with $\circ $ denoting element-wise multiplication. From creftype 7 to creftype 8, substitute $s:=y\circ \u03f5+{s}_{adv}$, to shift and re-scale the observation to within the unit ball around zero, $y\in {B}_{p}(0,1)$. The maximization in creftype 9 reduces to a q-norm in creftype 10 by the definition of the dual norm ${||z||}_{*}=\{\text{sup}{z}^{T}y:||y||\le 1\}$ and the fact the ${l}_{q}$ norm is dual of ${l}_{p}$ norm for $p,q\in [1,\mathrm{\infty})$ (with $1/p+1/q=1$). The closed form in creftype 10 is inserted into creftype 5 to return the best action.

## 5 Experimental Results

The robustness against adversaries and noise during execution is evaluated in simulations for collision avoidance among pedestrians and the cartpole domain. In both domains, increasing magnitudes of noise or adversarial attacks are added onto the observations, which reduces the reward of an agent following a nominal non-robust DQN policy. The added-on defense with robustness parameter, $\u03f5$, increases the robustness of the policy to the introduced perturbations and increases the performance.

### 5.1 Adversarially robust collision avoidance

A nominal DQN policy was trained in the environment described in creftype 3.2. To evaluate the learned policy’s robustness to deviations of the input in an $\u03f5$-ball, ${B}_{p}({s}_{0},\u03f5)$, around the true state, ${s}_{0}$, the observations of the environment agent’s position are deviated by an added uniform noise $\sim \mathbb{U}([-\sigma ,\sigma ])$, or adversarial attack during testing. The adversarial attack is a fast gradient sign method with target (FGST) [Kurakin_2017b] and approximates the adversary from creftype 4. FGST crafts the state ${\widehat{s}}_{adv}$ on the $\u03f5$-Ball’s perimeter that maximizes the Q-value for the nominally worst action ${\mathrm{arg}\mathrm{min}}_{a}Q({s}_{0},a)$. Specifically, ${\widehat{s}}_{adv}$ is picked along the direction of lowest softmax-cross-entropy loss, $\mathbb{L}$, in between a one-hot encoding ${y}_{adv}$ of the worst action and the nominal Q-values, ${y}_{nom}$:

${\widehat{s}}_{adv}={s}_{0}-{\u03f5}_{adv}\text{sign}({\nabla}_{s}\mathbb{L}({y}_{adv},{y}_{nom}))$ | |||

${y}_{adv}=[\mathrm{\U0001d7d9}\{{a}_{i}={argmin}_{a}Q({s}_{0},a)\}]\in {\mathbb{R}}^{|\mathbb{A}|}$ | |||

${y}_{nom}=[Q({s}_{0},{a}_{i})]\in {\mathbb{R}}^{|\mathbb{A}|}$ |

As expected, the nominal DQN policy, $\u03f5=0$, is not robust to the perturbation of inputs: Increasing the magnitude of adversarial, or noisy perturbation, ${\u03f5}_{adv},\sigma $ drastically 1) increases the average number of collisions (as seen in creftypeplural 3\crefpairconjunction3, respectively, at $\u03f5=0$) and 2) decreases the average reward (as seen in creftypeplural 3\crefpairconjunction3). The number of collisions and the reward are reported per run and have been averaged over $5$x$100$ episodes in the stochastic environment. Every set of $500$ episodes constitutes one data point, $*$, and has been initialized with the same $5$ random seeds.

Next, we demonstrate the effectiveness of the proposed add-on defense, called Certified Adversarially-Robust Reinforcement Learning (CARRL). The number of collisions decreases with an increasing robustness parameter $\u03f5$ under varying magnitudes of noise, or adversarial attack, as seen in creftypeplural 3\crefpairconjunction3. As a result, the reward increases with an increasing robustness parameter $$ under varying magnitudes of perturbations, as seen in creftypeplural 3\crefpairconjunction3. As expected, the importance of the proposed defense is highest under strong perturbations, as seen in the strong slopes of the curves ${\u03f5}_{adv}=0.23m$ and $\sigma =0.55m$. Interestingly, the CARRL policy only marginally reduces the reward under no perturbations, $\sigma =0,{\u03f5}_{adv}=0$.

Since the CARRL agent selects actions more conservatively than a nominal DQN agent, it is able successfully reach its goal instead of colliding like the nominal DQN agent does under noisy or adversarial perturbations. Interestingly, the reward drops significantly for $\u03f5>\sim 0.1$, because the agent “runs away” from the obstacle and never reaches the goal, also seen in creftype 6. This is likely due to the relatively small exploration of the full state space while training the Q-network. $\u03f5>0.1$ yields an $\u03f5$-ball around ${s}_{adv}$ that is too large and contains states that are far from the training data, causing our learned Q-function to be inaccurate, which breaks CARRL’s assumption of a perfectly learned Q-function for $\u03f5>\sim 0.1$. However, even with a perfectly learned Q-function, the agent could run away or stop, when all possible actions could lead to a collision, similar to the freezing robot problem [Trautman2010].

creftypecap 4 illustrates further intuition on choosing $\u03f5$. creftypecap 4 demonstrates a linear correlation in between the attack magnitude ${\u03f5}_{adv}$ and the best defending $\u03f5$ from creftype 3, i.e., the $\u03f5$ that maximizes the reward under the given attack magnitude ${\u03f5}_{best}={\mathrm{arg}\mathrm{max}}_{\u03f5\in [{\u03f5}_{min},{\u03f5}_{max}]}R(\u03f5,{\u03f5}_{adv})$. A correlation cannot be observed under sensor noise, as seen in creftype 4, because an adversary chooses an input state on the perimeter of the $\u03f5$-ball, whereas uniform noise samples from inside the $\u03f5$-ball. The $p=\mathrm{\infty}$-norm has been chosen for robustness against uniform noise and the FGST attack, as position observations could be anywhere within a 2-D box around the true state. The modularity in $\u03f5$ allows CARRL to capture uncertainties of varying scales in the input space, e.g., $\u03f5$ is here non-zero for position and zero for other inputs, but could additionally be set non-zero for velocities. $\u03f5$ can further be adapted on-line to account for, e.g., time-varying sensor noise.

The CARRL policy is able run at real-time; querying the policy took $\sim 20ms$. One forward pass with certified bounds took $\sim 2ms$, which compares to a forward pass of our nominal DQN of $\sim 1ms$. In our implementation, we inefficiently query the network once for each of the $11$ actions. The runtime could be reduced by parallelizing the action queries and should remain low for larger networks, as [Weng_2018] shows that the runtime scales linearly with the network size.

Visualization of the CARRL policy in particular scenarios offers additional intuition on the resulting policy. In creftype 5, an agent (orange) with radius $.5m$ (circle) observes a dynamic obstacle (blue) with added uniform noise $\sigma =.4m$ (not illustrated) on the position observation. The nominal DQN agent, in creftype 5 is not robust to the noise and collides with the obstacle. The CARRL policy, in creftype 5, however, is robust to the small perturbation in the input space and successfully avoids the dynamic obstacle. The resulting trajectories for several $\u03f5$ values are shown in creftype 6, for the same uniform noise addition. With increasing $\u03f5$ (toward right), the CARRL agent accounts for increasingly large worst-case position perturbations. Accordingly, the agent avoids the obstacle increasingly conservatively, i.e., selects actions that leave more safety distance from the obstacle.

The non-dueling DQN used $2$, $64$-unit layers with hyperparameters: learning rate $2.05\text{\U0001d68e}-4$, $\u03f5$-greedy exploration frac. $0.497$, final $\u03f5$-greedy $0.054852$, buffer size $152\text{\U0001d68e}3$, $4\text{\U0001d68e}5$ training steps, and target network update frequency, $10\text{\U0001d68e}3$. The hyperparameters were found by running 100 iterations of Bayesian optimization with Gaussian Processes [Jasper_2012] on the maximization of the training reward.

### 5.2 Generalization to the cartpole domain

Experiments in cartpole [barto1983neuronlike] show that the increased robustness to noise can generalize to another domain. The reward is the time that a pole is successfully balanced (capped at $200$ steps). The reward of a nominal DQN drops from $199$ to $141$ under uniform noise with $\sigma =.25$ added to all observations. CARRL, however, considers the worst-case state, i.e., a state in which the pole is the closest to falling, resulting in a policy that recovers a reward of $180$ with $\u03f5=.05$, under the same noise conditions. Hyperparameters for a $2$-layer, $4$-unit network were found via Bayesian Optimization.

## 6 Conclusion

This work adapted deep RL algorithms for application in safety-critical domains, by proposing an add-on certified defense to address existing failures under adversarially perturbed observations and sensor noise. The proposed extension of robustness certification tools from computer vision into a deep RL formulation enabled efficient calculation of a lower bound on Q-values, given the observation uncertainty. These guaranteed lower bounds were used to modify the action selection rule to provide maximum performance under worst-case observation perturbations. The resulting policy (added onto trained DQN networks) was shown to improve robustness to adversaries and sensor noise, causing fewer collisions in a collision avoidance domain and higher reward in cartpole. Future work will extend the guarantees to continuous action spaces and experiment with robotic hardware.

#### Acknowledgments

This work is supported by Ford Motor Company. The authors greatly thank Tsui-Wei (Lily) Weng for providing code for the Fast-Lin algorithm and insightful discussions.

plus 0.3ex