### Abstract

We consider the problem of using reinforcement learning to train adversarialagents for automatic testing and falsification of cyberphysical systems, suchas autonomous vehicles, robots, and airplanes. In order to produce usefulagents, however, it is useful to be able to control the degree ofadversariality by specifying rules that an agent must follow. For example, whentesting an autonomous vehicle, it is useful to find maximally antagonistictraffic participants that obey traffic rules. We model dynamic constraints ashierarchically ordered rules expressed in Signal Temporal Logic, and show howthese can be incorporated into an agent training process. We prove that ouragent-centric approach is able to find all dangerous behaviors that can befound by traditional falsification techniques while producing modular andreusable agents. We demonstrate our approach on two case studies from theautomotive domain.

### Quick Read (beta)

# Automatic Testing and Falsification with

Dynamically Constrained Reinforcement Learning

###### Abstract

We consider the problem of using reinforcement learning to train adversarial agents for automatic testing and falsification of cyberphysical systems, such as autonomous vehicles, robots, and airplanes. In order to produce useful agents, however, it is useful to be able to control the degree of adversariality by specifying rules that an agent must follow. For example, when testing an autonomous vehicle, it is useful to find maximally antagonistic traffic participants that obey traffic rules. We model dynamic constraints as hierarchically ordered rules expressed in Signal Temporal Logic, and show how these can be incorporated into an agent training process. We prove that our agent-centric approach is able to find all dangerous behaviors that can be found by traditional falsification techniques while producing modular and reusable agents. We demonstrate our approach on two case studies from the automotive domain.

## I Introduction

When developing cyberphysical systems such as autonomous vehicles,
drones, or aircraft, it is important to have a robust testing strategy
that finds critical bugs before the system is put into production.
Falsification techniques exist to find simulations in which the system
under test fails to satisfy its target specification. These
*falsification traces* can be generated from a bounded set of
inputs. Aside from these input bounds, it can be difficult to
constrain the falsification traces to satisfy dynamic constraints such
as traffic rules on vehicles or regulation on drones and aircraft.

For example, in the case of autonomous driving, it is useful to have testing scenarios with agents that automatically learn to induce the autonomous vehicle under test to make a mistake that results in a collision or other undesirable behavior. In this case, we are typically not interested in the maximally adversarial vehicle, e.g. a vehicle driving the wrong way on the freeway actively attempting to hit everything. Instead, we seek to constrain the behavior of the test agent depending on a desired level of difficulty of the testing regime. We may, for example, stipulate that the adversarial agent may not drive backwards, may not stop on the freeway, and must obey speed limits.

To constrain the adversarial agents, we can organize the dynamic
constraints in hierarchically organized sets, called *rulebooks*
[censi_liability_2019]. The hierarchy of these sets reflects the
relative importance of the rules. These rulebooks can define specific
legal requirements, e.g. do not drive the wrong way and do not run red
lights. They may also encode cultural or normative customary behavior
such as the ‘Pittsburgh left“ or “California Stop“. Each such dynamic
constraint is modeled as a specification in Signal Temporal Logic (STL)
[MalerNickovic2004].

In addition to the dynamic constraints, the adversaries are given a target
specification. This specification is an STL property that the vehicle
under test must attempt to maintain, and the adversaries attempt to
falsify. This problem is known as *falsification*. We show that our
that our algorithm is guaranteed to find bugs of the system under test
when they exist, and we demonstrate our approach on two case studies
from autonomous driving.

## II Related Work

There is extensive related work in falsification of cyberphysical
systems.
Falsification considers a system and a specification. The target specification is given in terms
of the output of the system, and the goal is to find a sequence of
inputs that induce a violation of the target specification. In
existing work on falsification using reinforcement learning algorithms
[akazaki_falsification_2018], in which a single, monolithic falsifier uses reinforcement learning, but to our knowledge this is the
first work in which *multiple* falsification engines is packaged as agents
in the simulation, interacting with the system under test.
Also, use of STL formulas combined with reinforcement
learning agents allows a system designer to specify constraints such as traffic rules.

Work in safe reinforcement [bouton_safe_2019] learning considers the problem of training agents with dynamic constraints. However, the conventional approaches typically use a model checker in the loop. This model checker is run at each time step to determine which actions preserve the dynamic constraints, and the RL agent is only allowed to choose from those actions. In contrast, our work allows the agent to explore and naturally learn to respect the dynamic constraints. Although the agent is not guaranteed to satisfy the dynamic constraints, it would be straightforward to use an SMT solver such as dReal in a counterexample-guided refinement loop until a formal proof can be obtained.

Existing work on *hierarchical rulebooks*
[censi_liability_2019] describes a model for specifications on an
autonomous vehicle as sets of rules that vary in importance, ranging
from basic collision-avoidance through traffic rules and comfort
requirements to local customs. However, our work provides an
implementation of this idea as signal temporal logic
constraints, and our algorithm describes how to train an agent that
behaves adversarially within the bounds of rulebooks.

## III Background

Our goal is to test an autonomous vehicle, called an “ego”, vehicle in simulation. The autonomous vehicle is being tested against a specification, and the simulation contains other agents that seek to cause the ego vehicle to falsify its requirement. We use reinforcement learning to train agents that behave in a constrained adversarial fashion to cause the ego to violate its requirement. The adversarial agents are constrained so that we are able to provide specifications, such as traffic rules. In this way, we are able to control the level of adversarial behavior, which enables having a spectrum of agents that exhibit different levels of difficult behavior. The constraints are expressed as logical scaffolds for the learning agent, and prioritized hierarchically.

### III-A Reinforcement learning

Reinforcement learning provides a general framework and class of algorithms to train goal-driven agents [sutton_reinforcement_2018].

A reinforcement learning problem consists of a set of states $S$, a set of actions $A$, and a set of reward values $R$. At each time step $t$, the agent considers the current state $s\in S$ and chooses an action $a\in A$. After taking this action, it receives a reward $r$.

We model the environment of the reinforcement learning
problem by a conditional distribution on next states and rewards, given
the current state and the agent action, $p({s}^{\prime},r|a,s)$. This
probability distribution is called the *dynamics* of
the reinforcement learning problem.

The goal of the agent is to learn a policy that maximizes the expected
reward over multiple episodes. A policy is modeled as a conditional
distribution over actions given the current state, $\pi (a|s)$. This
policy represents the probability of taking action $a$ at state
$s$. For a fixed policy $\pi $, the *state-action value function*, also
known as the Q function, quantifies the agent’s belief of the quality of action $a$ at state
$s$.

Many techniques exist to find the policy that maximizes expected returns over time.
In our application, however, the precise decision-making procedure of the vehicle
under test is not known to the adversaries, and for this reason it is not
possible to compute the expectation with the explicit probability distribution.
these cases, we instead use Monte Carlo-style techniques to
*sample* transitions of the system, estimate the quality of actions
in a given state, and improve the quality of the policy over time. Two
such techniques that we will use in this work are Q-learning and deep
Q-learning.

We sample the initial state of each episode with a probability distribution $\mu (s)$ that is nonzero at all states. Thus, if the algorithm runs long enough, all states will eventually be selected as the initial state. Additionally, we will fix the policies of the adversarial agents to be $\u03f5$-soft. This means that for each state $s$ and every action $a$, $\pi (a|s)\ge \u03f5$, where $\u03f5>0$ is a parameter. Taken together, random sampling of initial states and $\u03f5$-soft policies ensure that the agent performs sufficient exploration and avoids converging prematurely to local optima. In fact, it ensures that if the algorithm runs long enough, the global optimum will eventually be found, as all trajectories have nonzero probability.

Q-Learning: Q-learning is an algorithm used for reinforcement learning that does not require knowledge of the environment that the agent is interacting with. The agent maintains a table whose rows correspond to the states of the system and whose columns correspond to the actions. For a state-action pair $(s,a)$, the entry at row $s$ and column $a$ represents the quality $q(s,a)$ that the agent has currently calculated. The table is initialized randomly. At each time step $t$, the agent considers the current state value ${s}_{t}$ and, for each state $a\in A$, uses the table to judge the quality of each action from this state, $q({s}_{t},a)$. Then, based on this judgment, it selects an action ${a}_{t}$ based on the $\u03f5$-greedy policy described above.

Next, at time step $t+1$, the agent observes the reward received ${r}_{t+1}$ as well as the new state ${s}_{t+1}$, and it uses this information to update its beliefs about its previous behavior via the update equation

$q({s}_{t},{a}_{t})\leftarrow $ | $q({s}_{t},{a}_{t})+$ | ||

$\alpha \left[{r}_{t+1}+\gamma \underset{a}{\mathrm{max}}q({s}_{t+1},a)-q({s}_{t},{a}_{t})\right]$ |

where $\alpha $ is a learning rate parameter.

Deep Q-learning: In deep Q-learning
[mnih_playing_2013], the table $q(s,a)$ is approximated by a
neural network, $q(s,a,w)$, where $w$ are the network parameters.
Deep
Q-learning observes states and selects actions similarly to Q-learning, but it additionally uses *experience replay*, in which the agent stores
previously observed tuples of states, actions, next states, and rewards.
At each time step, the agent updates its q-function with the currently
observed experience as well as with a batch of experiences sampled
randomly from the experience replay buffer. The agent then updates its approximation network by gradient descent on

$${({y}_{t}-q({s}_{t},{a}_{t},w))}^{2}$$ |

where

$${y}_{t}={r}_{t+1}+\gamma \underset{{a}^{\prime}}{\mathrm{max}}q({s}_{t+1},a,w).$$ |

In the case that ${s}_{t}$ is a terminal state, it is common to assume that all transitions are such that ${s}_{t+1}={s}_{t}$ and ${r}_{t}=0$.

### III-B Signal Temporal Logic

Temporal logics are powerful tools to express specifications, and enable
a wide array of V&V applications, including testcase generation,
falsification, formal verification, and runtime monitoring. Applications
for these languages include robotics
[KressGazitEtAl2009, Wongpiromsarn2012, WongpiromsarnTopcuEtAl2011],
systems biology [BartocciBortolussiNenzi2013], and network planning
[KaramanFrazzoli2011].
Signal Temporal Logic (STL) [Maler2004]
enables specifications *real-valued signals*
and can be applied to many continuous and hybrid systems, such as automotive applications.

STL formulas are defined over predicates of the form $$, where $s$ is a timed trace (signal), $f:{\mathbb{R}}^{n}\to \mathbb{R}$ is a function and $c\in \mathbb{R}$. STL formulas are written using the following grammar:

$I:=$ | $\mathrm{}(a,b)|(a,b]|[a,b)|[a,b]$ | ||

$\varphi :=$ | $$ | ||

$\mathrm{}|{\mathcal{E}}_{I}\varphi |{\mathcal{A}}_{I}\varphi |\varphi {\mathcal{U}}_{I}\psi $ |

where $$ is a predicate, and the logical operators ($\mathrm{\neg},\wedge ,\vee $) have their typical meanings. In addition, $\mathcal{E}$ (eventually), $\mathcal{A}$ (always), $\mathcal{U}$ (until) and $\mathcal{T}$ (then) are temporal operators. We can achieve $f(s)>c$ by applying a negation on $$. The temporal operators have an associated time interval $I$ where $$. For ease of notation, $I$ is dropped from the grammar when $a=0,b=\mathrm{\infty}$.

STL formulas are evaluated over time series data, packaged in a data
structure that we call a *timed trace*. A timed trace $s$ consists
of an ordered sequence of states and their associated time, $s=({\mathbf{x}}_{0},{t}_{0}),\mathrm{\dots},({\mathbf{x}}_{n},{t}_{n})$ where $$
and ${\mathbf{x}}_{i}\in {\mathbb{R}}^{n}$. To refer to the value of the trace
at a given time, we use the notation, $s({t}_{i})={\mathbf{x}}_{i}$. We use
the notation${s}_{i}=(s,{t}_{i})$ to refer to the *tail* of trace $s$,
i.e. the trace that contains all of the same data $s$ from time ${t}_{i}$
onwards. The *Boolean semantics* of STL formulas over a trace $(s,t)$ are
defined recursively as follows.

$(s,t)$ | $$ | $$ | ||

$(s,t)$ | $\vDash \mathrm{\neg}\varphi $ | $\iff \mathrm{\neg}((s,t)\vDash \varphi )$ | ||

$(s,t)$ | $\vDash \varphi \wedge \psi $ | $\iff ((s,t)\vDash \varphi )\wedge ((s,t)\vDash \psi )$ | ||

$(s,t)$ | $\vDash \varphi \vee \psi $ | $\iff ((s,t)\vDash \varphi )\vee ((s,t)\vDash \psi )$ | ||

$(s,t)$ | $\vDash {\mathcal{E}}_{I}\varphi $ | $\iff \exists {t}^{\prime}\in I\oplus ts.t.(s,{t}^{\prime})\vDash \varphi $ | ||

$(s,t)$ | $\vDash {\mathcal{A}}_{I}\varphi $ | $\iff \forall {t}^{\prime}\in I\oplus ts.t.(s,{t}^{\prime})\vDash \varphi $ | ||

$(s,t)$ | $\vDash \varphi {\mathcal{U}}_{I}\psi $ | $\iff \exists {t}^{\prime}\in I\oplus ts.t.((s,{t}^{\prime})\vDash \psi )\wedge $ | ||

$\mathrm{\hspace{1em}\hspace{1em}}((s,t)\vDash {\mathcal{A}}_{[0,{t}^{\prime}]}\varphi )$ |

For a timed trace $(s,t)$ starting at time $t$, satisfying $\mathcal{A}\varphi $ means $\varphi $ is always true for the entire sequence. A trace satisfies $\mathcal{E}\varphi $ if $\varphi $ is true at least once during the sequence. The until operator states that the left formula is true until the right formula becomes true. Since STL specifications are defined recursively, temporal operators can be composed.

## IV Problem Statement

In a fixed scenario, an ego vehicle must drive while satisfying an STL specification $\psi $. Several other
agents ${a}_{1},\mathrm{\dots},{a}_{n}$ called *adversaries* are also in the
simulation. The adversaries seek to cause the ego to violate it
specification, while conforming their behavior to a prioritized rulebook
whose rules are expressed as STL formulas. For convenience, we shift
our perspective to the point of the view of the adversaries, who, in
falsifying $\psi $, must satisfy the *negation* of $\psi $, which
we call ${\varphi}_{target}$.

Formally, a scenario containing $n$ adversarial agents has an observable state vector $s$ that can be decomposed into the components

$$s={\left[\begin{array}{cccc}\hfill {s}_{ego}\hfill & \hfill {s}_{adv}^{(1)}\hfill & \hfill \mathrm{\dots}\hfill & \hfill {s}_{adv}^{(n)}\hfill \end{array}\right]}^{T},$$ |

where ${s}_{ego}$ is the state of the ego vehicle and ${s}_{adv}^{(k)}$ is the state of the $k$-th adversarial agent. In addition to the observable state, each agent may have an unobservable state, ${z}_{ego}$ for the ego vehicle and ${z}_{adv}^{(k)}$ for each adversarial agent. To simplify the treatment, we describe adversarial agents without internal state, but the addition of appropriate update equations is straightforward.

Note that an adversarial agent is not limited to correspond to vehicles, and can in fact control any state in the scenario, such as a traffic light that we wish to control in a constrained adversarial manner, such as pedestrians or traffic lights.

At each time step $t$, all agents simultaneously evaluate the observable state $s[t]$ and select an action according to their control policies. The ego selects an action

$${a}_{ego}[t+1]={f}_{ego}(s[t],{z}_{ego}[t])$$ |

as well as a next value for its internal state

$${z}_{ego}[t+1]={g}_{ego}(s[t],{z}_{ego}[t]).$$ |

The adversarial agents select their action ${a}_{adv}^{(k)}$ by sampling from a probability distribution

$${a}_{adv}^{(k)}[t+1]\sim {\pi}_{adv}^{(k)}({a}_{adv}^{(k)}[t+1]|s[t],{z}_{adv}^{(k)}[t]).$$ |

After all agents have selected their actions, the (possibly stochastic) simulator updates the observable state by an unknown function $s[t+1]={f}_{simulator}(s[t],a[t]).$ where $a[t]$ is the vector that concatenates the actions of all agents.

### IV-A Scaffolding the learning process

Each adversarial agent $k$ is given a *target* STL formula
${\varphi}_{t}^{(k)}$ as well as a rulebook of acceptable behavior.
Each rule in this rulebook is given as an STL formula. The rulebook for
agent $k$ is given by $(\{{c}_{j}^{(k)}\},\u2aaf)$, where
$\u2aaf$ is the relation that models the relative priority between
constraints. If constraint ${c}_{i}^{(k)}$ has lower priority than
constraint ${c}_{j}^{(k)}$, then ${c}_{i}^{(k)}\u2aaf{c}_{j}^{(k)}$ and
${c}_{j}^{(k)}\u22e0{c}_{i}^{(k)}$. Similarly, if the constraints have the
same priority, then ${c}_{i}^{(k)}\u2aaf{c}_{j}^{(k)}$ and ${c}_{j}^{(k)}\u2aaf{c}_{i}^{(k)}$.
Assume that the rules of the
rulebook have been sorted into sets with the same importance. For
example, let ${M}_{i}^{(k)}$ and ${I}_{j}^{(k)}$ be two such sets. If $$, we say that ${M}_{i}^{(k)}$ has less importance than ${M}_{j}^{(k)}$.
Then, for every ${c}_{i}^{(k)},{c}_{j}^{(k)}\in {M}_{i}^{(k)}$, we have that
${c}_{i}^{(k)}\u2aaf{c}_{j}^{(k)}$ and ${c}_{j}^{(k)}\u2aaf{c}_{i}^{(k)}$.
Similarly, for each ${c}_{i}^{(k)}\in {M}_{i}^{(k)}$ and every ${c}_{j}^{(k)}\in {M}_{j}^{(k)}$, we have that ${c}_{i}^{(k)}\u2aaf{c}_{j}^{(k)}$.

For each set ${M}_{i}^{(k)}$, we associate a hyperparameter ${\lambda}_{i}^{(k)}>0$. This is the amount of punishment that the agent will receive for violating a constraint in ${M}_{i}^{(k)}$. Furthermore, there is a hyperparameter ${\lambda}_{t}^{(k)}>0$ that represents the reward that agent $k$ receives when it attains its target. If the maximum length of an episode is fixed to be $N$, it is possible to give relationships between these parameters. We want to prevent the agent from “hacking” its rulebook, violating constraints because the punishment is less than the reward of attaining its target. The maximum reward that the agent could attain by meeting its target at all $N$ time steps is $N{\lambda}_{t}^{(k)}$. Then, we require that for the lowest priority group of constraints ${M}_{0}$, ${\lambda}_{0}>(N{\lambda}_{t}^{(k)}$. Similarly, the higher priority groups should have values that discourage the agent from violating those constraints more than the lower priority rules.

To compute the reward signal, we log the state trace from the initial time $0$ to the present time. We represent this state trace by ${s}_{0:t}$. For an STL formula $\varphi $, let $\mathcal{I}(\varphi ,{s}_{0:t})$ be the indicator function of $\varphi $ over the trace at time $t$. This function is equal to $1$ if $({s}_{0:t},t)\vDash \varphi $ (i.e. the state trace satisfies $\varphi $ at time $t$) and zero otherwise. Then, the reward signal can be computed as

${r}^{(k)}[t+1]$ | $={\lambda}_{t}^{(k)}(1-\mathcal{I}({\varphi}_{t}^{(k)}))$ | |||

$-{\displaystyle \sum _{j}}{\displaystyle \sum _{i=1}^{|{M}_{j}|}}{\lambda}_{j}^{(k)}\mathcal{I}({c}_{i}^{(k)},{s}_{0:t})$ | (1) |

This means that the agent is rewarded by an amount ${\lambda}_{t}^{(k)}$ for causing a violation of the target specification ${\varphi}_{t}$, and it is punished by an amount ${\lambda}_{j}^{(k)}$ for violating a constraint ${c}_{j}^{(k)}\in {M}_{j}$.

###### Theorem 1 (Completeness)

Let $N$ be the maximum episode length. If a satisfying trace of ${\varphi}_{t}^{\mathrm{(}k\mathrm{)}}$ of length $$ exists, agent $k$ will eventually find it.

This follows as we have required that agent policies be $\u03f5$-soft. For simplicity, suppose that there is only one adversarial agent. Since every action at every state is taken with probability at least $\u03f5>0$, then any $N$-step sequence of actions has probability at least ${\u03f5}^{N}>0$. Therefore, if the algorithm runs long enough, the trace will eventually be found.

## V Case Studies

We consider two case studies. In the first, an ego vehicle is following an adversarial agent on a single-lane freeway. In the second scenario, the ego vehicle is driving on the freeway and an adversarial agent performs a cut-in maneuver. In both cases, the adversary learns to cause a collision with the ego subject to traffic constraints automatically by exploring the state space. Experiments were performed with the CARLA simulator [Dosovitskiy17] on a Razer with the Intel Core i7 2.6 GHz processor and 16GB RAM.

### V-A Driving in lane

In this experiment, two vehicles are driving on a single lane freeway. The lead vehicle is an adversarial agent, and the follow vehicle is using an adaptive cruise control policy that we wish to test. This vehicle under test is called the “ego” vehicle. The throttle of the ego vehicle ${\theta}_{ego}$ is calculated by

$$ | (2) |

where ${\theta}_{min}$ and ${\theta}_{max}$ are saturation bounds, and $PD$ is a proportional-derivative control law given by

$$PD={K}_{p}(d-{d}_{set})+{K}_{d}({v}_{adv}-{v}_{ego}).$$ | (3) |

Here, $d$ is the distance between the front bumper of the two vehicles. ${d}_{set}$ is a setpoint distance that the vehicle tries to maintain, ${v}_{ego}$ is the velocity of the ego, ${v}_{adv}$ is the velocity of the adversary, and ${K}_{p}$ and ${K}_{d}$ are proportional and derivative gains, respectively.

The objective formula for the adversarial agent is described by the STL formula

$${\varphi}_{t}={\mathcal{E}}_{[0,T]}(d\le {d}_{min})$$ | (4) |

where $T$ is the maximum duration of an episode and ${d}_{min}$ is the minimum safe distance between the two vehicles. In other words, the objective of the adversary is to violate the safety distance in time less than $T$.

The constraints that the adversarial agent must obey are

${c}_{1}=\mathcal{A}({v}_{adv}\le {v}_{lim})$ | (5) | ||

${c}_{2}=\mathcal{A}({v}_{adv}\ge {v}_{min}),$ | (6) |

where ${v}_{lim}$ is the speed limit and ${v}_{min}>0$ is a minimum velocity to prevent the adversary from coming to a complete stop, since stopping on the freeway is a traffic rule violation.

We set the reward parameters ${\lambda}_{t}=10$ and ${\lambda}_{1}={\lambda}_{2}=100$. The reward can be calculated as in Equation 1 depending on whether it attained its objective and whether it violated its constraints. The safety distance is ${d}_{min}=5.02m$. The distance is computed between the two front bumpers. This represents a car length of $5m$, plus a small safety margin. At each time step, the adversarial agent may select an acceleration. The acceleration space has been discretized to contain $3$ possible actions. The state space consists of $d,{v}_{ego},{v}_{adv}$, the distance and velocities of the vehicles.

#### V-A1 Results using Q-table

We first consider tabular Q-learning. Each episode is chosen from a random initial state, and over time the adversary is able to induce a collision as training proceeds.

Figure 4 shows two episodes from the same initial state. In the first episode, the adversary fails to induce a collision, whereas in the second episode it succeeds.

Figure 5 shows adversary improvement on another starting condition. These two conditions have same initial speed but different initial distances. We can see that the agent learns to react with a different trajectory to cause a collision. When initial distance large, the adversary learns to not brake too fast. This makes the slope of the speed curve more gentle, luring the ego into complacence and causing a collision.

When initial distance is small, the adversary learns to brake sooner and harder, making the slope of speed curve more steep and causing collision.

epoch | success | episode | success rate (%) | sim time (s) |
---|---|---|---|---|

6 | 76 | 106 | 71.71 | 9115.97 |

8 | 149 | 206 | 72.35 | 16963.58 |

24 | 281 | 371 | 75.76 | 33986.82 |

We found that, with less strict constraints, for example for a larger safety margin the adversarial agent trying to falsify, the higher success rate can it get. Table I shows a glance of training procedure for ${d}_{min}=5.5m$.

Not only look at success condition, we also counting situations of: Q1: Success not achieved, constraints violated, Q2: Success achieved, constraints violated, Q3: No success, no constraints violated, Q4: Success achieved, no constraint violations, during the training. Figure 4 shows the Quad plot for Q-table. Which we can see, with minimum speed limit, it will either succeed or fail, but not succeed but violation constrain or find case that have no succeed and no violation yet shown in figure 5.

If we make the lower speed limit close to zero, the system will catch behaviors of finding no constraint violation, no success.

The total simulation for 488 episodes was 40623.52 seconds. The average episode duration was 83.24 seconds.

#### V-A2 Results using Neural Network

We then consider using Neural Network with a replay buffer[Mnih2013PlayingAW]. In the neural network case, we do not need to discretize the state space.

The quad plot is shown in figure 6. Performance of neural network increase over time.

The simulation time for 853 episodes was 24806.76 seconds, with an average time of 29.08 seconds per episode.

#### V-A3 Comparison between Q-table and Neural Network

From the section above and table II we see the average run time of the Neural Network case is less than the Q-table. Storing and reloading Q-table across different simulations is more computationally expensive than a portable Neural Network. For large feasible search spaces, a fine grained Q-table can reach 1GB of size. Note that the episode number is different because within an epoch, a simulation exit can happen.

Episode | Success | Success Rate (%) | Time (s) | |
---|---|---|---|---|

Q-table | 233 | 130 | 55.79 | 17420.58 |

NN | 237 | 130 | 54.85 | 6964.63 |

### V-B Lane Change Maneuvers

In this experiment, two vehicles are driving on a two-lane freeway. The ego vehicle is controlled together by a coupled controller consisting of cruise and collision-avoidance controllers. The ego vehicle predicts future adversary positions based on the current state and switches between the behaviors. The adversarial agent is on the left lane and has the goal of merging right to cause the ego vehicle to collide with it. To prevent cases where the adversary merges into the side of the ego vehicle, we add a constraint that the adversary should always be longitudinally in front of the ego car.

$${\text{Controller}}_{ego}=\{\begin{array}{ccc}\text{Cruising}\hfill & \hfill \text{if}\hfill & D>{d}_{safety}\hfill \\ \text{Aviod Collision}\hfill & \hfill \text{if}\hfill & D\le {d}_{safety}\hfill \end{array},$$ | (7) |

The look ahead distance is calculated by:

$$D={d}_{lat}-{v}_{lat\mathrm{\_}adversary}*\tau $$ | (8) |

where ${d}_{lat}$ is lateral distance between adversary and ego car, ${v}_{lat\mathrm{\_}adversary}$ is the current lateral velocity of the adversary, $\tau $ is the look ahead time.

Episode | Success count | Success rate (%) | Simulation time (s) |
---|---|---|---|

162 | 97 | 60.63 | 6797.20 |

192 | 117 | 61.58 | 7645.74 |

219 | 139 | 64.06 | 8289.18 |

247 | 157 | 64.08 | 9337.53 |

From table III we see the performance of the adversary is improving through time. The behavior of adversary can potentially help improve not only ego’s each single controller, but also ego’s policy of switching between cruising and avoid collision controller.

## VI Conclusions and Future Work

We have described a technique to automatically test and falsify complex systems by training dynamically constrained RL agents. Our approach can find all of the counterexample traces that a monolithic falsifier can find, and it comes with the additional advantage of being able to re-use pretrained adversarial agents in other testing scenarios. In future work, we will explore the use of an SMT solver to check that adversarial agents indeed satisfy their dynamic constraints, and as part of a counterexample-guided retraining process in case they do not. Furthermore, we will explore the use of signal clustering techniques to distinguish different categories of counterexample traces. We believe it will be useful to test engineers to see a few categories that may correspond to the same bug rather than to a large number of counterexample traces.