Automatic Testing and Falsification with Dynamically Constrained Reinforcement Learning

  • 2019-10-30 03:18:10
  • Xin Qin, Nikos Aréchiga, Andrew Best, Jyotirmoy Deshmukh
  • 2

Abstract

We consider the problem of using reinforcement learning to train adversarialagents for automatic testing and falsification of cyberphysical systems, suchas autonomous vehicles, robots, and airplanes. In order to produce usefulagents, however, it is useful to be able to control the degree ofadversariality by specifying rules that an agent must follow. For example, whentesting an autonomous vehicle, it is useful to find maximally antagonistictraffic participants that obey traffic rules. We model dynamic constraints ashierarchically ordered rules expressed in Signal Temporal Logic, and show howthese can be incorporated into an agent training process. We prove that ouragent-centric approach is able to find all dangerous behaviors that can befound by traditional falsification techniques while producing modular andreusable agents. We demonstrate our approach on two case studies from theautomotive domain.

 

Quick Read (beta)

Automatic Testing and Falsification with
Dynamically Constrained Reinforcement Learning

Xin Qin1, Nikos Aréchiga2, Andrew Best2, Jyotirmoy Deshmukh1 1 Xin Qin and Jyotirmoy Deshmukh are with the University of Southern California {xinqin, jdeshmuk}@usc.edu2 Nikos Aréchiga and Andrew Best are with Toyota Research Institute,
{nikos.arechiga, andrew.best}@tri.global
Abstract

We consider the problem of using reinforcement learning to train adversarial agents for automatic testing and falsification of cyberphysical systems, such as autonomous vehicles, robots, and airplanes. In order to produce useful agents, however, it is useful to be able to control the degree of adversariality by specifying rules that an agent must follow. For example, when testing an autonomous vehicle, it is useful to find maximally antagonistic traffic participants that obey traffic rules. We model dynamic constraints as hierarchically ordered rules expressed in Signal Temporal Logic, and show how these can be incorporated into an agent training process. We prove that our agent-centric approach is able to find all dangerous behaviors that can be found by traditional falsification techniques while producing modular and reusable agents. We demonstrate our approach on two case studies from the automotive domain.

(a) Case Study I: Driving in lane with lead vehicle.
(b) Case Study II: Left vehicle merges in front.
Fig. 1: Simulation environments for case studies

I Introduction

When developing cyberphysical systems such as autonomous vehicles, drones, or aircraft, it is important to have a robust testing strategy that finds critical bugs before the system is put into production. Falsification techniques exist to find simulations in which the system under test fails to satisfy its target specification. These falsification traces can be generated from a bounded set of inputs. Aside from these input bounds, it can be difficult to constrain the falsification traces to satisfy dynamic constraints such as traffic rules on vehicles or regulation on drones and aircraft.

For example, in the case of autonomous driving, it is useful to have testing scenarios with agents that automatically learn to induce the autonomous vehicle under test to make a mistake that results in a collision or other undesirable behavior. In this case, we are typically not interested in the maximally adversarial vehicle, e.g. a vehicle driving the wrong way on the freeway actively attempting to hit everything. Instead, we seek to constrain the behavior of the test agent depending on a desired level of difficulty of the testing regime. We may, for example, stipulate that the adversarial agent may not drive backwards, may not stop on the freeway, and must obey speed limits.

To constrain the adversarial agents, we can organize the dynamic constraints in hierarchically organized sets, called rulebooks [censi_liability_2019]. The hierarchy of these sets reflects the relative importance of the rules. These rulebooks can define specific legal requirements, e.g. do not drive the wrong way and do not run red lights. They may also encode cultural or normative customary behavior such as the ‘Pittsburgh left“ or “California Stop“. Each such dynamic constraint is modeled as a specification in Signal Temporal Logic (STL) [MalerNickovic2004].

In addition to the dynamic constraints, the adversaries are given a target specification. This specification is an STL property that the vehicle under test must attempt to maintain, and the adversaries attempt to falsify. This problem is known as falsification. We show that our that our algorithm is guaranteed to find bugs of the system under test when they exist, and we demonstrate our approach on two case studies from autonomous driving.

II Related Work

There is extensive related work in falsification of cyberphysical systems. Falsification considers a system and a specification. The target specification is given in terms of the output of the system, and the goal is to find a sequence of inputs that induce a violation of the target specification. In existing work on falsification using reinforcement learning algorithms [akazaki_falsification_2018], in which a single, monolithic falsifier uses reinforcement learning, but to our knowledge this is the first work in which multiple falsification engines is packaged as agents in the simulation, interacting with the system under test. Also, use of STL formulas combined with reinforcement learning agents allows a system designer to specify constraints such as traffic rules.

Work in safe reinforcement [bouton_safe_2019] learning considers the problem of training agents with dynamic constraints. However, the conventional approaches typically use a model checker in the loop. This model checker is run at each time step to determine which actions preserve the dynamic constraints, and the RL agent is only allowed to choose from those actions. In contrast, our work allows the agent to explore and naturally learn to respect the dynamic constraints. Although the agent is not guaranteed to satisfy the dynamic constraints, it would be straightforward to use an SMT solver such as dReal in a counterexample-guided refinement loop until a formal proof can be obtained.

Existing work on hierarchical rulebooks [censi_liability_2019] describes a model for specifications on an autonomous vehicle as sets of rules that vary in importance, ranging from basic collision-avoidance through traffic rules and comfort requirements to local customs. However, our work provides an implementation of this idea as signal temporal logic constraints, and our algorithm describes how to train an agent that behaves adversarially within the bounds of rulebooks.

III Background

Our goal is to test an autonomous vehicle, called an “ego”, vehicle in simulation. The autonomous vehicle is being tested against a specification, and the simulation contains other agents that seek to cause the ego vehicle to falsify its requirement. We use reinforcement learning to train agents that behave in a constrained adversarial fashion to cause the ego to violate its requirement. The adversarial agents are constrained so that we are able to provide specifications, such as traffic rules. In this way, we are able to control the level of adversarial behavior, which enables having a spectrum of agents that exhibit different levels of difficult behavior. The constraints are expressed as logical scaffolds for the learning agent, and prioritized hierarchically.

III-A Reinforcement learning

Reinforcement learning provides a general framework and class of algorithms to train goal-driven agents [sutton_reinforcement_2018].

A reinforcement learning problem consists of a set of states S, a set of actions A, and a set of reward values R. At each time step t, the agent considers the current state sS and chooses an action aA. After taking this action, it receives a reward r.

We model the environment of the reinforcement learning problem by a conditional distribution on next states and rewards, given the current state and the agent action, p(s,r|a,s). This probability distribution is called the dynamics of the reinforcement learning problem.

The goal of the agent is to learn a policy that maximizes the expected reward over multiple episodes. A policy is modeled as a conditional distribution over actions given the current state, π(a|s). This policy represents the probability of taking action a at state s. For a fixed policy π, the state-action value function, also known as the Q function, quantifies the agent’s belief of the quality of action a at state s.

Many techniques exist to find the policy that maximizes expected returns over time. In our application, however, the precise decision-making procedure of the vehicle under test is not known to the adversaries, and for this reason it is not possible to compute the expectation with the explicit probability distribution. these cases, we instead use Monte Carlo-style techniques to sample transitions of the system, estimate the quality of actions in a given state, and improve the quality of the policy over time. Two such techniques that we will use in this work are Q-learning and deep Q-learning.

We sample the initial state of each episode with a probability distribution μ(s) that is nonzero at all states. Thus, if the algorithm runs long enough, all states will eventually be selected as the initial state. Additionally, we will fix the policies of the adversarial agents to be ϵ-soft. This means that for each state s and every action a, π(a|s)ϵ, where ϵ>0 is a parameter. Taken together, random sampling of initial states and ϵ-soft policies ensure that the agent performs sufficient exploration and avoids converging prematurely to local optima. In fact, it ensures that if the algorithm runs long enough, the global optimum will eventually be found, as all trajectories have nonzero probability.

Q-Learning: Q-learning is an algorithm used for reinforcement learning that does not require knowledge of the environment that the agent is interacting with. The agent maintains a table whose rows correspond to the states of the system and whose columns correspond to the actions. For a state-action pair (s,a), the entry at row s and column a represents the quality q(s,a) that the agent has currently calculated. The table is initialized randomly. At each time step t, the agent considers the current state value st and, for each state aA, uses the table to judge the quality of each action from this state, q(st,a). Then, based on this judgment, it selects an action at based on the ϵ-greedy policy described above.

Next, at time step t+1, the agent observes the reward received rt+1 as well as the new state st+1, and it uses this information to update its beliefs about its previous behavior via the update equation

q(st,at) q(st,at)+
α[rt+1+γmaxaq(st+1,a)-q(st,at)]

where α is a learning rate parameter.

Deep Q-learning: In deep Q-learning [mnih_playing_2013], the table q(s,a) is approximated by a neural network, q(s,a,w), where w are the network parameters. Deep Q-learning observes states and selects actions similarly to Q-learning, but it additionally uses experience replay, in which the agent stores previously observed tuples of states, actions, next states, and rewards. At each time step, the agent updates its q-function with the currently observed experience as well as with a batch of experiences sampled randomly from the experience replay buffer. The agent then updates its approximation network by gradient descent on

(yt-q(st,at,w))2

where

yt=rt+1+γmaxaq(st+1,a,w).

In the case that st is a terminal state, it is common to assume that all transitions are such that st+1=st and rt=0.

III-B Signal Temporal Logic

Temporal logics are powerful tools to express specifications, and enable a wide array of V&V applications, including testcase generation, falsification, formal verification, and runtime monitoring. Applications for these languages include robotics [KressGazitEtAl2009, Wongpiromsarn2012, WongpiromsarnTopcuEtAl2011], systems biology [BartocciBortolussiNenzi2013], and network planning [KaramanFrazzoli2011]. Signal Temporal Logic (STL) [Maler2004] enables specifications real-valued signals and can be applied to many continuous and hybrid systems, such as automotive applications.

STL formulas are defined over predicates of the form f(s)<c, where s is a timed trace (signal), f:n is a function and c. STL formulas are written using the following grammar:

I:=  (a,b)|(a,b]|[a,b)|[a,b]
ϕ:=  true|f(s)<c|¬ϕ|ϕψ|ϕψ
 |Iϕ|𝒜Iϕ|ϕ𝒰Iψ

where f(s)<c is a predicate, and the logical operators (¬,,) have their typical meanings. In addition, (eventually), 𝒜 (always), 𝒰 (until) and 𝒯 (then) are temporal operators. We can achieve f(s)>c by applying a negation on f(s)<c. The temporal operators have an associated time interval I where 0a<b. For ease of notation, I is dropped from the grammar when a=0,b=.

STL formulas are evaluated over time series data, packaged in a data structure that we call a timed trace. A timed trace s consists of an ordered sequence of states and their associated time, s=(𝐱0,t0),,(𝐱n,tn) where ti-1<ti and 𝐱in. To refer to the value of the trace at a given time, we use the notation, s(ti)=𝐱i. We use the notationsi=(s,ti) to refer to the tail of trace s, i.e. the trace that contains all of the same data s from time ti onwards. The Boolean semantics of STL formulas over a trace (s,t) are defined recursively as follows.

(s,t) f(s(t))<c f(s(t))<c
(s,t) ¬ϕ ¬((s,t)ϕ)
(s,t) ϕψ ((s,t)ϕ)((s,t)ψ)
(s,t) ϕψ ((s,t)ϕ)((s,t)ψ)
(s,t) Iϕ tIts.t.(s,t)ϕ
(s,t) 𝒜Iϕ tIts.t.(s,t)ϕ
(s,t) ϕ𝒰Iψ tIts.t.((s,t)ψ)
  ((s,t)𝒜[0,t]ϕ)

For a timed trace (s,t) starting at time t, satisfying 𝒜ϕ means ϕ is always true for the entire sequence. A trace satisfies ϕ if ϕ is true at least once during the sequence. The until operator states that the left formula is true until the right formula becomes true. Since STL specifications are defined recursively, temporal operators can be composed.

IV Problem Statement

In a fixed scenario, an ego vehicle must drive while satisfying an STL specification ψ. Several other agents a1,,an called adversaries are also in the simulation. The adversaries seek to cause the ego to violate it specification, while conforming their behavior to a prioritized rulebook whose rules are expressed as STL formulas. For convenience, we shift our perspective to the point of the view of the adversaries, who, in falsifying ψ, must satisfy the negation of ψ, which we call ϕtarget.

Formally, a scenario containing n adversarial agents has an observable state vector s that can be decomposed into the components

s=[segosadv(1)sadv(n)]T,

where sego is the state of the ego vehicle and sadv(k) is the state of the k-th adversarial agent. In addition to the observable state, each agent may have an unobservable state, zego for the ego vehicle and zadv(k) for each adversarial agent. To simplify the treatment, we describe adversarial agents without internal state, but the addition of appropriate update equations is straightforward.

Note that an adversarial agent is not limited to correspond to vehicles, and can in fact control any state in the scenario, such as a traffic light that we wish to control in a constrained adversarial manner, such as pedestrians or traffic lights.

At each time step t, all agents simultaneously evaluate the observable state s[t] and select an action according to their control policies. The ego selects an action

aego[t+1]=fego(s[t],zego[t])

as well as a next value for its internal state

zego[t+1]=gego(s[t],zego[t]).

The adversarial agents select their action aadv(k) by sampling from a probability distribution

aadv(k)[t+1]πadv(k)(aadv(k)[t+1]|s[t],zadv(k)[t]).

After all agents have selected their actions, the (possibly stochastic) simulator updates the observable state by an unknown function s[t+1]=fsimulator(s[t],a[t]). where a[t] is the vector that concatenates the actions of all agents.

IV-A Scaffolding the learning process

Each adversarial agent k is given a target STL formula ϕt(k) as well as a rulebook of acceptable behavior. Each rule in this rulebook is given as an STL formula. The rulebook for agent k is given by ({cj(k)},), where is the relation that models the relative priority between constraints. If constraint ci(k) has lower priority than constraint cj(k), then ci(k)cj(k) and cj(k)ci(k). Similarly, if the constraints have the same priority, then ci(k)cj(k) and cj(k)ci(k). Assume that the rules of the rulebook have been sorted into sets with the same importance. For example, let Mi(k) and Ij(k) be two such sets. If i<j, we say that Mi(k) has less importance than Mj(k). Then, for every ci(k),cj(k)Mi(k), we have that ci(k)cj(k) and cj(k)ci(k). Similarly, for each ci(k)Mi(k) and every cj(k)Mj(k), we have that ci(k)cj(k).

For each set Mi(k), we associate a hyperparameter λi(k)>0. This is the amount of punishment that the agent will receive for violating a constraint in Mi(k). Furthermore, there is a hyperparameter λt(k)>0 that represents the reward that agent k receives when it attains its target. If the maximum length of an episode is fixed to be N, it is possible to give relationships between these parameters. We want to prevent the agent from “hacking” its rulebook, violating constraints because the punishment is less than the reward of attaining its target. The maximum reward that the agent could attain by meeting its target at all N time steps is Nλt(k). Then, we require that for the lowest priority group of constraints M0, λ0>(Nλt(k). Similarly, the higher priority groups should have values that discourage the agent from violating those constraints more than the lower priority rules.

To compute the reward signal, we log the state trace from the initial time 0 to the present time. We represent this state trace by s0:t. For an STL formula ϕ, let (ϕ,s0:t) be the indicator function of ϕ over the trace at time t. This function is equal to 1 if (s0:t,t)ϕ (i.e. the state trace satisfies ϕ at time t) and zero otherwise. Then, the reward signal can be computed as

r(k)[t+1] =λt(k)(1-(ϕt(k)))
-ji=1|Mj|λj(k)(ci(k),s0:t) (1)

This means that the agent is rewarded by an amount λt(k) for causing a violation of the target specification ϕt, and it is punished by an amount λj(k) for violating a constraint cj(k)Mj.

Theorem 1 (Completeness)

Let N be the maximum episode length. If a satisfying trace of ϕt(k) of length M<N exists, agent k will eventually find it.

This follows as we have required that agent policies be ϵ-soft. For simplicity, suppose that there is only one adversarial agent. Since every action at every state is taken with probability at least ϵ>0, then any N-step sequence of actions has probability at least ϵN>0. Therefore, if the algorithm runs long enough, the trace will eventually be found.

V Case Studies

We consider two case studies. In the first, an ego vehicle is following an adversarial agent on a single-lane freeway. In the second scenario, the ego vehicle is driving on the freeway and an adversarial agent performs a cut-in maneuver. In both cases, the adversary learns to cause a collision with the ego subject to traffic constraints automatically by exploring the state space. Experiments were performed with the CARLA simulator [Dosovitskiy17] on a Razer with the Intel Core i7 2.6 GHz processor and 16GB RAM.

V-A Driving in lane

In this experiment, two vehicles are driving on a single lane freeway. The lead vehicle is an adversarial agent, and the follow vehicle is using an adaptive cruise control policy that we wish to test. This vehicle under test is called the “ego” vehicle. The throttle of the ego vehicle θego is calculated by

θego={θmaxifPD>θmaxPDifθminPDθmaxθminifPD<θmin, (2)

where θmin and θmax are saturation bounds, and PD is a proportional-derivative control law given by

PD=Kp(d-dset)+Kd(vadv-vego). (3)

Here, d is the distance between the front bumper of the two vehicles. dset is a setpoint distance that the vehicle tries to maintain, vego is the velocity of the ego, vadv is the velocity of the adversary, and Kp and Kd are proportional and derivative gains, respectively.

The objective formula for the adversarial agent is described by the STL formula

ϕt=[0,T](ddmin) (4)

where T is the maximum duration of an episode and dmin is the minimum safe distance between the two vehicles. In other words, the objective of the adversary is to violate the safety distance in time less than T.

The constraints that the adversarial agent must obey are

c1=𝒜(vadvvlim) (5)
c2=𝒜(vadvvmin), (6)

where vlim is the speed limit and vmin>0 is a minimum velocity to prevent the adversary from coming to a complete stop, since stopping on the freeway is a traffic rule violation.

We set the reward parameters λt=10 and λ1=λ2=100. The reward can be calculated as in Equation 1 depending on whether it attained its objective and whether it violated its constraints. The safety distance is dmin=5.02m. The distance is computed between the two front bumpers. This represents a car length of 5m, plus a small safety margin. At each time step, the adversarial agent may select an acceleration. The acceleration space has been discretized to contain 3 possible actions. The state space consists of d,vego,vadv, the distance and velocities of the vehicles.

V-A1 Results using Q-table

We first consider tabular Q-learning. Each episode is chosen from a random initial state, and over time the adversary is able to induce a collision as training proceeds.

Figure 4 shows two episodes from the same initial state. In the first episode, the adversary fails to induce a collision, whereas in the second episode it succeeds.

Figure 5 shows adversary improvement on another starting condition. These two conditions have same initial speed but different initial distances. We can see that the agent learns to react with a different trajectory to cause a collision. When initial distance large, the adversary learns to not brake too fast. This makes the slope of the speed curve more gentle, luring the ego into complacence and causing a collision.

Fig. 2: Initial Condition 1:

When initial distance is small, the adversary learns to brake sooner and harder, making the slope of speed curve more steep and causing collision.

Fig. 3: Initial Condition 2
Fig. 4: Quad plot for Q-table.
epoch success episode success rate (%) sim time (s)
6 76 106 71.71 9115.97
8 149 206 72.35 16963.58
24 281 371 75.76 33986.82
TABLE I: Training With Looser Constrain

We found that, with less strict constraints, for example for a larger safety margin the adversarial agent trying to falsify, the higher success rate can it get. Table I shows a glance of training procedure for dmin=5.5m.

Not only look at success condition, we also counting situations of: Q1: Success not achieved, constraints violated, Q2: Success achieved, constraints violated, Q3: No success, no constraints violated, Q4: Success achieved, no constraint violations, during the training. Figure 4 shows the Quad plot for Q-table. Which we can see, with minimum speed limit, it will either succeed or fail, but not succeed but violation constrain or find case that have no succeed and no violation yet shown in figure 5.

If we make the lower speed limit close to zero, the system will catch behaviors of finding no constraint violation, no success.

Fig. 5: Quad plot for Q-table II

The total simulation for 488 episodes was 40623.52 seconds. The average episode duration was 83.24 seconds.

V-A2 Results using Neural Network

We then consider using Neural Network with a replay buffer[Mnih2013PlayingAW]. In the neural network case, we do not need to discretize the state space.

The quad plot is shown in figure 6. Performance of neural network increase over time.

Fig. 6: Quad plot for DQN

The simulation time for 853 episodes was 24806.76 seconds, with an average time of 29.08 seconds per episode.

V-A3 Comparison between Q-table and Neural Network

From the section above and table II we see the average run time of the Neural Network case is less than the Q-table. Storing and reloading Q-table across different simulations is more computationally expensive than a portable Neural Network. For large feasible search spaces, a fine grained Q-table can reach 1GB of size. Note that the episode number is different because within an epoch, a simulation exit can happen.

Episode Success Success Rate (%) Time (s)
Q-table 233 130 55.79 17420.58
NN 237 130 54.85 6964.63
TABLE II: Comparison at end of Epoch 14

V-B Lane Change Maneuvers

In this experiment, two vehicles are driving on a two-lane freeway. The ego vehicle is controlled together by a coupled controller consisting of cruise and collision-avoidance controllers. The ego vehicle predicts future adversary positions based on the current state and switches between the behaviors. The adversarial agent is on the left lane and has the goal of merging right to cause the ego vehicle to collide with it. To prevent cases where the adversary merges into the side of the ego vehicle, we add a constraint that the adversary should always be longitudinally in front of the ego car.

Controllerego={CruisingifD>dsafetyAviod CollisionifDdsafety, (7)

The look ahead distance is calculated by:

D=dlat-vlat_adversary*τ (8)

where dlat is lateral distance between adversary and ego car, vlat_adversary is the current lateral velocity of the adversary, τ is the look ahead time.

Episode Success count Success rate (%) Simulation time (s)
162 97 60.63 6797.20
192 117 61.58 7645.74
219 139 64.06 8289.18
247 157 64.08 9337.53
TABLE III: Case II

From table III we see the performance of the adversary is improving through time. The behavior of adversary can potentially help improve not only ego’s each single controller, but also ego’s policy of switching between cruising and avoid collision controller.

VI Conclusions and Future Work

We have described a technique to automatically test and falsify complex systems by training dynamically constrained RL agents. Our approach can find all of the counterexample traces that a monolithic falsifier can find, and it comes with the additional advantage of being able to re-use pretrained adversarial agents in other testing scenarios. In future work, we will explore the use of an SMT solver to check that adversarial agents indeed satisfy their dynamic constraints, and as part of a counterexample-guided retraining process in case they do not. Furthermore, we will explore the use of signal clustering techniques to distinguish different categories of counterexample traces. We believe it will be useful to test engineers to see a few categories that may correspond to the same bug rather than to a large number of counterexample traces.

References