We consider the problem of using reinforcement learning to train adversarialagents for automatic testing and falsification of cyberphysical systems, suchas autonomous vehicles, robots, and airplanes. In order to produce usefulagents, however, it is useful to be able to control the degree ofadversariality by specifying rules that an agent must follow. For example, whentesting an autonomous vehicle, it is useful to find maximally antagonistictraffic participants that obey traffic rules. We model dynamic constraints ashierarchically ordered rules expressed in Signal Temporal Logic, and show howthese can be incorporated into an agent training process. We prove that ouragent-centric approach is able to find all dangerous behaviors that can befound by traditional falsification techniques while producing modular andreusable agents. We demonstrate our approach on two case studies from theautomotive domain.
Quick Read (beta)
Automatic Testing and Falsification with
Dynamically Constrained Reinforcement Learning
We consider the problem of using reinforcement learning to train adversarial agents for automatic testing and falsification of cyberphysical systems, such as autonomous vehicles, robots, and airplanes. In order to produce useful agents, however, it is useful to be able to control the degree of adversariality by specifying rules that an agent must follow. For example, when testing an autonomous vehicle, it is useful to find maximally antagonistic traffic participants that obey traffic rules. We model dynamic constraints as hierarchically ordered rules expressed in Signal Temporal Logic, and show how these can be incorporated into an agent training process. We prove that our agent-centric approach is able to find all dangerous behaviors that can be found by traditional falsification techniques while producing modular and reusable agents. We demonstrate our approach on two case studies from the automotive domain.
When developing cyberphysical systems such as autonomous vehicles, drones, or aircraft, it is important to have a robust testing strategy that finds critical bugs before the system is put into production. Falsification techniques exist to find simulations in which the system under test fails to satisfy its target specification. These falsification traces can be generated from a bounded set of inputs. Aside from these input bounds, it can be difficult to constrain the falsification traces to satisfy dynamic constraints such as traffic rules on vehicles or regulation on drones and aircraft.
For example, in the case of autonomous driving, it is useful to have testing scenarios with agents that automatically learn to induce the autonomous vehicle under test to make a mistake that results in a collision or other undesirable behavior. In this case, we are typically not interested in the maximally adversarial vehicle, e.g. a vehicle driving the wrong way on the freeway actively attempting to hit everything. Instead, we seek to constrain the behavior of the test agent depending on a desired level of difficulty of the testing regime. We may, for example, stipulate that the adversarial agent may not drive backwards, may not stop on the freeway, and must obey speed limits.
To constrain the adversarial agents, we can organize the dynamic constraints in hierarchically organized sets, called rulebooks [censi_liability_2019]. The hierarchy of these sets reflects the relative importance of the rules. These rulebooks can define specific legal requirements, e.g. do not drive the wrong way and do not run red lights. They may also encode cultural or normative customary behavior such as the ‘Pittsburgh left“ or “California Stop“. Each such dynamic constraint is modeled as a specification in Signal Temporal Logic (STL) [MalerNickovic2004].
In addition to the dynamic constraints, the adversaries are given a target specification. This specification is an STL property that the vehicle under test must attempt to maintain, and the adversaries attempt to falsify. This problem is known as falsification. We show that our that our algorithm is guaranteed to find bugs of the system under test when they exist, and we demonstrate our approach on two case studies from autonomous driving.
II Related Work
There is extensive related work in falsification of cyberphysical systems. Falsification considers a system and a specification. The target specification is given in terms of the output of the system, and the goal is to find a sequence of inputs that induce a violation of the target specification. In existing work on falsification using reinforcement learning algorithms [akazaki_falsification_2018], in which a single, monolithic falsifier uses reinforcement learning, but to our knowledge this is the first work in which multiple falsification engines is packaged as agents in the simulation, interacting with the system under test. Also, use of STL formulas combined with reinforcement learning agents allows a system designer to specify constraints such as traffic rules.
Work in safe reinforcement [bouton_safe_2019] learning considers the problem of training agents with dynamic constraints. However, the conventional approaches typically use a model checker in the loop. This model checker is run at each time step to determine which actions preserve the dynamic constraints, and the RL agent is only allowed to choose from those actions. In contrast, our work allows the agent to explore and naturally learn to respect the dynamic constraints. Although the agent is not guaranteed to satisfy the dynamic constraints, it would be straightforward to use an SMT solver such as dReal in a counterexample-guided refinement loop until a formal proof can be obtained.
Existing work on hierarchical rulebooks [censi_liability_2019] describes a model for specifications on an autonomous vehicle as sets of rules that vary in importance, ranging from basic collision-avoidance through traffic rules and comfort requirements to local customs. However, our work provides an implementation of this idea as signal temporal logic constraints, and our algorithm describes how to train an agent that behaves adversarially within the bounds of rulebooks.
Our goal is to test an autonomous vehicle, called an “ego”, vehicle in simulation. The autonomous vehicle is being tested against a specification, and the simulation contains other agents that seek to cause the ego vehicle to falsify its requirement. We use reinforcement learning to train agents that behave in a constrained adversarial fashion to cause the ego to violate its requirement. The adversarial agents are constrained so that we are able to provide specifications, such as traffic rules. In this way, we are able to control the level of adversarial behavior, which enables having a spectrum of agents that exhibit different levels of difficult behavior. The constraints are expressed as logical scaffolds for the learning agent, and prioritized hierarchically.
III-A Reinforcement learning
Reinforcement learning provides a general framework and class of algorithms to train goal-driven agents [sutton_reinforcement_2018].
A reinforcement learning problem consists of a set of states , a set of actions , and a set of reward values . At each time step , the agent considers the current state and chooses an action . After taking this action, it receives a reward .
We model the environment of the reinforcement learning problem by a conditional distribution on next states and rewards, given the current state and the agent action, . This probability distribution is called the dynamics of the reinforcement learning problem.
The goal of the agent is to learn a policy that maximizes the expected reward over multiple episodes. A policy is modeled as a conditional distribution over actions given the current state, . This policy represents the probability of taking action at state . For a fixed policy , the state-action value function, also known as the Q function, quantifies the agent’s belief of the quality of action at state .
Many techniques exist to find the policy that maximizes expected returns over time. In our application, however, the precise decision-making procedure of the vehicle under test is not known to the adversaries, and for this reason it is not possible to compute the expectation with the explicit probability distribution. these cases, we instead use Monte Carlo-style techniques to sample transitions of the system, estimate the quality of actions in a given state, and improve the quality of the policy over time. Two such techniques that we will use in this work are Q-learning and deep Q-learning.
We sample the initial state of each episode with a probability distribution that is nonzero at all states. Thus, if the algorithm runs long enough, all states will eventually be selected as the initial state. Additionally, we will fix the policies of the adversarial agents to be -soft. This means that for each state and every action , , where is a parameter. Taken together, random sampling of initial states and -soft policies ensure that the agent performs sufficient exploration and avoids converging prematurely to local optima. In fact, it ensures that if the algorithm runs long enough, the global optimum will eventually be found, as all trajectories have nonzero probability.
Q-Learning: Q-learning is an algorithm used for reinforcement learning that does not require knowledge of the environment that the agent is interacting with. The agent maintains a table whose rows correspond to the states of the system and whose columns correspond to the actions. For a state-action pair , the entry at row and column represents the quality that the agent has currently calculated. The table is initialized randomly. At each time step , the agent considers the current state value and, for each state , uses the table to judge the quality of each action from this state, . Then, based on this judgment, it selects an action based on the -greedy policy described above.
Next, at time step , the agent observes the reward received as well as the new state , and it uses this information to update its beliefs about its previous behavior via the update equation
where is a learning rate parameter.
Deep Q-learning: In deep Q-learning [mnih_playing_2013], the table is approximated by a neural network, , where are the network parameters. Deep Q-learning observes states and selects actions similarly to Q-learning, but it additionally uses experience replay, in which the agent stores previously observed tuples of states, actions, next states, and rewards. At each time step, the agent updates its q-function with the currently observed experience as well as with a batch of experiences sampled randomly from the experience replay buffer. The agent then updates its approximation network by gradient descent on
In the case that is a terminal state, it is common to assume that all transitions are such that and .
III-B Signal Temporal Logic
Temporal logics are powerful tools to express specifications, and enable a wide array of V&V applications, including testcase generation, falsification, formal verification, and runtime monitoring. Applications for these languages include robotics [KressGazitEtAl2009, Wongpiromsarn2012, WongpiromsarnTopcuEtAl2011], systems biology [BartocciBortolussiNenzi2013], and network planning [KaramanFrazzoli2011]. Signal Temporal Logic (STL) [Maler2004] enables specifications real-valued signals and can be applied to many continuous and hybrid systems, such as automotive applications.
STL formulas are defined over predicates of the form , where is a timed trace (signal), is a function and . STL formulas are written using the following grammar:
where is a predicate, and the logical operators () have their typical meanings. In addition, (eventually), (always), (until) and (then) are temporal operators. We can achieve by applying a negation on . The temporal operators have an associated time interval where . For ease of notation, is dropped from the grammar when .
STL formulas are evaluated over time series data, packaged in a data structure that we call a timed trace. A timed trace consists of an ordered sequence of states and their associated time, where and . To refer to the value of the trace at a given time, we use the notation, . We use the notation to refer to the tail of trace , i.e. the trace that contains all of the same data from time onwards. The Boolean semantics of STL formulas over a trace are defined recursively as follows.
For a timed trace starting at time , satisfying means is always true for the entire sequence. A trace satisfies if is true at least once during the sequence. The until operator states that the left formula is true until the right formula becomes true. Since STL specifications are defined recursively, temporal operators can be composed.
IV Problem Statement
In a fixed scenario, an ego vehicle must drive while satisfying an STL specification . Several other agents called adversaries are also in the simulation. The adversaries seek to cause the ego to violate it specification, while conforming their behavior to a prioritized rulebook whose rules are expressed as STL formulas. For convenience, we shift our perspective to the point of the view of the adversaries, who, in falsifying , must satisfy the negation of , which we call .
Formally, a scenario containing adversarial agents has an observable state vector that can be decomposed into the components
where is the state of the ego vehicle and is the state of the -th adversarial agent. In addition to the observable state, each agent may have an unobservable state, for the ego vehicle and for each adversarial agent. To simplify the treatment, we describe adversarial agents without internal state, but the addition of appropriate update equations is straightforward.
Note that an adversarial agent is not limited to correspond to vehicles, and can in fact control any state in the scenario, such as a traffic light that we wish to control in a constrained adversarial manner, such as pedestrians or traffic lights.
At each time step , all agents simultaneously evaluate the observable state and select an action according to their control policies. The ego selects an action
as well as a next value for its internal state
The adversarial agents select their action by sampling from a probability distribution
After all agents have selected their actions, the (possibly stochastic) simulator updates the observable state by an unknown function where is the vector that concatenates the actions of all agents.
IV-A Scaffolding the learning process
Each adversarial agent is given a target STL formula as well as a rulebook of acceptable behavior. Each rule in this rulebook is given as an STL formula. The rulebook for agent is given by , where is the relation that models the relative priority between constraints. If constraint has lower priority than constraint , then and . Similarly, if the constraints have the same priority, then and . Assume that the rules of the rulebook have been sorted into sets with the same importance. For example, let and be two such sets. If , we say that has less importance than . Then, for every , we have that and . Similarly, for each and every , we have that .
For each set , we associate a hyperparameter . This is the amount of punishment that the agent will receive for violating a constraint in . Furthermore, there is a hyperparameter that represents the reward that agent receives when it attains its target. If the maximum length of an episode is fixed to be , it is possible to give relationships between these parameters. We want to prevent the agent from “hacking” its rulebook, violating constraints because the punishment is less than the reward of attaining its target. The maximum reward that the agent could attain by meeting its target at all time steps is . Then, we require that for the lowest priority group of constraints , . Similarly, the higher priority groups should have values that discourage the agent from violating those constraints more than the lower priority rules.
To compute the reward signal, we log the state trace from the initial time to the present time. We represent this state trace by . For an STL formula , let be the indicator function of over the trace at time . This function is equal to if (i.e. the state trace satisfies at time ) and zero otherwise. Then, the reward signal can be computed as
This means that the agent is rewarded by an amount for causing a violation of the target specification , and it is punished by an amount for violating a constraint .
Theorem 1 (Completeness)
Let be the maximum episode length. If a satisfying trace of of length exists, agent will eventually find it.
This follows as we have required that agent policies be -soft. For simplicity, suppose that there is only one adversarial agent. Since every action at every state is taken with probability at least , then any -step sequence of actions has probability at least . Therefore, if the algorithm runs long enough, the trace will eventually be found.
V Case Studies
We consider two case studies. In the first, an ego vehicle is following an adversarial agent on a single-lane freeway. In the second scenario, the ego vehicle is driving on the freeway and an adversarial agent performs a cut-in maneuver. In both cases, the adversary learns to cause a collision with the ego subject to traffic constraints automatically by exploring the state space. Experiments were performed with the CARLA simulator [Dosovitskiy17] on a Razer with the Intel Core i7 2.6 GHz processor and 16GB RAM.
V-A Driving in lane
In this experiment, two vehicles are driving on a single lane freeway. The lead vehicle is an adversarial agent, and the follow vehicle is using an adaptive cruise control policy that we wish to test. This vehicle under test is called the “ego” vehicle. The throttle of the ego vehicle is calculated by
where and are saturation bounds, and is a proportional-derivative control law given by
Here, is the distance between the front bumper of the two vehicles. is a setpoint distance that the vehicle tries to maintain, is the velocity of the ego, is the velocity of the adversary, and and are proportional and derivative gains, respectively.
The objective formula for the adversarial agent is described by the STL formula
where is the maximum duration of an episode and is the minimum safe distance between the two vehicles. In other words, the objective of the adversary is to violate the safety distance in time less than .
The constraints that the adversarial agent must obey are
where is the speed limit and is a minimum velocity to prevent the adversary from coming to a complete stop, since stopping on the freeway is a traffic rule violation.
We set the reward parameters and . The reward can be calculated as in Equation 1 depending on whether it attained its objective and whether it violated its constraints. The safety distance is . The distance is computed between the two front bumpers. This represents a car length of , plus a small safety margin. At each time step, the adversarial agent may select an acceleration. The acceleration space has been discretized to contain possible actions. The state space consists of , the distance and velocities of the vehicles.
V-A1 Results using Q-table
We first consider tabular Q-learning. Each episode is chosen from a random initial state, and over time the adversary is able to induce a collision as training proceeds.
Figure 4 shows two episodes from the same initial state. In the first episode, the adversary fails to induce a collision, whereas in the second episode it succeeds.
Figure 5 shows adversary improvement on another starting condition. These two conditions have same initial speed but different initial distances. We can see that the agent learns to react with a different trajectory to cause a collision. When initial distance large, the adversary learns to not brake too fast. This makes the slope of the speed curve more gentle, luring the ego into complacence and causing a collision.
When initial distance is small, the adversary learns to brake sooner and harder, making the slope of speed curve more steep and causing collision.
|epoch||success||episode||success rate (%)||sim time (s)|
We found that, with less strict constraints, for example for a larger safety margin the adversarial agent trying to falsify, the higher success rate can it get. Table I shows a glance of training procedure for .
Not only look at success condition, we also counting situations of: Q1: Success not achieved, constraints violated, Q2: Success achieved, constraints violated, Q3: No success, no constraints violated, Q4: Success achieved, no constraint violations, during the training. Figure 4 shows the Quad plot for Q-table. Which we can see, with minimum speed limit, it will either succeed or fail, but not succeed but violation constrain or find case that have no succeed and no violation yet shown in figure 5.
If we make the lower speed limit close to zero, the system will catch behaviors of finding no constraint violation, no success.
The total simulation for 488 episodes was 40623.52 seconds. The average episode duration was 83.24 seconds.
V-A2 Results using Neural Network
We then consider using Neural Network with a replay buffer[Mnih2013PlayingAW]. In the neural network case, we do not need to discretize the state space.
The quad plot is shown in figure 6. Performance of neural network increase over time.
The simulation time for 853 episodes was 24806.76 seconds, with an average time of 29.08 seconds per episode.
V-A3 Comparison between Q-table and Neural Network
From the section above and table II we see the average run time of the Neural Network case is less than the Q-table. Storing and reloading Q-table across different simulations is more computationally expensive than a portable Neural Network. For large feasible search spaces, a fine grained Q-table can reach 1GB of size. Note that the episode number is different because within an epoch, a simulation exit can happen.
|Episode||Success||Success Rate (%)||Time (s)|
V-B Lane Change Maneuvers
In this experiment, two vehicles are driving on a two-lane freeway. The ego vehicle is controlled together by a coupled controller consisting of cruise and collision-avoidance controllers. The ego vehicle predicts future adversary positions based on the current state and switches between the behaviors. The adversarial agent is on the left lane and has the goal of merging right to cause the ego vehicle to collide with it. To prevent cases where the adversary merges into the side of the ego vehicle, we add a constraint that the adversary should always be longitudinally in front of the ego car.
The look ahead distance is calculated by:
where is lateral distance between adversary and ego car, is the current lateral velocity of the adversary, is the look ahead time.
|Episode||Success count||Success rate (%)||Simulation time (s)|
From table III we see the performance of the adversary is improving through time. The behavior of adversary can potentially help improve not only ego’s each single controller, but also ego’s policy of switching between cruising and avoid collision controller.
VI Conclusions and Future Work
We have described a technique to automatically test and falsify complex systems by training dynamically constrained RL agents. Our approach can find all of the counterexample traces that a monolithic falsifier can find, and it comes with the additional advantage of being able to re-use pretrained adversarial agents in other testing scenarios. In future work, we will explore the use of an SMT solver to check that adversarial agents indeed satisfy their dynamic constraints, and as part of a counterexample-guided retraining process in case they do not. Furthermore, we will explore the use of signal clustering techniques to distinguish different categories of counterexample traces. We believe it will be useful to test engineers to see a few categories that may correspond to the same bug rather than to a large number of counterexample traces.