Multi-Vehicle Mixed-Reality Reinforcement Learning for Autonomous Multi-Lane Driving

  • 2019-11-26 17:08:40
  • Rupert Mitchell, Jenny Fletcher, Jacopo Panerati, Amanda Prorok
  • 1


Autonomous driving promises to transform road transport. Multi-vehicle andmulti-lane scenarios, however, present unique challenges due to constrainednavigation and unpredictable vehicle interactions. Learning-basedmethods---such as deep reinforcement learning---are emerging as a promisingapproach to automatically design intelligent driving policies that can copewith these challenges. Yet, the process of safely learning multi-vehicledriving behaviours is hard: while collisions---and their near-avoidance---areessential to the learning process, directly executing immature policies onautonomous vehicles raises considerable safety concerns. In this article, wepresent a safe and efficient framework that enables the learning of drivingpolicies for autonomous vehicles operating in a shared workspace, where theabsence of collisions cannot be guaranteed. Key to our learning procedure is asim2real approach that uses real-world online policy adaptation in amixed-reality setup, where other vehicles and static obstacles exist in thevirtual domain. This allows us to perform safe learning by simulating (andlearning from) collisions between the learning agent(s) and other objects invirtual reality. Our results demonstrate that, after only a few runs inmixed-reality, collisions are significantly reduced.


Quick Read (beta)

Multi-Vehicle Mixed-Reality Reinforcement Learning for Autonomous Multi-Lane Driving

Rupert Mitchell, Jenny Fletcher, Jacopo Panerati, and Amanda Prorok

Autonomous driving promises to transform road transport. Multi-vehicle and multi-lane scenarios, however, present unique challenges due to constrained navigation and unpredictable vehicle interactions. Learning-based methods—such as deep reinforcement learning—are emerging as a promising approach to automatically design intelligent driving policies that can cope with these challenges. Yet, the process of safely learning multi-vehicle driving behaviours is hard: while collisions—and their near-avoidance—are essential to the learning process, directly executing immature policies on autonomous vehicles raises considerable safety concerns. In this article, we present a safe and efficient framework that enables the learning of driving policies for autonomous vehicles operating in a shared workspace, where the absence of collisions cannot be guaranteed. Key to our learning procedure is a sim2real approach that uses real-world online policy adaptation in a mixed-reality setup, where other vehicles and static obstacles exist in the virtual domain. This allows us to perform safe learning by simulating (and learning from) collisions between the learning agent(s) and other objects in virtual reality. Our results demonstrate that, after only a few runs in mixed-reality, collisions are significantly reduced.


Department of Computer Science and Technology, University of Cambridge
{rmjm3, jlf60, jp872, asp45}, Department of Computer Science and Technology, University of Cambridge
{rmjm3, jlf60, jp872, asp45}


Additional Key Words and Phrases: Multi-robot systems; Machine learning for robotics; Reinforcement learning; Autonomous vehicles; Reality gap; Sim2real


R. Mitchell, J. Fletcher, J. Panerati, and A. Prorok


The deployment of automated and autonomous vehicles presents us with transformational opportunities for road transport. To date, the number of companies working on this technology is substantive, and growing (CBS, 2018). Opportunities reach beyond single-vehicle automation: by enabling groups of vehicles to jointly agree on maneuvers and navigation strategies, real-time coordination promises to improve overall traffic throughput, road capacity, and passenger safety (Dressler et al., 2014; Ferreira et al., 2010). However, driving in multi-vehicle and multi-lane settings still remains a challenging research problem, due to unpredictable vehicle interactions (e.g., non-cooperative cars, unreliable communication), hard workspace limitations (e.g., lane topographies), and constrained platform dynamics (e.g., steering kinematics, driver comfort).

Learning-based methods, such as deep reinforcement learning, have proven effective at designing robot control policies for an increasing number of tasks in single-vehicle systems, for applications such as navigation (Khan et al., 2019), flight (Molchanov et al., 2019), and locomotion (Tan et al., 2018). Leveraging such methods for learning autonomous driving policies is emerging as a particularly promising approach (Pan et al., 2017; Shalev-Shwartz et al., 2016; Kuderer et al., 2015). Yet, the process of safely learning autonomous driving involves unique challenges, since the decision models often used in robotics do not lend themselves naturally to the multi-vehicle domain, due to the unpredictable behaviour of other agents. The unapologetic nature of the trial-and-error process in reinforcement learning compounds the difficulty of ensuring functional safety.

These adversities call for learning that first takes place in simulation, before transferring to the real world (Miglino et al., 1995; Shah et al., 2018). This transfer, often referred to as sim2real, is challenging due to discrepancies between conditions in simulation and the real world (such as vehicle dynamics and sensor data) (Peng et al., 2018; James et al., 2019; Chebotar et al., 2019). Despite substantial advances in this field, the problem of executing immature policies directly on an autonomous vehicle still raises considerable safety concerns. These concerns are exacerbated when multiple autonomous vehicles share the same workspace, risking collisions and un-reparable damage. Simultaneously, the act of colliding—or nearly-colliding—is essential to the learning process, enabling future policy roll-outs to incorporate these critical experiences. How are we to provide safe multi-vehicle learning experiences, without forgoing the realism of high-fidelity training data? There is a dearth of work that addresses this challenge.

Figure 1. Mixed-reality multi-vehicle multi-lane traffic circuit including one real DeepRacer robot and twelve virtual ones, in beige. Four static virtual vehicles are rendered in blue. The colliding virtual vehicle is rendered in red.

Our goal in this paper is to develop a safe and efficient framework that allows us to learn driving policies for autonomous vehicles operating in a shared workspace, where collision-freeness cannot be guaranteed. Towards this end, we learn an end-to-end policy for vehicle navigation on a multi-lane track that is shared with other moving vehicles and static obstacles. The learning is based on a model-free method embedded in a distributed training mechanism that we tailor for mixed-reality compatibility. Key to our learning procedure is a sim2real approach that uses real-world online policy adaptation in a mixed-reality setup, where obstacles (vehicles and objects) exist in the virtual domain. This allows us to perform safe learning by simulating (and learning from) collisions between the learning agent(s) and other objects in virtual reality. We apply our framework to a multi-vehicle setup consisting of one real vehicle, and several simulated vehicles (as shown in Figure 1). Experiments show that a significant performance improvement can be obtained after just a few runs in mixed-reality, reducing the number of collisions and increasing reward collection. To the best of our knowledge, this is the first demonstration of mixed-reality reinforcement learning for multi-vehicle applications.

Training in simulation before transferring learned policies to the real world provides the benefits of safety and facilitated data collection. Several methods alleviate the difficulty of bridging the reality gap: (i) parameter estimation, which estimates parameters of the real system to achieve a more realistic simulation (Lowrey et al., 2018; Tan et al., 2018), (ii) iterative data collection, which learns distributions of dynamics parameters in an iterative manner (Christiano et al., 2016; Chebotar et al., 2019), and (iii) domain randomization, which trains over a distribution of the system dynamics for policies that are more robust against simulator discrepancies from reality (Peng et al., 2018; Muratore et al., 2018; James et al., 2019; Tobin et al., 2017). Although these methods contribute significantly to closing the reality gap, the problem of guaranteeing safe policy execution still persists. Moreover, it often proves hard to accommodate all situations the robot may encounter in the real world, where unexpected conditions are the norm. To ease this challenge, researchers have proposed methods for continuous online adaptation in model-based reinforcement learning (Fu et al., 2016; Gu et al., 2016). The aim of this approach is to learn an approximate model and then adapt it at test time. However, this can still lead to safety concerns when there is a mismatch between what the model is trained for, and how it is used at test-time. More recent approaches, such as meta-learning, strive to overcome this challenge (Nagabandi et al., 2019). The commonality of all these approaches, however, is their focus on single-robot systems in isolated work-spaces; guaranteeing safe online-learning in shared workspaces is still an open problem.

The idea of exploiting mixed (and augmented) reality for robotics applications was originally introduced as a tool to facilitate development and prototyping. Early work experiments with virtual humanoids amongst real obstacles (Stilman et al., 2005), leveraging the setup to rapidly prototype and test humanoid sub-components. Chen et al. (Chen et al., 2009) use augmented reality to obtain a coherent display of visual feedback during interactions between a real robot and virtual objects. More recently, mixed reality has gained importance in shared human-robot environments (Williams et al., 2018), where combinations of physical and virtual environments can provide safer ways to test interactions, “… by also allowing a gradual transition of the system components into shared physical environments” (Hoenig et al., 2015). The introduction of mixed reality to support reinforcement learning has barely been considered. In (Mohammadi et al., 2019), Mohammadi et al. present an approach for online continuous deep reinforcement learning for a reach-to-grasp task in a mixed-reality environment. Although targets exist in the physical world, the learning procedure is carried out in simulation (using real data), before actions are transferred and executed on the actual robot.

The particularity of our work is that we focus on multi-robot settings, where inter-robot interactions contribute significantly to the learning process, but cannot be executed directly on multiple real platforms without incurring repeated damages. Not only does our mixed-reality framework help bridge the reality gap that still stymies progress in reinforcement learning for robotics, but also, it is especially significant for the specific application at hand in this work.

We consider a multi-vehicle system composed of N vehicles on a multi-lane (closed) traffic circuit with M lanes. Each vehicle in the system has a unique target velocity, vt, i.e., vehicles aim to travel at potentially different speeds. The circuit is obstructed by K obstacles (static vehicles). In order to maintain target speeds and avoid collisions, vehicles must learn to change lanes and execute overtaking maneuvers (we do not enforce a rule regarding which side a vehicle may overtake on). An image of our three-lane setup is shown in Figure 1, for 17 vehicles (one of which is real, those in blue are static).

Assumptions. We are especially interested in a vehicle’s high-level decision-making process that involves lane changes and speed modulation. We, therefore, consider the availability of a low-level controller that executes reliable trajectory following, allowing the vehicle to remain in the centre of its current lane. To facilitate the low-level control task, we represent a lane by a sequence of cubic Bezier curves, continuous up to their first derivative (i.e. having no sharp corners). Vehicles are provided reliable (essentially noise-free) positioning information (e.g., through a motion capture system). We also assume the ability of basic local communication, such that the desired velocity of each neighboring vehicle is available to the high-level controller. This neighborhood includes the six nearest vehicles within a vision radius, rv.

Goal. Our goal is to learn a high-level control policy that allows vehicles to drive as closely as possible to their target velocities, while avoiding collisions with other vehicles.

Our multi-vehicle system is based on a physical vehicle, the DeepRacer robot (Balaji et al., 2019), for which we also develop a virtual counterpart. This platform, its dynamics, and control model are detailed below.

The DeepRacer is a 1/18th scale car with a 4MP camera, 4-wheel drive and Ackermann steering. It sports an Intel Atom processor, 4GB of memory, and 32GB of storage. It runs Ubuntu 16.04 LTS and ROS Kinetic Kame. The on-board computer and motors are powered by 13600mAh and 1100mAh batteries, respectively.

The DeepRacer was originally designed as a platform for vision-based reinforcement learning, with training carried out in simulation only. This is different to our aim—which includes online training and but also only focuses on non-vision-based, high-level decision-making. Therefore, we modified the platform to make it more suited to our goal. The default ROS launch script was replaced, so that the DeepRacer does not run a ROS master but relies on one running on a different device—therefore allowing more than one DeepRacer to be controlled simultaneously. We implemented a new ROS node to communicate with the DeepRacer’s servo node to set turning and throttle values. Adding this node also meant that communication to the DeepRacer could be done via UDP, reducing latency. Finally, a custom, non-reflective case was designed to allow the integration of the robot with a motion tracking system.

The DeepRacer has Ackermann steering geometry. We approximate its kinematics by the bicycle model, with motion equations:

x˙ = vccosξ
y˙ = vcsinξ
ψ˙ = L-1vctanϕs, (1)

where ϕs is the steering angle, vc is the forward speed, ξ is the heading, and L is the vehicle’s wheel base. These equations are numerically integrated in our simulation via the Euler method to obtain the position of the DeepRacer at each time step. For the purpose of collision detection in mixed-reality, the DeepRacer was modeled by a bounding box of similar size to its physical dimensions (30cm×20cm). Virtual vehicles are also identically modeled.

We segregate the vehicle’s driving strategy into two levels: a high-level controller that is responsible for (i) lane-change decisions and (ii) velocity modulation, and a low-level controller that acts upon this information to track desired lanes at desired speeds. In Section Multi-Vehicle Mixed-Reality Reinforcement Learning for Autonomous Multi-Lane Driving, the objective of our learning is the high-level control policy only. We assume the existence of background traffic that is deployed with a fixed high-level driving strategy.

Low-level control. Two low-level controllers are used for lateral and longitudinal control. A PID controller onboard the DeepRacer maintains the robot’s forwards velocity at the value requested by the high-level controller. The steering angle ϕs of the DeepRacer is set by a PD controller, keeping the robot on the trajectory chosen by the higher level controller. The onboard velocity controller gets a desired velocity vc from the high-level controller, and pose information from the motion tracking system; it calculates velocity and acceleration towards the desired trajectory. These are used in the PID controller which outputs a throttle value to the motors. This allows the DeepRacer to travel at the speed requested by the high level controller regardless of external factors such as how discharged the battery is.

The objective of the steering angle controller is to minimise the perpendicular distance, δ, between the robot and the desired trajectory. For small deviations, the angle of the robot’s heading with respect to the trajectory, ψ, is proportional to dδds and the steering angle of the robot, ϕs, is proportional to d2δds2, where s is the travelled distance. This permits a controller of the form ϕs=-gδ-gdtanψ+lκ, where κ is the curvature of the trajectory at the nearest point and g and d are gain and damping factors, respectively. The use of tanψ in place of ψ causes the robot to continue to converge to the desired trajectory even for larger deviations, not affecting its behaviour for small deviations. Since the controller uses derivatives with respect to s rather than t directly, it behaves the same independently of how the high-level controller changes the robot’s speed.

High-level control policy. While low-level controller is capable of maintaining a specified velocity and following the centre of a chosen lane, we use a high-level control algorithm to decide when to accelerate or decelerate and when to change lanes. This high-level policy is the learnable policy (described in Section 2) applied to the agent vehicle.

Background traffic. For realistic (virtual) background traffic we use a hard-coded algorithm, following the work in (Hyldmar et al., 2019). This controller has both longitudinal and lateral control components. The longitudinal component is based on the Intelligent Driver Model (IDM) proposed in (Treiber et al., 2000). Using this control method, a vehicle’s forward acceleration is a function of its current velocity, vc, its gap s to the vehicle in front, and the rate at which it is approaching the vehicle in front, Δv:

aIDM=α[1-(vcvt)δ-(s(vc,Δv)s)2], (2)

where s is a function determining the desired minimum gap to the preceding vehicle and vt is a target velocity. This gap is defined as:

s(vc,Δv)=s0+Tvc+vcΔv2αβ, (3)

where T, α, β, s0, vt are parameters and s0 is a jam distance—the distance which cars in a queue will leave between each other.

The lateral component of this high level controller, responsible for lane changes, is based on the MOBIL controller proposed in (Kesting et al., 2007). The MOBIL strategy is designed to maximise the current vehicle’s freedom to accelerate while also considering the interests of nearby vehicles, and maintaining safety. To determine the effect of a lane change on the current vehicle’s own acceleration, the MOBIL controller considers the effect (Δaself) the new gap to the next vehicle would have on the chosen acceleration by its longitudinal control algorithm, IDM. The MOBIL controller similarly calculates the effect a proposed lane change would have on the chosen accelerations of nearby vehicles, assuming they were also using IDM. It then compares the expected benefit to a threshold value ΔaT to determine whether or not to change lane:

Δaself+p(Δan+Δao)>ΔaT, (4)

where Δan and Δao are the effects on the new and old following vehicles, and p is a politeness factor. Safety is maintained by adding the condition that the MOBIL controller does not force the new follower vehicle to decelerate at a rate greater than a safety limit, βn. Since we do not enforce a rule regarding which side vehicles may overtake on, the MOBIL controller considers changing lanes in both directions, and takes the better option if both surpass the threshold ΔaT.

As anticipated in Section Multi-Vehicle Mixed-Reality Reinforcement Learning for Autonomous Multi-Lane Driving, we wish to learn a high-level control policy letting a vehicle avoid collisions while maintaining its desired velocity. We formulate this as a sequential decision problem and solve it with an actor-critic based reinforcement learning approach. We approximate the value function V and the policy function π using the critic and actor components, respectively.

Our goal is to safely (collision and damage-free) find an optimal high-level controller, such that each vehicle (agent) is as close as possible to its desired velocity. We formalise this high-level control problem as a reinforcement learning problem (Sutton and Barto, 2011) with state space, 𝒪 (the agent’s observations), and action space 𝒜. 𝒪 contains both information about the agent’s own state, 𝒪s, as well as the state of other nearby vehicles, 𝒪o, such that:

𝒪=𝒪s×𝒪o. (5)

In 𝒪s, an agent observes: (i) its current velocity, vc; (ii) its target velocity, vt; (iii) the number of lanes to its right, lr; (iv) the number of lanes to its left, ll; (v) its lane-changing state s (i.e. whether it is changing lane or not). An element of 𝒪s is thus represented as a vector of the form:

𝐨s=[vc,vt,lr,ll,s]5. (6)

In 𝒪o, the agent observes up to six nearby vehicles (defining its neighbourhood, as introduced in Section Multi-Vehicle Mixed-Reality Reinforcement Learning for Autonomous Multi-Lane Driving). If there are less than six vehicles within radius rv, then this vector is padded up to six using “null” vehicles. For each nearby vehicle, ci, the agent receives the relative position of ci in polar coordinates (di, θi). The agent also receives the relative lane-wise velocity, vri, of ci, the number of lanes to ci, Δli, and the lane-changing state of ci, si. An element of 𝒪o is thus represented as 6 vectors of the form:

𝐨oi=[di,cosθi,sinθi,vri,Δli,si]6. (7)

The action space, 𝒜, contains pairs of tuples from a (discrete) acceleration space, 𝒜a, and a (discrete) lane changing space, 𝒜l, such that:

𝒜=𝒜a×𝒜l. (8)

Set 𝒜a consists of “constant acceleration”, “maintaining the current speed”, and a “constant deceleration”. Set 𝒜l consists of “changing lane left”, “right”, or “not at all”. The reinforcement learning reward function is designed to prevent the agent from deviating unnecessarily from its desired speed while avoiding collisions with other cars. This function is expressed as:

R(𝐨s,𝐨o)=-|vc-vt|-max(p1,p2), (9)

where p1 and p2 are proximity penalty terms defined as:

p1=max(0,c1λ-dl), (10)
p2=max(0,c2L-da), (11)

where da is the distance to the closest nearby vehicle, dl is the distance to the closest nearby vehicle in the same lane, λ is the distance between lanes, L is the length of a vehicle, and c1 and c2 are parameters (see also Figure 2). These two proximity penalties exist to deter the agent from coming too close to other vehicles. While this specific formalization would admit a solution through discrete action-space methods, such as Double Q-learning (Hasselt, 2010), in the following, we present a more general approach based one the actor critic method. As a consequence, our approach can generalise to continuous action spaces as well.

Figure 2. Schematics presenting the main components in the observations vectors 𝐨s and 𝐨oi for a vehicle tackling the reinforcement learning problem described in Subsection Multi-Vehicle Mixed-Reality Reinforcement Learning for Autonomous Multi-Lane Driving.

We approximate value V(o) and policy function π(o,a) using a deep neural network containing one actor and two critics (Figure 3). From observation vectors 𝐨oi’s, the salient features of nearby cars are extracted using a sequence of four linear layers of hidden size nh with output size nf. These features are then max-pooled across nearby vehicles to get a single size nf vector of features pertaining to observed vehicles. This vector is then concatenated with the agent’s own observations 𝐨s to produce the input of the actor and critic networks.

The actor network consists of a sequence of three linear layers of hidden and output size nh followed by two heads, each consisting of a final layer of hidden size nh and an output size of 3, followed by soft-max activation. These two heads correspond to the two discrete spaces 𝒜l and 𝒜a, i.e., lane changes and acceleration, respectively. We elect to use two critic networks which are similarly composed by a sequence of four linear layers of hidden size nh, though this time each terminating in a one-dimensional evaluation of the value function. As proposed by Fujimoto et al. (Fujimoto et al., 2018), we consider the less extreme of the two evaluations during training to try to reduce the impact of extreme estimationsw of the value function in the early stages. We found these spurious estimates to be detrimental, thus, the maximum value from the two is was used when updating π.

Figure 3. Schematics of the neural network mapping observations 𝐨𝐬𝒪s, 𝐨𝐨𝐢𝒪o to (i) actions aa𝒜a, al𝒜l and (ii) value function V(). We detail this architecture in Subsection 2 and its training in Subsection 3.

We develop our reinforcement learning method as an adaptation of Asynchronous Advantage Actor Critic (A3C) (Mnih et al., 2016), by maintaining an approximation for the value function of a state o, V(o), and for the policy function π(a|o) using explicitly calculated returns over short trajectories. Returns Rt from actions were calculated as

Rt=i=0k-tγirt+i+γkVavg(ot+k), (12)

where 0t<k for trajectory length k and Vavg is the mean of the two value functions. The approximation of the value function was trained to minimise A(ot,at)2 where A(ot,at) is the Advantage function, Rt-V(ot).

The policy function is updated using the PPO-Clip (Schulman et al., 2017) loss function:

L(ϕ)=-min(ρtAϕ(ot,at),fc(ρt,1-ϵ,1+ϵ)Aϕ(ot,at)), (13)

where ϕ are the network parameters, subscript ϕ denotes the evaluation of the network using parameters ϕ, fc is the clamp function and ϵ is a constant parameter:

ρt=πϕ(ot,at)πϕ¯(ot,at). (14)

As we do not use mini-batching, the target policy that we compare against is not one computed before a current set of mini-batches (as in (Schulman et al., 2017)), but rather duplicated versions of part of the network (the shaded boxes in Figure 3) with parameters smoothed exponentially in time, ϕ¯, updated to follow the latest parameters, ϕ, according to the rule:

ϕ¯t+1=τϕ¯t+(1-τ)ϕt, (15)

where τ is a parameter set during training. We also add to the loss function a term proportional to the negation of the policy entropy, in order to discourage premature convergence. We weight the three contributions to the total network loss with coefficients wa, wc and we corresponding to the PPO loss, the critic loss and the entropy term, respectively.

To improve speed and stability of learning, we use multiple parallel actors when pre-training a policy in simulation only. We parallelise this process on two levels. First, we use asynchronous updates, as in (Mnih et al., 2016), to allow multiple threads acting in the problem environment to send gradients to a separate thread updating the policy parameters, and then returning the new parameters (as shown in Figure 4). In addition, each actor thread simultaneously acts in multiple environments (Clemente et al., 2017) in order to take advantage of vectorisation (Figure 4). Combined, these two parallelisation strategies substantially improved (10x speed-up) training speed in purely virtual environments.

Figure 4. Schematics of the distributed training approach presented in Subsection 3 for the network in Figure 3.
Figure 5. Overall schematics of the proposed multi-vehicle, mixed-reality reinforcement learning approach. Reinforcement learning of high-level driving policies is handled through PyTorch. Both virtual and real DeepRacer vehicles exist within a C++ simulation that manages the physics of the virtual cars and emulates collisions in mixed-reality. The physics of real-life DeepRacers is captured through OptiTrack’s motion capture system and fed to the simulation.

Our mixed-reality experimental setup seamlessly integrates multiple real-world and virtual components, as illustrated in Figure 5. The learning of high-level policies by DeepRacer agents, using the framework presented in Section Multi-Vehicle Mixed-Reality Reinforcement Learning for Autonomous Multi-Lane Driving, is performed during the concurrent execution of all these modules, i.e., in mixed-reality.

In our setup, a C++ simulation provides the environment in which reinforcement learning agents can act, observe, and learn. As such, it also contains the high-level IDM/MOBIL controllers of the background traffic vehicles. We implemented the reinforcement learning approach described in the previous section using Python and the PyTorch library. An interface between the C++ simulation and the Python interpreter was created using the BOOST.Python C++ library. This interface exposes the ability to create environments as either mixed-real or purely virtual. The simulation provides observations and reward signals to the Python implementation, according to the state of the environment. Then, it updates its state to reflect the agents’ actions, as received from the Python interpreter.

The simulated environment also contains (i) the specifications of the Bezier curves for all lanes in the track, (ii) the states of the vehicles controlled by either reinforcement learning agents or the IDM/MOBIL algorithms, and (iii) K static obstacles. These obstacles are placed far enough apart not to fully block the road, and so that there is at least one in each lane of the circuit. Their exact positions are otherwise randomised. The starting locations of the background traffic and agent vehicle are likewise randomised along with the desired velocities vt’s of all vehicles. For each of the vehicles in the environment, collision detection is accomplished using bounding boxes of the same shape and size of a DeepRacer.

The simulation was written in C++ in order to provide higher performance, especially when pre-training a network in a purely simulated environment. To the same end, the simulation was designed to be capable of running several simultaneous virtual environments (Figure 4) in order to allow the reinforcement learning algorithm to submit multiple parallel actions and receive multiple parallel observations—thus making a more efficient use of our learning computing hardware.

As shown in Figure 5, the physical DeepRacer must interface with the simulation while training in mixed-reality. The location and pose of a real-life DeepRacer in the environment is tracked using six OptiTrack Prime 17W cameras and the Motive motion capture software. When multiple real DeepRacers are used, we distinguish them by using unique layouts of reflective markers. The positions of each of the DeepRacers is broadcast by Motive, received by a VRPN client and published to a ROS topic, making the data available to all nodes in our ROS environment. In order to reduce network load and increase reliability, the frequency at which poses were transmitted was restricted to 50Hz, since this was also the update rate of the physics engine in the simulation. From the perspective of the tracking system, the centre of a vehicle was defined as the centre of its rear axle. This choice preserves consistency with the simulation’s definition of the centre of a car—itself chosen for the sake of simplicity, while using an Ackermann steering model. The vehicles drive on a closed loop track made up of individual trajectories that contain no intersections and are C1 continuous.

Mixed-reality plays a two-fold role in our work: (i) it fosters an agent’s learning, allowing simultaneous real and simulated training, and (ii) it provides us with better evaluation tools, through the ability to visualise the virtual and real agents’ interactions.

Learning In the mixed-reality environment, the simulation receives live updates on the pose of the DeepRacer through the motion capture system and updates its representation of the environment state accordingly. The simulation sends commands setting the steering angle and velocity of the DeepRacer according to the actions of the high-level controller and the lateral component of the low-level controller.

The simulation is able to detect collisions between the DeepRacer and the virtual vehicles through a collision box identical to that of a virtual vehicle sharing the same pose as the real agent. From the point of view of the high-level controllers, including the reinforcement learning agent, the situation is no different from a purely virtual scenario—with the exception of the world’s physics affecting the real DeepRacer. Parallelisation of environments is unavailable when training in a mixed-real environment, but since our implementation of A3C uses trajectories of experience with explicitly calculated returns, we substantially increase their length and generate only a small number of trajectories for each optimisation step. Each of these trajectories is created using a different random initialisation of the environment in order to provide a variety of experiences to the reinforcement learning algorithm, at each optimisation step.

Visualisation To visualise the interaction between the virtual cars and the DeepRacer, during our tests, we set up a fixed camera to record the entire full-length experiments. From the simulation environment, we collect pose data for both the virtual and real cars and compute whether any vehicle is currently experiencing collisions. These data are processed through a Python script importing Blender’s API. At each timestep, we insert an animation keyframe of a vehicle model in the pose specified by the previously recorded data and a colour determined by whether the vehicle is (i) a fixed obstacle (blue), (ii) a moving vehicle (beige), or (iii), a vehicle currently in collision (red). In a separate scene, the DeepRacer alias is also animated using the same procedure. These two scenes are then composited together using Z-buffer values so that—when the DeepRacer is in front of a virtual vehicle—the area obscured by the Deepracer is transparent. The output can then be overlayed on top of the test footage to create the effect that the real and virtual vehicles are interacting.

To demonstrate the effectiveness of our mixed-reality setup—to train agents capable of collision-free driving—we performed experiments on a (M=) 3-lane track (see Figure 1) with lanes λ=30cm wide. The track itself fits a 3.5m×2.2m area, with a lap length of roughly 16.4 metres, i.e., 50 times the size of a DeepRacer (L=32cm). Our experiments include N=13 (1 real, 12 virtual) vehicles and K=4 virtual obstacles. The low-level control parameters g and d (see Subsection Multi-Vehicle Mixed-Reality Reinforcement Learning for Autonomous Multi-Lane Driving) were set to 3 and 0.4, respectively. For the learning parameters (see Section Multi-Vehicle Mixed-Reality Reinforcement Learning for Autonomous Multi-Lane Driving), we selected γ=0.9, τ=0.7, ϵ=0.1, k=128, wa=10, wc=1, we=0.003, nh=64, nf=8, c1=0.833, and c2=2.81. For the actor and critics, we used learning rates of 2e-4 and 2e-3. Our results are summarized in Figures 6, 7, and 8 as well as by additional footage available on the Prorok Lab YouTube channel.11 1

Figure 6. Evolution during training of (i) the number of collisions per minute (top plot, lower is better) and (ii) the average reward collected by the training agent, over a sliding window of 8000 frames (bottom plot, higher is better).
Figure 7. Empirical distributions at test time of (i) the number of collisions per scenario (top plot, left is best) and (ii) the total collected reward per scenario (bottom plot, right is best) before (blue) and after (red) training in mixed-reality.
Figure 8. Plots of track positions (y axis) against time (x axis) of four static obstacles (horizontal lines), twelve virtual vehicles, and one real-life DeepRacer (thicker line). The colormap captures the velocities of all cars. The red dots represent collisions incurred by the DeepRacer. The top and bottom plots compare behaviours recorded before and after mixed-reality training.

First, we want to assess the soundness of our approach by evaluating how well training fares—in terms of incurred collisions and collected reward. This is shown in Figure 6, where the two plots describe the evolution over time (measured in frames, i.e., the steps in which an agent receives one set of observations and takes one action) of: (i) the number of collisions per minute (top plot of Figure 6); and (ii) the average collected reward (bottom plot of Figure 6). Successful training is reflected in a general downward slope of the top plot (fewer collisions) and, conversely, a general upward slope of the bottom plot (greater reward).

Second, we want to quantify the effectiveness of mixed-reality training at test time. This is shown in Figure 7. The top and bottom plots refer, once more, to collisions and collected reward, respectively. Each one of the two plots compares two density distributions of these performance metrics: one before (in blue) and one after (in red) training in mixed-reality. As our simulation environment is partially randomised, the word scenario refers to all the data gathered from a single instantiation. On the top plot, we can observe a left-shift (from blue to red, i.e., before and after) of the collisions’ density distribution, that is, fewer collisions occurring after mixed-real training. On the bottom plot, conversely, a right-shift reflects the improved ability of the agent, trained in mixed-reality, to collect reward.

Finally, Figure 8 presents a qualitative comparison of how a DeepRacer agent’s behaviour changes before (top) and after (bottom) mixed-reality training. The x axis in Figure 8 shows the passing of time (in seconds) while the y axis captures the position of a vehicle along the track (in metres). Four blue horizontal lines represent obstacles (i.e., static virtual vehicles) on the track. All other (13) lines represent moving vehicles—the thicker one being the DeepRacer agent. A color map is used to encode the speed (in metres per second) of each vehicle. Red dots indicate collisions between the real-life DeepRacer and either a virtual obstacle or vehicle. Indeed, collisions are rarer after mixed-reality training. Footage of the mixed-reality experiments in Figure 8 is also available (link).

The training stability and effectiveness of the proposed approach is reported in Figure 6: in the top plot, one can observe early improvements—i.e., a reduction—in the number of collisions during training. This is followed by two periods of worsening performance (around frames 20’000 and 30’000), and then a more consistent downward trend (from frame 35’000 on). The early improvements and performance deterioration (until frame 25’000) may be explained by the choice of hyper-parameters. Our learning rates aimed at aggressive policy changes. That is, an agent would have been, at first, too eager to learn how to overly accelerate—and collect more reward—resulting into more early collisions. The bottom plot, presenting the collection of reward during training, shows a distinct mirroring (x axis symmetry) of the top plot. This is consistent with what we would expect—that is, fewer collisions leading to higher reward.

Figure 7 demonstrates the performance of our methodology at test time. In the top plot, we observe that the density distribution of collisions is significantly shifted to the left after mixed-reality training—indicating that our learning approach can effectively reduce collisions. The after-training distribution is also narrower, suggesting reduced variance and uncertainty. The bottom plot presents the slightly more trivial result that reinforcement learning training does, indeed, lead to improved reward collection. Nonetheless, at test time, this is evidence of the ability of our approach to generalize.

The qualitative results in Figure 8 demonstrate how the learning agent’s behaviour changes before and after mixed-reality training. In the top plot, a DeepRacer that has not yet been trained in mixed-reality collides remarkably often, with nearly every obstacle. This collision-prone behaviour may be due to the reduced responsiveness of the real DeepRacer hardware—when compared to the simulated vehicle—making it harder for the agent to timely stop or avoid other vehicles. After training in mixed-reality, collisions are almost completely amended. In the bottom plot of Figure 8, we can also observe virtual agents (IDM/MOBIL background traffic) either (i) overtaking the learning agent in the longer gaps between obstacles or (ii) piling-up behind it in more constrained regions of the road—e.g., when the agent is cautiously approaching two near obstacles. Interestingly enough, traffic (e.g., between 50” and 80” in the bottom plot of Figure 8) is likely exacerbated by the fact that IDM/MOBIL agents would be willing to give the agent room to accelerate instead of overtaking it—yet, the agent proceeds at a reduced speed. While the learning agent is less dangerous after training, its unexpected prudence can mislead the other driving agents—which are not capable of learning—and reduce throughput.

Finally, it is important to observe that the simulation performance of the agents we transferred into our framework was still characterised by relatively high entropy. This choice was made to minimise the risk of overfitting to the simulation environment and let agents adapt more quickly to the mixed-reality setup. While we cannot say whether additional simulation-only training would have benefited or hurt the agents transferring to mixed-reality, our results support the idea that this approach led to quick and effective real-world adaptation. In future developments of our framework, we will investigate more sample-efficient off-policy reinforcement learning methods—e.g., (Haarnoja et al., 2018) which might allow for better performance without the need for a substantial increase in data gathering—and continuous action spaces.

This work presented a mixed-reality framework for safe and efficient reinforcement learning of driving policies in multi-vehicle systems. Our learning algorithm was trained using a distributed mechanism specifically tailored to suit the needs of our mixed-reality setup. We demonstrated successful online policy adaptation in an experimental setup involving one real vehicle and sixteen virtual vehicles. Our results showed that mixed-reality learning is able to provide significant performance improvements, leading to a reduction of collisions in the learned policies.

The particularity of our system is that it focuses on multi-robot settings, where interactions with other dynamic objects contribute significantly to the learning process, but cannot be executed directly on multiple real platforms without incurring repeated damages. The proposed framework is a first of its kind: beyond providing specific benefits to the application at hand, it also helps bridge the reality gap that still stymies progress in reinforcement learning for robotics at large. Future work will consider multiple learning agents using on-board sensing (e.g., vision), and how our mixed-reality setup enables their gradual introduction into mutually shared spaces.

This work was supported by the Engineering and Physical Sciences Research Council (grant EP/S015493/1). Their support is gratefully acknowledged. The DeepRacer robots used in this work were a gift to Amanda Prorok from AWS. Their support is gratefully acknowledged. This article solely reflects the opinions and conclusions of its authors and not AWS or any other Amazon entity.


  • (1)
  • Balaji et al. (2019) Bharathan Balaji, Sunil Mallya, Sahika Genc, Saurabh Gupta, Leo Dirac, Vineet Khare, Gourav Roy, Tao Sun, Yunzhe Tao, Brian Townsend, et al. 2019. DeepRacer: Educational Autonomous Racing Platform for Experimentation with Sim2Real Reinforcement Learning. arXiv preprint arXiv:1911.01562 (2019).
  • CBS (2018) CBS. 2018. CBS Insights Research Brief. (2018). (Accessed August 15, 2018).
  • Chebotar et al. (2019) Yevgen Chebotar, Ankur Handa, Viktor Makoviychuk, Miles Macklin, Jan Issac, Nathan Ratliff, and Dieter Fox. 2019. Closing the sim-to-real loop: Adapting simulation randomization with real world experience. In 2019 International Conference on Robotics and Automation (ICRA). IEEE, 8973–8979.
  • Chen et al. (2009) Ian Yen-Hung Chen, Bruce MacDonald, and Burkhard Wunsche. 2009. Mixed reality simulation for mobile robots. In 2009 IEEE International Conference on Robotics and Automation. IEEE, 232–237.
  • Christiano et al. (2016) Paul Christiano, Zain Shah, Igor Mordatch, Jonas Schneider, Trevor Blackwell, Joshua Tobin, Pieter Abbeel, and Wojciech Zaremba. 2016. Transfer from simulation to real world through learning deep inverse dynamics model. arXiv preprint arXiv:1610.03518 (2016).
  • Clemente et al. (2017) Alfredo V. Clemente, Humberto Nicolás Castejón Martínez, and Arjun Chandra. 2017. Efficient Parallel Methods for Deep Reinforcement Learning. arXiv preprint arXiv:1705.04862 (2017).
  • Dressler et al. (2014) Falko Dressler, Hannes Hartenstein, Onur Altintas, and Ozan Tonguz. 2014. Inter-vehicle communication: Quo vadis. IEEE Communications Magazine 52, 6 (2014), 170–177.
  • Ferreira et al. (2010) Michel Ferreira, Ricardo Fernandes, Hugo Conceição, Wantanee Viriyasitavat, and Ozan K Tonguz. 2010. Self-organized traffic control. In Proceedings of the seventh ACM international workshop on VehiculAr InterNETworking. ACM, 85–90.
  • Fu et al. (2016) Justin Fu, Sergey Levine, and Pieter Abbeel. 2016. One-shot learning of manipulation skills with online dynamics adaptation and neural network priors. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 4019–4026.
  • Fujimoto et al. (2018) Scott Fujimoto, Herke van Hoof, and David Meger. 2018. Addressing Function Approximation Error in Actor-Critic Methods. In Proceedings of the 35th International Conference on Machine Learning (Proceedings of Machine Learning Research), Jennifer Dy and Andreas Krause (Eds.), Vol. 80. PMLR, Stockholmsmässan, Stockholm Sweden, 1587–1596.
  • Gu et al. (2016) Shixiang Gu, Timothy Lillicrap, Ilya Sutskever, and Sergey Levine. 2016. Continuous deep q-learning with model-based acceleration. In International Conference on Machine Learning. 2829–2838.
  • Haarnoja et al. (2018) Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. 2018. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. arXiv preprint arXiv:abs/1801.01290 (2018).
  • Hasselt (2010) Hado van Hasselt. 2010. Double Q-learning. In Proceedings of the 23rd International Conference on Neural Information Processing Systems - Volume 2 (NIPS’10). Curran Associates Inc., USA, 2613–2621.
  • Hoenig et al. (2015) Wolfgang Hoenig, Christina Milanes, Lisa Scaria, Thai Phan, Mark Bolas, and Nora Ayanian. 2015. Mixed reality for robotics. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 5382–5387.
  • Hyldmar et al. (2019) Nicholas Hyldmar, Yijun He, and Amanda Prorok. 2019. A Fleet of Miniature Cars for Experiments in Cooperative Driving. IEEE International Conference Robotics and Automation (ICRA) (2019).
  • James et al. (2019) Stephen James, Paul Wohlhart, Mrinal Kalakrishnan, Dmitry Kalashnikov, Alex Irpan, Julian Ibarz, Sergey Levine, Raia Hadsell, and Konstantinos Bousmalis. 2019. Sim-to-real via sim-to-sim: Data-efficient robotic grasping via randomized-to-canonical adaptation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 12627–12637.
  • Kesting et al. (2007) Arne Kesting, Martin Treiber, and Dirk Helbing. 2007. General Lane-Changing Model MOBIL for Car-Following Models. Transportation Research Record 1999, 1 (2007), 86–94.
  • Khan et al. (2019) Arbaaz Khan, Chi Zhang, Shuo Li, Jiayue Wu, Brent Schlotfeldt, Sarah Y Tang, Alejandro Ribeiro, Osbert Bastani, and Vijay Kumar. 2019. Learning safe unlabeled multi-robot planning with motion constraints. arXiv preprint arXiv:1907.05300 (2019).
  • Kuderer et al. (2015) Markus Kuderer, Shilpa Gulati, and Wolfram Burgard. 2015. Learning driving styles for autonomous vehicles from demonstration. In 2015 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2641–2646.
  • Lowrey et al. (2018) Kendall Lowrey, Svetoslav Kolev, Jeremy Dao, Aravind Rajeswaran, and Emanuel Todorov. 2018. Reinforcement learning for non-prehensile manipulation: Transfer from simulation to physical system. In 2018 IEEE International Conference on Simulation, Modeling, and Programming for Autonomous Robots (SIMPAR). IEEE, 35–42.
  • Miglino et al. (1995) Orazio Miglino, Henrik Hautop Lund, and Stefano Nolfi. 1995. Evolving mobile robots in simulated and real environments. Artificial life 2, 4 (1995), 417–434.
  • Mnih et al. (2016) Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. 2016. Asynchronous Methods for Deep Reinforcement Learning. arXiv preprint arXiv:1602.01783 (2016).
  • Mohammadi et al. (2019) Hadi Beik Mohammadi, Mohammad Ali Zamani, Matthias Kerzel, and Stefan Wermter. 2019. Mixed-Reality Deep Reinforcement Learning for a Reach-to-grasp Task. In International Conference on Artificial Neural Networks. Springer, 611–623.
  • Molchanov et al. (2019) Artem Molchanov, Tao Chen, Wolfgang Hönig, James A. Preiss, Nora Ayanian, and Gaurav S. Sukhatme. 2019. Sim-to-(Multi)-Real: Transfer of Low-Level Robust Control Policies to Multiple Quadrotors. arXiv:1903.04628 [cs] (March 2019). arXiv: 1903.04628.
  • Muratore et al. (2018) Fabio Muratore, Felix Treede, Michael Gienger, and Jan Peters. 2018. Domain randomization for simulation-based policy optimization with transferability assessment. In Conference on Robot Learning. 700–713.
  • Nagabandi et al. (2019) Anusha Nagabandi, Ignasi Clavera, Simin Liu, Ronald S. Fearing, Pieter Abbeel, Sergey Levine, and Chelsea Finn. 2019. Learning to Adapt in Dynamic, Real-World Environments Through Meta-Reinforcement Learning. arXiv:1803.11347 [cs, stat] (Feb. 2019). arXiv: 1803.11347.
  • Pan et al. (2017) Xinlei Pan, Yurong You, Ziyan Wang, and Cewu Lu. 2017. Virtual to real reinforcement learning for autonomous driving. arXiv preprint arXiv:1704.03952 (2017).
  • Peng et al. (2018) Xue Bin Peng, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. 2018. Sim-to-real transfer of robotic control with dynamics randomization. In 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 1–8.
  • Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal Policy Optimization Algorithms. arXiv preprint arXiv:1707.06347 (2017).
  • Shah et al. (2018) Shital Shah, Debadeepta Dey, Chris Lovett, and Ashish Kapoor. 2018. Airsim: High-fidelity visual and physical simulation for autonomous vehicles. In Field and service robotics. Springer, 621–635.
  • Shalev-Shwartz et al. (2016) Shai Shalev-Shwartz, Shaked Shammah, and Amnon Shashua. 2016. Safe, multi-agent, reinforcement learning for autonomous driving. arXiv preprint arXiv:1610.03295 (2016).
  • Stilman et al. (2005) Michael Stilman, Philipp Michel, Joel Chestnutt, Koichi Nishiwaki, Satoshi Kagami, and James Kuffner. 2005. Augmented reality for robot development and experimentation. Robotics Institute, Carnegie Mellon University, Pittsburgh, PA, Tech. Rep. CMU-RI-TR-05-55 2, 3 (2005).
  • Sutton and Barto (2011) Richard S Sutton and Andrew G Barto. 2011. Reinforcement learning: An introduction. (2011).
  • Tan et al. (2018) Jie Tan, Tingnan Zhang, Erwin Coumans, Atil Iscen, Yunfei Bai, Danijar Hafner, Steven Bohez, and Vincent Vanhoucke. 2018. Sim-to-real: Learning agile locomotion for quadruped robots. arXiv preprint arXiv:1804.10332 (2018).
  • Tobin et al. (2017) J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel. 2017. Domain randomization for transferring deep neural networks from simulation to the real world. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). 23–30.
  • Treiber et al. (2000) Martin Treiber, Ansgar Hennecke, and Dirk Helbing. 2000. Congested traffic states in empirical observations and microscopic simulations. Phys. Rev. E 62 (Aug 2000), 1805–1824. Issue 2.
  • Williams et al. (2018) Tom Williams, Daniel Szafir, Tathagata Chakraborti, and Heni Ben Amor. 2018. Virtual, augmented, and mixed reality for human-robot interaction. In Companion of the 2018 ACM/IEEE International Conference on Human-Robot Interaction. ACM, 403–404.