End-to-End Model-Free Reinforcement Learning for Urban Driving using Implicit Affordances

  • 2019-11-25 12:34:26
  • Marin Toromanoff, Emilie Wirbel, Fabien Moutarde
  • 3


Reinforcement Learning (RL) aims at learning an optimal behavior policy fromits own experiments and not rule-based control methods. However, there is no RLalgorithm yet capable of handling a task as difficult as urban driving. Wepresent a novel technique, coined implicit affordances, to effectively leverageRL for urban driving thus including lane keeping, pedestrians and vehiclesavoidance, and traffic light detection. To our knowledge we are the first topresent a successful RL agent handling such a complex task especially regardingthe traffic light detection. We demonstrate the effectiveness of our method bybeing one of the top teams of the camera only track of the CARLA challenge.


Quick Read (beta)

End-to-End Model-Free Reinforcement Learning
for Urban Driving using Implicit Affordances

Marin Toromanoff
MINES ParisTech, Valeo DAR, Valeo.ai,
[email protected]
[email protected]

Emilie Wirbel
Valeo Driving Assistance Research
[email protected]

Fabien Moutarde
Center for Robotics, MINES ParisTech, PSL
[email protected]

Reinforcement Learning (RL) aims at learning an optimal behavior policy from its own experiments and not rule-based control methods. However, there is no RL algorithm yet capable of handling a task as difficult as urban driving. We present a novel technique, coined implicit affordances, to effectively leverage RL for urban driving thus including lane keeping, pedestrians and vehicles avoidance, and traffic light detection. To our knowledge we are the first to present a successful RL agent handling such a complex task especially regarding the traffic light detection. We demonstrate the effectiveness of our method by being one of the top teams of the camera only track of the CARLA challenge.

1 Introduction

Urban driving is probably one of the hardest situations to solve for autonomous cars, particularly regarding the interaction on intersections with traffic lights, pedestrians crossing and cars going on different possible lanes. Solving this task is still an open problem and it seems complicated to handle such difficult and highly variable situations with classic rules-based approach. This is why a significant part of the state of the art in autonomous driving [Liang, Codevilla, CILRS] focuses on end-to-end systems, i.e. learning driving policy from data without relying on hand-crafted rules.

Imitation learning (IL) [Pomerleau1989a] aims to reproduce the behavior of an expert (a human driver for autonomous driving) by learning to mimic the control the human driver applied in the same situation. This leverages the massive amount of data annotated with human driving that most of automotive manufacturer and supplier can obtain relatively easily. On the other side, as the human driver is always in an almost perfect situation, IL algorithms suffer from a distribution mismatch, i.e. the algorithm will never encounter failing cases and thus will not react appropriately in those conditions. Techniques to augment the database with such failing cases do exist but they are currently mostly limited to lane keeping and lateral control [Bojarski2016d, toromanoff2018end].

Deep Reinforcement Learning (DRL) on the other side lets the algorithm learn by itself by providing a reward signal at each action taken by the agent and thus does not suffer from distribution mismatch. This reward can be sparse and not describing exactly what the agent should have done but just how good the action taken is locally. The final goal of the agent is to maximize the sum of accumulated rewards and thus the agent needs to think about sequence of actions rather than instantaneous ones. One of the major drawbacks of DRL is that it can need a magnitude larger amount of data than supervised learning to converge, which can lead to difficulties when training large networks with many parameters. Moreover many RL algorithms rely on a replay buffer [Lillicrap2015, mnih2015human, Rainbow] allowing to learn from past experiments but such buffers can limit the size of the input used (e.g. the size of the image). That is why neural networks and image size in DRL are usually tiny compared to the ones used in supervised learning. Thus they may not be expressive enough to solve such complicated tasks as urban driving. Therefore current DRL approaches to autonomous driving are applied to simpler cases, e.g. only steering control for lane keeping [Kendall] or going as fast as possible in racing games [Mnih, jaritz2018end]. Another drawback of DRL, shared with IL, is that the algorithm appears as a black box from which it is difficult to understand how the decision was taken.

A promising way to solve both the data efficiency (particularly for DRL) and the black box problem is to use privileged information as auxiliary losses also coined affordances in some recent papers [Chen, Sauer]. The idea is to train a network to predict high level information such as semantic segmentation maps, distance to center of the lane, traffic light state etc… This prediction can then be used in several ways, either by a classic controller as in Sauer et al. [Sauer], either as auxiliary loss helping to find better features to the main imitative task loss as in Mehta et al. [Mehta] or also in a model-based RL approach as in the really recent work of Pan et al. [Pan2019a] while also providing some interpretable feedback on how the decision was taken.

In this work, we will present our RL approach for the case of end-to-end urban driving from vision, including lane keeping, traffic light detection, pedestrian and vehicle avoidance, and handling intersection with incoming traffic. To achieve this we introduce a new technique that we coin implicit affordances. The idea is to split the training in two phases: first an encoder backbone (Resnet-18 [Resnet]) is trained to predict affordances such as traffic light state or distance to center of the lane. Then the output features of this encoder is used as the RL state instead of the raw images. Therefore the RL signal is only used to train the last part of the network. Moreover the features are used directly in the replay memory rather than the raw images, which corresponds to approximately 20 times less memory needed. We showed our method performance by being in the top teams of the ”Camera-Only” track in the CARLA Autonomous Driving Challenge [carlaChallenge]. To our knowledge we are the first to show a successful RL agent on urban driving, particularly with traffic lights handling.

We summarize our main contributions below:

  • The first RL agent successfully driving from vision in urban environment including intersection management and traffic lights detection.

  • Introducing a new technique coined implicit affordances allowing training of replay memory based RL with much larger network and input size than most of network used in previous RL works.

  • Extensive parameters and ablation studies of implicit affordances and reward shaping.

  • Showcase of the capability or our method by being in the top teams of the ”Camera Only” track in the CARLA Autonomous Driving Challenge.

2 Related Work

2.1 End-to-End Autonomous Driving with RL

As RL relies on trial and error, most of RL works applied to autonomous cars are conducted in simulation both for safety reasons and data efficiency. One of the most used simulator is TORCS [Wymann2015] as it is an open-source and simple to use racing game. Researchers used it to test their new actor-critic algorithm to control a car with discrete actions in Mnih et al. [Mnih] and with continuous actions in Lillicrap et al. [Lillicrap2015]. But as TORCS is a racing game, the goal of those works is to reach the end of the track as fast as possible and thus does not handle intersections nor traffic lights.

Recently, many papers used the new CARLA [Dosovitskiy] simulator as an open-source urban simulation including pedestrians, intersection and traffic lights. In the original CARLA paper [Dosovitskiy], the researchers released a driving benchmark along with one Imitation learning and one RL baseline. The RL baseline was using the A3C algorithm with discrete actions [Mnih] and its results were far behind the imitation baseline. Lang et al [Liang] used RL with DDPG [Lillicrap2015] and continuous actions to fine-tune an imitation agent. But they rely mostly on imitation learning and do explicitly explain how much improvement comes from the RL fine-tuning. Moreover they also do not handle traffic lights.

Finally, there are still only few RL methods applied in a real car. The first one was Learning to Drive in a Day [Kendall] in which an agent is trained directly on the real car for steering. A really recent work [Zej] also integrates RL on a real car and compares different ways of transferring knowledge learned in CARLA in the real world. Even if their studies are really interesting, their results are preliminary and applied only on few specific real-world scenarios. Both of these works only handle steering angle for lane keeping and a large gap has to be crossed before reaching throttle and steering control simultaneously in urban environment on a real car with RL.

2.2 Auxiliary Tasks and Learning Affordances

The UNREAL agent [Jaderberg] is one of the first articles to study the impact of auxiliary tasks for DRL. They showed that adding losses such as predicting incoming reward could improve data efficiency and final performance on both Atari games and labyrinth exploration.

Chen et al. [Chen] introduce affordance prediction for autonomous driving: a neural network is trained to predict high level information such as distance to the right, center and left part of the lane or distance to the preceding car. Then they used those affordances as input to a rule-based controller and reached good performance on the racing simulator TORCS. Sauer et al. upgraded this in their Conditionnal Affordance Learning [Sauer] paper to handle more complicated scenarios such as urban driving. In order to achieve that they also predict information specific to urban driving such as the maximum allowed speed and the incoming traffic light state. As Chen et al. they finally used those information in a rule-based controller and showed their performance in the CARLA benchmark [Dosovitskiy] for urban driving. Both of those works do not include any RL and rely on rule-based controller. Just after, Mehta et al. [Mehta] used affordances as auxiliary tasks to their imitation learning agent and showed it was improving both data efficiency and final performance. But they do not handle traffic lights and rely purely on imitation.

Finally, there are two really recent articles closely related to ours. The first one by Gordon et al [Gordon] introduced SplitNet on which they explicitly decompose the learning scheme in finding features from perception task and use these features as input to their model-free RL agent. But their scheme is applied to a completely different task, robot navigation and scene exploration. The second one by Pan et al. [Pan2019a] train a network to predict high-level information such as probability of collision or being off-road in the near futures from a sequence of observations and actions. They use this network in a model-based RL scheme by evaluating different trajectories to finally apply the generated trajectory giving the lowest cost. However, they use a model-based approach and do not handle traffic light signal.

3 The CARLA Challenge

The CARLA Challenge [carlaChallenge] is an open competition for autonomous driving relying on the CARLA simulator. The main goal of this challenge is to give an accessible benchmark to researchers in autonomous driving. Indeed evaluate driving systems on real world is not feasible for most of researchers as it it extremely costly. Moreover comparing different autonomous driving scheme is difficult if tested on different environments and with different sensors. The CARLA Challenge allows to test different algorithms in the exact same conditions.

Figure 1: Sample of traffic light image (left is US, right is EU).

This competition addresses specifically the problem of urban driving. The goal is to drive in unseen maps from sensors to control, ensuring lane keeping, handling intersections with high level navigation orders (Right, Left, Straight), handling lane changes, pedestrians and other vehicles avoidance and finally handling traffic lights US and EU at the same time (traffic lights are positioned differently in Europe and in US, see Figure 1). This is way more difficult than the original CARLA benchmark [Dosovitskiy] with the main differences being handling much more environments with multi-lane roads, EU and US traffic lights at the same time and change lane orders. The CARLA Challenge consists in 4 different tracks with the only difference being the sensors available, from cameras only to a full stack perception. We will only handle the ”only cameras” track there, in fact we even used only a single frontal camera for all this work.

4 Method

4.1 RL Setup

4.1.1 Value-based Reinforcement Learning: Rainbow-IQN Ape-X

There are two main families of model-free RL: value-based and policy-based methods. We choose to use value-based RL as it is the current state-of-the-art on Atari [Rainbow] and is known to be more data efficient than policy-based method. However, it has the drawback of handling only discrete actions. We will describe in this work how we handled this discretization of actions. Making a comparison between value-based RL and policy-based RL (or actor-critic RL which is a sort of combination of both) for Urban driving is out of the scope of this paper but would definitely be interesting for future work. We started with the open-source implementation of Rainbow-IQN Ape-X [Rainbow, Dabney2018, Horgan] (for Atari originally) taken from the paper of Toromanoff et al. [Toromanoff]. We removed the dueling network [Dueling] from Rainbow as we found it was leading to same performance while using much more parameters. The distributed version of Rainbow-IQN was mandatory for our usage: CARLA is too slow for RL and cannot generate enough data if only one instance is used. Moreover this allowed us to train on multiple maps of CARLA at the same time, generating more variability in the training data, better exploration and providing an easy way to handle both US and EU traffic lights (some town used in training were US while others were EU).

4.1.2 Reward Shaping

The reward used for the training relies mostly on the waypoint API present in the latest version of CARLA (CARLA 0.9.X). This API allows to get continuous waypoints position and orientation of all lanes in the current town. This is fundamental to decide what path the agent has to follow. Moreover, this API provides the different possibilities at each intersection. At the beginning of an episode, the agent is initialized on a random waypoint on the city, then the optimal trajectory the agent should follow can be computed using the waypoint API. When arriving at an intersection, we choose randomly a possible maneuvre (Left, Straight or Right) and the corresponding order is given to the agent. The reward relies on three main components: desired speed, desired position and desired rotation.

Figure 2: Desired speed according to environment. The desired speed adapts in function of the situation, getting lower when arriving close to a red light, going back to maximum speed when traffic light goes to green and again getting lower when arriving behind an obstacle. The speed reward is maximum when the vehicle speed is equal to the desired speed.

The desired speed reward is maximum (and equal to 1) when the agent is at the desired speed, and linearly goes down to 0 if the agent speed is lower or higher. The desired speed, illustrated on Figure 2, is adapting to the situation: when the agent arrives near a red traffic light, the desired speed goes linearly to 0 (the closest the agent is from the traffic light), and goes back to maximum allowed speed when it turns green. The same principle is used when arriving behind an obstacle, pedestrian, bicycle or vehicle. The desired speed is set to a constant maximum speed (here 40km/h) on all other situations.

The second part of the reward, the desired position, is inversely proportional to the distance from the middle of the lane (we compute this distance using the waypoints mentioned above). This reward is maximum equal to 0 when agent is exactly in the middle of the lane and goes to -1 when reaching a maximum distance from lane Dmax. When the agent is further than Dmax, the episode terminates. For all our experiments, Dmax was set to 2 meters: this is the distance from the middle of the lane to the border. Other termination conditions are colliding with anything, running a red light and being stuck for no reason (i.e. not behind an obstacle nor stopped at a red traffic light). For all those termination conditions, the agent receives a reward of -1.

With only the two previous reward components, we observed the trained agents were not going straight as oscillations near the center of lane were giving almost the same amount of reward as going straight. That is why we added our third reward component, desired rotation. This reward is inversely proportional to the difference in angle between the agent and the orientation of the nearest waypoint from the optimal trajectory (see Figure 3 for details). Ablation studies on the reward shaping can be found at section 5.3.

Figure 3: Lateral distance and angle difference for lateral and angle reward computation. The difference is measured between the ideal waypoint (in green) and the current agent position (in red).

4.1.3 Handling Discrete Actions

As aforementioned, standard value-based RL algorithms such as DQN [mnih2015human], Rainbow [Rainbow] and Rainbow-IQN [Toromanoff] imply to use discrete actions. In our first trials, we had issues with agents oscillating and failing to stay in lane. The main reason for this failure is that we did not use enough different discrete actions, particularly for the steering angle (only 5 actions at first). Better results can be obtained by using more steering actions such as 9 or 27 different steering values. Throttle is less of an issue: 3 different values for throttle are used, plus one for brake. This leads to a total of 36 (9×4) or 108 (27×4) actions for our experiments. We also tried to predict the derivative of steering angle: the prediction of network is used to update the previous steering (which is given as input) instead of using directly the prediction as current steering.

We also use a really simple yet effective trick: we can reach more fine-grained discrete actions by using a bagging of multiple predictions and average them. To do so, we can simply use consecutive snapshots of the same training, which avoids having to train again and is free to have. This idea was always improving behavior by first reducing oscillations by a large margin and also gave better final performance. Furthermore as the encoder is frozen so can be shared, the computational overhead of averaging multiple snapshots of the same training is almost negligible (less than 10% of the total forward time for averaging 3 predictions). Therefore, all our reported results are obtained by averaging 3 consecutive snapshots of the same training together (for example, results at 10M steps is the bagging of snapshots at 8M, 9M and 10M).

Figure 4: Network architecture. A Resnet-18 [Resnet] encoder is used in a conditional network [Codevilla] with a Rainbow-IQN [Toromanoff] RL training (hence the IQN network [Dabney2018] and noisy fully connected layers [NoisyNetwork])

4.2 Implicit Affordances

4.2.1 Network Architecture

Most of networks used in model-free RL with images as input train a particularly small network [Dosovitskiy, Rainbow] compared to networks used commonly in supervised learning [VGG, Resnet]. One of the larger networks used for model-free RL for Atari is the large architecture from IMPALA [IMPALA] which consists of 15 convolutional layers and 1.6 million parameter: as comparison our architecture has 18 convolutional layers and 30M parameters. Moreover IMPALA used more than 1B frames when we used only 20M. The most common architecture (e.g. [Mnih, Dabney2018]) is the one introduced in the original DQN paper [mnih2015human], taking a 84×84 grayscale image as input. Our first observation was that traffic light state (particularly for US traffic lights which are further) could not be seen on so small images. Therefore a larger input size was chosen (around 40 times bigger than the one used in DQN): 4×288×288×3 by concatenating 4 consecutive frames as a simple and standard [mnih2015human, Dosovitskiy] way to add some temporality in the input. We choose this size as it was the smallest one we tested on which we still had a good accuracy on traffic light detection (using a conventional supervised training). We choose to use Resnet-18 [Resnet] as a relatively small network (compared to the one used in supervised training) to ensure a small inference time. Indeed RL needs a lot of data to converge so each step must be as fast as possible to reduce the overall training time. However, even if Resnet-18 is among the smallest networks used for supervised learning, it contains around 140 times more weights in its convolutional layers than the standard network from DQN [mnih2015human]. Moreover Resnet-18 incorporates most of state-of-the art advances in supervised learning such as residual connections and batchnorm [Batchnorm]. Finally, we used a conditional network as in Codevilla et al. [Codevilla] to handle 6 different maneuvers: follow lane, left/right/straight, change lane left/right. The full network architecture is described in Figure 4.

4.2.2 Supervised Phase: Affordances Learning

How to train a larger network with larger images for RL?

Using a larger network and input size raises two major issues. The first one is that such a network is much longer and harder to train. Indeed it is well known that training a DRL agent is really data consuming even with tiny networks. The second issue is the replay memory. One of the major advantages of value-based RL over policy-based methods is to be off-policy, meaning the data used for learning can come from another policy. That is why the use of replay memory is really a standard in value-based RL [mnih2015human, Rainbow] allowing for better data efficiency, but storing image 35 times bigger raises issues for storing as many transitions (usually 1M transitions are stored which correspond to 6GB for 84×84 images and thus would be 210GB for 288×288×3 images which is unpractical).

Our main idea is to pre-train the convolutional encoder part of the network to predict some high-level information and then freeze it while training the RL. The intuition is that the RL signal is too weak to train the whole network but can be used to train only the fully connected part. Moreover this solves the replay memory issue as we can now store features directly in the replay memory and not the raw images. We coined this scheme as implicit affordances because the RL agent do not use the explicit predictions but have only access to the implicit features (i.e the features from which our initial supervised network predicts the explicit affordances).

Viewpoints Augmentation
Figure 5: Why data augmentation is needed for training the encoder: RL agents trajectories (right) might deviate from the lane center, which leads to semantic segmentation with much more varied lane marking positions than what can be encountered if training only from autopilot data (left).

The data for the supervised phase is collected while driving with an existing autopilot in the CARLA simulator. However this autopilot always stays in the middle of the lane, so the pre-trained encoder which is frozen does not generalize well during the RL training, particularly when the agent starts to deviate from the middle of the lane: with an encoder trained on data collected only from auto pilot driving, RL agent performance was poor. This is the exact same idea as for Imitation Learning with the distribution mismatch and the intuition behind it is explained on Figure 5. To make the whole training work, we made some viewpoints augmentation by moving the camera around the autopilot. With this augmentation the encoder performance is much better while the RL agent drives and explores and we found this was mandatory to obtain good performance during the RL training phase.

Which high level semantic information/affordances to predict?

The most simple idea to pre-train our encoder would be to use an auto-encoder [VAE], i.e. trying to compress the images by trying to predict back the full image from a smaller feature space. This was used in the work Learning to Drive in a Day [Kendall] and allowed for faster training on their real car. We thought this would not work for our harder use-case particularly regarding the traffic light detection. Indeed, traffic light states represent only a few pixels in the image (red or green) but those pixels are the most relevant for the driving behavior.

To ensure that there is relevant signal in the features used as RL state, we choose to rely on high level semantic information available in CARLA. We used 2 main losses for our supervised phase: traffic light state (binary classification) and semantic segmentation. Indeed all relevant information but traffic light state are contained in our semantic segmentation. We used 6 classes for the semantic mask: moving obstacles, traffic lights, road markers, road, sidewalk and background. We also predict some other affordances to help the supervised training such as the distance to the incoming traffic light, if we are in an intersection or not, the distance from the middle of the lane and the relative rotation to the road. The two last estimations are coming from our viewpoint augmentation (without it the autopilot is always perfectly in the middle of the lane with no rotation). Our supervised training with all our losses is represented in the following Figure 6. Ablation studies to estimate the impact of these affordance estimations are presented on section 5.2.

Figure 6: Decoder and losses used to train the encoder: semantic segmentation, traffic light (presence, state, distance), intersection presence, lane position (distance and rotation)

5 Experiments and Ablation Studies

5.1 Defining a Common Test Situation and a Metric for Comparison

We first define a common set of scenarios and a metric to make fair comparison. Indeed the CARLA challenge maps are not publicly available and the old CARLA benchmark is only available on a depreciated version of CARLA (0.8.X) on which rendering and physics differs from the version of CARLA used in the CARLA challenge (0.9.X). Moreover as aforementioned, this CARLA benchmark is a much simpler task than the CARLA challenge.

Defining test scenarios

We choose the hardest environment in the available maps of CARLA. Town05 includes the biggest urban district, is mainly multi-lane and US style: the traffic lights are on the opposite side of the road and much harder to detect. We also randomly spawn pedestrians crossing the road ahead of our agent to verify our models brake on this situations. We additionally set changing weather to make the task as hard as possible. This way, even with a single town training, we have a challenging setup. The single town training is necessary to make all our experiments and ablations studies in a reasonable time. All those experiments were made with 20M iterations on CARLA, with 3 actors (so 6.6M steps for each actor) and with a framerate of 10 FPS. Thus 20M steps is equivalent to around 20 days of simulated driving (as a comparison the most standard time [mnih2015human, Dabney2018] used to train RL for Atari games is 200M frames corresponding to around 40 days and can go to more than 5 years of gametime in some papers [R2D2, Horgan]). We defined 10 scenarios of urban situations each one consisting in 10 consecutive intersections over the whole Town05 environment. We also defined some scenarios on highway but we found those cases were much easier and thus less discriminative: for example our best model goes off-road less than one time every 100km on highway situation. The highway scenarios were mostly used for evaluating the oscillations of our different agents.

Defining a metric to compare different model and ablation studies

We tested our models 10 times on each scenario varying the weather condition and resetting the position of all other agents. Contrary to the training phase, we only terminate episode when the agent goes off-road as this allows to keep track of the number of infractions encountered. Our main metric is the average percentage of intersections successfully crossed (Inters., higher is better), for example 50% completion corresponds to a mean of 5 intersections crossed in each scenario. We also keep track of the percentage of traffic lights passed without infraction (TL, higher is better) and the percentage of pedestrians passed without collision (Ped., higher is better). Note that the last two are slightly less relevant, as a non-moving car will never run a red traffic light nor crash a pedestrian. That is why our main metric for comparison is the mean percentage of intersection crossed and we use the traffic light run and pedestrian collision metric for more fine-grained comparability. We also introduced a measure for oscillations that we use on section 5.3, this measure is the mean absolute rotation between the agent and the road along the episode (Osc., lower is better).

5.2 Ablations Studies on the Supervised Phase

In this section, we will detail our ablation studies concerning the supervised learning phase of affordances. The RL setup is exactly the same to ensure fair comparison.

Figure 7: Evolution of agent performance with training steps and choice of the encoder behavior. The first group of encoders (solid lines) have frozen weights, the second group (dashed) are trained only by the RL signal (stopped earlier because the performance is clearly lower). Some experiments are averaged over multiple seeds (see Supplementary Materials for details on stability).

First, some experiments are conducted without any supervised phase, i.e. training the whole network from scratch in the RL phase. Three different architectures are compared: the initial network from DQN with 84×84 images, a simple upgrade of the DQN network which takes 288×288×3 images as input and finally our model with the Resnet-18 encoder.

Figure 7 shows that without affordances learning, agents fail to learn and do not even succeed to pass one intersection in average (less than 10% intersections crossed). Moreover it is important to note that training the bigger image encoder (respectively the full resnet-18) took 50% (resp. 200%) more time than training with our implicit affordances scheme even considering the time used for the supervised phase. That’s why we stopped these experiments after 10M steps. These networks also require much more memory, because full images are stored in the replay memory. As expected, these experiments prove that training a large network using only RL signal is hard.

Encoder used Inters. TL Ped.
Random 0% NA NA
No TL state 33.4% 80% 82%
No segmentation 41.6% 96.5% 63%
All affordances 61.9% 97.6% 76%
Table 1: Comparison of agent performance with regards to encoder training loss (random weights, trained without traffic light loss, without semantic segmentation loss, or with all affordance losses)

The second stage of experiments concerned the Resnet-18 encoder training. First, as a sanity check, the encoder is frozen to random features. Then, either the traffic light state or the segmentation is removed from the loss in the supervised phase. These experiments show the interest of predicting the traffic light state and the semantic segmentation in our supervised training. The performance of the corresponding agents is illustrated in Figure 7.

Table 1 shows that removing the traffic light state has a huge impact on the final performance. As expected the RL agent using an encoder trained without the traffic light loss is running more red traffic lights. It is interesting to note that this ratio is much better than a random choice (which would be 25% of success for traffic light because traffic lights are green only 25% of the time). This means that the agent still succeeds to detect some traffic light state signal in the features. We guess that as the semantic segmentation includes a traffic light class (but not the actual state of it) the features contain some information about traffic light state. Removing the semantic segmentation loss from the encoder training also has an impact on final performance. As expected, performance on pedestrian collision is worse than any other training meaning the network has trouble to detect pedestrians and vehicles (this information is only contained in the semantic map).

5.3 Ablations Studies on the RL Setup

For fair comparison, the same pre-trained encoder is used for all experiments, trained with all affordances mentioned in Section 4.2.2. The encoder used here is the same one as the CARLA challenge, and has been trained on slightly more data and for more epochs than the encoders used for the previous ablation study.

Two experiments are conducted with different rewards to measure the impact of the reward shaping. In the first one (constant desired speed), the desired speed is not adapted to the situation: the agent needs to understand only from termination signal to brake on red traffic lights and to avoid collisions. In the second experiment, the angle reward component is removed to see the impact of this reward on oscillations. Two different settings for actions are also evaluated. First, the derivative of the steering angle is predicted instead of the current steering. Finally the steering angle discretization is studied, decreasing from 27 to 9 steering absolute values. Results are summarized in Table 2.

Input/output Inters. TL Ped. Osc.
Constant desired speed 50.3% 31% 42% 1.51°
No angle reward 64.7% 99% 77.7% 1.39°
27 steering values (derivative) 64.5% 98.7% 85.1% 1.64°
9 steering values (absolute) 74.4% 98.5% 84.6% 0.88°
27 steering values (absolute) 75.8% 98.3% 81.6% 0.84°
Table 2: Performance comparison according to the steering angle discretization used and reward shaping

The most interesting result of these experiments is the one from Constant desired speed. Indeed, the agent fails totally at braking for both cases of red traffic light or pedestrian crossing: its performance is much worse than any other agent. The agent trained with desired speed set to constant runs 70% of traffic lights which is very close to a random choice. It also collides with 60% of pedestrians. This experiment shows how important the speed reward component is to learn a braking behaviour.

Surprisingly, we found that predicting derivative of steering results in more oscillations, even more than when removing the desired rotation reward component. Finally, taking 9 or 27 different steering values did not have any significant impact and both of these agents reach the best performance with low oscillation.

5.4 Generalization on Unseen Towns

Finally, we made some experiments of generalization as this was the actual setting of the CARLA challenge. For this purpose, we trained on 3 different towns at the same time (one with EU traffic light and the 2 others with US) and tested on 2 unseen town (one EU and one US). We also test our best single town agent as a generalization baseline.

Training Unseen EU Town Unseen US Town
Only Town05 2.4% 42.6%
Multi town 58.4% 36.2%
Table 3: Generalization performance.

We can see that performance on the unseen EU town is really poor for the agent trained only on a single US town, confirming the interest of training on both EU and US town at the same time. On the unseen US town, the performance is roughly similar for both trainings. These experiments show that our method generalizes to unseen environments. A video of this agent performing on unseen environments can be found there11 1 https://www.youtube.com/watch?v=YlCJ84VO3cU.

5.5 Comparison on CARLA Benchmark

Very recently, Learning by Cheating [Chena] re-implemented on open-source the CARLA benchmark on the newest version of CARLA. With such limited time, we could only test our best agent, i.e. the agent trained on one EU town and two US towns. It is important to note that we trained all our agents in CARLA 0.9.5 and the re-implementation of the benchmark rely on CARLA 0.9.6 with highly different rendering as mentioned in the paper of Chen et al. [Chena]. We also did not have time to change our training setup regarding the weather condition, so we just report results on train and test town on training weather (train town results can be found in the Supplementary).

CoRL2017 (test town) NoCrash (test town)
Task RL CAL CILRS LBC Ours Task LBC Ours
Straight 74 93 96 100 95 Empty 100 90
One turn 12 82 84 100 88 Regular 94 84
Navigation 3 70 69 98 94 Dense 51 60
Nav. dynamic 2 41 66 99 91
Table 4: Success rate comparison (in % for each task and scenario, more is better) with baselines [Dosovitskiy, Sauer, CILRS, Chena]

Even if our training setup is really different and thus cannot be explicitly compared to other methods, we think our results give a really nice idea of our method performance, especially knowing that we trained on an other version of CARLA with different rendering and that our agent is also handling multi-lane towns and US traffic light. The new LBC [Chena] baseline is the only one outperforming our agent on the hardest task of CoRL2017 benchmark (ie. Nav. dynamic). Our agent has also the best performance on the hardest situation, Dense scenario on NoCrash benchmark. A video of our agent tested on the Dense scenario on NoCrash benchmark can be found there22 2 https://www.youtube.com/watch?v=YlCJ84VO3cU.

6 Conclusion

In this work, we introduced implicit affordances as a new method allowing to train replay memory based RL with bigger network and input size. We present the first successful RL agent at end-to-end urban driving from vision including traffic light detection, validating our design choices with ablation studies. We showcased our performance by being in the top teams of the track ”Camera Only” in the CARLA challenge. In future work, it could be interesting to apply our implicit affordances scheme for policy-based or actor-critic and to train our affordance encoder on real images in order to apply this method on a real car.


The authors would like to thank Mustafa Shukor for his valuable time and his help on training some of our encoder.


Appendix A Supplementary materials: Implementation details

In this section, we will detail the hyper-parameters and the architecture of both the Supervised and the Reinforcement Learning training.

A.1 Supervised phase of affordances training: architecture and hyper-parameters

Our encoder architecture is mainly based on Resnet-18 [Resnet] with two main differences. First, we changed the first convolutional layer to take 12 channels as input (we stack 4 RGB frames). Secondly, we changed the kernel size of downsample convolutional layers from 1x1 to 2x2. Indeed as mentionned in the paper Enet [Enet], When downsampling, the first 1x1 projection of the convolutional branch is performed with a stride of 2 in both dimensions, which effectively discards 75% of the input. Increasing the filter size to 2x2 allows to take the full input into consideration, and thus improves the information flow and accuracy.. We also removed the two last layers: the average pooling layer and the last fully connected. Finally, we added a last downsample layer taking 512x7x7 feature maps as input and outputting our RL state of size 512x4x4.

For the loss computation, we add a weight of 10 for the part of the loss around traffic light state detection, and 1 for all other losses.

Table 5: Supervised training hyperparameters
Parameter Value
Learning rate 5.10-5, eps 3.10-4 (Adam)
Batchsize 32
Epochs 20

For the semantic decoder, each layer consists of an upsample layer with a nearest neighbor interpolation, then 2 convolutional layers with batchnorm. All the other losses are build with fully connected layers with one hidden layer of size 1024. See Table 5 for more details on other hyper-parameters used in the supervised phase.

To train our encoder, we used a dataset of around 1M frames with associated ground-truth label (e.g. semantic segmentation, traffic light state and distance). This dataset was collected mainly in 2 cities of the CARLA [Dosovitskiy] simulator: Town05 (US) and Town02 (EU).

A.2 Reinforcement Learning phase: architecture and hyper-parameters

In all our RL trainings, we used our encoder trained on affordances learning as a frozen image encoder: the actual RL state is the 8162 features coming from this frozen encoder. We then give this state to one fully connected layer of size 8162x1024. Then from these 1024 features concatenated with the 4 previous speed and steering angle values, we use a gated network to handle different orders as presented in CIL [Codevilla]. All the 6 heads have the same architecture but different weights, they are all made with 2 fully connected layers with one hidden layer of size 512.

Table 6: RL training hyperparameters for our Single Town and Multi-Town experiments: all parameters not mentioned come from the open-source implementation of Rainbow-IQN [Toromanoff].
Parameter Single Town / Multi-Town
Learning rate 5.10-5, eps 3.10-4 (Radam)
Batchsize 32
Memory capacity 90 000 / 450 000
Number actors 3 / 9
Number steps 20M (23 days) / 50M (57 days)
Synchro. actors/learner Yes / No

All hyperparameters used in our Rainbow-IQN training are the same as the one used in the open-source implementation [Toromanoff] but for the replay memory size and for the optimiser. We use the really recent Radam [Radam] optimiser as it is giving consistent improvement on standard supervised training. Some comparisons were made with the Adam optimiser but did not show any significant difference. For all our Single Town experiments, we used Town05 (US) as environment. For our Multi-Town training, we used Town02 (EU), Town04 (US) and Town05 (US). Table 6 details the hyper-parameters used in our RL training.

Appendix B Experiments

B.1 Stability study

One RL training of 20M steps was taking more than one week on a Nvidia 1080 Ti. That is why we did not have time nor computational resources to run an extensive study on the stability for all our experiments. Moreover evaluating our saved snapshot was also taking time, around 2 days to evaluate performance each million of steps as in Figure 7 of the main paper. Still, we performed multiple runs for 3 experiments presented in Table 1: No TL state, No segmentation and All Affordances. We evaluated those seeds at 10M and at 20M steps and the results (mean and standard deviation) can be found in the following Table 7.

10M steps 20M steps
Encoder used Inters. Nb seeds Inters. Nb seeds
No TL state 17.9% ± 7.3 6 27% ± 5.7 5
No segmentation 27.7% ± 9.3 5 41.7% ± 0.1 2
All affordances 24.9% ± 8.2 6 64.4% ± 2.5 2
Table 7: Mean and standard deviation of agents performance with regards to encoder training loss (trained without traffic light loss, without semantic segmentation loss, or with all affordance losses)

Even if we just have few different runs, those experiments on stability support the fact that our training are roughly stable and our results are significant. At 20M steps the ”best” seed of No TL state perform worse than both seeds of No segmentation. More importantly, both seeds of No segmentation perform way worse than both seeds of All affordances.

B.2 Additional experiments

We made one experiment, 4 input one output, to know the impact of predicting only one semantic segmentation instead of predicting 4 at the same time. Indeed, we stack 4 frames as our input and we thought it would give more information to learn from, if we train using all 4 semantic segmentations. We also tried to remove temporality in the input: taking only one frame as input and thus predicting only one semantic segmentation, One input one output. Finally, we made an experiment, U-net Skip connection, on which we used a standard U-net like architecture [Unet] for the semantic prediction. Indeed we did not use skip connections in all our experiments to prevent the semantic information to flow in this skip connections. Our intuition was that the semantic information could not be present in our final RL state (the last features maps of 4x4) if using skip connections.

The results of this 3 experiments are described in Table 8.

Encoder used Inters. TL Ped.
One input one output 29.6% 95% 85%
4 input one output 64.3% 93.8% 70.7%
U-net Skip connection 58.6% 95% 69.8%
All affordances 64.4% 98.1% 76.2%
Table 8: Additional experiments to study impact of temporality both as input and as output of our Supervised phase. Also experiments with skip connection for the semantic prediction (U-net like skip connection [Unet]).

We can see from this results that using only one frame as input has a large impact on the final performance (going from 64% intersections crossed with our standard scheme All Affordances to 29% when using only one image as input). The impact of predicting only one semantic segmentation instead of 4 is marginal on our main metric (Inters.) but we can see that the performance on traffic lights (TL) and on pedestrians (Ped.) are slightly lower. Finally, the impact of using U-net like skip connections seems to be relatively small on the number of intersection crossed. However, there is still a difference with our normal system particularly on the pedestrians metric.

As a conclusion, those additional experiments confirmed our intuitions first about adding temporality both as input and output of our encoder and secondly to not use standard U-net skip connection is our semantic segmentation decoder to prevent semantic information to flow away from our final RL state. However, the impact of those intuitions are relatively small and we conducted only one seed which could not be representative enough.

B.3 Description of our test scenario

Each of our scenario is defined by a starting waypoint and 10 orders one for each intersection to cross. An example of one of our 10 scenario can be found on Figure 8. We also spawn 50 vehicles in the whole Town05 while testing. Finally, we spawn randomly pedestrian ahead of the agent every 20/30 seconds.

Figure 8: Sample of one of our scenario in Town05. The blue point is the starting point, the red is the destination.

B.4 Comparison on CARLA Benchmark: Train town results (Town02)

CoRL2017 (train town) NoCrash (train town)
Task RL CAL CILRS Ours Task Ours
Straight 89 100 96 95 Empty 97
One turn 34 97 92 100 Regular 89
Navigation 14 92 95 98 Dense 38
Nav. dynamic 7 83 92 99
Table 9: Success rate comparison (in % for each task and scenario, more is better) with baselines [Dosovitskiy, Sauer, CILRS]

As mentioned in the main paper, we did not have time to re-implement our training setup for the really recently released [Chena] implementation of the CARLA benchmark on the newer version of CARLA (0.9.6), particularly regarding the weather condition. Actually, we just had time to test our CARLA Challenge agent (i.e our multi-town training) on this benchmark. That is why we report our results as if it was Training Weather, because in our training setup, we varied the weather condition as much as possible and we did not keep any for test time. Moreover, as we trained our agent on Town02, we used Town01 scenario as Test Town and Town02 scenario as Train Town. Therefore, our results can not be explicitly compared as our training and testing setup differs from the standard one used in other papers [Dosovitskiy, Sauer, CILRS, Chena]. Still, we think this gives a reasonable idea of our agent performance, especially knowing that our agent also deals with harder tasks presents in the CARLA challenge [carlaChallenge] as handling multi-lane towns and US/EU traffic lights at the same time. We gave our results on Test Town on the main paper and we give our results on Train Town in Table 9.

B.5 Training infrastructure

The training of the agents was split over several computers and GPUs, containing in total:

  • 3 Nvidia Titan X and 1 Nvidia Titan V (training computer)

  • 1 Nvidia 1080 Ti (local workstation)

  • 2 Nvidia 1080 (local workstations)

  • 3 Nvidia 2080 (training computer)