Dynamic Interaction-Aware Scene Understanding for Reinforcement Learning in Autonomous Driving

  • 2019-09-30 10:59:11
  • Maria Huegle, Gabriel Kalweit, Moritz Werling, Joschka Boedecker
  • 1

Abstract

The common pipeline in autonomous driving systems is highly modular andincludes a perception component which extracts lists of surrounding objects andpasses these lists to a high-level decision component. In this case, leveragingthe benefits of deep reinforcement learning for high-level decision makingrequires special architectures to deal with multiple variable-length sequencesof different object types, such as vehicles, lanes or traffic signs. At thesame time, the architecture has to be able to cover interactions betweentraffic participants in order to find the optimal action to be taken. In thiswork, we propose the novel Deep Scenes architecture, that can learn complexinteraction-aware scene representations based on extensions of either 1) DeepSets or 2) Graph Convolutional Networks. We present the Graph-Q and DeepScene-Qoff-policy reinforcement learning algorithms, both outperformingstate-of-the-art methods in evaluations with the publicly available trafficsimulator SUMO.

 

Quick Read (beta)

Dynamic Interaction-Aware Scene Understanding for Reinforcement Learning in Autonomous Driving

Maria Huegle1, Gabriel Kalweit1, Moritz Werling2 and Joschka Boedecker1,3
1,3Dept. of Computer Science, University of Freiburg, Germany.{hueglem,kalweitg,jboedeck}@cs.uni-freiburg.de2BMWGroup, Unterschleissheim, Germany.[email protected]3Cluster of Excellence BrainLinks-BrainTools, Freiburg, Germany.
Abstract

The common pipeline in autonomous driving systems is highly modular and includes a perception component which extracts lists of surrounding objects and passes these lists to a high-level decision component. In this case, leveraging the benefits of deep reinforcement learning for high-level decision making requires special architectures to deal with multiple variable-length sequences of different object types, such as vehicles, lanes or traffic signs. At the same time, the architecture has to be able to cover interactions between traffic participants in order to find the optimal action to be taken. In this work, we propose the novel Deep Scenes architecture, that can learn complex interaction-aware scene representations based on extensions of either 1) Deep Sets or 2) Graph Convolutional Networks. We present the Graph-Q and DeepScene-Q off-policy reinforcement learning algorithms, both outperforming state-of-the-art methods in evaluations with the publicly available traffic simulator SUMO.

\usetikzlibrary

arrows,automata, positioning, calc

I INTRODUCTION

In autonomous driving scenarios, the number of traffic participants and lanes surrounding the agent can vary considerably over time. Common autonomous driving systems use modular pipelines, where a perception component extracts a list of surrounding objects and passes this list to other modules, including localization, mapping, motion planning and high-level decision making components. Classical rule-based decision-making systems are able to deal with variable-sized object lists, but are limited in terms of generalization to unseen situations or are unable to cover all interactions in dense traffic. Since Deep Reinforcement Learning (DRL) methods can learn decision policies from data and off-policy methods can improve from previous experience, they offer a promising alternative to rule-based systems. In the past years, DRL has shown promising results in various domains [1, 2, 3, 4, 5]. However, classical DRL architectures like fully-connected or convolutional neural networks (CNNs) are limited in their ability to deal with variable-sized, structured inputs or to model interactions between objects.

Prior works on reinforcement learning for autonomous driving that used fully-connected network architectures and fixed sized inputs [6, 7, 5, 8, 9] are limited in the number of vehicles that can be considered. CNNs using occupancy grids [10, 11] are limited to their initial grid size. Recurrent neural networks are useful to cover temporal context, but are not able to handle a variable number of objects permutation-invariant w.r.t to the input order for a fixed time step. In [12], limitations of these architectures are shown and a more flexible architecture based on Deep Sets [13] is proposed for off-policy reinforcement learning of lane-change maneuvers, outperforming traditional approaches in evaluations with the open-source simulator SUMO.

In this paper, we propose to use Graph Networks [14] as an interaction-aware input module in reinforcement learning for autonomous driving. We employ the structure of Graphs in off-policy DRL and formalize the Graph-Q algorithm. In addition, to cope with multiple object classes of different feature representations, such as different vehicle types, traffic signs or lanes, we introduce the formalism of Deep Scenes, that can extend Deep Sets and Graph Networks to fuse multiple variable-sized input sets of different feature representations. Both of these can be used in our novel DeepScene-Q algorithm for off-policy DRL. Our main contributions are:

  1. 1.

    Using Graph Convolutional Networks to model interactions between vehicles in DRL for autonomous driving.

  2. 2.

    Extending existing set input architectures for DRL to deal with multiple lists of different object types.

II RELATED WORK

Graph Networks are a class of neural networks that can learn functions on graphs as input [15, 16, 17, 18, 19] and can reason about how objects in complex systems interact. They can be used in DRL to learn state representations [20, 21, 22, 17], e.g. for inference and control of physical systems with bodies (objects) and joints (relations). In the application for autonomous driving, Graph Networks were used for supervised traffic prediction while modeling traffic participant interactions [23], where vehicles were modeled as objects and interactions between them as relations. Another type of interaction-aware network architectures, Interaction Networks, were proposed to reason about how objects in complex systems interact [18]. A vehicle behavior interaction network that captures vehicle interactions was presented in [24]. In [25], a convolutional social pooling component was proposed using a CNN to model spatial connections between vehicles for vehicle trajectory prediction.

(a)           (b)

Fig. 1: Scheme of DeepScene-Q, using (a) Deep Sets and (b) Graphs. Both architectures combine multiple variable-length object lists in a scene, here a traffic sign s1, lanes l1,l2 and vehicles x1,x2. The modules ϕi, ρ and Q are fully-connected networks. As permutation invariant pooling operator, we use the sum. The vector xstatic includes static features and q the action value output.

III PRELIMINARIES

We model the task of high-level decision making for autonomous driving as a Markov Decision Process (MDP), where the agent is following a policy π in an environment in a state st, applying a discrete action atπ to reach a successor state st+1 according to a transition model . In every time step t, the agent receives a reward rt, e.g. for driving as close as possible to a desired velocity. The agent tries to maximize the discounted long-term return R(st)=itγi-tri, where γ[0,1] is the discount factor. In this work, we use Q-learning [26]. The Q-function Qπ(st,at)=𝐄ai>tπ[R(st)|at] represents the value of following a policy π after applying action at. The optimal policy can be inferred from the optimal action-value function Q* by maximization over actions.

III-A Q-Function Approximation

We use DQN [1] to estimate the optimal Q-function by function approximator Q, parameterized by θQ. It is trained in an offline fashion on minibatches sampled from a fixed replay buffer with transitions collected by a driver policy π^. As loss, we use L(θQ)=1bi(yi-Q(si,ai|θQ))2 with targets yi=ri+γmaxaQ(si+1,a|θQ), where Q is a target network, parameterized by θQ, and (si,ai,si+1,ri)|0ib is a randomly sampled minibatch from . For the target network, we use a soft update, i.e. θQτθQ+(1-τ)θQ with update step-size τ[0,1]. Further, we use a variant of Double-Q-learning [27] which is based on two Q-network pairs and uses the minimum of the predictions for the target calculation, similar as in [28].

III-B Deep Sets

A network Q𝒟𝒮 can be trained to estimate the Q-function for a state representation s=(Xdyn,xstatic) and action a. The representation consists of a static input xstatic and a dynamic, variable-length input set Xdyn=[x1,..,xseq len], where xj|1jseq len are feature vectors for surrounding vehicles in sensor range. In [12], it was proposed to use Deep Sets to handle this input representation, where the Q-network consists of three network modules ϕ,ρ and Q. The representation of the dynamic input set is computed by Ψ(Xdyn)=ρ(xXdynϕ(x)), which makes the Q-function permutation invariant w.r.t. the order of the dynamic input [13]. Static feature representations xstatic are fed directly to the Q-module, and the Q-values can be computed by Q𝒟𝒮=Q(Ψ(Xdyn)||xstatic), where || denotes a concatenation of two vectors. The Q-learning algorithm is called DeepSet-Q [12].

IV METHODS

IV-A Deep Scene-Sets

To overcome the limitation of DeepSet-Q to one variable-sized list of the same object type, we propose a novel architecture, Deep Scene-Sets, that are able to deal with K input sets Xdyn1,,XdynK, where every set has variable length. A combined, permutation invariant representation of all sets can be computed by

Ψ(Xdyn1,,XdynK)=ρ(kxXdynkϕk(x)),

where 1kK. The output vectors ϕk()F of the neural network modules ϕk have the same length F. We additionally propose to share the parameters of the last layer for the different ϕ networks. Then, ϕk() can be seen as a projection of all input objects to the same encoded object space. We combine the encoded objects of different types by the sum (or other permutation invariant pooling operators, such as max) and use the network module ρ to create an encoded scene, which is a fixed-sized vector. The encoded scene is concatenated to xstatic and the Q-values can be computed by Q𝒟=Q(Ψ(Xdyn1,,XdynK)||xstatic). We call the corresponding Q-learning algorithm DeepScene-Q, shown in Algorithm 2 (Option 1) and creftypecap 1 (a).

\SetAlgoLined\DontPrintSemicoloninitialize Q𝒢=(ϕ,ρ,H,Q) and Q𝒢=(ϕ,ρ,H,Q), set replay buffer
\Foroptimization step o=1,2,… \SetKwProgFn: get minibatch (si,ai,(Xi+1dyn,xi+1static),ri+1) from
\ForEachtransition \ForEachobject xi+1j in Xi+1dyn (ϕi+1)j=ϕ(xi+1j)
compute Hi+1(L) by GCN with Hi+1(0)=[(ϕi+1)1,,(ϕi+1)seq len]\[email protected]
get ρi+1=ρ(kjHi+1(L))\[email protected]
yi=ri+1+γmaxaQ(ρi+1,xi+1static,a)
perform a gradient step on loss: 1bi(Q𝒢(si,ai)-yi)2
update target network by: θQ𝒢τθQ𝒢+(1-τ)θQ𝒢
\algorithmcfname 1 Graph-Q

IV-B Graphs

In the Deep Set architecture, relations between vehicles are not explicitly modeled and have to be inferred in ρ. We extend this approach by using Graph Networks, considering graphs as input. Graph Convolutional Networks (GCNs) [14] operate on graphs defined by a set of node features Xdyn=[x1,..,xseq len] and a set of edges represented by an adjacency matrix A. The propagation rule of the GCN is H(l)=σ(D12A~D12H(l-1)W(l-1)) with 1lL, where we set H(0)=[ϕ(x1),,ϕ(xseq len)] using an encoder module similar as in the Deep Sets approach. A~N×N is an adjacency matrix with added self-connections, Di,i=jA~i,j, σ the activation function, H(l)N×F hidden layer activations and W(l) the learnable matrix of the l-th layer. The dynamic input representation can be computed from the last layer L of the GCN: Ψ(Xdyn)=ρ(xXdynH(L)), where ϕ is a neural network and the output vector ϕ()F has length F. The Q-values can be computed by Q𝒢=Q(Ψ(Xdyn)||xstatic). We call the corresponding Q-learning algorithm Graph-Q, see creftypecap 1.

IV-C Deep Scene-Graphs

The graph representation can be extended to deal with multiple variable-length lists of different object types Xdyn1,,XdynK by using K encoder networks. As node features, we use H(0)=[Φ1,,ΦK] and Φk=[ϕk(x1),,ϕk(xseq lenk)] for 1kK, and compute the dynamic input representation from the last layer of the GCN:

Ψ(Xdyn1,,XdynK)=ρ(kxXdynkH(L)),

with 1kK. Similar to the Deep Scene-Sets architecture, ϕk are neural network modules with output vector length D and parameter sharing in the last layer. To create a fixed vector representation, we combine all node features by the sum into an encoded scene. The Q-values can be computed by Q𝒟=Q(Ψ(Xdyn1,,XdynK)||xstatic). This module can replace the DeepScene-Sets module in DeepScene-Q as shown in Algorithm 2 (Option 2) and in creftypecap 1 (b).

\SetAlgoLined\DontPrintSemicoloninitialize Q𝒟=(ϕ1,,ϕK,ρ,H,Q) and Q𝒟=(ϕ1,,ϕK,ρ,H,Q), set replay buffer
\Foroptimization step o=1,2,… \SetKwProgFn: get minibatch (si,ai,(Xi+1dyn1,,Xi+1dynK,xi+1static),ri+1) from
\ForEachtransition \ForEachobject type k(1,,K) \ForEachobject xi+1j in Xi+1dynk (ϕi+1k)j=ϕk(xi+1j)
\FnSet (Option 1) get ρi+1=ρ(kj(ϕi+1k)j) \FnGraph (Option 2) compute Hi+1(L) by GCN with Hi+1(0)=[Φ1,,ΦK] and Φk=[(ϕi+1)1,,(ϕi+1)seq len]\[email protected]
get ρi+1=ρ(kjHi+1(L))\[email protected]
yi=ri+1+γmaxaQ(ρi+1,xi+1static,a)
perform a gradient step on loss and update target network as in Algorithm 1.
\algorithmcfname 2 DeepScene-Q

IV-D Graph Construction

We propose two different strategies to construct bidirectional edge connections between vehicles for Graphs and Deep Scene-Graphs representations:

  1. 1.

    Close agent connections: Connect agent vehicle to its direct leader and follower in its own and the left and right neighboring lanes (62 edges).

  2. 2.

    All close vehicles connections: Connect all vehicles to their leader and follower in their own and the left and right lanes (K62 edges for K surrounding vehicles).

Edge weights are computed by the inverse absolute distance between two vehicles, as shown in [23]. A fully-connected graph is avoided due to computational complexity.

IV-E MDP Formulation

The feature representations of the the surrounding cars and lanes are shown in creftype V-B. The action space 𝒜 consists of a discrete set of three possible actions in lateral direction: keep lane, left lane-change and right lane-change. Acceleration and collision avoidance are controlled by low-level controllers, that are fixed and not updated during training. Maintaining safe distance to the preceding vehicle is handled by an integrated safety module, as proposed in [11, 5]. If the chosen lane-change action is not safe, the agent keeps the lane. The reward function r:𝒮×𝒜 is defined as: r(s,a)=1-|vcurrent(s)-vdesired(s)|vdesired(s)-plc(a), where vcurrent and vdesired are the actual and desired velocity of the agent, plc is a penalty for choosing a lane-change action and minimizing lane-changes for additional comfort.

Driver Type maxSpeed lcCooperative accel/ decel length lcSpeedGain
agent driver 10 - 2.6/4.5 4.5 -
passenger drivers 1 𝒰(8,12) 0.2 2.6/4.5 𝒰(4,5) 𝒰(5,10)
passenger drivers 2 𝒰(5,9) 1.0 2.6/4.5 𝒰(4,5) 𝒰(5,10)
passenger drivers 3 𝒰(3,7) 0.8 2.6/4.5 𝒰(4,5) 𝒰(5,10)
truck drivers 𝒰(2,4) 0.4 1.3 / 2.25 𝒰(9.5,14.5) 𝒰(0,3)
motorcycle drivers 𝒰(7,11) 0.2 3.0/5.0 𝒰(2,3) 𝒰(15,20)
TABLE I: SUMO parameters for different driver types. In each scenario, trucks and motorcycles are sampled with 10% and 5% probability, passenger cars and their driver types are sampled uniformly for the remaining number of vehicles.

V EXPERIMENTAL SETUP

We use the open-source SUMO [29] traffic simulation to learn lane-change maneuvers.

V-A Scenarios

Highway

To evaluate and show the advantages of Graph-Q, we use the 1000m circular highway environment shown in [12] with three continuous lanes and one object class (passenger cars). To train our agents, we used a dataset with 500.000 transitions.

Fast Lanes

To evaluate the performance of DeepScene-Q, we use a more complex scenario with a variable number of lanes, shown in creftypecap 2. It consists of a 1000m circular highway with three continuous lanes and additional fast lanes in two 250m sections. At the end of lanes, vehicles slow down and stop until they can merge into an ongoing lane. The agent receives information about additional lanes in form of traffic signs starting 200m before every lane start or end. Further, different vehicle types with different behaviors are included, i.e. cars, trucks and motorcycles with different lengths and behaviors. For simplicity, we use the same feature representation for all vehicle classes. As dataset, we collected 500.000 transitions in the same manner as for the Highway environment.

Fig. 2: Fast Lanes scenario in SUMO. The agent (blue) is overtaking other vehicles (red) on the fast lane and has to merge before the lane ends.

V-B Input Features

In the Highway scenario, we use the same input features as proposed in [12]. For the Fast Lanes scenario, the input features used for vehicle i are:

  • relative distance: dri=(pi-pagent)/dmax,
    pagent, pi are longitudinal positions in a curvilinear coordinate system of the lane.

  • relative velocity: dvi=(vi-vagent)/vallowed

  • relative lane index: dli=li-lagent,
    where li, lagent are lane indices.

  • vehicle length: leni/10.0

The state representation for lane j is:

  • lane start and end: distances (km) to lane start and end

  • lane valid: lane currently passable

  • relative lane index: dlj=lj-lagent,
    where lj, lagent are lane indices.

For the agent, the normalized velocity vcurrent/vdesired is included, where vcurrent and vdesired are the current and desired velocity of the agent. Passenger cars, trucks and motorcycles use the same feature representation. When the agent reaches a traffic sign indicating a starting (ending) lane, the lane features get updated until the start (end) of the lane.

V-C Training & Evaluation Setup

All agents are trained off-policy on datasets collected by a rule-based agent with enabled SUMO safety module integrated, performing random lane changes to the left or right whenever possible. For training, traffic scenarios with a random number of n(30,60) vehicles for Highway and with n(30,90) vehicles for Fast Lanes are used. Evaluation scenarios vary in the number of vehicles n(30,35,,90). For each fixed n, we evaluate 20 scenarios with different a priori randomly sampled positions and driver types for each vehicle, to smooth the high variance.

In SUMO, we set the time step length to 0.5s. The action step length of the reinforcement learning agents is 2s and the lane change duration is 2s. Desired time headway τ and minimum gap are 0.5s and 2m. All vehicles have no desire to keep right (lcKeepRight=0.0). The sensor range of the agent is dmax=80m. LC2013 is used as lane-change controller for all other vehicles. To simulate traffic conditions as realistic as possible, different driver types are used with parameters shown in creftypecap I.

Social CNN VBIN GCN
Input(B×80×5) Input(B×15) Input(B×seq×3)
ϕ: FC(20), FC(80) ϕ: FC(20), FC(80) ϕ: FC(20), FC(80)
16×Conv2D(3×1) concat() 1×GCN(80)
32×Conv2D(3×1) ρ: FC(80), FC(20) sum()
concat(, Input(B×3))
FC(100)*, FC(100), Linear(3)
Deep Scene-Sets Deep Scene-Graphs
Input(B×seq0×4) and Input(B×seq1×4)
ϕ0: FC(20), FC(80),FC(80)** ϕ0: FC(20), FC(80),FC(80)**
ϕ1: FC(20), FC(80), FC(80)** ϕ1: FC(20), FC(80),FC(80)**
sum() 1×GCN(80)
ρ: FC(80), FC(80) sum()
concat(, Input(B×3))
FC(100), FC(100), Linear(3)
TABLE II: Network architectures. FC() are fully-connected layers. The CNN uses strides of (2×1). (*) For VBIN FC(200). (**) Parameters of the last layers are shared.

V-D Comparative Analysis

Each network is trained with a batch size of 64 and optimized by Adam [30] with a learning rate of 10-4. As activation function, we use Rectified Linear Units (ReLu) in all hidden layers of all architectures. The target networks are updated with a step-size of τ=10-4. All network architectures, including the baselines, were optimized using Random Search with the same budget of 20 training runs. We preferred Random Search over Grid Search, since it has been shown to result in better performance using budgets in this range [31]. The Deep Sets architecture and hyperparameter-optimized settings for all encoder networks are used from [12]. The network architectures are shown in creftypecap II. Graph-Q is compared to two other interaction-aware Q-learning algorithms, that use input modules originally proposed for supervised vehicle trajectory prediction. To support our architecture choices for the Deep Scene-Sets, we compare to a modification with separate ρ networks. We use the following baselines11 1 Since we do not focus on including temporal context, we adapt recurrent layers to fully-connected layers in all baselines.:

Rule-Based Controller

Naive, rule-based agent controller, that uses the SUMO lane change model LC2013.

Convolutional Social Pooling (SocialCNN)

In [25], a social tensor is created by learning latent vectors of all cars by an encoder network and projecting them to a grid map in order to learn spatial dependencies.

Vehicle Behaviour Interaction Networks (VBIN)

In [24], instead of summarizing the output vectors as in the Deep Sets approach, the vectors are concatenated, which results in a limitation to a fixed number of cars. We consider the 6 vehicles surrounding the agent (leader and follower on own, left and right lane).

Multiple ρ-networks

Deep Scene architecture where all object types are processed separately by using K different ρ-network modules. The K resulting output vectors are concatenated as [ρ1(xXdyn1ϕ1(x)),,ρK(xXdynKϕK(x))] and fed into the Q-network module.

V-E Implementation Details & Hyperparameter Optimization

All networks were trained for 1.25106 optimization steps. The Random Search configuration space is shown in creftypecap III. For all approaches except VBIN, we used the same ϕ and Q architectures. Due to stability issues, adapted these parameters for VBIN. For SocialCNN, we used the optimized grid from [12] with a size of 80×5. The GCN architectures were implemented using the pytorch gemoetric library [32].

Architecture Parameter Configuration Space
Encoders ϕ: num layers 1,2,3
ϕ: hidden/ output dims 5,20,80,100
Deep Sets ρ: num layers 1,2,3
ρ: hidden/ output dims 5,20,100
GCN num GCN layers 1,2,3
hidden and output dim 20, 80
use edge weights True, False
SocialCNN CONV: num layers 2,3
kernel sizes ([7,3,2],[2,1])
strides ([2,1],[2,1])
filters 8,16,32
VBIN ϕ : output dim 20, 80
ρ : hidden dim 20, 80, 160, 200
Q : hidden dim 100, 200
Deep Scene-Sets ρ : output dim 20, 80
shared parameters True, False
Deep Scene-Graphs use ρ network True, False
ρ : output dim 20, 80
shared parameters True, False
TABLE III: Random Search configuration space. For every architecture, we sampled 20 configurations to find the best setting.
Fig. 3: Mean performance and standard deviation in the Highway scenario over 10 training runs for Graph-Q with all close vehicle connections, the Deep Sets [12] and two other interaction-aware Q-function input modules (left), and Graph-Q using the two proposed graph construction strategies (right). The number of vehicles indicates the traffic intensity, from light to dense traffic.
Fig. 4: Mean performance and standard deviation in the Fast Lanes scenario over 10 training runs for Deep Scene-Sets, Deep Scene-Graphs and the rule-based controller from SUMO (left), and different architecture choices of the Deep Scenes (right). The number of vehicles indicates the traffic intensity.

VI RESULTS

The results for the Highway scenario are shown in creftypecap 3. Graph-Q using the GCN input representation (with all close vehicle connections) is outperforming VBIN and Social CNN. Further, the GCN input module yields a better performance compared to Deep Sets in all scenarios besides in very light traffic with rare interactions between vehicles. While the Social CNN architecture has a high variance, VBIN shows a better and more robust performance and is also outperforming the Deep Sets architecture in high traffic scenarios. This underlines the importance of interaction-aware network modules for autonomous driving, especially in urban scenarios. However, VBIN are still limited to fixed-sized input and additional gains can be achieved by combining both variable input and interaction-aware methods as in Graph Networks. To verify that the shown performance increases are significant, we performed a T-Test exemplarily for 90 car scenarios:

  • Independence of the mean performances of DeepSet-Q and Graph-Q is highly significant (<0.001) with a p-value of 0.0011.

  • Independence of the mean performances between Graph-Q and VBIN is significant (<0.1) with a p-value of 0.0848. Graph-Q is additionally more flexible and can consider a variable number of surrounding vehicles.

creftypecap 3 (right) shows the performance of the two graph construction strategies. A graph built with connections for all close vehicles outperforms a graph built with close agent connections only. However, the performance increase is only slight, which indicates that interactions with the direct neighbors of the agent are most important.

The evaluation results for Fast Lanes are shown in creftypecap 4 (left). The vehicles controlled by the rule-based controller rarely use the fast lane. In contrast, our agent learns to drive on the fast lane as much as possible (39.0% of the driving time). We assume, that the Deep Scene-Sets are outperforming Deep Scene-Graphs slightly, because the agent has to deal with less interactions than in the Highway scenario. Finally, we compare Deep Scene-Sets to a basic Deep Sets architecture with a fixed feature representation. Using the exact same lane features (if necessary filled with dummy values), both architectures show similar performance. However the performance collapse for the Deep Sets agent considering only its own, left and right lane shows, that the ability to deal with an arbitrary number of lanes (or other object types) can be very important in certain situations. Due to its limited lane representation, the Deep Sets (closest lanes) agent is not able to see the fast lane and thus significantly slower. creftypecap 4 (right) shows an ablation study, comparing the performance of the Deep-Scene Sets with and without shared parameters in the last layer of the encoder networks. Using shared parameters in the last layer leads to a slight increase in robustness and performance, and outperforms the architecture with separate ρ networks.

VII CONCLUSION

In this paper, we propose Graph-Q and DeepScene-Q, interaction-aware reinforcement learning algorithms that can deal with variable input sizes and multiple object types in the problem of high-level decision making for autonomous driving. We showed, that interaction-aware neural networks, and among them especially GCNs, can boost the performance in dense traffic situations. The Deep Scene architecture overcomes the limitation of fixed-sized inputs and can deal with multiple object types by projecting them into the same encoded object space. The ability of dealing with objects of different types is necessary especially in urban environments. In the future, this approach could be extended by devising algorithms that adapt the graph structure of GCNs dynamically to adapt to the current traffic conditions. Based on our results, it would be promising to omit graph edges in light traffic, essentially falling back to the Deep Sets approach, while it is beneficial to model more interactions with increasing traffic density.

References

  • [1] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. A. Riedmiller, A. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
  • [2] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. P. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, “Mastering the game of go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, pp. 484–489, 2016.
  • [3] M. Watter, J. T. Springenberg, J. Boedecker, and M. A. Riedmiller, “Embed to control: A locally linear latent dynamics model for control from raw images,” in Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, 2015, pp. 2746–2754.
  • [4] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of deep visuomotor policies,” Journal of Machine Learning Research, vol. 17, pp. 39:1–39:40, 2016.
  • [5] B. Mirchevska, C. Pek, M. Werling, M. Althoff, and J. Boedecker, “High-level decision making for safe and reasonable autonomous lane changing using reinforcement learning,” 2018 21st International Conference on Intelligent Transportation Systems (ITSC), pp. 2156–2162, 2018.
  • [6] P. Wolf, K. Kurzer, T. Wingert, F. Kuhnt, and J. M. Zöllner, “Adaptive behavior generation for autonomous driving using deep reinforcement learning with compact semantic states,” CoRR, vol. abs/1809.03214, 2018. [Online]. Available: http://arxiv.org/abs/1809.03214
  • [7] B. Mirchevska, M. Blum, L. Louis, J. Boedecker, and M. Werling, “Reinforcement learning for autonomous maneuvering in highway scenarios.” 11. Workshop Fahrerassistenzsysteme und automatisiertes Fahren.
  • [8] M. Nosrati, E. A. Abolfathi, M. Elmahgiubi, P. Yadmellat, J. Luo, Y. Zhang, H. Yao, H. Zhang, and A. Jamil, “Towards practical hierarchical reinforcement learning for multi-lane autonomous driving,” 2018 NIPS MLITS Workshop, 2018.
  • [9] M. Kaushik, V. Prasad, M. Krishna, and B. Ravindran, “Overtaking maneuvers in simulated highway driving using deep reinforcement learning,” 06 2018, pp. 1885–1890.
  • [10] M. Mukadam, A. Cosgun, and K. Fujimura, “Tactical decision making for lane changing with deep reinforcement learning,” NIPS Workshop on Machine Learning for Intelligent Transportation Systems, 2017.
  • [11] L. Fridman, B. Jenik, and J. Terwilliger, “DeepTraffic: Driving Fast through Dense Traffic with Deep Reinforcement Learning,” arXiv e-prints, p. arXiv:1801.02805, Jan. 2018.
  • [12] M. Huegle, G. Kalweit, B. Mirchevska, M. Werling, and J. Boedecker, “Dynamic input for deep reinforcement learning in autonomous driving,” IEEE/RSJ International Conference on Intelligent Robots and Systems, 2019.
  • [13] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhutdinov, and A. J. Smola, “Deep sets,” in Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds.   Curran Associates, Inc., 2017, pp. 3391–3401. [Online]. Available: http://papers.nips.cc/paper/6931-deep-sets.pdf
  • [14] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” CoRR, vol. abs/1609.02907, 2016. [Online]. Available: http://arxiv.org/abs/1609.02907
  • [15] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini, “The graph neural network model,” Trans. Neur. Netw., vol. 20, no. 1, pp. 61–80, Jan. 2009. [Online]. Available: http://dx.doi.org/10.1109/TNN.2008.2005605
  • [16] J. Zhou, G. Cui, Z. Zhang, C. Yang, Z. Liu, and M. Sun, “Graph neural networks: A review of methods and applications,” CoRR, vol. abs/1812.08434, 2018. [Online]. Available: http://arxiv.org/abs/1812.08434
  • [17] A. Sanchez-Gonzalez, N. Heess, J. T. Springenberg, J. Merel, M. A. Riedmiller, R. Hadsell, and P. Battaglia, “Graph networks as learnable physics engines for inference and control,” CoRR, vol. abs/1806.01242, 2018. [Online]. Available: http://arxiv.org/abs/1806.01242
  • [18] P. W. Battaglia, R. Pascanu, M. Lai, D. J. Rezende, and K. Kavukcuoglu, “Interaction networks for learning about objects, relations and physics,” CoRR, vol. abs/1612.00222, 2016. [Online]. Available: http://arxiv.org/abs/1612.00222
  • [19] P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. F. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner, Ç. Gülçehre, F. Song, A. J. Ballard, J. Gilmer, G. E. Dahl, A. Vaswani, K. Allen, C. Nash, V. Langston, C. Dyer, N. Heess, D. Wierstra, P. Kohli, M. Botvinick, O. Vinyals, Y. Li, and R. Pascanu, “Relational inductive biases, deep learning, and graph networks,” CoRR, vol. abs/1806.01261, 2018. [Online]. Available: http://arxiv.org/abs/1806.01261
  • [20] H. Dai, E. B. Khalil, Y. Zhang, B. Dilkina, and L. Song, “Learning combinatorial optimization algorithms over graphs,” CoRR, vol. abs/1704.01665, 2017. [Online]. Available: http://arxiv.org/abs/1704.01665
  • [21] J. B. Hamrick, K. R. Allen, V. Bapst, T. Zhu, K. R. McKee, J. B. Tenenbaum, and P. W. Battaglia, “Relational inductive bias for physical construction in humans and machines,” CoRR, vol. abs/1806.01203, 2018. [Online]. Available: http://arxiv.org/abs/1806.01203
  • [22] J. Jiang, C. Dun, and Z. Lu, “Graph convolutional reinforcement learning for multi-agent cooperation,” CoRR, vol. abs/1810.09202, 2018. [Online]. Available: http://arxiv.org/abs/1810.09202
  • [23] F. Diehl, T. Brunner, M. Truong-Le, and A. Knoll, “Graph neural networks for modelling traffic participant interaction,” CoRR, vol. abs/1903.01254, 2019. [Online]. Available: http://arxiv.org/abs/1903.01254
  • [24] W. Ding, J. Chen, and S. Shen, “Predicting vehicle behaviors over an extended horizon using behavior interaction network,” CoRR, vol. abs/1903.00848, 2019. [Online]. Available: http://arxiv.org/abs/1903.00848
  • [25] N. Deo and M. M. Trivedi, “Convolutional social pooling for vehicle trajectory prediction,” CoRR, vol. abs/1805.06771, 2018. [Online]. Available: http://arxiv.org/abs/1805.06771
  • [26] C. J. C. H. Watkins and P. Dayan, “Q-learning,” in Machine Learning, 1992, pp. 279–292.
  • [27] H. van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double q-learning,” CoRR, vol. abs/1509.06461, 2015. [Online]. Available: http://arxiv.org/abs/1509.06461
  • [28] S. Fujimoto, H. van Hoof, and D. Meger, “Addressing function approximation error in actor-critic methods,” in Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, 2018, pp. 1582–1591. [Online]. Available: http://proceedings.mlr.press/v80/fujimoto18a.html
  • [29] D. Krajzewicz, J. Erdmann, M. Behrisch, and L. Bieker-Walz, “Recent development and applications of sumo - simulation of urban mobility,” International Journal On Advances in Systems and Measurements, vol. 3&4, 12 2012.
  • [30] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2014. [Online]. Available: http://arxiv.org/abs/1412.6980
  • [31] J. Bergstra and Y. Bengio, “Random search for hyper-parameter optimization,” J. Mach. Learn. Res., vol. 13, pp. 281–305, Feb. 2012. [Online]. Available: http://dl.acm.org/citation.cfm?id=2188385.2188395
  • [32] M. Fey and J. E. Lenssen, “Fast graph representation learning with pytorch geometric,” CoRR, vol. abs/1903.02428, 2019. [Online]. Available: http://arxiv.org/abs/1903.02428