Abstract
Autonomous driving decisionmaking is a great challenge due to the complexityand uncertainty of the traffic environment. Combined with the rulebasedconstraints, a Deep QNetwork (DQN) based method is applied for autonomousdriving lane change decisionmaking task in this study. Through the combinationof highlevel lateral decisionmaking and lowlevel rulebased trajectorymodification, a safe and efficient lane change behavior can be achieved. Withthe setting of our state representation and reward function, the trained agentis able to take appropriate actions in a realworldlike simulator. Thegenerated policy is evaluated on the simulator for 10 times, and the resultsdemonstrate that the proposed rulebased DQN method outperforms the rulebasedapproach and the DQN method.
Quick Read (beta)
Lane Change Decisionmaking through Deep Reinforcement Learning with Rulebased Constraints ^{†}^{†}thanks: This work is supported by the National Natural Science Foundation of China (NSFC) under Grants No. 61803371, No. 61573353, No. 61533017 and No. 61873268. This work is also supported by Noah’s Ark Lab, Huawei Technologies.
Abstract
Autonomous driving decisionmaking is a great challenge due to the complexity and uncertainty of the traffic environment. Combined with the rulebased constraints, a Deep QNetwork (DQN) based method is applied for autonomous driving lane change decisionmaking task in this study. Through the combination of highlevel lateral decisionmaking and lowlevel rulebased trajectory modification, a safe and efficient lane change behavior can be achieved. With the setting of our state representation and reward function, the trained agent is able to take appropriate actions in a realworldlike simulator. The generated policy is evaluated on the simulator for 10 times, and the results demonstrate that the proposed rulebased DQN method outperforms the rulebased approach and the DQN method.
I Introduction
IA Background
Autonomous driving is widely regarded as one of the most important ways to alter transportation systems, for example, it has the potential to eliminate traffic accidents that are almost entirely caused by human improper operations [1]. Typically, an autonomous driving system consists of four modules: an environment perception module, a decisionmaking module, a control module, and an actuator mechanism module [2]. The task of the decisionmaking module is to make an appropriate driving decision according to the sensor information from the perception module, and then plan a drivable path to the control module. Reasonable decisionmaking in a variety of complex environments is a great challenge, and it is impossible to enumerate coping strategies in various situations. Therefore, a method that can learn a suitable behavior from its own experiences would be desirable. Over the past few years, deep reinforcement learning (DRL) has made breakthroughs in many fields of artificial intelligence (AI), such as games [3, 4, 5, 6] and autonomous driving [7, 8]. Deep learning has a strong ability in learning of representations and generalization of matching patterns from the raw data like images and videos [9], while reinforcement learning has a good capacity of decisionmaking based on lowdimension features [10]. Therefore, DRL algorithms are very effective in tasks requiring both feature representation and policy learning, e.g., autonomous driving lane change decisionmaking.
IB Related Work
To the best of our knowledge, the methods for the autonomous driving decisionmaking task can be classified into two main categories: rulebased and learningbased. While rulebased methods have achieved some successful applications, the learningbased approaches have also demonstrated their capability in recent years.
Most traditional techniques are based on manually designed rules and typically rely on state machines to switch between predefined decision behaviors [11]. For instance, the “Boss” developed by CMU [12] determined the triggering of lane changing behavior according to certain rules and a preset threshold. Similarly, researchers at Stanford [13] used a series of reward designs and a finite state machine to determine the trajectory and steering instructions of driverless cars. Nevertheless, how to make a reliable decision is still a problem for the rulebased methods owing to their poor ability to generalize to unknown situations [11].
As a core technology of AI, learningbased approaches can provide more advanced and safe decisionmaking algorithms for autonomous driving. In 1989, ALVINN (Autonomous Land Vehicle in a Neural Network) [14] started an early attempt on endtoend driving by training a neural network to map from camera images to the steering angle. Since then, learningbased autonomous driving decisionmaking algorithms have been studied extensively. Recently, NVIDIA researchers [15] trained a deep convolutional neural network (CNN) to map raw images of three frontfacing cameras directly to the steering command in a real vehicle. During training, the images were fed into a CNN after random shift and rotation. Then the network outputted a steering command which was compared to the desired command. And the error between the two steering angles was used to adjust the network parameters through back propagation. The trained model was able to handle the task of lane and road following such as driving on a gravel road.
In addition to endtoend supervised learning, reinforcement learning has also been widely used in the autonomous driving decisionmaking task since it has the capability of dealing with timesequential problems and seeking optimal policies for longterm objectives. Wolf et al. [16] presented a method for teaching an agent to drive a car in a simulation environment by using DQN. The input of the network was the frontfacing camera image with a shape of 48$\times $27, and the action space, i.e. the network output, was discrete—5 actions were defined and each corresponding to a different steering angle. The reward function depended on the distance from the lane center and some relative items (such as the error of the angle between the vehicle and the center line). Hoel et al. [17] also used DQN to deal with the problem of vehicle speed and lane change decisionmaking in a simulation environment. Different from previous work mentioned above, the Qnetwork input in [17] was defined as a onedimensional vector of the relative position, speed, and lane of surrounding vehicles, rather than frontfacing images. Two different agents were defined with the discrete action space. The first agent considered only the lateral lane change control, while the second considered both the lane change and speed control. The result showed that the second agent outperforms the former when CNNs were both utilized. However, how to guarantee the decisions’ safety of the learningbased method should be further considered.
IC Overview
This paper presents the details of a DQN algorithm with the rulebased constraints for autonomous driving lane change decisionmaking in a realworldlike simulation environment. The rest of this paper is constructed as follows. The preliminaries and methodology are introduced in Sect. 2. Next, the used simulator and our MDP formulation are described in Sect. 3. In Sect. 4, we first give an overview of our simulation framework and then explain the setting of relevant parameters and training details. Finally, the discussion and conclusions are given in Sect. 5.
II Methodology
IIA Markov Decision Process and Reinforcement Learning
A Markov decision process (MDP) provides a mathematical framework for modeling decisionmaking in situations where outcomes are partly random and partly under the control of a decision maker. The policy of an MDP has the Markov property, which means that the conditional probability distribution of the future state of a random process depends entirely on the current state when the present state and all the past states are given [18].
An MDP is a 4tuple $M=\u27e8S,A,{P}_{sa},R\u27e9$ [18], where

•
$S$ is a set of states, and ${s}_{i}\in S$ is the state in time step $i$.

•
$A$ is a set of actions, and ${a}_{i}\in A$ is the action in time step $i$.

•
${P}_{sa}$ is the probability that action $a\in A$ in current state $s\in S$ will lead to next state. In particular, $p({s}^{\prime}s,a)$ is the probability that action $a$ in current state $s$ will lead to state ${s}^{\prime}$.

•
$R$ is the reward function. Particularly, $r({s}^{\prime}s,a)$ is the expected immediate reward received after transitioning from state $s$ to state ${s}^{\prime}$, due to action $a$.
The agent interacts with the environment at each of a sequence of discrete time steps, $t=0,1,2,\mathrm{\dots}$. At each time step $t$, the agent perceives the current state ${s}_{t}\in S$ of the environment, and then it selects a doable action ${a}_{t}\in A$ based on the state and executes the action ${a}_{t}$. After the action carries out in the environment, the agent receives a numerical reward ${r}_{t}\in R$ and the next state ${s}_{t+1}$.
The goal of reinforcement learning is to learn an optimal policy which maps from environmental states to agent’s actions by maximizing the cumulative reward of actions taken from the environment [18]. The policy of time step $t$ denoted by ${\pi}_{t}$ maps from environmental states to probabilities of selecting each possible action, and ${\pi}_{t}(as)$ is the probability that ${a}_{t}=a$ if ${s}_{t}=s$. To obtain the optimal policy ${\pi}^{*}$, reinforcement learning typically uses a method of maximizing a total expected return $G$ that is a cumulative sum of immediate rewards $r$ received over the long run. At time step $t$, ${G}_{t}$ is defined as
$${G}_{t}=\sum _{k=0}^{\mathrm{\infty}}{\gamma}^{k}{r}_{t+k},$$  (1) 
where $\gamma \in (0,1]$ is a discount factor.
IIB Qlearning and Deep QNetwork
Qlearning is an offpolicy TD control algorithm in reinforcement learning [19]. Basically, a stateaction value function ${Q}^{\pi}(s,a)$ for policy $\pi $ is defined as
$${Q}^{\pi}(s,a)={\mathbb{E}}_{\pi}\left[\sum _{k=0}^{\mathrm{\infty}}{\gamma}^{k}{r}_{t+k}\right{s}_{t}=s,{a}_{t}=a].$$  (2) 
It is used to evaluate the particular policy $\pi $, and the Qlearning algorithm tries to find the optimal stateaction value function ${Q}^{*}(s,a)$ defined as
$${Q}^{*}(s,a)=\underset{\pi}{max}{Q}^{\pi}(s,a),$$  (3) 
which corresponding to the optimal policy ${\pi}^{*}$ [18].
The value function ${Q}^{*}(s,a)$ follows the Bellman optimality equation
$${Q}^{*}(s,a)=\mathbb{E}\left[r+\gamma \underset{{a}^{\prime}\in A}{max}{Q}^{*}({s}^{\prime},{a}^{\prime})s,a\right].$$  (4) 
Based on the optimal value function ${Q}^{*}(s,a)$, we can determine the optimal policy by finding one or more actions that maximize the value function at each state[18].
When the states are discrete and finite, the Qfunction can be easily formulated in a tabular form. But in many practical applications, for example, lane change decisionmaking task, the state space of them is very large or even continuous, using the Qlearning algorithm will lead to dimension disaster. Therefore, tabular Qlearning algorithm does not applicable to the learning problem of continuous state space and continuous action space.
DQN algorithm proposed in [3] can be used to handle the problem of dimension disaster. In the DQN algorithm, multilayer neural network (i.e. the Qnetwork) is used to approximate the value function. The approximate value function is denoted by $Q(s,a;{\theta}_{i})$, in which ${\theta}_{i}$ are the weights of the Qnetwork at iteration $i$. The agent’s experiences ${e}_{t}=({s}_{t},{a}_{t},{r}_{t},{s}_{t+1})$ at each time step $t$ stored in a data set ${D}_{t}=\{{e}_{1},{e}_{2},\mathrm{\cdots},{e}_{t}\}$ are used to train the Qnetwork. At iteration $i$, a minibatch of experiences $M$ are randomly sampled from ${D}_{t}$ to update the parameters ${\theta}_{i}$ with the loss function
$${L}_{i}({\theta}_{i})={\mathbb{E}}_{M}\left[{\left(r+\gamma \underset{{a}^{\prime}}{max}Q({s}^{\prime},{a}^{\prime};{\theta}_{i}^{})Q(s,a;{\theta}_{i})\right)}^{2}\right],$$  (5) 
where ${\theta}_{i}^{}$ are the network parameters used to compute the target at iteration $i$. The target network parameters ${\theta}_{i}^{}$ are only updated using the Qnetwork parameters ${\theta}_{i}$ every $C$ steps. For computational convenience, the stochastic gradient descent method is commonly used to optimize the loss function.
III Problem Formulation
IIIA Simulator
In this study, we utilize a threelane highway traffic simulator which is proposed by one of Udacity selfdriving car projects for our simulation, see https://github.com/udacity/CarNDPathPlanningProject. A screenshot of the simulator is illustrated in Fig. 1.
Only three lanes on the right side of the road are considered in this paper. The dark car in the middle lane shown in Fig. 1 is controlled by our algorithm, namely, the selfdriving car, and there are up to 12 interference cars in the simulation environment. The localization, speed and sensor fusion data (include other cars’ localization and speed) of the selfdriving car are provided by the simulator. There is also a sparse map list of waypoints around the highway. The maximum speed of the car cannot exceed 50 MPH. The car should try to drive as fast as possible and avoid collisions with other cars. It can try to change lanes safely when there is a front car blocking. Also, the car should not experience total acceleration over 10 m/s${}^{2}$ and jerk greater than 10 m/s${}^{3}$, and it should be able to make one complete loop around the 6946m highway.
IIIB MDP Framework
We regard the autonomous driving decisionmaking process as an MDP, in which the selfdriving car interacts with the environment including surrounding vehicles and lanes. We define the state, action, and reward function for driving policy learning as follows.
1) State
As mentioned, the information offered by the simulator includes the position and speed of the selfdriving car and other cars. Specifically, it includes cars’ $x$, $y$ position in map coordinates and $s$, $d$ position in frenet coordinates, the selfdriving car’s yaw angle in the map and speed in MPH, and the other cars’ $x$, $y$ speed in m/s. In this paper, the $s$, $d$ positions and speeds of each vehicle are used to represent the state of the simulation environment. We unite the unit of speed to MPH and convert traffic conditions to a grid form.
As shown in Fig. 2, we use a $45\times 3$ matrix as the state representation in this study. The whole matrix corresponds to traffic situation within the range of 60 meters in front and 30 meters behind the selfdriving car in the three lanes on the right side of the road. Since cars are approximately 5.5 meters in length and each row in the matrix spans 2 meters longitudinal, one car can occupy 4 cells in extreme cases. Therefore, we fill the 4 cells corresponding an individual car with the car’s normalized speed (within $[0,0.5)$), and the normalized values of speed are positive to selfdriving car (the red block shown in Fig. 2) while other cars’ are negative (the blue blocks shown in Fig. 2). Where there is no car, the corresponding cells are filled with a uniform value (set to $1$ in this study).
2) Action
Decision  Action 

${a}_{0}$  Stay in current lane 
${a}_{1}$  Change lanes to the left 
${a}_{2}$  Change lanes to the right 
When it comes to an autonomousdrivingdecisionmaking problem, both longitudinal speed and lateral lane change decisions need to be considered. In this paper, we focus on lane change decisionmaking with reinforcement learning, and longitudinal speed control is implemented in a rulebased way. We define three actions in this study, and the action space (given in Table I) of the agent is then discrete.
3) Reward
Typically, safety and efficiency are the main concerns for a lane change process. In terms of safety, autonomous cars should be able to avoid collisions while driving, and they also need to travel in prescribed lanes. In order to force the autonomous car to drive only in the prescribed lanes, we limit it to only drive in the three lanes on the right side of the road. That means when a decision to change lanes to the left (${a}_{1}$ in Table I) is made while the car is on the left lane (or ${a}_{2}$ on the right lane), it stays in current lane, but a penalizing reward ${r}_{ch1}=5$ will be given to the agent. To avoid the collisions, a large penalizing reward ${r}_{co}=10$ is then given to the agent if a lane changing decision results in the selfdriving car collides.
For efficiency, the selfdriving car should try to meet the requirements of driving as fast as possible without exceeding the maximum speed limit, and not changing lanes too often. In order to make it travel faster, we define the following reward according to the speed of the car
$${r}_{v}=\lambda \cdot \left(v{v}_{ref}\right),$$  (6) 
where $v$ denotes the car’s average speed in MPH within one decision period since last decisionmaking, and ${v}_{ref}$ is reference speed while $\lambda $ is a normalizing coefficient. Considering the speed of the selfdriving car is between 0 and 50 MPH (less than 50 MPH), we set ${v}_{ref}=25$ and $\lambda =0.04$, so that the value of reward is normalized to the range of $[1,1)$. To avoid the meaningless lane change behavior, a negative reward ${r}_{ch2}=3$ is given if the car chooses to change lanes while there is no car in front of it. In addition, a small negative reward ${r}_{ch3}=0.3$ is added to the basic reward ${r}_{v}$ when lane change happens without collision. In general, our reward function goes as
$$r=\{\begin{array}{cc}{r}_{co}\hfill & \text{collision happens}\hfill \\ {r}_{ch1}\hfill & \text{illegal lane change}\hfill \\ {r}_{ch2}\hfill & \text{invalid lane change}\hfill \\ \lambda \cdot \left(v{v}_{ref}\right)+{r}_{ch3}\hfill & \text{legal lane change}\hfill \\ \lambda \cdot \left(v{v}_{ref}\right)\hfill & \text{normal drive}\hfill \end{array}.$$  (7) 
4) Rulebased constraints
To ensure the absolute security of the lane changing behavior, we add the rulebased constraints based on the planning trajectories and the others’ predicted trajectories. When the highlevel decision maker determines an action, the lowlevel controller can predict the trajectories of the ego car and the surrounding cars in the intend lane based on this action. The trajectory of the ego car is the planning driving trajectory by path planning. And surrounding vehicles’ trajectories are made by assuming they maintain the current speed and keep in the current lane. If there is a moment when the distance between the predicted trajectory of the ego car and that of surrounding vehicles is less than the predefined threshold value, the decision made by the highlevel decision maker is potentially dangerous. This action will be canceled and the ego car stays in the current lane. In other words, when the highlevel decision maker makes a dangerous decision, the lowlevel decision maker will modify it according to the surrounding environment to guarantee safety.
IV Simulation and Results
IVA Simulation Setup
The implementation of our simulation experiment consists of three parts, as shown in Fig. 3. The first part is the simulator (introduced in Section 3), which outputs the environment information and receives a specified path. The second part is controller and planner, which is responsible for controlling speed and planning path. The third part is the DQN algorithm, which is in charge of highlevel lane change decisionmaking.
In this study, we aim at deep reinforcement learning based lateral lane change decisionmaking. In Fig. 3, the middle block representing lowlevel controller shows the rulesbased speed controller in the left half and the path planner in the right half while data processing part is omitted. The longitudinal speed control is implemented by using a rulebased approach, while path planning is implemented by using spline interpolation according to the provided waypoints combined with the potential target points corresponding to the lane change decision result. Simultaneously, the controller also acts as a lowerlevel modifier to revise the higherlevel decisions.
The interaction between the three parts in Fig. 3 is as follows: the simulator provides environmental information including vehicle speed and position to the controller. After preliminary processing (such as calculating reward value) of the information, it transmits the processed information to the decision maker. The decision maker first obtains the grid form state based on this information, and then DQN computes an action according to the current state and returns it to the planner. The planner finally outputs an appropriate path to the simulator according to the action combined with speed control. The following presentation focuses on the decisionmaking part.
IVB Training
1) Architecture of artificial neural network
In recent years, CNNs that inspired by biological processes [20] have become one of the research focuses of many scientific fields, especially in the field of machine learning (ML). Since CNNs avoid the complex preprocessing of images and can receive inputs from the raw images directly, they have been widely used in many visionrelated domains. Benefits from locally connected and sharedweights architecture and translation invariance characteristics, the amount of parameters in the network is vastly reduced. In this study, a CNN architecture is used as our Qnetwork (as shown in Fig. 4). The input2 shown in Fig. 4 is a 3$\times $1 onedimensional vector, defined as
$$\begin{array}{c}{s}_{1}\text{Normalized ego vehicle speed}\hfill \\ {s}_{2}\{\begin{array}{cc}1\hfill & \text{if there is a lane to the left}\hfill \\ 0\hfill & \text{otherwise}\hfill \end{array}.\hfill \\ {s}_{3}\{\begin{array}{cc}1\hfill & \text{if there is a lane to the right}\hfill \\ 0\hfill & \text{otherwise}\hfill \end{array}\hfill \end{array}$$ 
2) Training details
In terms of reinforcement learning, the experience is of great importance for an agent to learn a good policy. As mentioned in Sec. 2, the agent’s experiences are stored in an experience pool, and during training, random samples from it are used to update the network parameters. Thus, the experience of the agent determines its performance. By applying experience replay, the correlations between the samples are broken and the variance of the updates is reduced [4]. In order to enable the agent to learn more effective policies, we divide the experience pool into two parts, one is to store the experience of keeping in lane action and the other is to store the experience of lane changing action. Then the network is updated by sampling half batch size experiences from each of these two parts.
Since the appropriate training samples are needed, the agent must face the balance between exploration and exploitation. On the one hand, the agent need explores more and discovers some unknown states and actions because some of them may bring a higher reward. On the other hand, it is necessary for the agent to use the knowledge it has learned to choose actions. This will ensure that the experience is effectively used and the convergence of the learning process is accelerated. The tradeoff between exploration and exploitation in this study is handled by following an $\epsilon $greedy policy. The main idea of the $\epsilon $greedy algorithm is to randomly select an action from the action space with the probability of $\epsilon $, and to select the current optimal action according to the greedy method with the probability of $1\epsilon $. When $\epsilon =0$, only the current optimal action is selected, and when $\epsilon =1$, the action is completely randomly selected. In this study, $\epsilon $ is not a constant value, but it will slowly decline with the iteration goes, that is,
$$\epsilon =\mathrm{max}({\epsilon}_{0}\cdot \lambda _{decay}{}^{step},{\epsilon}_{\mathrm{min}}).$$  (8) 
We set ${\epsilon}_{0}=1.0$, ${\lambda}_{decay}=0.99985$ and ${\epsilon}_{\mathrm{min}}=0.03$.
IVC Results and Evaluation
The simulator runs at a realworld time, and it takes about 6 minutes for the ego car to complete a circle in the simulation environment, which is seen as an episode of our training process. And our training process is composed of 100 episodes. When one episode is completed, that is, the ego vehicle returns to the initial point, we restart the simulator for the next episode of training. The average speed and the number of lane changing times of one specific episode during training for the proposed rulebase DQN are shown in Fig. 5.
As can be seen from Fig. 5, the agent generally achieves an ideal behavior. As the training progresses, the average speed of the ego vehicle increases in general while the frequency of lane change reduces significantly. These trends indicate that the agent gradually learned effective lane changing policy during the training process, and achieved faster speed with fewer lane changing behaviors. At the same time, it can be seen that a smaller number of lane changing times usually corresponds to a slower average speed, which may be because the ego car is blocked and cannot change lanes.
We evaluate the performance of the trained rulebased DQN agent by comparing its average speed and average lane changing times to other different approaches, i.e. a random policy with collision avoidance constraints, a rulebased policy and a DQNbased policy. For each method, we test 10 times in the simulation environment and then calculate its average speed, the average number of lane changing times and the safety rate. The agent chooses actions randomly when it is following the randomaction policy. For the rulebased policy, the agent makes a lane change decision when the distance between the ego car and the very front car in its lane is less than 20 meters, and simultaneously the front car in the neighbor lane is further to the ego car. It has a tendency to switch to the left when both sides are available. All four methods use the same longitudinal controller and lowlevel corrector. The results are summarized in Table II, where $\overline{v}$ denotes the average speed and ${c}_{ch}$ denotes the number of lane changing times of one test episode. At the same time, the safety rate is the ratio of the number of test episodes without collisions to the total number of test episodes. In terms of both average $\overline{v}$ and average ${c}_{ch}$, the proposed rulebased DQN is superior to others. In fact, DQN methods can improve the average speed compared to the rulebased methods, and lane changing occurs less frequently for the proposed algorithm. Note that the rulebased DQN policy has a higher safety rate than the DQNbased policy, which can guarantee the safety of the lane changing decision. This suggests that our approach achieves a more efficient and safe policy than the others.
avg $\overline{v}$ (MPH)  avg ${c}_{ch}$  safety rate  

randomaction policy  44.59  152.60  0.6 
rulebased policy  45.22  8.40  0.6 
DQNbased policy  46.16  37.40  0.2 
rulebased DQN policy  46.99  8.80  0.8 
V Conclusion and Future Work
In this paper, we present the details of deep reinforcement learning algorithm with rulebased constraints to handle the problem of highlevel lane change decision making. First, a realworldlike simulation environment is used. Different from other simulation environments, it needs an explicit driving path, and it covers both dense and sparse traffic conditions. Second, through the combination of highlevel lateral decisionmaking and rulebased lowlevel longitudinal control and trajectory modification, safe and efficient driving behavior can be achieved. Third, the rulebased DQN is used to map the representation of the surrounding environment to the lateral decision. Due to the setting of our state representation and reward function, the trained agent is able to take appropriate actions in different scenarios. The generated policy is evaluated on the simulator for 10 times. The results demonstrate the effectiveness of the proposed method.
There still exist some shortcomings in this work that need to be improved in the future. For example, the historical trajectory information of vehicles can be taken into account as part of the state representation. And also, more advanced deep reinforcement learning algorithms can be adopted.
References
 [1] D. J. Fagnant and K. Kockelman. “Preparing a nation for autonomous vehicles: opportunities, barriers and policy recommendations.” Transportation Research Part A: Policy and Practice 77 (2015): 167181.
 [2] D. Li, et al. “Reinforcement learning and deep learning based lateral control for autonomous driving.” arXiv preprint arXiv:1810.12778 (2018).
 [3] V. Mnih, et al. “Playing atari with deep reinforcement learning.” arXiv preprint arXiv:1312.5602 (2013).
 [4] V. Mnih, et al. “Humanlevel control through deep reinforcement learning.” Nature 518.7540 (2015): 529.
 [5] D. Silver, et al. “Mastering the game of Go with deep neural networks and tree search.” Nature 529.7587 (2016): 484.
 [6] D. Silver, et al. “Mastering the game of Go without human knowledge.” Nature 550.7676 (2017): 354.
 [7] D. Zhao, Z. Xia, and Q. Zhang. “Modelfree optimal control based intelligent cruise control with hardwareintheloop demonstration [research frontier].” IEEE Computational Intelligence Magazine 12.2 (2017): 5669.
 [8] A. E. Sallab, et al. “Deep reinforcement learning framework for autonomous driving.” Electronic Imaging 2017.19 (2017): 7076.
 [9] Y. LeCun, Y. Bengio, and G. Hinton. “Deep learning.” Nature 521.7553 (2015): 436.
 [10] D. Zhao, et al. “Review of deep reinforcement learning and discussions on the development of computer Go.” Control Theory and Applications 33.6 (2016): 701717.
 [11] W. Schwarting, J. AlonsoMora, and D. Rus. “Planning and decisionmaking for autonomous vehicles.” Annual Review of Control, Robotics, and Autonomous Systems 1 (2018): 187210.
 [12] C. Urmson, et al. “Autonomous driving in urban environments: Boss and the urban challenge.” Journal of Field Robotics 25.8 (2008): 425466.
 [13] M. Montemerlo, et al. “Junior: The stanford entry in the urban challenge.” Journal of field Robotics 25.9 (2008): 569597.
 [14] D. A. Pomerleau. “ALVINN: An autonomous land vehicle in a neural network.” Advances in Neural Information Processing Systems 1989.
 [15] M. Bojarski, et al. “End to end learning for selfdriving cars.” arXiv preprint arXiv:1604.07316 (2016).
 [16] P. Wolf, et al. “Learning how to drive in a real world simulation with deep Qnetworks.” IEEE Intelligent Vehicles Symposium. IEEE, 2017.
 [17] C.J. Hoel, K. Wolff, and L. Laine. “Automated speed and lane change decision making using deep reinforcement learning.” arXiv preprint arXiv:1803.10056 (2018).
 [18] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018.
 [19] C. J. Watkins and P. Dayan. “Qlearning.” Machine learning 8.34 (1992): 279292.
 [20] M. Matsugu, et al. “Subject independent facial expression recognition with robust face detection using a convolutional neural network.” Neural Networks 16.56 (2003):555559.