Abstract
A model used for velocity control during car following was proposed based ondeep reinforcement learning (RL). To fulfil the multiobjectives of carfollowing, a reward function reflecting driving safety, efficiency, and comfortwas constructed. With the reward function, the RL agent learns to controlvehicle speed in a fashion that maximizes cumulative rewards, through trialsand errors in the simulation environment. A total of 1,341 carfollowing eventsextracted from the Next Generation Simulation (NGSIM) dataset were used totrain the model. Carfollowing behavior produced by the model were comparedwith that observed in the empirical NGSIM data, to demonstrate the model'sability to follow a lead vehicle safely, efficiently, and comfortably. Resultsshow that the model demonstrates the capability of safe, efficient, andcomfortable velocity control in that it 1) has small percentages (8\%) ofdangerous minimum time to collision values (\textless\ 5s) than human driversin the NGSIM data (35\%); 2) can maintain efficient and safe headways in therange of 1s to 2s; and 3) can follow the lead vehicle comfortably with smoothacceleration. The results indicate that reinforcement learning methods couldcontribute to the development of autonomous driving systems.
Quick Read (beta)
Safe, Efficient, and Comfortable Velocity Control based on Reinforcement Learning for Autonomous Driving
ABSTRACT
A model used for velocity control during car following was proposed based on deep reinforcement learning (RL). To fulfill the dual objectives of imitating human drivers and optimizing driving performance, a reward function was developed by referencing human driving data and combining driving features related to safety, efficiency, and comfort. With the reward function, the RL agent learns to control vehicle speed in a fashion that maximizes cumulative rewards, through trials and errors in the simulation environment. A total of 1,341 carfollowing events extracted from the Next Generation Simulation (NGSIM) dataset were used to train the model. Carfollowing behavior produced by the model were compared with that observed in the empirical NGSIM data, to demonstrate the model’s ability to follow a lead vehicle safely, efficiently, and comfortably. Results show that the model demonstrates the capability of safe, efficient, and comfortable velocity control in that it 1) has small percentages (8%) of dangerous minimum time to collision values (< 5s) than human drivers in the NGSIM data (35%); 2) can maintain efficient and safe headways in the range of 1s to 2s; and 3) can follow the lead vehicle comfortably with smooth acceleration. The results indicate that proposed approach could contribute to the development of better autonomous driving systems.
Keywords: Car Following, Autonomous Driving, Velocity Control, Reinforcement Learning, NGSIM, Deep Deterministic Policy Gradient (DDPG)
I INTRODUCTION
Car following is the most frequent driving scenario. The main task of car following is controlling vehicle velocity to keep safe and comfortable following gaps. Autonomous carfollowing velocity control has the promise to mitigate drivers’ workload, to improve traffic safety, and to increase road capacity [369980:8164317, zhu2018modeling].
Driver models are critical elements of velocity control systems [wang2016drivers, wang2016development]. In general, driver models related to car following have been established with two approaches: rulebased and supervised learning [369980:8164318, feiyue]. Rulebased approach mainly refers to traditional carfollowing models, such as the GaxisHermanRothery model [369980:8164259] and the intelligent driver model [369980:8164320]. Supervised learning approach relies on data typically provided through human demonstration in order to approximate the relationship between carfollowing state and acceleration.
These two approaches all intend to emulate human drivers’ carfollowing behavior. However, solely imitating human driving behaviors may not be the best solution in autonomous driving. Firstly, human drivers may not drive in an optimal way [chai2015fuzzy]. Secondly, users may not want their autonomous vehicles driving in a way like them [369980:8164261]. Thirdly, driving should be optimized with respect to safety, efficiency, comfort, besides imitating human drivers.
To resolve the problem, we propose a carfollowing model for autonomous velocity control based on deep reinforcement learning (RL). This model not only tries to emulate human drivers but also directly optimizes driving safety, efficiency, and comfort, by learning from trial and interaction with a simulation environment.
Specifically, the deep deterministic policy gradient (DDPG) algorithm [369980:8164262] that performs well in continuous control field was utilized to learn an actor network together with a critic network. The actor is responsible for policy generation: outputting following vehicle accelerations based on speed, relative speed, and spacing. The critic is responsible for policy improvement: update the actor’s policy parameters in the direction of performance improvement.
To evaluate the proposed model, realworld driving data collected in the Next Generation Simulation (NGSIM) project [369980:8164263] were used to train the model. And carfollowing behavior simulated by the DDPG model was compared with that observed in the empirical NGSIM data, to demonstrate the model’s ability to follow a leading vehicle safely, efficiently, and comfortably.
II BACKGROUND
IIA Car Following
Carfollowing models describe the movements of a following vehicle (FV) in response to the actions of the lead vehicle (LV) [zhu2018modeling]. They are essential components of microscopic traffic simulation [brackstone1999car], and serve as theoretical references for autonomous carfollowing systems [Wei]. Since the early investigation of carfollowing dynamics in 1953 [pipes1953operational], numerous carfollowing models have been built.
The first carfollowing model [pipes1953operational] was proposed in the middle 1950s, and a number of models have been developed since then, for example, the GaxisHermanRothery (GHR) model [gazis1961nonlinear], the intelligent driver model (IDM) [treiber2000congested], the optimal velocity model [bando1995dynamical], and the models proposed by Helly [helly1959simulation], Gipps [gipps1981behavioural], and Wiedemann [wiedemann1974simulation]. For detailed review and historical development of the subject, consult Brackstone and McDonald [brackstone1999car] and Saifuzzaman and Zheng [saifuzzaman2014incorporating].
IIB Reinforcement Learning
Reinforcement learning optimizes sequential decisionmaking problems by letting an RL agent interact with an environment. At time step t, the agent observes a state s${}_{t}$ and chooses an action a${}_{t}$ from some action space A based on a policy $\pi $(a${}_{t}$s${}_{t}$) that maps from state s${}_{t}$ to actions a${}_{t}$. Meanwhile, the system gives a reward r${}_{t}$ to the agent, and transits to the next state s${}_{t\mathrm{+}}$${}_{\mathrm{1}}$. This process continues until a terminal state is reached, then the agent restarts. The agent intends to get a maximum discounted, accumulated reward ${R}_{t}={\sum}_{k=0}^{\mathrm{\infty}}{\gamma}^{k}{r}_{t+k}$, with the discount factor $\gamma \in (0,1]$ [369980:8164264]. In general, there are two types of RL methods: valuebased and policybased [369980:8164265].
IIB1 ValueBased Reinforcement Learning
A value function measures the quality of a state or state stateaction pair. The action value ${Q}^{\pi}(s,a)=E[{R}_{t}{s}_{t}=s,{a}_{t}=a]$ is the expected return for selecting action a in state s and then following policy $\pi $. It represents the goodness of taking action a in a state s. Valuebased RL methods intend to infer the action value function from historical experience. $Q$learning is a typical valuebased RL method. Beginning with a random $Q$function, the agent keeps updating its $Q$values based on the Bellman equation [369980:8164265].
$$Q(s,a)=E\left[r+\gamma \underset{{a}^{\prime}}{\mathrm{max}}Q({s}^{\prime},{a}^{\prime})\right]$$  (1) 
The intuition is that: maximum future reward for this state s and action a is the immediate reward r plus maximum future reward for the next state ${s}^{\prime}$. Based on the estimated Qvalues, the optimal policy is to take the action with the highest $Q(s,a)$ to get maximum expected future rewards.
IIB2 PolicyBased Reinforcement Learning
Different with valuebased methods, policybased methods try to improve the policy $\pi (as;\theta )$ directly, by updating its parameters $\theta $ with gradient ascent on $E\left[{R}_{t}\right]$. A typical policybased method is REINFORCE, which updates the policy parameters $\theta $ with ${\nabla}_{\theta}\mathrm{log}\pi \left({a}_{t}{s}_{t};\theta \right){R}_{t}$ [369980:8164264] .
To reduce the variance of policy gradients and increase learning speed, an actorcritic method is usually adopted. Two learning agents are used in an actorcritic algorithm: the actor (policy) and the critic (value function). The actor determines which action to take, and the critic tells the actor the quality of the action and how it should adjust the policy[369980:8164266].
IIC Deep Reinforcement Learning
Deep reinforcement learning refers to reinforcement learning algorithms that use neural networks to approximate value function $V(s;\theta )\text{, policy}\pi (as;\theta )$, or system model.
IIC1 Deep QNetwork
Instead of computing $Q(s,a)$ for each stateaction pair, deep $Q$learning uses neural network as function approximator to estimate the actionvalue function [369980:8164267]. The action is selected with a maximum $Q(s,a)$ value. Deep Q networks (DQN) work well with discrete action spaces but fail in continuous action spaces, like in our case. To address this, Lillicrap et al. [369980:8164262] developed an algorithm called deep deterministic policy gradient (DDPG). DDPG introduced an actorcritic mechanism to DQN and can be used for continuous control problems.
IIC2 Deep Deterministic Policy Gradient
DDPG uses two separate networks to approximate the actor and critic respectively [369980:8164262]. The critic network with weights $\theta $${}^{Q}$ is responsible for estimating the actionvalue function Q(s,a$\theta $${}^{Q}$). The actor network with weights $\theta $${}^{\mu}$ is responsible for explicitly representing the agent’s policy µ(s$\theta $${}^{\mu}$). As proposed in $DQN$, experience replay and target network are adopted in DDPG to facilitate stable and robust learning.

•
Experience replay
A replay buffer was applied to avoid learning from sequentially generated, correlated experience samples. The replay buffer is a finite sized cache D that stores transitions (s${}_{t}$, a${}_{t}$, r${}_{t}$, s${}_{t\mathrm{+}}$${}_{\mathrm{1}}$) sampled from the environment. The replay buffer is continually updated by replacing old samples with new ones. At each time step, the actor and critic networks are trained on random minibatches of transitions from the replay buffer.

•
Target network
Target networks are used to represent target values of the main networks, to avoid divergence of the algorithm [369980:8164267]. Two target networks, ${Q}^{\prime}(s,a{\theta}^{{Q}^{\prime}})\text{and}{\mu}^{\prime}(s{\theta}^{{\mu}^{\prime}})$, were created for the main critic and actor networks respectively. They have the same architecture with the main networks but with different network parameters ${\theta}^{\prime}$. The parameters of target networks are updated by letting them slowly track the main networks: ${\theta}^{\prime}=\tau \theta +(1\tau ){\theta}^{\prime}\text{with}\tau \ll 1$. In this way, the target values are constrained to update slowly, greatly enhancing the stability of learning.
The full DDPG algorithm is listed in Algorithm I. It begins with initializing the replay buffer and the actor, critic and corresponding target networks. At each time step, an action $a$ is taken according to the exploratory policy. Then, the reward ${r}_{t}$ and new state ${s}_{t+1}$ are observed and stored in the replay memory D. The critic is trained with minibatches sampled from the replay memory. Afterward, the actor is updated by performing a gradient ascent step on the sampled policy gradient. Finally, the target networks with weights ${\theta}^{{Q}^{\prime}}$ and ${\theta}^{{\mu}^{\prime}}$ are updated to slowly track the actor and critic networks.
[!htbp] {algorithmic}[1] \StateRandomly initialize critic $Q(s,a{\theta}^{Q})$ and actor $\mu (s{\theta}^{\mu})$ networks with weights ${\theta}^{Q}$ and ${\theta}^{\mu}$. \StateInitialize target network ${Q}^{\prime}(s,a{\theta}^{{Q}^{\prime}})$ and ${\mu}^{\prime}(s{\theta}^{{\mu}^{\prime}})$ with weights ${\theta}^{{Q}^{\prime}}\leftarrow {\theta}^{Q}$ and ${\theta}^{{\mu}^{\prime}}\leftarrow {\theta}^{\mu}$ \StateSet up empty replay buffer D \Forepisode = 1 to M \StateBegin with a random process N for action exploration \StateObserve initial carfollowing state: initial gap, follower speed, and relative speed \Forepisode = 1 to T \StateCalculate reward ${r}_{t}$ \StateChoose follower acceleration ${a}_{t}=\mu ({s}_{t}\mathrm{\u2502}{\theta}^{\mu})+{N}_{t}$ based on current actor network and exploration noise ${N}_{t}$ \StateImplement acceleration ${a}_{t}$ and transfer to new state ${s}_{t+1}$ based on kinematic pointmass model \StateSave transition $({s}_{t},{a}_{t},{r}_{t},{s}_{t+1})$ into replay buffer $D$ \StateSample random minibatch of $N$ transitions $({s}_{i},{a}_{i},{r}_{i},{s}_{i+1})$ from $D$ \StateSet ${y}_{i}={r}_{i}+\gamma {Q}^{\prime}({s}_{i+1},{\mu}^{\prime}({s}_{i+1}{\theta}^{{\mu}^{\prime}}){\theta}^{{Q}^{\prime}})$ \StateUpdate critic through minimizing loss: $L=\frac{1}{N}{\sum}_{i}{({y}_{i}Q({s}_{i},{a}_{i}{\theta}^{Q}))}^{2}$ \StateUpdate actor policy using sampled policy gradient: ${\nabla}_{{\theta}^{\mu}}J\approx {{\frac{1}{N}\sum _{i}{\nabla}_{a}Q(s,a{\theta}^{Q})}_{s={s}_{i},a=\mu ({s}_{i})}{\nabla}_{{\theta}^{\mu}}\mu (s{\theta}^{\mu})}_{{s}_{i}}$ \StateUpdate target networks:$\begin{array}{c}{\theta}^{{Q}^{\prime}}=\tau {\theta}^{Q}+(1\tau ){\theta}^{{Q}^{\prime}}\hfill \\ {\theta}^{{\mu}^{\prime}}=\tau {\theta}^{\mu}+(1\tau ){\theta}^{{\mu}^{\prime}}\hfill \end{array}$ \EndFor\EndFor
III DATA PREPARATION
Vehicle trajectory data in the Next Generation Simulation (NGSIM) project [369980:8164263] were used. Specifically, this study used for data retrieved from eastbound I80 in the San Francisco Bay area in Emeryville, CA, on April 13, 2005, as shown in Fig. 1. The investigation region was around 500 meters (1,640 feet) long and comprised of six freeway lanes, including a highoccupancy vehicle (HOV) lane. An aggregate of 45 minutes of data are accessible in the full dataset, divided into three 15minute time spans: 4:00 p.m. to 4:15 p.m.; 5:00 p.m. to 5:15 p.m.; and 5:15 p.m. to 5:30 p.m. These periods contain the congestion buildup, or the interstate between uncongested and congested traffic states, and full congestion during a peak period. The data provide the precise location information for each vehicle, with the sampling rate being 10 Hz. To enhance data quality, the reconstructed NGSIM I80 data [369980:8164268] were utilized.
Carfollowing events were extracted by applying a carfollowing filter as described in Wang et al. [wang2018capturing]. A carfollowing event was defined as:

•
The leading and following vehicle pairs stay in the same lane;

•
Duration of the event > 15s: ensuring that the carfollowing persisted long enough to be analyzed.
A total of 1,341 carfollowing events were extracted and utilized in this study.
IV FEATURES FOR REWARD FUNCTION
In this section, features that capture relevant objectives of the car following velocity control were proposed, with a final aim to construct a proper reward function.
IVA Safety
Safety should be the most important element of autonomous car following. Time to collision (TTC) was used to represent safety. As a widely used safety indicator, TTC represents the time left before two vehicles collide. It is computed as:
$$TTC(t)=\frac{{S}_{n1,n}(t)}{\mathrm{\Delta}{V}_{n1,n}(t)}$$  (2) 
where ${S}_{n1,n}$ is the following gap, $\mathrm{\Delta}{V}_{n1,n}$ is the relative speed.
TTC is inversely related to crash risk (smaller TTC values correspond to higher crash risks and vice versa) [369980:8164269]. To apply TTC as a feature reflecting safety, a safety limit (a lower bound of TTC) should be determined. However, different thresholds (from 1.5s to 5s) are reported in the literature [369980:8164269]. To address this problem, we determine the safety limit based on the NGSIM empirical data. Fig. 2 shows the cumulative distribution of TTC values in carfollowing events extracted from the NGSIM data. A 7second safety limit corresponding to the 10 percentiles of TTC distribution was chosen. Then the TTC feature was constructed as:
$${F}_{TTC}=\{\begin{array}{cc}\mathrm{log}(TTC/7)\hfill & 0\le TTC\le 7\hfill \\ 0\hfill & \text{otherwise}\hfill \end{array}$$  (3) 
In this way, if TTC is less than 7s, the TTC feature will be negative. And as TTC approaches zeros, the TTC feature will be close to negative infinity, which represents a severe punishment to nearcrash situations.
IVB Efficiency
Time headway was used to measure driving efficiency, and it is defined as the passed time between the arrival of the lead vehicle (LV) and the following vehicle (FV) at a designated point. Keeping a short headway within the safety bounds can improve traffic flow efficiency because short headways correspond to large roadway capacities [369980:8164270].
The rules of different countries are not quite the same, in regard to the legal or recommended time headway. In the U.S., several driver training programs state that it is difficult to follow a vehicle safely with headway being less than 2 s. In Germany, the recommended time headway is 1.8 s, and fines are imposed when the time headway is less than 0.9 s. In Sweden, the police use a time headway of 1 s as a threshold for imposing fines [369980:8164269].
This study determined the appropriate time headway based on the empirical NGSIM data. Fig. 3 presents the distribution of time headway in all of the extracted 1,341 carfollowing events. A lognormal distribution was fit on the data. The lognormal distribution is a probability distribution whose logarithm has a normal distribution. The probability density function of the lognormal distribution is:
$${f}_{\text{lognorm}}(x\mu ,\sigma )=\frac{1}{x\sigma \sqrt{2\pi}}{e}^{\frac{{(\mathrm{ln}x\mu )}^{2}}{2{\sigma}^{2}}};x0$$  (4) 
where $x$ is the distribution variable, time headway in this study, and $\mu ,\sigma $ are the mean and log standard deviation of the variable x, respectively. Based on the empirical data, the estimated $\mu $ and $\sigma $ were 0.4226 and 0.4365 respectively.
A headway feature was constructed as the probability density value of the estimated headway lognormal distribution:
$${F}_{headway}={f}_{\text{lognorm}}(headway\mu =0.4226,\sigma =0.4365)$$  (5) 
According to this headway feature, headways around 1.3 seconds correspond to large headway feature values (about 0.65); while headways being too long or too short correspond to low feature values. In this way, efficient headways are encouraged while unsafe or too long headways are discouraged.
IVC Comfort
Jerk, defined as the change rate of acceleration, was used to measure driving comfort because it has a strong influence on the comfort of the passengers [369980:8164271]. A jerk feature was constructed as:
$${F}_{jett}=\frac{jer{k}^{2}}{3600}$$  (6) 
The squared jerk was divided by a base value (3600) to scale the feature into the range of [0 1]. The base value was determined by the following intuition:

1.
The sample interval of the data is 0.1s;

2.
The acceleration is bounded between 3 to 3 m/s${}^{2}$ based on the observed FV acceleration of all the carfollowing events;

3.
Therefore the largest jerk value is $\frac{3(3)}{0.1}=60m/{s}^{3}$, if squared, we get 3600.
V PROPOSED APPROACH
Since vehicle acceleration is a continuous variable, deep deterministic policy gradient (DDPG) [369980:8164262] algorithm was used. In this section, the approach proposed to learn velocity control strategy using DDPG is explained.
VA State and Action
At a certain time step t, the state of a carfollowing process is described by the FV speed ${V}_{n}(t)$, spacing ${S}_{n1,n}(t)$, and relative speed $\mathrm{\Delta}{V}_{n1,n}(t)$. The action is the longitudinal acceleration of the FV ${a}_{n}(t)$. Given state and action at time step t, the nextstep state is updated by a kinematic pointmass model:
$\begin{array}{cc}& {V}_{n}(t+1)={V}_{n}(t)+{a}_{n}(t)*\mathrm{\Delta}T\hfill \\ & \mathrm{\Delta}{V}_{n1,n}(t+1)={V}_{n1}(t+1){V}_{n}(t+1)\hfill \\ & {S}_{n1,n}(t+1)={S}_{n1,n}(t)+{\displaystyle \frac{\mathrm{\Delta}{V}_{n1,n}(t)+\mathrm{\Delta}{V}_{n1,n}(t+1)}{2}}*\mathrm{\Delta}T\hfill \end{array}$  (7) 
where $\mathrm{\Delta}T$ is the simulation time interval, set as $0.1s$ in this study, and ${V}_{n1}$ is the velocity of lead vehicle (LV), which was externally inputted.
VB Simulation Setup
To enable the RL agent to learn from trial and error, a simple carfollowing simulation environment was implemented. Initialized with the empirically given following vehicle speed, spacing and velocity differences,${V}_{n}(t=0)={V}_{n}^{data}(t=0),{S}_{n1,n}(t=0)={S}_{n1,n}^{data}(t=0),\text{and}\mathrm{\Delta}{V}_{n1,n}(t=0)=\mathrm{\Delta}{V}_{n1,n}^{data}(t=0)$, the RL agent is used to compute the acceleration ${a}_{n}(t)$. Given acceleration, future FV velocity, relative speed, and spacing are then generated iteratively based on (7). Once a carfollowing event reaches its ending, the state is reinitialized with empirical data of the next event.
VC Reward function
The reward function, r(s, a), serves as a training signal to encourage or discourage behaviors in the context of a desired task. For the task of autonomous car following, a reward function was established based on a linear combination of the features constructed in section IV:
$$r={w}_{1}{F}_{TTC}+{w}_{2}{F}_{headway}{w}_{3}{F}_{jerk}$$  (8) 
where ${w}_{1},{w}_{2},\text{and}{w}_{3}$ are coefficients of the features, all set as 1 in the current study.
VD Network Architecture
The actor and critic was each represented by a neural network. The input of the actor network is the state at time step t, ${s}_{t}=({V}_{n}(t),\mathrm{\Delta}{V}_{n1,n}(t),{S}_{n1,n}(t))$. Its output is FV’s acceleration ${a}_{n}(t)$. The input of the critic network is a stateaction pair $({s}_{t},{a}_{t})$. Its output is a scalar Qvalue $Q({s}_{t},{a}_{t})$.
Fig. 4 presents the architectures of the actor and critic networks [zhu2018human]. Both of them consist of three layers: an input layer, an output layer, and a hidden layer with 30 neurons. Deeper neural networks with more than one hidden layers were also tested, but the results showed that they did not perform significantly better.
For the hidden layers, the Rectified Linear Unit (ReLU) activation function (f(x) = max(0, x)) was used. The ReLU can accelerate the convergence of network parameter optimization [369980:8164272]. For the output layer of the actor network, a $tanh$ activation function was used. The $tanh$ function maps realvalued numbers to the range [1, 1] and thus can bound the outputted accelerations between 3 to 3 m/s${}^{2}$.
VE Network Update and Hyper Parameters
The parameters of the networks were updated based on Adam [369980:8164273] optimization algorithm. The critic network was updated by minimize the loss function $L=\frac{1}{N}{\sum}_{i}{({y}_{i}Q({s}_{i},{a}_{i}{\theta}^{Q}))}^{2}$; the actor network was updated according to the gradient ${\nabla}_{{\theta}^{\mu}}J\approx {{\frac{1}{N}\sum _{i}{\nabla}_{a}Q(s,a{\theta}^{Q})}_{s={s}_{i},a=\mu ({s}_{i})}{\nabla}_{{\theta}^{\mu}}\mu (s{\theta}^{\mu})}_{{s}_{i}}$ [369980:8164262].
The hyperparameters (parameters set prior to the training process) adopted are presented in Table I, these values were determined according to Lillicrap et al. [369980:8164262] and also by performing a test on a randomly sampled training dataset.
Hyperparameter  Value  Description 

Learning rate  0.001  The learning rate used by Adam 
Discount factor  0.99  Discount factor gamma used in the Qlearning update 
Minibatch size  32  Number of training cases over which each stochastic gradient descent (SGD) update is computed 
Replay memory size  7000  Number of training samples in the replay memory 
Soft target update $\tau $  0.001  The update rate of target networks 
VF Exploration Noise of Action
An exploration policy was constructed by adding noise sampled from a noise process to the original actor policy. as suggested by Lillicrap et al. [369980:8164262], an OrnsteinUhlenbeck process [369980:8164274] with $\theta $ = 0.15 and $\sigma $ = 0.2 was used. The OrnsteinUhlenbeck process models the velocity of a Brownian particle with friction, generating temporally correlated values centered around zero. The temporally correlated noise enables the agent to explore well in a physical environment that has momentum.
VG Training the DDPG Velocity Control Model
For the 1,341 extracted carfollowing events, 70% (938) were used for training, and 30% were used for testing. At the training stage, the RL agent sequentially simulates all the carfollowing events in the training data. Whenever a carfollowing event terminates and a new event is to be simulated, the state of the agent is initialized with the empirical data of the new one.
The training was repeated for 60 episodes, and the RL agent generated the maximum average step reward on testing data was selected. Fig. 5 shows the change of average step reward with respect to training episode. As can be seen, the performance of the DDPG model starts to converge when the training episode reaches 20. When the model converges, the agent receives a reward value of about 0.64. This is achieved by selecting actions in a way that makes TTC and jerk feature values near 0 and get maximum headway features (0.65). It should be noted that the model has similar performances on training and testing data, demonstrating that it can generalize well on new data.
VI RESULTS
In this section, carfollowing behavior observed in the empirical NGSIM data and that simulated by the DDPG model were compared, to demonstrate the model’s ability to follow a leading vehicle safely, efficiently, and comfortably. The DDPG model produces the following vehicle trajectories by taking the leading vehicle trajectories as input.
VIA Safe Driving
Driving safety is evaluated based on minimum TTC during a carfollowing event. Fig. 6 shows the cumulative distributions of minimum TTC for NGSIM empirical data and DDPG simulation. Nearly 35% of NGSIM minimum TTCs were lower than 5s, while only about 8% of DDPG minimum TTCs were lower than 5s. This means that carfollowing behavior generated by DDPG model is much safer than drivers’ behavior observed in the NGSIM data.
To give an illustration of the safe driving of the DDPG model, a carfollowing event was randomly chosen from the NGSIM dataset. Fig. 7 shows the observed speed, spacing, and acceleration, and the corresponding ones generated by the DDPG model. The driver in the NGSIM data drove in a way that produced very small intervehicle spacing, while the DDPG model keeps a safe following gap around 10m.
VIB Efficient Driving
Time headway during carfollowing process was used to evaluate driving efficiency. Time headway was calculated at every time step of a carfollowing event, and the distribution of these time headways is shown in Fig. 8.
As can been seen, the DDPG model produced carfollowing trajectories that always maintained a time headway in the range of 1s to 2s. While the NGSIM data had a much wider range of time headway distribution (0s to 6s). This included some dangerous headways that were less than 1s, and also some inefficient headways that were larger than 3s. Therefore, d it can be concluded that the DDPG model has the ability to follow the leading vehicle with an efficient and safe time headway.
VIC Comfortable Driving
Driving comfort was evaluated based on jerk values during car following. Similar to time headway, it was calculated for every time step of a carfollowing event. Fig. 9 presents the histograms of jerk values during car following.
It is obvious that the DDPG model produced trajectories with lower values of jerk. Firstly, the DDPG trajectories had a narrow jerk distribution range (−5 to 5 m/s${}^{3}$) than NGSIM data (−10 to 10 m/s${}^{3}$). Second, jerk values were centered more closely to zero in DDPG simulation trajectories than in NGSIM data. As smaller absolute values of jerk correspond to more comfortable driving, it can be concluded that the DDPG model can control vehicle velocity in a more comfortable way than human drivers in the NGSIM data.
To give an illustration of the comfortable driving of the DDPG model, a carfollowing event was randomly chosen in the NGSIM dataset. Fig. 10 shows the observed speed, spacing, acceleration, and jerk, and the corresponding ones generated by the DDPG model. The driver in the NGSIM data drove in a way with frequent acceleration changes and large jerk values, while the DDPG model can remain a nearly constant acceleration and produced low jerk values.
To summarize, the DDPG model demonstrated the capability of safe, efficient, and comfortable driving in that it 1) had small percentages of dangerous minimum TTC values that is less than 5 seconds; 2) could maintain efficient and safe headways within the range of 1s to 2s; and 3) followed the leading vehicle comfortably with smooth acceleration.
VII DISCUSSION AND CONCLUSION
A model used for velocity control during autonomous carfollowing was proposed based on deep RL. The model uses deep deterministic policy gradient (DDPG) algorithm to learn from trials and interaction, with a reward function signaling how the RL agent performs. The reward function was developed by referencing human driving data and combining driving features related to safety, efficiency, and comfort. By doing this, the model not only tries to imitate human drivers’ behavior but also directly optimizes driving safety, efficiency,and comfort. Results show that compared to human drivers in the real world, the proposed DDPG carfollowing model demonstrated a better capability of safe, efficient, and comfortable driving.
The proposed model can be further extended in the following aspects:

1.
More objectives can be added, such as energysaving driving;

2.
The weights of the objectives can be adjusted to reflect users’ individual preferences;

3.
In the current, a linear function was used to combine different objectives (features). More complicated reward function forms can be adopted to express more complex reward mechanisms, such as a nonlinear function.
Although this study used networks with only one hidden layer, the key idea of deep reinforcement learning methods is fully exploited. Moreover, the networks can be easily extended to deeper ones once more input variables are provided.
This study can further be improved by designing better experience replay mechanisms. Experience replay lets RL agents remember and reuse experiences from the past. In the currently adopted DDPG algorithm, experience transitions were uniformly sampled, without considering their significance [369980:8164275]. In future work, prioritizing experience can be utilized to replay important transitions more frequently, and therefore learn in a more efficient way.
To sum up, this study uses deep RL to learn how to control vehicle velocity during car following in a safe, efficient, and comfortable way. Human driving data of the real world, from the NGSIM study, was used to train the model. Carfollowing behavior produced by the model were, then, compared with the observed one in the empirical NGSIM data, to evaluate the model’s performance. Results show that the proposed model demonstrated the capability of safe, efficient, and comfortable driving, and may even perform better than human drivers. The results indicate that reinforcement learning methods could contribute to the development of autonomous driving systems.
ACKNOWLEDGEMENTS
This study was sponsored by the Chinese National Science Foundation (51522810).