### Abstract

Predicting the road traffic speed is a challenging task due to differenttypes of roads, abrupt speed changes, and spatial dependencies between roads,which requires the modeling of dynamically changing spatial dependencies amongroads and temporal patterns over long input sequences. This paper proposes anovel Spatio-Temporal Graph Attention (STGRAT) that effectively captures thespatio-temporal dynamics in road networks. The features of our approach mainlyinclude spatial attention, temporal attention, and spatial sentinel vectors.The spatial attention takes the graph structure information (e.g., distancebetween roads) and dynamically adjusts spatial correlation based on roadstates. The temporal attention is responsible for capturing traffic speedchanges, while the sentinel vectors allow the model to retrieve new featuresfrom spatially correlated nodes or preserve existing features. The experimentalresults show that STGRAT outperforms existing models, especially in difficultconditions where traffic speeds rapidly change (e.g., rush hours). Weadditionally provide a qualitative study to analyze when and where STGRATmainly attended to make accurate predictions during a rush-hour time.

### Quick Read (beta)

# STGRAT: A Spatio-Temporal Graph Attention Network for Traffic Forecasting

###### Abstract

Predicting the road traffic speed is a challenging task due to different types of roads, abrupt speed changes, and spatial dependencies between roads,
which requires the modeling of dynamically changing spatial dependencies among roads and temporal patterns over long input sequences.
This paper proposes a novel Spatio-Temporal Graph Attention (STGRAT) that effectively captures the spatio-temporal dynamics in road networks. The features of our approach mainly include spatial attention, temporal attention, and spatial sentinel vectors.
The spatial attention takes the graph structure information (e.g., distance between roads) and dynamically adjusts spatial correlation based on road states.
The temporal attention is responsible for capturing traffic speed changes, while the sentinel vectors allow the model to retrieve new features from spatially correlated nodes or preserve existing features.
The experimental results show that STGRAT outperforms existing models, especially in difficult conditions where traffic speeds rapidly change (e.g., rush hours).
We additionally provide a qualitative study to analyze *when* and *where* STGRAT mainly attended to make accurate predictions during a rush-hour time.

STGRAT: A Spatio-Temporal Graph Attention Network for Traffic Forecasting

Cheonbok Park^{1}, Chunggi Lee^{2}, Hyojin Bahng^{1}, Taeyun won^{1},
Kihwan Kim^{2}, Seungmin Jin^{2}, Sungahn Ko^{2}, Jaegul Choo^{1}
^{1} Korea University , ^{2} UNIST

Copyright © 2020, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

## Introduction

Predicting traffic speed is an important task, which can highly impact our daily life. However, this task is challenging, as a prediction method needs not only to find innate spatial dependencies among roads, but also to understand how their dependencies change over time and impact other roads’ traffic conditions. For example, when a road is congested, there is a high chance that its neighboring roads are also congested. Moreover, roads in residential areas tend to have different traffic patterns compared to those in industry complexes (?).

Numerous deep learning models (?; ?; ?; ?) have been proposed for traffic speed prediction based on graph convolution neural network (GCNN) with recurrent neural network (RNN), outperforming conventional approaches (?). Recently, Li et al. (?) have proposed diffusion convolution recurrent neural network (DCRNN) that combines diffusion convolution (?) with RNN and demonstrates improved prediction accuracy. Graph WaveNet (?) adapts diffusion convolution, incorporates a self-adaptive adjacency matrix, and uses dilated convolution for achieving a state-of-the-art performance.

However, prior models have weaknesses. First, existing models (e.g., DCRNN (?), STGCN (?), and Graph WaveNet (?)) assume *fixed* spatial dependencies among roads.
In other words, they compute spatial dependencies once and use that computation result all the time without considering dynamically changing traffic conditions.
GaAN (?) applies different spatial correlation results across roads by utilizing an attention mechanism.
However, GaAN does not consider traffic flow directions and overall graph structure information (e.g., distances between nodes), which can play an important role in deciding which road to attend.
Second, there are models that use recurrent neural networks (RNNs) for temporal modeling (e.g., DCRNN (?) GaAN (?)).
However, RNNs cannot directly access past features in long input sequences, which implies a limitation in capturing long temporal dependencies (?).

In this work, we propose a novel Spatio-Temporal Graph Attention Network (STGRAT) for predicting traffic speed, entirely based on a self-attention mechanism. STGRAT uses spatial and temporal attention for efficiently capturing spatio-temporal dependencies within a road network. Unlike aforementioned models that use RNNs for temporal modeling, STGRAT can directly access distant features of input sequences without any restriction by using temporal attention. The spatial attention observes speed changes to model spatial dependencies across roads, along with graph structure information, such as distances and hops between nodes and traffic flow directions. To further enhance prediction accuracy, we add ‘spatial sentinel’ key and value vectors to the spatial attention, motivated from sentinel mixture model(?; ?). This is introduced as we find that existing attention techniques force a model to always retrieve new information from neighbor nodes, even in unnecessary cases. The sentinel vectors allow STGRAT to avoid such unnecessary attention and focus on existing encoded features instead. The experiment results indicate that STGRAT achieves state-of-the-art performance. We also confirm that STGRAT is better than existing models at predicting traffic speeds in situations where the road speeds are abruptly changing (e.g., rush hours). Lastly, we present the heatmaps of the attention modules which reveal where and when STGRAT attends for making predictions.

The contributions of this work include 1) STGRAT, effectively capturing both spatial and temporal dynamics of input sequences, 2) a newly adapted self-attention module with the sentinel vectors that help the model decide which feature to use, 3) quantitative comparison results against existing and state-of-the-art models, and 4) the qualitative study which describes how STGRAT uses the attention information that changes over time and traffic conditions.

## Related work

In this section, we review previous approaches regarding traffic prediction and attention models.

### Approaches for Short-Term Traffic Forecasting

Many deep learning models exist for predicting short-term traffic conditions by modeling spatial and temporal dependencies of the road traffic. The graph convolution neural network (GCNN) (?) has been popular for spatial relationship modeling. The GCNN views sensor node networks as a type of graph and aggregates neighboring nodes’ information into features based on convolution coefficients. These coefficients are computed by spatial information (e.g., distance between nodes). Recurrent neural networks (RNNs) or its variants are often combined with the encoded spatial relationship to model temporal dependencies (e.g., road speed sequences) (?; ?).

As modeling spatial correlation is a key factor for improving prediction performance, many researchers propose new approaches for effective spatial correlation modeling. Li et al. (?) presents the diffusion convolution recurrent neural network (DCRNN) which combines diffusion convolution (?) and recurrent neural networks for modeling spatial and temporal dependencies. Graph WaveNet (?) also adapts diffusion convolution in spatial modeling, but it is different from DCRNN, as it 1) considers both connected and unconnected nodes in the modeling process and 2) uses dilated convolution to learn long sequences of data.

While effective, existing approaches use constant coefficients, which are computed once and applied to all traffic conditions. However, the fixed coefficients may result in inaccuracy when spatial correlation is variable (e.g., abrupt speed changes). Compared to existing models, STGRAT improves accuracy by dynamically adjusting the coefficients of neighboring nodes based on their present states and more spatial information (e.g., distance, node connectivity, flow direction).

### Attention Models

Attention-based neural networks are widely used for sequence-to-sequence modeling, such as machine translation and natural language processing (NLP) tasks (?; ?). Vaswani et al. propose a novel self-attention network called Transformer (?) which is able to dynamically capture diverse syntactic and semantic features of the given context by using multi-head self-attention heads. The self-attention mechanism has additional advantages compared to conventional long-short-term memory (LSTM) (?) in that its process can be easily paralleled and it directly attends to related input items regardless of the coverage of the receptive field. Due to the advantages, Transformer has contributed to many other NLP tasks for improving accuracy (?; ?). Velickovic et al. (?) utilize the self-attention network for graph data, demonstrating that the attention networks outperform the GCNN model. Zhang et al. (?) propose a graph attention network, replacing the diffusion convolution operation in DCRNN (?) with the gating attention. Their experiment results show that the graph attention model does not lag behind the convolution-based model in the spatiotemporal predictions.

While previous models can be used for replacing GCNN-based spatial modeling, they all have a drawback–they do not consider the information embedded in the graph structure in deciding when and where to attend, such as distances and flow directions between nodes in their spatial dependency modeling processes. Compared to previous models, STGRAT has a novel spatial attention mechanism that can consider all of the mentioned graph structure information.

## Proposed Approach

In this section, we define the traffic forecasting problem and describe our spatio-temporal graph attention network.

### Defining Traffic Speed Forecasting Problem

We aim to predict the future traffic speed at each sensor location. We define the input graph as $\mathcal{G}=(\mathcal{V},\mathcal{E},\mathcal{A})$, where $\mathcal{V}$ is the set of all of different sensor nodes, ($|\mathcal{V}|=N$), $\mathcal{E}$ is the set of edges, and $\mathcal{A}\in {\mathbb{R}}^{N\times N}$ is a weighted adjacency matrix. The matrix $\mathcal{A}$ includes three types of information connectivity, edge weights and proximity. Connectivity indicates whether two nodes are directly connected or not. Edge weights are comprised of the distance and direction of the edges between two connected nodes. Proximity refers to the overall structure information on a given graph including connectivity, edge directions, and distances of the entire nodes.

We denote ${X}^{\left(t\right)}\in {\mathbb{R}}^{N\times 2}$ as the input feature matrix at time $t$, where $N$ is the number of nodes and $2$ is the number of features (the velocity and the timestamp). Following the problem definition in most of previous traffic forecasting frameworks, our goal is to learn a mapping function $f$ that predicts the speed of next $T$ time steps ($Y=[{X}^{\left(t+1\right)},\mathrm{\cdots},{X}^{\left(t+T\right)}]$), given the previous $T$ input speeds in a sequence ($X=[{X}^{\left(t-T+1\right)},\mathrm{\cdots},{X}^{\left(t\right)}]$) and graph $\mathcal{G}$, i.e., $Y=f(X,\mathcal{G})$. We set $T$ as 12 in this work. To solve this sequence-to-sequence learning problem, we utilize an encoder-decoder architecture, as shown in Fig. 1 which is further described in the following sections.

### Encoder Architecture

Given a sequence of observations, $X=[{X}^{1},\mathrm{\cdots},{X}^{\left(t\right)}]$, the encoder consists of spatial attention and temporal attention for predicting the future sequence $Y=[{X}^{\left(t+1\right)},\mathrm{\cdots},{X}^{\left(t+T\right)}]$. As shown in Fig. 1, a single encoder layer consists of three sequential sub-layers: the spatial attention layer, the temporal attention layer, and the point-wise feed-forward neural networks. The spatial attention layer attends to neighbor nodes spatially correlated to the center node at each time-step, while the temporal attention layer attends to individual nodes, focusing on different time steps of a given input sequence. The encoder has a skip-connection to bypass the sub-layer, and we employ layer normalization (?) and dropout after each sub-layer to improve the generalization performance. The overall encoder architecture is a stack of an embedding layer and four ($L=4$) identical encoder layers. The encoder transforms the spatial and temporal dependencies of an input signal into a hidden representation vector, later used for attention layers in the decoder.

### Embedding Layer

Unlike convolution-based GNNs (?; ?), attention-based GNNs (?; ?) mainly utilize connectivity between nodes. But conventional model do not consider proximity information in their modeling process. To incorporate the proximity information, the embedding layer in STGRAT takes a pre-trained node embedding vector generated by LINE (?). The node embedding features are used to compute spatial attention which will be further discussed in the following section.

The embedding layer additionally performs positional embedding to acquire the order of input sequences. Unlike previous methods that use a recurrent or convolutional layer for sequence modeling, we follow the positional encoding scheme of the Transformer (?). Positional encoding does not require extra training parameters. We apply residual skip connections to prevent the vanishing effect of embedded features that can occur as the number of encoder or decoder layers increases. We concatenate each node embedding result with the node features and then project the concatenated features onto ${d}_{model}$. Lastly, we add the positional encoding vector to each time step.

### Spatial attention

Fig. 2 shows the proposed spatial attention, which consists of multiple inflow attention heads (odd head indices) and outflow attention heads (even head indices). Previous attention-based GNNs (?; ?) define spatial correlation in an undirected manner. They calculate attention with respect to all neighbor nodes without considering their direction in a road network. On the other hand, our model differentiates neighbor nodes by direction, i.e., in-coming and out-going, based on the center node. To be specific, we divide the attention heads, odd indices responsible for inflow nodes and even indices for outflow nodes, which allows the model to attend to different information for each of inflow and outflow traffic.

We denote the encoder hidden states as $\mathcal{Z}=[{z}_{i},\mathrm{\cdots},{z}_{N}]$, where ${z}_{i}\in {\mathcal{R}}^{\times {d}_{model}}$ is the hidden state of the $i$-th node. We denote the set of the $i$-th node and its neighbor nodes as ${\mathcal{N}}_{i}$. We define the dimensions of the query, key and value vectors are ${d}_{q}={d}_{k}={d}_{v}={d}_{model}/H$, respectively, where $H$ is the number attention head in multi-head self-attention.

To extract diverse high-level features from multiple attention heads, we project the current node onto a query space and ${\mathcal{N}}_{i}$ onto a key and a value spaces. Each attention head’s output is defined as a weighted sum of value vectors, where the weight of each value is computed from a learned similarity function of the corresponding key with the query. However, existing self-attention methods have the constraint that the sum of weights have to be one. Hence, the query node has to attend to key-value pairs of ${\mathcal{N}}_{i}$, even in situations where any spatial correlation does not exist among them.

To prevent such unnecessary attention, we add spatial sentinel key and value vectors, which are linear transformations of a query node. For instance, if a query node does not require any information from the key-value pairs of ${\mathcal{N}}_{i}$, it will attend to the sentinel key and value vectors, i.e., stick to its existing information rather than always attending to the given key and the value information.

Thus, the output feature of the $i$-th node in the $h$-th attention head, ${o}_{i}^{h}\in {\mathcal{R}}^{{d}_{v}}$, is the weighted sum of the spatial sentinel value vector and the value vectors of ${\mathcal{N}}_{i}$:

${o}_{i}^{h}$ | $=\left(1-{\displaystyle \sum _{j={\mathcal{N}}_{i}}}{\alpha}_{i,j}\right)*\left({z}_{i}{W}_{h}^{{V}_{s}}\right)+{\displaystyle \sum _{j={\mathcal{N}}_{i}}}{\alpha}_{i,j}\left({z}_{j}{W}_{h}^{{V}_{n}}\right),$ |

where ${W}_{h}^{{V}_{s}}\in {\mathcal{R}}^{{d}_{model}\times {d}_{v}}$ and ${W}_{h}^{{V}_{n}}\in {\mathcal{R}}^{{d}_{model}\times {d}_{v}}$ indicate the linear transformation matrix of the sentinel value vector and the value vector of spatial attention. The attention coefficient, ${\alpha}_{i,j}$, is computed as

$$\begin{array}{cc}\hfill {\alpha}_{i,j}=\frac{exp\left({e}_{i,j}\right)}{{e}_{i,s}+{\sum}_{j={\mathcal{N}}_{i}}exp\left({e}_{i,j}\right)},& \end{array}$$ | (1) |

where ${e}_{i,j}$ indicates the energy logits, and ${e}_{i,s}$ represents the sentinel energy logit.

Energy logits are computed by using a scaled dot-product of the query vector of the $i$-th node and the key vector of the $j$-th node, i.e.,

$$\begin{array}{cc}\hfill {e}_{i,j}=\frac{\left({z}_{i}{W}_{h}^{{Q}_{N}}\right){\left({z}_{j}{W}_{h}^{{K}_{N}}\right)}^{T}}{\sqrt{{d}_{k}}}+{P}_{h}(\mathcal{A}),& \end{array}$$ | (2) |

where parameter matrices ${W}_{h}^{{V}_{s}}\in {\mathcal{R}}^{{d}_{model}\times {d}_{q}}$ and ${W}_{h}^{{V}_{n}}\in {\mathcal{R}}^{{d}_{model}\times {d}_{k}}$ are the linear transformation matrix of the query vector and the key vector. Moreover, to explicitly provide edge information, we include additional prior knowledge ${P}_{h}(\mathcal{A})$, called diffusion prior, based on a diffusion process in a graph. The diffusion prior indicates whether the attention head is an inflow attention or an outflow attention, defined as

$$\begin{array}{cc}\hfill {P}_{2m+1}(\mathcal{A})=\sum _{k=0}^{K}{\beta}_{h}^{k}*{({D}_{I}^{-1}{\mathcal{A}}^{T})}^{k}& \\ \hfill {P}_{2m}(\mathcal{A})=\sum _{k=0}^{K}{\beta}_{h}^{k}*{({D}_{O}^{-1}\mathcal{A})}^{k},& \end{array}$$ | (3) |

where $K$ is the number of truncation steps of the diffusion process. ${D}_{I}$ and ${D}_{O}$ are the in-coming diagonal matrix and out-going diagonal matrix respectively. $\left({D}_{O}^{-1}\mathcal{A}\right)$ and $\left({D}_{I}^{-1}\mathcal{A}\right)$ denote the out-going and the in-coming state transition matrices. ${\beta}_{h}^{k}$ is the weight of the diffusion process at step $k$ in the $h$-th attention head, which is a learnable parameter at each layer of the attention head.

The sentinel energy logit is similar to energy logits, but it excludes prior knowledge and uses a sentinel key vector instead. Hence, it is defined as follows:

$$\begin{array}{cc}\hfill {e}_{i,s}=\frac{\left({z}_{i}{W}_{h}^{{Q}_{N}}\right){\left({z}_{i}{W}_{h}^{{K}_{s}}\right)}^{T}}{\sqrt{{d}_{k}}},& \end{array}$$ | (4) |

where ${W}_{h}^{{K}_{s}}\in {\mathcal{R}}^{{d}_{model}\times {d}_{k}}$ is a linear transformation matrix of the sentinel key vector. If ${e}_{i,s}$ is higher than ${e}_{i,j}$, the model will not attend to ${\mathcal{N}}_{i}$ nodes and only aggregate the sentinel value vector.

After computing the output features ${o}_{i}^{h}$ on each attention head, they are concatenated and projected as

$${z}_{i}^{*}=\mathrm{Concat}({o}_{i}^{1},\mathrm{\dots},{o}_{i}^{H})*{W}^{{O}_{N}},$$ | (5) |

where ${W}^{O}\in {\mathcal{R}}^{{d}_{model}\times {d}_{model}}$ is the projection layer. The projection layer helps the model to combine various aspects of spatial-correlation features and outputs of the inflow and outflow attention heads.

### Temporal Attention

There are two major differences between temporal and spatial attention–1) temporal attention does not use the sentinel vectors, and 2) temporal attention attends to important time steps of each node, while spatial attention attends to important nodes at each time step. However, temporal attention is similar to spatial attention in that it uses multi-head attention to capture the diverse representation in the query, key, and value spaces.

The temporal attention is computed by concatenating the output matrix of each attention head and projecting it by ${W}^{{O}_{T}}\in {\mathcal{R}}^{{d}_{model}\times {d}_{model}}$:

$MultiHead$ | $=\mathrm{Concat}(hea{d}_{1},\mathrm{\dots},hea{d}_{H})*{W}^{{O}_{T}}.$ |

At each temporal attention head, ${\mathcal{Z}}_{i}=[{z}_{i}^{(t-T+1)},\mathrm{\cdots},{z}_{i}^{t}]$ is projected onto the query, key, and value spaces,

$hea{d}_{i}$ | $=\mathrm{Attention}(Q{W}_{i}^{{Q}_{T}},K{W}_{i}^{{K}_{T}},V{W}_{i}^{{V}_{T}}),$ |

where ${\mathscr{H}}_{i}$ is an input sequence of hidden state features of the $i$-th node, and ${W}_{i}^{{Q}_{T}}\in {\mathcal{R}}^{{d}_{model}\times {d}_{q}}$,${W}_{i}^{{K}_{T}}\in {\mathcal{R}}^{{d}_{model}\times {d}_{k}}$ and ${W}_{i}^{{V}_{T}}\in {\mathcal{R}}^{{d}_{model}\times {d}_{v}}$ indicates the linear transformation matrix of the query, key and value respectively. After projecting each time-step feature onto sub-spaces, we apply the scaled dot-product attention to obtain the attention weight at each time step. We calculate the output matrix of the weighted sum of value vectors which is defined as follows:

$Attention(Q,K,V)$ | $=\mathrm{Softmax}\left({\displaystyle \frac{Q{K}^{T}}{\sqrt{{d}_{k}}}}\right)*V.$ |

Note that the temporal attention layer can directly attend to features across time steps without any restriction in accessing information in the input sequence, which is different from previous approaches (?; ?) that cannot access features at distant time steps.

### Point-Wise Feed-Forward Networks

The last sub-layer of the encoder is the point-wise feed-forward networks, which consist of two sequential fully-connected networks:

$$FFN({z}_{i}^{t})={W}_{2}*GELU\left({W}_{1}{z}_{i}^{t}+{b}_{1}\right)+{b}_{2},$$ | (6) |

where ${W}_{1},{W}_{2}\in {\mathcal{R}}^{{d}_{model}\times {d}_{model}}$ and ${b}_{1},{b}_{2}\in {\mathcal{R}}^{{d}_{model}}$ are the weights and biases of the two layers. We use the GELU (?) activation function between the two fully-connected layers. This last sub-layer is commonly applied to each node separately, creating a high-level feature that integrates information from the two attention layers.

### Decoder Architecture

The overall structure of the decoder is similar to that of the encoder. The decoder consists of the embedding layer and four other sub-layers: the spatial attention layer, two temporal attention layers, and feed-forward neural networks. After each sub-layer, layer normalization is applied. One difference between the encoder and decoder is that the decoder contains two different temporal attention layers–masked attention layer and encoder-decoder (E-D) attention layer. The masked attention layer masks future time step inputs to restrict attention to present and past information. The encoder-decoder attention layer extracts features by using the encoded vectors from both the encoder and the masked attention layer. In this layer, the encoder-decoder attention from the encoder is used as key and value, and query features of each node are passed along the masked self-attention. Finally, the linear layer predicts the future sequence.

## Evaluation

We present our experiment results on two real-world large-scale datasets–METR-LA and PEMS-BAY. The speed data in the METR-LA dataset (?) is collected from loop detectors on the highways of Los Angeles County. We use 207 sensors’ data for the experiment (data range: 3/1/2012 to 6/30/2012). The PEMS-BAY speed data is gathered from California Transportation Agencies Performance Measurement System and 325 sensors in the dataset are chosen for our experiment (data range: 1/1/2017 to 6/1/2017). We present a comprehensive experiment result with the METR-LA dataset, as BAY area’s traffic conditions are more complicated than those in the LA area (?).

T | Metric | GCRNN | DCRNN | GaAN | STGCN | Graph WaveNet | HyperST | STGRAT | |
---|---|---|---|---|---|---|---|---|---|

METR-LA |
15 min | MAE | 2.80 | 2.73 | 2.71 | 2.88 | 2.69 | 2.71 | 2.60 |

RMSE | 5.51 | 5.27 | 5.24 | 5.74 | 5.15 | 5.23 | 5.07 | ||

MAPE | 7.5% | 7.12% | 6.99% | 7.62% | 6.90% | - | 6.61% | ||

30 min | MAE | 3.24 | 3.13 | 3.12 | 3.47 | 3.07 | 3.12 | 3.01 | |

RMSE | 6.74 | 6.40 | 6.36 | 7.24 | 6.26 | 6.38 | 6.21 | ||

MAPE | 9.0% | 8.65% | 8.56% | 9.57% | 8.37% | - | 8.15 % | ||

1 hour | MAE | 3.81 | 3.58 | 3.64 | 4.59 | 3.53 | 3.58 | 3.49 | |

RMSE | 8.16 | 7.60 | 7.65 | 9.40 | 7.37 | 7.56 | 7.42 | ||

MAPE | 10.9% | 10.43% | 10.62% | 12.70% | 10.01% | - | 10.01% | ||

Average | MAE | 3.28 | 3.14 | 3.16 | 3.64 | 3.09 | 3.13 | 3.03 | |

RMSE | 6.80 | 6.42 | 6.41 | 7.46 | 6.26 | 6.39 | 6.23 | ||

MAPE | 9.13% | 8.73% | 8.72% | 9.96% | 8.42% | - | 8.25% | ||

PEMS-Bay |
15 min | MAE | - | 1.38 | - | 1.36 | 1.30 | - | 1.29 |

RMSE | - | 2.95 | - | 2.96 | 2.74 | - | 2.71 | ||

MAPE | - | 2.9% | - | 2.9% | 2.73% | - | 2.67% | ||

30 min | MAE | - | 1.74 | - | 1.81 | 1.63 | - | 1.61 | |

RMSE | - | 3.97 | - | 4.27 | 4.52 | - | 3.69 | ||

MAPE | - | 3.9% | - | 4.17% | 3.67% | - | 3.63% | ||

1 hour | MAE | - | 2.07 | - | 2.49 | 1.95 | - | 1.95 | |

RMSE | - | 4.74 | - | 5.69 | 4.52 | - | 4.54 | ||

MAPE | - | 4.9% | - | 5.79% | 4.63% | - | 4.64% | ||

Average | MAE | - | 1.73 | - | 1.88 | 1.62 | - | 1.62 | |

RMSE | - | 3.88 | - | 4.30 | 3.65 | - | 3.65 | ||

MAPE | - | 3.9% | - | 4.28% | 3.67% | - | 3.65% |

We pre-process the data to have five-minute interval speed and timestamp data, replace missing values with zeros, and apply the z-score and the min-max normalization.
We use 70% of the data for training, 10% of the data for validation, and the rest of the data for testing.
To form a directed road network graph with sensor information, we define $\mathcal{A}$ based on the shortest distance between two nodes and the thresholded Gaussian kernel (?).
To calculate the shortest path distance between two nodes $d({v}_{i},{v}_{j})$, we apply Dijkstra’s algorithm at each node.
Our pre-processing follows the approach used in DCRNN (?) ^{1}^{1}
1
https://github.com/liyaguang/DCRNN/tree/master/data.

Model / T | 15 min | 30 min | 60 min | Average | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

MAE | RMSE | MAPE | MAE | RMSE | MAPE | MAE | RMSE | MAPE | MAE | RMSE | MAPE | |

Our Model (00-06) | 2.37 | 3.57 | 4.30% | 2.41 | 3.67 | 4.44% | 2.45 | 3.77 | 4.53% | 2.38 | 3.63 | 4.37% |

Graph WaveNet (00-06) | 2.41 | 3.57 | 4.38% | 2.45 | 3.68 | 4.50% | 2.52 | 3.75 | 4.61% | 2.44 | 3.65 | 4.47% |

Our Model (06-12) | 2.55 | 5.31 | 6.93% | 3.07 | 6.72 | 8.82% | 3.68 | 8.16 | 10.93% | 3.01 | 6.51 | 8.63% |

Graph WaveNet (06-12) | 2.67 | 5.43 | 7.55% | 3.19 | 6.79 | 9.42% | 3.73 | 8.01 | 11.28% | 3.11 | 6.55 | 9.15% |

Our Model (12-18) | 3.14 | 6.16 | 9.34% | 3.80 | 7.69 | 12.02% | 4.55 | 9.29 | 15.53% | 3.72 | 7.47 | 11.88% |

Graph WaveNet (12-18) | 3.29 | 6.23 | 10.25% | 3.80 | 7.69 | 12.41% | 4.48 | 8.85 | 15.39% | 3.77 | 7.31 | 12.28% |

Our Model (18-24) | 2.34 | 4.77 | 5.69% | 2.72 | 5.86 | 7.00% | 3.21 | 7.10 | 8.60% | 2.69 | 5.75 | 6.91% |

Graph WaveNet (18-24) | 2.41 | 4.83 | 6.09% | 2.78 | 5.87 | 7.47% | 3.19 | 6.90 | 8.99% | 2.73 | 5.71 | 7.33% |

Our Model (Impeded Interval) | 5.72 | 9.89 | 23.80% | 7.57 | 12.91 | 33.99% | 9.63 | 15.84 | 44.23% | 7.38 | 12.78 | 32.39% |

Graph WaveNet (Impeded Interval) | 6.21 | 10.21 | 27.18% | 7.87 | 12.94 | 35.68% | 9.62 | 15.50 | 45.73% | 7.64 | 12.74 | 34.87% |

DCRNN (Impeded Interval) | 6.34 | 10.68 | 27.41% | 8.19 | 13.64 | 36.85% | 10.11 | 16.43 | 47.61% | 7.95 | 13.45 | 35.93% |

### STGRAT Settings

STGRAT predicts the speeds of the next 12 time steps from present time (5-minute interval, 1-hour in total) based on the input speeds of the previous 12 time steps (5-minute interval, 1-hour in total). We train a four-layer spatio-temporal attention model ($H=4$, ${d}_{model}=128$) and use the scheduled sampling method (?) to reduce discrepancy between the training and testing phase. We also utilize an inverse sigmoid decay scheduled sampling function.

For optimization, we apply Adam-warmup optimizer (?) and set the warmup step size and batch size as 4,000 and 20 respectively. We use dropout (rate: $0.3$) (?) on the inputs of every sub-layer and on the attention weights. We initialize the parameters by using Xavier initialization (?) and use a uniform distribution $U(1,6)$ to initialize the weights of the diffusion prior. LINE (?) is used for node embedding (dimension: 64) which takes two minutes for training with 20 threads. We set the edge weight as a distance between two connected nodes, using Vector Auto-Regression (VAR) for correlation computation. Finally, we set the learning rate and training samples as 0.025 and 100 respectively.

### Experiment Results

In this section, we compare the performance of STGRAT with six baseline models, including state-of-the-art deep learning models [Graph Convolution based RNN (GCRNN), Diffusion Convolution based RNN (DCRNN) (?), Gated Attention Networks (GaAN) (?), STGCN (?), Graph WaveNet (?) and HyeperST (?)]. As we observed in our experiment that the performance of reproduced DCRNN on the METR-LA dataset is better than that in the original work (?), we use the reproduced result for comparison. Other than DCRNN, we could not achieve better performance than the reported results, so we take the original results for comparison. If results are not available, we denote as ‘-’. In the experiment, we measure the accuracy of the models using absolute error (MAE), root mean squared error (RMSE), and mean absolute percentage error (MAPE). The task is to predict the traffic speed 15, 30 and 60 minutes at present time. We also report average scores of three forecasting horizons on the dataset.

Table 1 shows the experiment results of both datasets, where we observe that STGRAT achieves the state-of-the-art performance in average scores. In particular, STGRAT excels at predicting speed after 15 and 30 minutes. STGRAT shows higher accuracy than GaAN, which is based on undirected spatial attention, but neglects the proximity information. STGRAT also achieves higher performance, compared to DCRNN and GraphWavenet, demonstrating that our spatial and temporal attention mechanism is more effective than that of the the two models for predicting short-term future sequences.

### Experiments with Cases

We further evaluate STGRAT in several perspectives. First, we compare the forecasting performance of the models in separate time ranges. We do this because traffic and congestion patterns in a city dynamically change based on time. For example, some roads around residential areas are congested during regular rush hours, while those around industrial complexes are congested from late night to early morning (?). Table 2 shows the experiment results with four time ranges (00:00–05:59, 06:00–11:59, 12:00–17:59, and 18:00–23:59) and we find that STGRAT shows better performance than Graph WaveNet.

There are several factors that affect road speeds, such as rush hours and vehicle accidents. When these factors occur, traffic speeds tend to decrease; and then increase as the factors are alleviated. As these factors tend to occur frequently, it is important for models to effectively capture the relevant temporal dynamics. To evaluate how STGRAT adapts to speed changes, we first extract intervals of speeds from the data where road speeds change rapidly. We use “ruptures” (?), an algorithm for computing changing points in a non-stationary signal. Fig. 3 shows the extracted intervals of two example roads where the y-axis represents speed and the x-axis represents the sequence of intervals. The numbers below each interval indicate the lengths in minutes. To find the intervals related to traffic congestion, we filter out intervals in which the slowest speeds are faster than 20 mph. After filtering, the final set of intervals for our experiment constitutes 21.23%, 28.39%, 29.38%, and 16.95% of each time range respectively. Table 2 (last row) shows the performance comparison between STGRAT, Graph WaveNet and DCRNN, and we observe that our model better captures the temporal dynamics of speed.

## Qualitative Analysis

Several methods exist for traffic forecasting, yet they rarely present interpretation of how deep learning models make decisions. In this section, we describe our model’s prediction process by analyzing the attention dynamics. Fig. 4 A1 shows Node 115’s speed changes, where the x- and y-axis represent time and speed (mph) respectively. We choose this node for demonstration due to its typical congestion pattern at evenings. As shown in Fig. 4 A1, this road was congested from 16:50 and congestion alleviated around 19:35 (i.e., two hours of congestion). The heatmaps of Fig. 4 (B1–B3) provide information for the last layer’s spatial attention in different times (B1–T1, B2–T2, B3–T3). In the heatmaps, the y-axis at each head (in-out-in-out degrees) represents 12 sequences, and the x-axis indicates the 207 nodes and the sentinel vector (last column).

First, we find that STGRAT gave more attention to six specific nodes (54, 122, 141, 168, 175, and 201) than others when the road was experiencing light traffic (T1 in Fig. 4, A1). Then, as time evolved from T1 to T2, Nodes 54 and 175 gradually received less attention, and Nodes 52, 111, 145, and 201 earned more attention, as shown in Fig. 4 B2. A notable observation at this point is that STGRAT attended to Node 145 and Node 201 (i.e., dark blue bars) more than other nodes. To investigate, we first review Node 145’s speed records (Fig. 4, A2) and find Node 145’s speed tended to precede that of Node 115 by about 30 minutes. Checking correlations between two nodes based on Vector Auto-Regression, we find that Node 115’s speed pattern was highly correlated with that of Node 145 (correlation: 0.59), while it was less correlated with other nodes (e.g., correlation: 0.48 with Node 52). Checking Node 201, we see that the correlation between Node 115 and Node 201 was also high (correlation.: 0.56). An interesting point here is that while both Node 145 and Node 201 is highly correlated with Node 115, the distances from the two nodes to Node 115 are different–0.9 miles between Node 115 and Node 145 and 7.64 miles between Node 115 and Node 201. We believe this is the result of learning spatial relationships–STGRAT chose to attend to two highly correlated nodes at different distances, when the road condition changed from light to heavy traffic.

When the traffic congestion alleviated at 19:35 (Fig. 4, T3), STGRAT attended to its sentinel vector of Head#1 (In), as shown in Fig. 4 B3. This implies that STGRAT decided to utilize the existing features extracted from 18:45 to 19:05. Note that in the sentinel vector, index 0 (top) means the time step of 55 minutes ago and index 11 (bottom) represents the present time step. STGRAT also attended to Node 125 (correlation: 0.43) for updating spatial dependency from a neighboring road (2.7 miles away from Node 115). We see a similar behavior in Head#4 (Out)–STGRAT focused on the sentinel vector, while attending to a distant Node 168 (correlation: 0.59, 7.8 miles away from Node 115).

Next, we analyze the first layer’s second head of encoder, decoder, and encoder-decoder (E-D) temporal attention (Fig. 4, D1). Note that Fig. 4 C shows how the input and output sequences are mapped with each axis of the attention heatmaps in D1–D3. We see from Fig. 4 D1 that when Node 115 was not congested (T1), the encoder’s temporal attention and decoder’s temporal attention were equally spread out across all time steps. However, it is interesting that the encoders’ temporal attention gradually became divided into two regions (top-left and bottom-right) as the road became congested, as shown in a series of the heatmaps in Fig. 4 D2. Here, the top-left region represents past unimpeded condition’s information, while the other region means current impeded condition’s information. We believe this divided attention is reasonable, as STGRAT needed to consider two possible future directions at the same time–one with static congestion and another with a changing condition, possible with a recovering speed. It is notable that the first column of the decoder’s temporal attention became darker (D2), implying that STGRAT attended recent information for making predictions in the impeded condition. Overall, STGRAT uses the attention module in an adaptive manner with evolving traffic conditions for effectively capturing temporal dynamics of traffic.

## Conclusion

In this work, we present STGRAT with a novel spatial and temporal attention for accurate traffic speed prediction. Node attention captures the spatial correlation among roads, utilizing graph structure information, while temporal attention captures the temporal dynamics of the road network by directly attending to features in long sequences. The experiment results of the METR-LA and PEMS-BAY datasets indicate that STGRAT achieves the state-of-the-art performance compared to existing methods, including recent deep learning models. We also show that STGRAT records higher accuracy than the baseline models in cases when speeds abruptly change. Lastly, we visualize when and where STGRAT attends for making predictions during traffic congestion. For future work, we plan to conduct more experiments using STGRAT with different spatio-temporal domains and datasets, such as air quality data.