Reinforcement Learning of Spatio-Temporal Point Processes

  • 2020-09-05 03:06:41
  • Shixiang Zhu, Shuang Li, Zhigang Peng, Yao Xie
  • 0

Abstract

We present a novel Neural Embedding Spatio-Temporal (NEST) point processmodel for spatio-temporal discrete event data and develop an efficientimitation learning (a type of reinforcement learning) based approach for modelfitting. Despite the rapid development of one-dimensional temporal pointprocesses for discrete event data, the study of spatial-temporal aspects ofsuch data is relatively scarce. Our model captures complex spatio-temporaldependence between discrete events by carefully design a mixture ofheterogeneous Gaussian diffusion kernels, whose parameters are parameterized byneural networks. This is the key that our model can capture intricate spatialdependence patterns and yet still lead to interpretable results as we examinemaps of Gaussian diffusion kernel parameters. The imitation learning modelfitting for NEST is more robust since it directly measures the divergencebetween the empirical distributions between the training data and themodel-generated data. Moreover, our imitation learning-based approach enjoyscomputational efficiency due to the explicit characterization of the rewardfunction related to the likelihood function; furthermore, the likelihoodfunction under our model enjoys tractable expression due to Gaussian kernelparameterization. Experiments based on real data show our method's goodperformance relative to the state-of-the-art and the good interpretability ofNEST's result.

 

Quick Read (beta)

Reinforcement Learning of
Spatio-Temporal Point Processes

Shixiang Zhu, Shuang Li, Zhigang Peng, Yao Xie
Abstract

We present a novel Neural Embedding Spatio-Temporal (NEST) point process model for spatio-temporal discrete event data and develop an efficient imitation learning (a type of reinforcement learning) based approach for model fitting. Despite the rapid development of one-dimensional temporal point processes for discrete event data, the study of spatial-temporal aspects of such data is relatively scarce. Our model captures complex spatio-temporal dependence between discrete events by carefully design a mixture of heterogeneous Gaussian diffusion kernels, whose parameters are parameterized by neural networks. This is the key that our model can capture intricate spatial dependence patterns and yet still lead to interpretable results as we examine maps of Gaussian diffusion kernel parameters. The imitation learning model fitting for NEST is more robust since it directly measures the divergence between the empirical distributions between the training data and the model-generated data. Moreover, our imitation learning-based approach enjoys computational efficiency due to the explicit characterization of the reward function related to the likelihood function; furthermore, the likelihood function under our model enjoys tractable expression due to Gaussian kernel parameterization. Experiments based on real data show our method’s good performance relative to the state-of-the-art and the good interpretability of NEST’s result.

Spatio-temporal point processes, Generative model, Imitation learning

I Introduction

Spatio-temporal event data has become ubiquitous, emerging from various applications such as social media data, crime events data, social mobility data, and electronic health records. Such data consist of a sequence of times and locations that indicate when and where the events occurred. Studying generative models for discrete events data has become a hot area in machine learning and statistics: it reveals of pattern in the data, helps us to understand the data dynamic and information diffusion, as well as serves as an important step to enable subsequent machine learning tasks.

Point process models (see [Reinhart2018] for an overview) have become a standard choice for generative models of discrete event data. In particular, the self and mutual exciting processes, also known as the Hawkes processes, are popular since they can capture past events’ influence on future events over time, space, and networks.

Despite the rapid development of one-dimensional temporal point processes models for discrete event data, the study focusing on spatial-temporal aspects of such data is relatively scarce. The original works of [Ogata1988, Ogata1998] develop the so-called ETAS model, which is still widely used, suggesting an exponential decaying diffusion kernel function. This model captures the seismic activities’ mechanism and is convenient to fit, as the kernel function is homogeneous at all locations with the same oval shape (Fig. 2). However, these classical models for spatio-temporal event data (usually statistical models in nature) tend to make strong parametric assumptions on the conditional intensity.

Fig. 1: A motivating example of seismic activities: four major earthquakes and their aftershocks occurred in New Madrid, MO., in the United States since 1811. The blue triangles represent the major earthquakes, and the dotted circles represent the estimated aftershock regions suggested by ETAS. The small red dots represent the actual aftershocks caused by the major earthquakes. We can observe that the locations of actual aftershocks are related to the geologic structure of faults, and the vanilla ETAS model may not sufficiently capture such complex spatial dependence patterns over time.
Fig. 2: To illustrate the difference in the nature of ETAS and proposed NEST models, we show two series of events sequentially generated in a square region 𝒮=[-1,+1]×[-1,+1] within the time horizon [0,T] by ETAS and NEST, respectively. The colored balls represent events generated by the corresponding model. Snapshots were taken at time t1,t2,t3,t4 indicated by the black smaller dots to show the progression of the spatial intensity λ*(t,s),s𝒮 through time for these two sequences. The brighter region in the snapshots represents a higher intensity value and is more likely to generate the next event. As self-exciting spatio-temporal point processes, the occurrence of a new event will raise the intensity in the local region instantaneously. Then the intensity of this region will decay and diffuse to the surrounding region over time.

However, in specific scenarios, the simplifying spatio-temporal model based on ETAS may lack flexibility. It does not capture the anisotropic spatial influence and cannot capture the complex spatial dependence structure. Take earthquake event data as an example, consisting of a sequence of records of seismic activities: their times and locations. The aftershocks are minor seismic activities that are trigger by the major earthquakes. According to the study11 1 https://www.nature.com/news/2009/091104/full/news.2009.1062.html, it has been shown that most of the recent seismic events occurred in New Madrid, MO, are aftershocks of four earthquakes of magnitude 7.5 in 1811. As shown in Fig. 1, the distribution of the minor seismic activities is in a complex shape (clearly not “circles” or anisotropic), and the spatial correlation between seismic activities is related to the geologic structure of faults through the complex physical mechanism and usually exhibits a heterogeneous conditional intensity. For instance, most aftershocks either occur along the fault plane or along other faults within the volume affected by the mainshock’s strain. This creates a spatial profile of the intensity function that we would like to capture through the model, such as the direction and shape of the intensity function at different locations, to provide useful information to geophysicists’ scientific study.

On the other hand, when developing spatio-temporal models, we typically want to generate some statistical interpretations (e.g., temporal correlation, spatial correlation), which may not be easily derived from a complete neural network model. Thus, generative model based on specifying conditional intensity of point process models is a popular approach. For example, recent works [Du2016, Mei2017, Li2018, Upadhyay2018, Xiao2017A, Xiao2017B, Zhu2019B] has achieved many successes in modeling temporal event data (some with marks) which are correlated in time. It remains an open question on extending this type of approach to include the spatio-temporal point processes. One challenge is how to address the computational challenge associated with evaluating the log-likelihood function (see expression (4)). This can be intractable for the general model without a carefully crafted structure since it requires the integration of the conditional intensity function in a continuous spatial and time-space. Another challenge is regarding how to develop robust model-fitting methods that do not rely too much on the modeling assumption.

In this paper, we present a novel point-process based model for spatio-temporal discrete event data and develop an efficient imitation learning (a type of reinforcement learning) based approach for model fitting. Our proposed NEST model tackles flexible representation for complex spatial dependence, interpretability, and computational efficiency, through meticulously designed neural networks with embedding capturing spatial information. We generalize the idea of using a Gaussian diffusion kernel to model spatial correlation by introducing the more flexible heterogeneous mixture of Gaussian diffusion kernels with shifts, rotations, and non-isotropic shapes. Such a model can still be efficiently represented using a handful of parameters (compared with a full neural network model such as convolutional neural networks (CNN) over space). The Gaussian diffusion kernels are parameterized by neural networks, which allows the kernels to vary continuously over locations. This is the key that our model can capture intricate spatial dependence patterns and yet still lead to interpretable results as we examine maps of Gaussian diffusion kernel parameters. As shown in Fig. 2, our model is able to represent arbitrary diffusion shape at different locations in contrast to ETAS developed by [Ogata1981, Ogata1988, Ogata1998].

Second, we develop computationally efficient approaches for fitting the NEST model based on imitation learning [Gretton2007, Li2018], and compared it with the maximum likelihood estimation method. The imitation learning model fitting for NEST is more flexible and robust since it directly measures the divergence between the empirical distributions between the training data and the model-generated data. Moreover, our imitation learning-based approach enjoys computational efficiency due to the explicit characterization of the reward function related to the likelihood function; furthermore, the likelihood function under our model enjoys tractable expression due to Gaussian kernel parameterization. Experiments based on real data show our method’s good performance relative to the state-of-the-art and NEST results’ interpretability.

The rest of the paper is organized as follows. We start by reviewing related literature. Then Section II introduces the background of spatio-temporal point processes and classic models. Section III presents our proposed NEST model. Section IV presents the imitation learning framework for mode fitting. Section V contains experiments based on synthetic and real data. Finally, Section VI concludes the paper with discussions.

I-A Related work

Existing literature on spatial-temporal point process modeling usually makes simplified parametric assumptions on the conditional intensity based on the prior knowledge of the processes. Landmark works [Ogata1998, Ogata1988] suggest an exponential decaying kernel function. This model captures seismic activities’ mechanism to a certain degree and is easy to learn, as the kernel function is homogeneous at all locations. However, in applications to other scenarios, such a parametric model may lack flexibility.

Recently, another approach to obtain generative models for point processes is based on the idea of imitation learning and reinforcement learning. Good performance has been achieved for modeling temporal point processes [Li2018] and marked temporal point processes [Upadhyay2018]. The premise is to formulate the generative model as a policy for reinforcement learning and extract policy from sequential data as if it were obtained by reinforcement learning [Richard1998] followed by inverse reinforcement learning [Andrew2000, Ho2016]. In this way, the generative model is parameterized by neural networks [Du2016, Duan2016, Mei2017, Volodymyr2016]. However, by representing the conditional intensity entirely using neural networks may lacks certain interpretability. Compressing all the information by neural networks may also miss the opportunity to consider prior knowledge about the point process. Also it remains an open question on how to extend this approach to include the spatial component as well. The spatial-temporal point processes are significantly more challenging to model than the temporal point processes only since the spatial dependence is much more complex to capture than the one-dimensional temporal dependence.

Other works such as [Short2010-1, Short2010-2] have achieved some success in modeling complicated spatial patterns of crime by considering the spatial influence as hotspots with full parametric models. Some works [Fox2016, Zipkin2016, Lewis2012, Xiao2017B] consider the event sequences as temporal point processes without incorporating spatial information leverages non-parametric approaches. As a compromise, [Zhu2019A, Mohler2011, Mei2017] seek to learn the spatial and temporal pattern jointly by multivariate point processes with discretized the spatial structure.

II Background

In this section, we revisit the definitions of the spatio-temporal point processes (STPP) and the commonly used STPP models.

II-A Spatio-temporal point processes (STPP)

STPP consists of an ordered sequence of events in time and location space. Assume our observation duration is T and the data is given by {a1,a2,,aN(T)}, which sequences of events ordered in time. Each ai is a spatio-temporal tuple ai=(ti,si), where ti[0,T) is the time of the event and si𝒮2 is the associated location of the i-th event. We denote by N(T) the number of the events in the sequence between time [0,T) and in the region 𝒮.

The joint distribution of a STPP is completely characterized by its conditional intensity function λ(t,s|t). Given the event history t={(ti,si)|ti<t}, the intensity corresponds to the probability of observing an event in an infinitesimal around (t,s):

λ(t,s|t)=limΔt,Δs0𝔼[N([t,t+Δt)×B(s,Δs))|t]Δt×|B(s,Δs)|,

where N(A) is the counting measure of events over the set A[0,T)×𝒮, B(s,Δs) denotes a Euclidean ball centered at s with radius Δs, || is the Lebesgue measure. Below, for notational simplicity, we denote the conditional intensity function λ(t,s|t) as λ*(t,s).

For instance, a type of self-exciting point processes, Hawkes processes [Hawkes1971] has been widely used to capture the mutual excitation among temporal events. Assuming that influence from past events are linearly additive for the current event, the conditional intensity function of a Hawkes process is defined as

λ(t|t)=λ0+ti<tν(t-ti),

where λ00 is the background intensity of events, ν()0 is the triggering function that captures temporal dependencies of the past events. The triggering function can be chosen in advance, for instance, in the one-dimensional case ν(t-ti)=αexp{-β(t-ti)}.

II-B ETAS model

The most commonly used kernel function for spatio-temporal point processes is the standard diffusion kernel function proposed by the Epidemic Type Aftershock-Sequences (ETAS) modeling [Musmeci1992], which was originally introduced to model the earthquake events, but now widely used in many other applications [Ogata1988, Ogata1998, Zhu2019A, Fox2016, Lewis2012, Zipkin2016]. ETAS model assumes that the influence over time and space decouples, and the influence decays exponentially over time, and over space decay only depends on distance (thus, it is a spherical model). Thus, ETAS model does not capture the anisotropic shape of kernel. This is a simplification and may not capture complex spatial dependence. ETAS model can also deal with scalar-valued marks (e.g., the magnitude of earthquakes), which we will not discuss here while focusing on capturing spatio-temporal interactions. One of the reasons that ETAS is a popular model is due to its interpretability.

III Proposed model

Fig. 3: An example of kernel used in the NEST model: σx, σy, ρ defines a Gaussian component in the heterogeneous Gaussian diffusion kernel. The right hand side is the conditional intensity at time t, where two points occurred at location (x1,y1) and (x2,y2) have triggered the two diffusions (the bright spots) with different shapes. Note that this Gaussian component is specified by parameters (mean, covariance) that vary continuously over space, which are themselves represented by neural networks.

To capture the complex and heterogenous spatial dependence in discrete events, we present a novel continuous-time and continuous-space point process model, called the Neural Embedding Spatio-Temporal (NEST) model. The NEST uses the flexible neural network structure to represent the conditional intensity’s spatial heterogeneity while retaining interpretability as a semi-parametric statistical model.

III-A Spatially heterogeneous Gaussian diffusion kernel

We start by specifying the conditional probability of the point process model, as it will uniquely specify the joint distribution of a sequence of events. First, to obtain a similar interpretation as the ETAS model [Ogata1988], we start from a similar parametric form for the conditional intensity function

λ*(t,s)=λ0+j:tj<tν(t,tj,s,sj), (1)

where λ0>0 is a constant background rate, ν is the kernel function that captures the influence of the past events t. The form of the kernel function ν determines the profile of the spatio-temporal dependence of events.

We assume the kernel function takes the form of a standard Gaussian diffusion kernel over space and decays exponential over time. To enhance the spatial expressiveness, we adopt a mixture of generalized Gaussian diffusion kernels, which is location dependent. Thus, it can capture more complicated spatio-nonhomogeneous structure. Given all past events t, we define

ν(t,t,s,s)=k=1Kϕs(k)g(t,t,s,s|Σs(k),μs(k)), (2)
t<t,s𝒮,

where {μs(k),Σs(k)} are the mean and covariance matrix parameters (which we will specify later), K is the hyper-parameter that defines the number of components of the Gaussian mixture; ϕs(k):𝒮 (form specified later) is the weight for the k-th Gaussian component that satisfies k=1Kϕs(k)=1, s𝒮. In the following discussions, we omit the superscript k for the notational simplicity.

Now each Gaussian diffusion kernel is defined as

g(t,t,s,s|Σs,μs)=Ce-β(t-t)2π|Σs|(t-t) (3)
exp{-(s-s-μs)TΣs-1(s-s-μs)2(t-t)},

where β>0 controls the temporal decay rate, C>0 is constant to control the magnitude, μs=[μx(s),μy(s)]T, and Σs denote the mean and covariance parameters of the diffusion kernel (which may vary over time t and “source” locations s𝒮); || denotes the determinant of a covariance matrix; Σs is defined as a positive semi-definite matrix

Σs=(σx2(s)ρ(s)σx(s)σy(s)ρ(s)σx(s)σy(s)σy2(s)).

The parameters μs and Σs control the shift, rotation and shape of each Gaussian component. As shown in Fig. 3, parameters σx(s),σy(s),ρ(s) may vary according to the location s and jointly control the spatial structure of diffusion at s. The μx(s),μy(s) define the offset of the center of the diffusion from the location s. Note that μx:𝒮, μy:𝒮, σx:𝒮+, σy:𝒮+, ρ:𝒮(-1,1) are the non-linear mappings that project location s to the parameters. To capture intricate spatial dependence, we represent such non-linear mappings from location to the parameters of Gaussian components (defined by (3)) using neural networks.

III-B Comparison with ETAS model

In the standard ETAS model, the kernel function is defined as a single component whose parameters do not vary over space and time: the kernel function (2) is simplified to ν(t,t,s,s)=g(t,t,s,s|Σ,0), where the spatial and temporal parameters are invariant Σdiag{σx2,σy2} and μs0. Compared with the standard Gaussian diffusion kernel used in ETAS, here we introduce additional parameters ρ,μx,μy that allows the diffusion to shift, rotate, or stretch in the space. An example of the comparison of the spatio-temporal kernels between ETAS and NEST models is presented in Fig. 2.

III-C Deep neural network representation

Recall that parameters in each Gaussian component are determined by a set of non-linear mappings {ρ(s),σx(s),σy(s),μx(s),μy(s)}. We capture these non-linear spatial dependencies using a deep neural network through a latent embedding, which we explain below.

Assume the spatial structure at one location s is summarized by an latent embedding vector 𝒉(s)d, where d is the dimension of the embedding. The parameters of a Gaussian component at location s can be represented by a single-layer neural network where the input of the network is the latent embedding 𝒉(s). This layer is specified as follows. The mean parameters are specified by

μx(s) =Cx(sigm(𝒉(s)TWμx+bμx)-1/2),
μy(s) =Cy(sigm(𝒉(s)TWμy+bμy)-1/2),

where Cx,Cy are preset constants that control the shift of the center of Gaussian components from location s, sigm(x)=1/(1+e-x) is the sigmoid function which gives an output in the range [0, 1]. The variance and the correlation parameters are specified by

σx(s) =softplus(𝒉(s)TWσx+bσx),
σy(s) =softplus(𝒉(s)TWσy+bσy),
ρ(s) =2sigm(𝒉(s)TWρ+bρ)-1,

where softplus=log(1+ex) is a smooth approximation of the ReLU function. The parameters in the network θw={Wσx,Wσy,Wμx,Wμy,Wρ} and θb={bσx,bσy,bμx,bμy,bρ} are weight-vectors and biases in the output layer of the Gaussian component. Note that we omitted the superscript k of each parameter in the discussion above for the notational simplicity, but it should be understood that each Gaussian component will have its own set of parameters. Finally, the weight of each component is given by ϕs(k), which is defined through the soft-max function

ϕs(k)=e𝒉(s)TWϕ(k)/κ=1Ke𝒉(s)TWϕ(κ).

where Wϕ(k)d is a weight vector to be learned. The latent embedding 𝒉(s) is characterized by another neural network defined as 𝒉(s)=ψ(s|θh), where ψ():2d is a fully-connected multi-layer neural network function taking spatial location s as input, θh contains the parameters in this neural network. In our experiments later, we typically use three-layer neural networks where the width of each layer is 64.

In summary, the NEST with heterogeneous Gaussian mixture diffusion kernel is jointly parameterized by θ={β,θh,{θw(k),θb(k)}k=1,,K}. The architecture is summarized in Fig. 4. In the following, we denote the conditional intensity as λθ*(s,t) defined in (1), to make the dependence on the parameter more explicit. Note that the Gaussian diffusion kernels’s parameters vary continuously over location, and are represented by flexible neural networks; this is the key that our model can capture complex spatial dependence patterns.

Fig. 4: An illustration for NEST’s neural network architecture based on a mixture of heterogeneous Gaussian diffusion kernel. Note that each Gaussian kernel is specified by neural networks, which can be viewed as summarizing the latent embedding information from data.

IV Computationally efficient learning

In this section, we define two approaches to learn parameters for the NEST model: (i) the maximum likelihood-based approach, and (ii) the imitation learning-based approach, using a policy parameterized by the conditional intensity and a non-parametric reward function based on the maximum mean discrepancy (MMD) metric. In contrast to the maximum likelihood, the imitation learning approach is more robust to model misspecification, as it measures the model’s “actual performance” by examining the divergence between the empirical distribution of the training data and the counterfeits generated by the model. Fig. 5 illustrates the basic ideas of two approaches.

IV-A Maximum likelihood approach

The model parameters can be estimated via maximum likelihood estimate (MLE) since we have the explicit form of the conditional intensity function. Given a sequence of events 𝒂={a0,a1,,an} occurred on (0,T]×𝒮 with length n, where ai=(ti,si), the log-likelihood is given by

(θ)=(i=1nλθ*(ti,si))-0T𝒮λθ*(τ,r)𝑑r𝑑τ. (4)

A crucial step to tackle the computational challenge is to evaluate the integral in (4). Here, we can obtain a closed-form expression for the likelihood function, using the following proposition. This can reduce the integral to an analytical form, which can be evaluated directly without numerical integral.

Proposition 1 (Integral of conditional intensity function).

Given ordered event times 0=t0<t1<<tn<tn+1=T, for i=0,,n,

titi+1𝒮λθ*(τ,r)𝑑r𝑑τ=λ0(ti+1-ti)|𝒮|+(1-ϵ)Cβj:tj<tiCj(e-β(ti-tj)-e-β(ti+1-tj)),

where

Cj=k=1Kϕsj(k)σx(k)(sj)σy(k)(sj)|Σsj(k)|1/2,

and the constant

ϵ=maxj:tj<ti+1titi+1𝒮g(τ,tj,r,sj)𝑑r𝑑τtiti+12g(τ,tj,r,sj)𝑑r𝑑τ.

Since spatially, the kernel g is a Gaussian concentrated around s, when 𝒮 is chosen sufficiently large, and most events si locates in the relatively interior of 𝒮, we can ignore the marginal effect and ϵ can become a number much smaller than 1. Due to the decreased activity in the region’s edges, the boundary effect is usually negligible [Ogata1998]. Define t0=0 and tn+1=T. Since

0T𝒮λθ*(τ,r)𝑑r𝑑τ=i=0n+1titi+1𝒮λθ*(τ,r)𝑑r𝑑τ,

using Proposition 1 we can write down the integral in the log-likelihood function in closed-form expression.

Finally, the optimal parameters trained the maximum-likelihood is thus obtained by θ^=argmaxθlog(θ). Due to the non-convex nature of this problem, we solve the problem by stochastic gradient descent.

IV-B Imitation learning approach

We now present a more flexible, imitation learning framework for model fitting. The major benefit of this approach is that it does not rely on the likelihood function model and thus is more robust to model misspecification. The model’s performance is evaluated by the maximum mean discrepancy (MMD) statistic, which is a non-parametric metric that measures the divergence between two empirical distributions.

(a) Maximum likelihood estimation (MLE) for NEST
(b) Imitation learning (IL) for NEST
Fig. 5: Comparison between the maximum likelihood and the imitation learning approaches. The main difference is that the MLE measure the “likelihood” of the data under a model and thus can be more susceptible to model misspecification, whereas our IL approach measures the actual divergence between the training data and the sequence generated from the model using MMD statistic without relying on model assumptions.

The setting of imitation learning is as follows. Assume a learner takes actions a:=(t,s)[0,T)×𝒮 sequentially in an environment according to certain policy, and the environment gives feedbacks (using a reward function) to the learner via observing the discrepancy between the learner actions and the demonstrations (training data) provided by an expert. In our setting, both the learner’s actions and the demonstrations are over continuous-time and continuous space, which is a distinct feature of our problem.

IV-B1 Policy parameterization

Our desired learner policy is a probability density function of possible actions given the history t. We define such function as π(t,s):[0,T)×𝒮[0,1], which assigns a probability to the next event at any possible location and time. Let the last event time before T be tn, and thus the next possible event is denoted as (tn+1,sn+1). The definition of the policy is

π(t,s)=((tn+1,sn+1)[t,t+dt)×B(s,Δs)|t).

We will show that the policy can be explicitly related to the the conditional intensity function of the point process.

Lemma 1.

A spatial temporal point process which generate new samples according to π(t,s) has the corresponding conditional intensity function

λθ*(t,s)=π(t,s)1-0t𝒮π(τ,r)𝑑τ𝑑r. (5)

From Lemma 1, we can obtain the learner policy as the following proposition:

Proposition 2 (Learner policy related to conditional intensity).

Given the conditional intensity function λθ*(t,s), the learner policy of a STPP on [0,T)×S is given by

πθ(t,s)=λθ*(t,s)exp{-tnt𝒮λθ*(τ,r)𝑑τ𝑑r}.

Thus, this naturally gives us a policy parameterization in a principled fashion: the learner policy πθ is parameterized by θ based on the proposed heterogeneous Gaussian mixture diffusion kernel in Section III. Note that using Proposition 1, which gives an explicit formula for the integral required in the exponent, the policy can be exactly evaluated even a deep neural network is included in the model.

IV-B2 Imitation learning objective

Now assume the training data is generated by an expert policy πE, where the subscript E denotes “expert”. Given a reward function r(), the goal is to find an optimal policy that maximize the expected reward

maxθJ(θ):=𝔼𝒂πθ[i=1nar(ai)], (6)

where 𝒂={a1,,anα} is one sampled roll-out from policy πθ. Note that na can be different for different roll-out samples.

IV-B3 Reward function

Consider the minimax formulation of imitation learning, which chooses the worst-case reward function that will give the maximum divergence between the rewards earned by the expert policy and the best learner policy:

maxr(𝔼ϵπ[i=1ner(ei)]-maxπθ𝒢𝔼απθ[i=1nar(ai)]),

where 𝒢 is the family of all candidate policies πθ and is the family class for reward function r in reproducing kernel Hilbert space (RKHS).

We adopt a data-driven approach to solve the optimization problem and find the worst-case reward. This is related to inverse reinforcement learning; we borrow the idea of MMD reward in [Kim2013, Gretton2007, Dziugaite2015, Li2018] and generalize it from a simple one-dimensional temporal point process to the more complex spatio-temporal setting. Suppose we are given training samples {𝒆j}, j=1,2,,ME, which are the demonstrations provided by the expert πE, where each 𝒆j={e0(j),e1(j),,enj(j)} denotes a single demonstration. Also, we are given samples generated by the learner: let the trajectories generated by the learner πθ denoted by {𝒂j}, j=1,2,,ML, where each trajectory 𝒂j={a0(j),a1(j),,anj(j)} denotes a single action trajectory. We will discuss how to generate samples in the following sub-section.

Using a similar argument as proving Theorem 1 in [Li2018] and based on kernel embedding, we can obtain analytical expression for the worst-case reward function based on samples, which is given by

r^(a)1MEj=1MEi=1njk(ei(j),a)-1MLk=1MLl=1nkk(al(k),a). (7)

where k(,) is a reproducing kernel Hilbert space (RKHS) kernel. Here we use Gaussian kernel function, which achieves good experimental results in both synthetic data and real data.

Using samples generated from learner policy, the gradient of J(θ) with respect to θ can be computed by using policy gradient with variance reduction [Li2018],

θJ(θ)1MEj=1ME[i=1nj(θlogπθ(ai)r^(ai))].

The gradient of policy θlogπθ(ai) can be computed analytically in closed-form, since it is specified by the conditional intensity in Proposition 2, and the conditional intensity is fully specified by the neural networks architecture of NEST as we discussed in the Section III.

\SetAlgoLinedinput θ,λ0,β,T,𝒮\[email protected]
output A set of events α ordered by time.\[email protected]
Initialize α=, t=0, s𝚞𝚗𝚒𝚏𝚘𝚛𝚖(𝒮).\[email protected]
\Whilet<T u,D𝚞𝚗𝚒𝚏𝚘𝚛𝚖(0,1); s𝚞𝚗𝚒𝚏𝚘𝚛𝚖(𝒮)\[email protected]
λ¯λ0+(τ,r)αν(t,τ,sn,r)\[email protected]
tt-lnu/λ¯\[email protected]
Compute λθ*(t,s) from (5)\[email protected]
\IfDλ¯>λθ*(t,s) αα{(t,s)}; sns\[email protected]
\algorithmcfname 1 Efficient thinning algorithm for STPP
(a) ground truth for synthetic data set 1
(b) ground truth for synthetic data set 2
(c) recovered result for synthetic data set 1
(d) recovered result for synthetic data set 2
Fig. 6: Simulation results on two sets of synthetic data. (5(a)): The ground truth of the Gaussian parameter σx,σy,ρ in the synthetic data set 1; (5(b)): The ground truth of the Gaussian parameter σx,σy,ρ in the synthetic data set 2; (5(c)): The recovered Gaussian parameters using the synthetic data set 1; (5(d)): The recovered Gaussian parameters using the synthetic data set 2.
TABLE I: Average MSE for five methods on two synthetic data sets.
Data set Random ETAS NEST+IL NEST+MLE RLPP
Synthetic data 1 (space-time) .2781 .0433 .0075 .0134 N/A
Synthetic data 2 (space-time) .3327 .0512 .0124 .0321 N/A
Synthetic data 1 (time-only) .1734 .0135 .0021 .0048 .0146
Synthetic data 2 (time-only) .2147 .0323 .0036 .0055 .0341

IV-B4 Sampling from STPP using thinning algorithm

An important step in the imitation learning approach is to generate samples from our proposed model, i.e., aπθ, given the history t. Here, we develop an efficient sampling strategy to achieve good computational efficiency. We need to sample a point tuple a=(t,s) according to the conditional intensity defined by (1). A default way to simulate point processes is to use the thinning algorithm [Gabriel2013, Daley2008]. However, the vanilla thinning algorithm suffers from low sampling efficiency as it needs to sample in the space |𝒮|×[0,T) uniformly with the upper limit of the conditional intensity λ¯ and only very few of candidate points will be retained in the end. In particular, given the parameter θ, the procedure’s computing complexity increases exponentially with the size of the sampling space. To improve sampling efficiency, we propose an efficient thinning algorithm summarized in Algorithm 1. The “proposal” density is a non-homogeneous STPP, whose intensity function is defined from the previous iterations. This analogous to the idea of importance sampling [Ogata1981].

V Experiments

In this section, we evaluate our approaches by conducting experiments on both synthetic and real data sets.

V-A Baselines and evaluation metrics

In this section, we compare our proposed Neural Embedding Spatio-Temporal (NEST) with a benchmark and several state-of-the-art methods in the field. These include (1) Random uniform that randomly makes actions in the action space; (2) Epidemic Type Aftershock-Sequences (ETAS) with standard diffusion kernel, which is currently the most widely used approach in spatio-temporal event data modeling. For ETAS, the parameters are estimated by maximum likelihood estimate (MLE); (3) reinforcement learning point processes model (RLPP) [Li2018] is for modeling temporal point process only, which cannot be easily generalized to spatio-temporal models. (4) NEST+IL is our approach using imitation learning; (5) NEST+MLE is our approach where the parameters are estimated by MLE.

To evaluate the performance of algorithms (i.e., various generative models), we adopt two performance metrics: (1) the average Mean Square Error (MSE) of the one-step ahead prediction (which, thus, is a prediction performance metric). The one-step ahead prediction is obtained by performing Algorithm 1 given the current intensity λ*(t,s) and the past events; (2) the maximum mean discrepancy (MMD) metric between the real observed sequences and the generated sequences from the models, as specified in Section IV-B (thus, this measures how good the generative model give a specific training method). For synthetic data, we also compare the recovered parameters against the true parameters, which is used for constructing the generator and generating the corresponding synthetic data.

TABLE II: Absolute MMD averaged per sequence for five methods on two real data sets.
Data set Random ETAS NEST+IL NEST+MLE RLPP
Robbery (space-time) 108.004 72.869 68.372 69.372 N/A
Seismic (space-time) 53.139 32.185 21.147 28.622 N/A
Robbery (time only) 126.782 90.607 82.189 83.750 83.134
Seismic (time only) 60.725 51.037 21.223 27.256 23.251
TABLE III: Average MSE for five methods on two real data sets.
Data set Random ETAS NEST+IL NEST+MLE RLPP
Robbery (space-time) .6323 .1425 .0503 .0649 N/A
Seismic (space-time) .2645 .0221 .0119 .0153 N/A
Robbery (time only) .4783 .0857 .0104 .0094 .0183
Seismic (time only) .1266 .0173 .0045 .0150 .0122

V-B Synthetic data

V-B1 Description of simulation settings

We first evaluate our method on two synthetic data sets. These two data sets are generated by NEST with artificial parameters shown in Fig. 5(a) and 5(b), respectively, where parameters in Fig. 5(a) are linearly related to their spatial locations and parameters in Fig. 5(b) are non-linearly related their spatial locations. In both data sets, there are 5,000 sequences with an average length of 191, and 80%, 20% sequences are used for training and testing. The time and location of events have been normalized to T=10 and 𝒮=[-1,+1]×[-1,+1]. As an ablation study, we specify the model with only one Gaussian component and three hidden layers in the neural network. Each layer contains 64 neurons, and randomly takes 40 sequences as a batch. The model is trained by both MLE, and IL approaches.

V-B2 Results

We report the average MSE of the one-step-ahead prediction in Table I for each method on both two synthetic datasets. As we can see from the table, our model (based on different training methods)(NEST+ML and NEXT+IL) outperform other methods on two synthetic data sets in terms of prediction error. Since the ground truth of the model is known, we also visualize the parameters in the Gaussian component in comparison to the true parameters. Fig. 5(a) and Fig. 5(a) show the true parameters that are used to construct the generators and generate two synthetic data sets. We fit our NEST model using these two synthetic data sets separately and obtain the corresponding parameters. Fig. 5(c) and Fig. 5(d) show the recovered parameters learned from two synthetic data sets. It shows that the spatial distribution of the recovered parameters is similar to the true parameters. It also confirms that our model can capture the underlying spatial pattern in both linear and non-linear parameter distributions.

V-C Real data

V-C1 Data description

Now we test our approaches on two real-world data sets: Atlanta 911 calls-for-services data (provide by the Atlanta Police Department to us under a data agreement) and Northern California seismic data [NCEDC2014]. For the ease of comparison, we normalize the space region of both data sets to the same space T×𝒮 where T=(0,10] and 𝒮=[-1,1]×[-1,1]. The detailed description of two data sets is as follows.

Atlanta 911 calls-for-service data. The 911 calls-for-service data in Atlanta from the end of 2015 to 2017 is provided by the Atlanta Police Department (this data set has been previously used to validate the crime linkage detection algorithm [Zhu2018-1, Zhu2018-2, Zhu2019A]). We extract 7,831 reported robbery from the data set since robbers usually follow particular modus operandi (M.O.), where criminal spots and times tend to have a causal relationship with each other. Each robbery report is associated with the time (accurate to the second) and the geolocation (latitude and longitude) indicating when and where the robbery occurred. We consider each series of robbery as a sequence.

Northern California seismic data. The Northern California Earthquake Data Center (NCEDC) provides public time series data [NCEDC2014] that comes from broadband, short period, strong motion seismic sensors, and GPS, and other geophysical sensors. We extract 16,401 seismic records that have a magnitude larger than 3.0 from 1978 to 2018 in Northern California and partition the data into multiple sequences every quarter. To test our model, we only take advantage of time and geolocation in the record.

V-C2 Results

TABLE IV: Averaged log-likelihood per sequence.
Data set ETAS NEST+MLE
Robbery 22.341 34.270
Seismic 138.120 224.090
(a) Robbery seq at t1 (NEST)
(b) Robbery seq at t2 (NEST)
(c) Robbery seq at t2 (ETAS)
(d) Seismic seq at t1 (NEST)
(e) Seismic seq at t2 (NEST)
(f) Seismic seq at t2 (ETAS)
Fig. 7: Snapshots of the conditional intensity for two real data sequences (crime events in Atlanta and seismic events): (6(a), 6(b), 6(c)): Snapshots of the conditional intensity for a series of robberies in Atlanta. (6(d), 6(e), 6(f)): Snapshots of the conditional intensity for a series of earthquakes in North of California. (6(b), 6(a), 6(e), 6(d)) are generated by NEST+IL (K=5) and (6(f), 6(c)) are generated by ETAS, respectively. The color depth indicate the value of intensity. The region in darker red has higher risk to have next event happened again. The dots on the maps represent the occurred events in which dark blue represents the newly happened events, and white represents events happened at very beginning.

We first quantitatively compare our NEST+MLE and ETAS by evaluating the obtained averaged log-likelihood per sequence for both real-world data sets. As shown in Table IV, our model attains a larger average log-likelihood value on both data sets in contrast to the state-of-the-art (ETAS), which confirms that our model obtains a better fit on these two sets of real data. Then we compare NEST+MLE and NEST+IL with other baseline methods by measuring their average absolute MMD values and average MSE of the one-step-ahead prediction. The average absolute MMD value can be measured by computing the pairwise distances between the observed training data and the generated sequences according to (7). The absolute MMD value indicates the level of similarity of two arbitrary distributions. If two distributions are the same, then their absolute MMD value is zero. In the experiment, we randomly pick 100 pairs of generated sequences against observed sequences from the real data sets and compute their average absolute MMD value. We summarized the results of the average MDD and average MSE in Table II and Table III, respectively. The results show that both our methods NEST+IL and NEST+MLE outperform the state-of-the-art (ETAS) regarding two metrics. In addition, we also show that our method has a competitive performance even without considering spatial information in contrast to RLPP.

(a) t15.6
(b) t26.8
(c) t37.0
(d) t47.5
Fig. 8: Snapshots of the conditional intensity of a robbery sequence at four time points. The color depth indicates the value of intensity. The region in darker red has higher risk to have robbery event happened again. It also shows that robbery events at different regions have very distinctive diffusion pattern, which may be related to the geographic feature of the city.

V-C3 Interpretable conditional intensity

To interpret the spatial dependence learned using our model, we plot the conditional intensity as a heatmap over the space at a specific time frame. For the state-of-the-art, as shown in Fig. 6(c), 6(f), we can see that the ETAS captured the general pattern of the conditional intensity over the space, where regions with more events tend to have higher intensity. Comparing with the result shown in Fig. 6(a), 6(b), 6(d), 6(e), our NEST is able to capture complex spatial pattern at different locations and the shape of the captured diffusion also differ from application to application. For 911 calls-for-service data, shown in Fig. 6(a), 6(b), the spatial influence of some robbery events diffuse to the surrounding streets and the community blocks unevenly. For seismic data, shown in Fig. 6(e), 6(d), the spatial influence of some events majorly diffuses along the earthquake fault lines.

In addition, Fig. 8 shows a series of snapshots of the conditional intensity for a sequence of robberies at different time points. Each subfigure shows that events in different urban regions have various shapes of spatial influence on their neighborhoods. For example, events near the highway or railroad tend to spread their influence along the road. In contrast, events around the Midtown area tend to spread their influence more evenly to the neighboring communities.

V-D Learning efficiency

To compare the efficiencies between different learning strategies, we perform the experiments on the same data sets and record the corresponding objective values (likelihood or MMD) over iterations. It turns out that the training process of NEST + MLE is much faster (about 1 hour fast) and more stable than NEST + IL. However, NEST + IL can generate robust samples with lower absolute MMD values in contrast to the MLE (Table II) and requires less data to attain the same level of the objective value (absolute MMD). Comparing to other baseline methods, such as [Xiao2017B, Li2018], which are usually over-parameterized on simple problems, our NEST can achieve better results using a fewer number of parameters, since the deep neural network in NEST is only used to capture the non-linear spatial pattern in the embedding space.

VI Conclusion and discussions

We have presented a generative model for spatio-temporal event data by designing heterogeneous and flexible spatio-temporal point process models whose parameters are parameterized by neural networks, representing complex-shaped spatial dependence. Different from a full neural network approach, our parameterization based on Gaussian diffusion kernels can still enjoy interpretability: such as the shape of the spatial-temporal influence kernels, which are useful for deriving knowledge for domain experts such as seismic data analysis. In addition, we present a flexible model-fitting approach based on imitation learning, a form of reinforcement learning, which does not rely on exact model specification since it is based on directly measuring the divergence between the empirical distributions of the training data and the data generated from data (in contrast to MLE, which measures the “likelihood” of data under a pre-specified model structure). The robustness of the imitation learning based training approach for NEST model is demonstrated using both synthetic and real data sets, in terms of the predictive capability (measured by MSE of the conditional intenstity function, which measure the probability for an event to occur at a location in the future given the past), and the maximum-mean-divergence (measures how realistic the data generated from the model matches to the real training data). We have observed on synthetic data and real-data that the imitation learning approach will achieve a significant performance gain than maximum likelihood and more robust for training NEST model. Our NEST can have a wide range of applications: since it is a probabilistic generative model and can be used to evaluate the likelihood of a sequence and perform anomaly detection spatio-temporal event data, as well as making predictions about future incidents given the history information.

Future work may include extending to marked spatio-temporal processes, for instance, to consider the magnitude of the earthquake in the models in the current set-up. Moreover, it will be interesting to consider how to incorporate known spatial structural information in the spatio-temporal model. For instance, in the most recent earthquake model in California, geophysicists have already considered both ETAS and fault model/distribution [Field2017].

References

Proof of Proposition 1.
titi+1𝒮λθ*(τ,r)𝑑r𝑑τ=titi+1𝒮[λ0+j:tj<τk=1Kϕsj(k)g(τ,tj,r,sj)]𝑑r𝑑τ=λ0(ti+1-ti)|𝒮|+(1-ϵ)j:tj<ti+1k=1Kϕsj(k)titi+12g(τ,tj,r,sj)𝑑r𝑑τ

where |𝒮| is the area of the space region. The last equality is breaking the integral into two parts, over set 𝒮 and its complement.

The triple integral in the second term can be written explicitly by changing variable. Let q be the radius of the oval, φ be the angle, so that we have u/σx(k)(si)=qcos(φ), v/σy(k)(si)=qsin(φ), where (u,v)2 are the Cartesian coordinate. The Jacobian matrix of this variable transformation is

[σx(k)(si)cos(φ)-σx(k)(si)qsin(φ)σy(k)(si)sin(φ)σy(k)(si)qcos(φ)]

So the determinant of the Jacobian is qσx(k)(si)σy(k)(si).

Therefore, the double integral, which is independent from q and φ, can be written as

titi+12g(τ,tj,r,sj)𝑑r𝑑τ
=  titi+102π0Ce-β(τ-tj)2π|Σsj|-1/2(τ-tj)
 exp{-q22(τ-tj)}qσx(k)(sj)σy(k)(sj)dqdφdτ
=  Cσx(k)(sj)σy(k)(sj)2π|Σsj|-1/2titi+102π0e-β(τ-tj)(τ-tj)
 exp{-q22(τ-tj)}qdqdφdτ
=  Cσx(k)(sj)σy(k)(sj)2π|Σsj|-1/2titi+102πexp{-β(τ-tj)}𝑑φ𝑑τ
=  Cσx(k)(sj)σy(k)(sj)|Σsj|-1/2titi+1exp{-β(τ-tj)}𝑑τ
=  Cσx(k)(sj)σy(k)(sj)β|Σsj|-1/2(e-β(ti-tj)-e-β(ti+1-tj)).

Let

Cj=k=1Kϕsj(k)σx(k)(sj)σy(k)(sj)|Σsj(k)|1/2,

we can have

 titi+1𝒮λθ*(τ,r)𝑑r𝑑τ=λ0(ti+1-ti)|𝒮|+
 (1-ϵ)Cβj:tj<ti+1Cj(e-β(ti-tj)-e-β(ti+1-tj)),

where the constant

ϵ=maxj:tj<ti+1titi+1𝒮g(τ,tj,r,sj)𝑑r𝑑τtiti+12g(τ,tj,r,sj)𝑑r𝑑τ.

Proof of Proposition 2.

Now we can derive a consequence of the above lemma using the following simple argument. Let

F(t)=0t𝒮π(τ,r)𝑑τ𝑑r.

Then

F˙(t)=dF(t)dt=𝒮π(t,r)𝑑r.

From (5) we obtain

𝒮λθ*(t,r)𝑑r=𝒮π(t,r)𝑑r1-0t𝒮π(τ,r)𝑑τ𝑑r=F˙(t)1-F(t)=-ddt(log(1-F(t)))

Thus,

tnt𝒮λθ*(τ,r)𝑑τ𝑑r=log(1-F(tn))-log(1-F(t)).

Since F(tn)=0, we obtain

1-F(t)=exp{-tnt𝒮λθ*(τ,r)𝑑τ𝑑r}.

Also, from (5), we can write

λθ*(t,s)=π(t,s)1-F(t).

Thus

π(t,s)=λθ*(t,s)(1-F(t))=λθ*(t,s)exp{-tnt𝒮λθ*(τ,r)𝑑τ𝑑r}.