Abstract
Temporal point process is an expressive tool for modeling event sequencesover time. In this paper, we take a reinforcement learning view whereby theobserved sequences are assumed to be generated from a mixture of latentpolicies. The purpose is to cluster the sequences with different temporalpatterns into the underlying policies while learning each of the policy model.The flexibility of our model lies in: i) all the components are networksincluding the policy network for modeling the intensity function of temporalpoint process; ii) to handle varyinglength event sequences, we resort toinverse reinforcement learning by decomposing the observed sequence into states(RNN hidden embedding of history) and actions (time interval to next event) inorder to learn the reward function, thus achieving better performance orincreasing efficiency compared to existing methods using rewards over theentire sequence such as loglikelihood or Wasserstein distance. We adopt anexpectationmaximization framework with the Estep estimating the clusterlabels for each sequence, and the Mstep aiming to learn the respective policy.Extensive experiments show the efficacy of our method againststateofthearts.
Quick Read (beta)
Reinforcement Learning with Policy Mixture Model
for Temporal Point Processes Clustering
Abstract
Temporal point process is an expressive tool for modeling event sequences over time. In this paper, we take a reinforcement learning view whereby the observed sequences are assumed to be generated from a mixture of latent policies. The purpose is to cluster the sequences with different temporal patterns into the underlying policies while learning each of the policy model. The flexibility of our model lies in: i) all the components are networks including the policy network for modeling the intensity function of temporal point process; ii) to handle varyinglength event sequences, we resort to inverse reinforcement learning by decomposing the observed sequence into states (RNN hidden embedding of history) and actions (time interval to next event) in order to learn the reward function, thus achieving better performance or increasing efficiency compared to existing methods using rewards over the entire sequence such as loglikelihood or Wasserstein distance. We adopt an expectationmaximization framework with the Estep estimating the cluster labels for each sequence, and the Mstep aiming to learn the respective policy. Extensive experiments show the efficacy of our method against stateofthearts.
Reinforcement Learning with Policy Mixture Model
for Temporal Point Processes Clustering
Weichang Wu Shanghai Jiao Tong University [email protected] Junchi Yan ^{†}^{†}thanks: Corresponding author. Shanghai Jiao Tong University [email protected] Xiaokang Yang Shanghai Jiao Tong University [email protected] Hongyuan Zha Georgia Institute of Technology [email protected]
noticebox[b]Preprint. Under review.\[email protected]
1 Introduction
Event sequences with time stamps in continuous domain are ubiquitous across different areas and applications. In ecommerce, online purchase records over time can form event sequences. In health informatics, a series of treatments taken by patient can be tracked as an event sequence. In seismology, a sequence of earthquakes can be recorded. Recognizing and understanding the structure in the event sequences is of vital importance for downstream applications.
Temporal point processes (TPP) [3] are useful tools for modeling event sequences whereby one key concept is to model the event occurrence rate over time using the conditional intensity function. Lots of literatures have been proposed for both datamining like in [39, 14] and prediction task like in [36].
In this paper, we are devoted to another important but relatively less studied scenario: event sequence clustering in the continuous time domain, which can be more challenging than the traditional (aggregated and discrete) time series clustering. Event sequence clustering can find its utility in many realworld applications. Given a number of event sequences, it is important to discover and learn the underlying clustering structure robustly. For example, the purchase records can help cluster ecommerce users into different groups to benefit a recommender system; clustering patients according to their treatment logs helps hospitals optimize medication resources.
However, despite the extensive existing works on event sequence modeling and prediction as mentioned above, event sequence clustering and especially learning mixture model of event sequences have rarely been addressed, except in [34] to our best knowledge. In [34], a parametric likelihood based latent Dirichlet allocation model is proposed for event sequences clustering using Hawkes processes. In this paper, we propose a deep reinforcement learning (RL) based EM framework for event sequence clustering using likelihoodfree temporal point processes learning, and the purpose is to discover and learn these underlying policies of experts which can be shared over similar sequences. he highlights of our work include:
1) We present a network based EM framework for TPP clustering, differing from previous work using parametric clustering models [34]. Under the EM scheme, clustering of the entire dataset and model fitting of each cluster are jointly performed rather than separated in two steps [9].
2) We take an RL view to the TPP clustering problem, whereby each cluster corresponds to one of the latent expert policies that generates the observed sequences in the cluster. Our method can be seen as a meta learning approach by adopting IRL to learn policy’s reward function. In particular, we employ generative adversarial imitation learning [11] as an efficient IRL embodiment for TPP.
3) We empirically show that our method exceeds peer models notably including a very recent one [34]. We also compare with the recent mixture GAN based (image) clustering model [37] and adapt it to temporal event sequences by using the Wasserstein distance between point processes according to [32] (see details in supplementary material), for a more fair comparison. The results show that our model outperforms significantly regarding training efficiency (one order faster), with similar performance.
2 Related Work
We review TPP methods in literature in two aspects: i) modeling of the intensity function; ii) learning objectives and algorithms. The relevance lies in that both the intensity function and learning objective relate to latent policy discovering and learning.
2.1 Temporal Point Process Intensity Modeling
Traditional TPP models are mostly developed around the design of the intensity function $\lambda (t)$ which measures the instantaneous event occurrence rate at time $t$, like Reinforced Poisson processes [23], Selfexciting process (Hawkes process) [10], Reactive point process [6], etc. An obvious limitation of these traditional models is that they all assume all the samples obey a single parametric form which is too idealistic for realworld data. This also suggests the need for learning clustered behaviors beyond single model based methods.
By contrast, in neural point process [4, 18, 33], recurrent neural network (RNN) and its variants e.g. longshort term memory (LSTM) are used for modeling the conditional intensity function over time. As such, the learning can be generally fulfilled by gradient descent and no restriction on the form of the intensity function is imposed. More recently attention mechanisms are introduced to improve the interpretability of the neural model [30]. However, in all these models, the event sequences are all fed into the model without any discrimination to the groups they may differently belong to. In fact, the behind mechanisms for generating each event sequence can be very different, thus it may be a better idea to learn one model for each cluster of similar samples for more accurate modeling.
2.2 Temporal Point Process Learning
There are alternative objectives for TPP learning for both parametric models and neural models. Traditional methods mostly follow the maximum likelihood estimation (MLE) procedure under the probabilistic framework [22]. While the MLE objective may not be the only choice. This is because we are often given only a limited number of sequences which may further contain arbitrary noises. Recent efforts have been made for devising adversarial learning based objective inspired by generative adversarial networks (GAN) [8] and especially Wasserstein GAN [1]. In [35], adversarial objective is developed in addition with MLE loss by approximating the continuous domain predictions using a discrete time series. In [32], Wasserstein distance over temporal event sequences is explicitly defined to learn a deep generative point process model for temporal events generation.
Another line of works consider the challenge for learning highdimensional TPP models, whereby the socalled infectivity matrix to be learned can be of squared size of the dimensionality. One popular technique is imposing lowrank regularizer [38] or factorization model [31] on the infectivity matrix. However, they do not explicitly deal with the sequence clustering problem. In fact, the observed dimension marker does not correspond to the underlying cluster.
The mostly related work to our method appears in [34] as it deals with a similar problem setting: grouping event sequences into different clusters and learn the respective TPP model parameters for each cluster. However, the technical approaches are completely different. First the parametric model [34] is tailored to Hawkes process while our network based model is more general; Second, the work [34] is under the Bayesian probabilistic framework while our method is likelihoodfree and incorporates both adversarial learning and inverse reinforcement learning [20] for more effective objective design beyond MLE; We show in the experiments that our method significantly outperforms [34] on realworld data. Source code will be made public available for reproducible research.
As shown in Fig. 1, this paper takes a reinforcement learning (RL) perspective on the modeling and clustering of temporal point processes for its dynamic sequence nature. Though there exist works [7, 15, 29] on (deep) RL and intervention of TPP, while little effort ([34] does not involve deep model nor RL) has been paid on TPP clustering which calls additional careful treatment on disentangling the mixture of policies. Using the language of RL, suppose there is a number of event sequences generated (with noise) by $N$ underlying expert policies, which can be reflected in the form of $N$ clusters. In this sense, we formulate the event sequence clustering task as a reinforcement learning problem whereby the purpose is to discover the unknown event generation policies, and meanwhile the learning cost function for fitting event sequences is also automatically learned from data using IRL.
3 Proposed Model
We present our approach under the expectationmaximization (EM) framework – a natural way for disentangling the clustering and fitting of each cluster of event sequences which are assumed to be generated by a mixture of policies. In expectation step, each sequence is assigned with a cluster label corresponding to its latent policy that generates this sequence. The latent policy of each cluster is learned in maximization step. The model is illustrated in in Fig. 1. The complete learning algorithm is shown in Alg. 1 which calls an IRL based subfunction GAILTPP described in Alg. 2.
3.1 EM Learning for Policy Mixture Model
Given a temporal event set $X$ with $M$ observed event sequences: $X=\{{x}_{1},{x}_{2},\mathrm{\dots},{x}_{M}\}$ and the discrete latents i.e. cluster labels $Y=\{{y}_{1},{y}_{2},\mathrm{\dots},{y}_{M}\}$ for ${y}_{i}\in \{1,2,\mathrm{\dots},N\}$, we suppose that $X$ are generated by a mixture of $N$ experts with a latent policy for each expert, parameterized by $\theta $ as a whole. The log likelihood is:
$$\mathcal{L}(\theta ;X,Y)=\mathrm{log}p(X,Y\theta ),$$ 
where $p(X,Y\theta )$ is the conditional probability of observing $X,Y$ given parameter $\theta $, and $\theta $ is determined by maximizing the marginal log likelihood of observed $X$:
$${\theta}^{*}=\underset{\theta}{\mathrm{arg}\mathrm{max}}\mathcal{L}(\theta ;X)=\underset{\theta}{\mathrm{arg}\mathrm{max}}\mathrm{log}p(X\theta ),$$ 
where $p(X\theta )=\int p(X,Y\theta )\mathit{d}Y$ is the marginal probability.
Suppose the latents $Y$ are sampled from an arbitrary valid probability distribution $q(Y)$, then a lower bound $\mathcal{F}(q,\theta )$ of the marginal log likelihood $\mathcal{L}(\theta ;X)$ can be obtained by Jensen’s inequality as:
$$\mathcal{F}(q,\theta )=\mathcal{L}(\theta ;X){D}_{KL}(qp),$$  (1) 
which means we have $\forall q(Y):\mathcal{L}(\theta ;X)\ge \mathcal{F}(q,\theta )$^{1}^{1} 1 The detailed derivation is presented in supplementary material in the EM learning convergence proof. Given randomly initialized parameter ${\theta}^{(0)}$ and arbitrary distribution ${q}^{(0)}$, we iteratively update ${q}^{(k)}$ and parameter ${\theta}^{(k)}$ by the following ExpectationMaximization procedure:
Estep: given model parameter ${\theta}^{(k)}$, update ${q}^{(k)}$ to ${q}^{(k+1)}$ by matching $q$ to posterior $p(YX,{\theta}^{(k)})$
Mstep: given ${q}^{(k+1)}$, update $\theta $: ${\theta}^{(k+1)}\leftarrow {\theta}^{(k)}+\nabla \mathcal{F}({q}^{(k+1)},\theta )$ by maximizing $\mathcal{F}({q}^{(k+1)},\theta )$
We iteratively perform Estep and Mstep until ${\theta}^{*}$, and the distribution of the hidden variable $Y$ is: ${q}^{*}(Y)=p(YX,{\theta}^{*}).$ A detailed proof for the convergence of the EM procedure is presented in the supplementary material.
Specifically, the computational components in Estep and Mstep are all implemented by neural networks as the Reinforcement Learning with Policy Mixture Model (RLPMM), including the Estep for policy clustering and Mstep for policy learning.
3.2 Expectation Step for Policy Clustering
As mentioned above, in Estep, we match the hidden variable distribution $q$ to posterior distribution $p(YX,{\theta}^{(k)})$, and fill in values of latent variables $Y$ for samples in observed data $X$ according $q(Y)$, so that we can rerecompute the expectation of $X$ given ${\theta}^{(k)}$, i.e., the likelihood function $\mathcal{L}({\theta}^{(k)})$.
We compute the hidden variable distribution $q$ as
$${q}^{(k)}=\underset{q\in {\mathscr{H}}_{k}}{\mathrm{arg}\mathrm{min}}KL(q{p}^{(k)}),$$  (2) 
where we restrict the distribution of hidden variable ${q}^{(k)}$ is in a bounded hypothesis space ${\mathscr{H}}_{k}$.
For mixture of policies, given parameter ${\theta}^{(k)}$ and observed data $X$, the posterior distribution ${p}^{(k)}$ is:
$$p({y}_{ij}{x}_{j},{\theta}^{(k)})=\frac{p({x}_{j},{y}_{ij}{\theta}^{(k)})}{p({x}_{j}{\theta}^{(k)})},$$  (3) 
where ${y}_{ij}=1$ if and only if ${x}_{j}$ is generated by the $i$th policy.
Inspired by Eq. 3, to find ${q}^{(k)}$ in Eq. 2, we train a classifier to fit the current guess of the discrete hidden variable distribution ${q}^{(k)}$ to ${p}^{(k)}$, i.e., holding the policies parameter ${\theta}^{(k)}$ fixed, train a classifier ${h}_{q}$ by data generated by learned policies. Therefore, the Estep involves line 1, 1, 1 in Alg. 1.
In practical application, ${h}_{q}$ is a $3$layer classifier including sequence embedding layer, RNN layer and classification layer as used in [4, 33]. In addition, an implementation trick dealing with imbalanced classification at the beginning of the procedure is used, as presented in the supplementary material.
3.3 Maximization Step for Policy Learning
Given the hidden variable $Y$ estimated by the classifier in Estep, each event sequence $x$ in the training dataset $D$ is classified to a specific policy with discrete hidden variable ${y}_{i}$. In Mstep, dataset $D$ is divided into $N$ clusters $\{{D}_{1},{D}_{2},\mathrm{\dots},{D}_{N}\}$ to train each policy model.
Now we present an IRL based policy learning method. Differing from previous works imposing a specific form for sequence learning e.g. Wasserstein distance [32], we assume the policy reward (cost) is unknown which need be learned via IRL. The rationale is that it is nontrivial to define the temporal event sequence fitting error with varying length in contrast to the vectorlike data. To make the learning scalable to realworld data, the IRL procedure is efficiently fulfilled by a generative adversarial imitation learning scheme [11].
3.3.1 Policy network for sequence generation
The policy function ${\pi}_{\theta}$ should have the capacity to capture the complex sequential dependency pattern and stochastic nature in the point process. We adopt RNN with stochastic neurons [2] as the policy network. Here action refers to the time to next event from current event timestamp and state refers to the hidden embedding of RNN for the history. Note action $a$ is sampled from distribution $\pi (a\theta ({h}_{i1}))$ as
$${a}_{i}\sim \pi (a\theta ({h}_{i1})),$$  (4) 
where ${h}_{i}=\psi (V{a}_{i}+W{h}_{i1})$, ${a}_{i}={t}_{i}{t}_{i1}\in {\mathbb{R}}^{+}$ is the $i$th interevent time (${t}_{0}=0$), ${h}_{i}\in {\mathbb{R}}^{d}$ is hidden state of RNN encoding history before ${t}_{i}$, $\theta $ is nonlinear mapping from $\mathbb{R}$ to policy’s parameter space, $V\in {\mathbb{R}}^{d}$ and $W\in {\mathbb{R}}^{d\times d}$ are coefficients, $\psi $ is nonlinear activation function, e.g., $\psi (z)=\frac{{e}^{z}{e}^{z}}{{e}^{z}+{e}^{z}}$ as used in this paper.
There are alternatives for parameterizing the policy function, i.e. the probability density function $\pi $, as long as they satisfy the constraint that the random variable sampled from $\pi $ is positive since $a>0$, such as exponential distribution: $\pi (a\theta ({h}_{i1}))=\theta (h){e}^{\theta (h)a}$ and Rayleigh distribution: $\theta (h)a{e}^{\theta (h){a}^{2}/2}$ as used in this paper.
So far the RNN policy network with stochastic neurons is able to mimic the event generating mechanism of stochastic temporal point process by Eq. 4. Given a sequence of past events $$, the next event time is generated as ${t}_{i+1}={t}_{i}+a$, with the interevent time $a$ sampled from stochastic policy ${\pi}_{\theta}(a{s}_{t})$ as the action.
3.3.2 Inverse reinforcement learning for cost modeling
As shown in Fig.2, we use the GAIL framework to learn the cost function for reinforcement learning, in which the IRL procedure is substituted by training a Discriminator ${D}_{w}$, with the gradient of discriminator parameter $w$ given by
$${\mathbb{E}}_{{x}_{\theta}}[{\nabla}_{w}\mathrm{log}({D}_{w}(s,a))]+{\mathbb{E}}_{{x}_{E}}({\nabla}_{w}\mathrm{log}(1{D}_{w}(s,a)),$$  (5) 
where ${x}_{\theta}\sim {\pi}_{\theta}$ is the sequences sampled from learned policy ${\pi}_{\theta}$ and ${x}_{E}\sim {\pi}_{E}$ is the sequences sampled from expert’s true policy.
Given the cost function $\mathrm{log}{D}_{w}(s,a)$, the RL procedure is implemented by the policy gradient descent with gradient of policy network parameter $\theta $ given by
$${\mathbb{E}}_{{x}_{\theta}}[{\nabla}_{\theta}\mathrm{log}{\pi}_{\theta}(as)Q(s,a)]\lambda {\nabla}_{\theta}H({\pi}_{\theta}),$$  (6) 
where $Q(\overline{s},\overline{a})={\mathbb{E}}_{{x}_{\theta}}[\mathrm{log}({D}_{w}(s,a)){s}_{0}=\overline{s},{a}_{0}=\overline{a}]$. And the gradient of the causal entropy regularizer is given by
${\nabla}_{\theta}H(\theta )$  $={\nabla}_{\theta}{\mathbb{E}}_{{\pi}_{\theta}}[{\mathrm{log}}_{{\pi}_{\theta}}(as)]$  (7)  
$={\mathbb{E}}_{{\pi}_{\theta}}[{\nabla}_{\theta}\mathrm{log}{\pi}_{\theta}(as){Q}_{\mathrm{log}}(s,a)],$ 
where ${Q}_{\mathrm{log}}(\overline{s},\overline{a})={\mathbb{E}}_{{\pi}_{\theta}}[{\mathrm{log}}_{{\pi}_{\theta}}(as){s}_{0}=\overline{s},{a}_{0}=\overline{a}]$
In essence, by iteratively updating $w$ using IRL in Eq. 5, and updating $\theta $ using RL in Eq. 6, the GAIL algorithm find a saddle point $({\pi}_{\theta}^{*},{D}_{w}^{*})$ of the expression
$${\mathbb{E}}_{{\pi}_{\theta}}[\mathrm{log}(D(s,a))]+{\mathbb{E}}_{{\pi}_{E}}[\mathrm{log}(1D(s,a))]\lambda H(\pi ),$$  (8) 
which is equivalent to find the optimal policy ${\pi}_{\theta}^{*}$.
Method  Intensity function  Framework  Running pipeline  Learning model 

RLPMM  Neural networks (RNN)  Temporal point process  Joint clustering and learning  Reinforcement learning 
WGANMM [37]  Neural networks (RNN)  Temporal point process  Joint clustering and learning  Adversarial learning 
DMMHP [34]  Parametric Hawkes  Temporal point process  Joint clustering and learning  Maximum likelihood learning 
ODE/LS+GMM [13, 5]  Nonparametric Hawkes  Temporal point process  Separate clustering and learning  Feature based GMM 
VAR+GMM [9]  Not applicable  Discretized time series  Separate clustering and learning  Feature based GMM 
4 Experiments
We evaluate the performance of our Reinforcement Learning for Policy Mixture Model (RLPMM) on both synthetic and realworld data. To demonstrate the effectiveness and efficiency of our model, we compare with stateoftheart methods for event sequence clustering.
As summarized in Table 1, peer methods include: 1) Gaussian Mixture Model (GMM) includes 3 Twostep pipeline models that firstly extract features from sequential events using Vector AutoRegression (VAR)[9], Ordinary Differential Equation (ODE)[13] or Least Squares (LS)[5], the use the GMM to cluster the event sequences; 2) Dirichlet Mixture Model of Hawkes Processes (DMMHP)[34] which is the most related work to our knowledge; 3) Wasserstein Generative Adversarial Network Mixture Model (WGANMM) that we adapt from [37]. Due to page limit, we present the technical details of our adaption and the other baselines in the supplementary material.
Metrics used to measure clustering performance are 1) Clustering Purity (CP) [26]; 2) Rand Index (RI) [24]; 3) Empirical Intensity Deviation (EID) [32]; 4) Clustering Consistency (CC) [28]. All the metrics except for clustering consistency (as CC itself already involves random trials) are computed by the average of 10 trials on the whole dataset by random initialization for clustering. The details of their definitions are also presented in supplementary material.
Model  VAR+  ODE+  LS+  DMMHP  WGAN  RLPMM  VAR+  ODE+  LS+  DMMHP  WGAN  RLPMM 

Cluster#  GMM  GMM  GMM  MM  GMM  GMM  GMM  MM  
Evaluation  Clustering Purity (CP): the higher the better  
Dataset  noHawkes  Hawkes  
$K=2$  0.5722  0.7167  0.7493  0.9557  0.9801  0.9776  0.5037  0.7525  0.7741  0.9879  0.9710  0.9653 
(0.0387)  (0.0290)  (0.0239)  (0.0164)  (0.0053)  (0.0035)  (0.0548)  (0.0219)  (0.0152)  (0.0110)  (0.0126)  (0.0141)  
$K=3$  0.4117  0.5518  0.6235  0.9047  0.9515  0.9660  0.3788  0.5739  0.6405  0.9558  0.9316  0.9464 
(0.0447)  (0.0388)  (0.0412)  (0.0179)  (0.0141)  (0.0130)  (0.0783)  (0.0255)  (0.0239)  (0.0164)  (0.0205)  (0.0184)  
$K=4$  0.2694  0.4108  0.4556  0.8755  0.9374  0.9528  0.2593  0.4290  0.4634  0.9256  0.9091  0.9182 
(0.0714)  (0.0447)  (0.0436)  (0.0173)  (0.0176)  (0.0170)  (0.0707)  (0.0412)  (0.0311)  (0.0176)  (0.0257)  (0.0155)  
Evaluation  Random Index (RI): the higher the better  
Dataset  noHawkes  Hawkes  
$K=2$  0.3874  0.6114  0.6685  0.9006  0.9547  0.9517  0.3293  0.6283  0.6832  0.9418  0.9274  0.9297 
(0.0917)  (0.0701)  (0.0529)  (0.0184)  (0.0105)  (0.0082)  (0.0503)  (0.0197)  (0.0173)  (0.0095)  (0.0195)  (0.0130)  
$K=3$  0.2940  0.4822  0.5265  0.8618  0.9262  0.9439  0.2411  0.4995  0.5361  0.9251  0.8960  0.9193 
(0.1342)  (0.0755)  (0.0748)  (0.0296)  (0.0192)  (0.0114)  (0.0640)  (0.0268)  (0.0217)  (0.0087)  (0.0243)  (0.0145)  
$K=4$  0.1182  0.3750  0.4127  0.8043  0.8857  0.9128  0.0946  0.3898  0.4189  0.8912  0.8626  0.8779 
(0.0424)  (0.0837)  (0.0954)  (0.0247)  (0.0214)  (0.0130)  (0.0775)  (0.0361)  (0.0212)  (0.0143)  (0.0207)  (0.0170)  
Evaluation  Empirical Intensity Deviation (EID): the lower the better  
Dataset  noHawkes  Hawkes  
$K=2$  —  3.734  2.947  1.512  0.355  0.358  —  2.787  1.963  0.338  0.524  0.417 
—  (0.1612)  (0.1789)  (0.0279)  (0.0158)  (0.0126)  —  (0.0819)  (0.0768)  (0.0187)  (0.0236)  (0.0241)  
$K=3$  —  3.960  3.255  2.173  0.484  0.419  —  3.149  2.918  0.373  0.557  0.446 
—  (0.1549)  (0.2324)  (0.0557)  (0.0181)  (0.0138)  —  (0.1342)  (0.0849)  (0.0235)  (0.0259)  (0.0274)  
$K=4$  —  3.904  3.726  2.219  0.554  0.472  —  3.321  3.135  0.409  0.570  0.491 
—  (0.2449)  (0.2168)  (0.0469)  (0.0202)  (0.0152)  —  (0.1817)  (0.1183)  (0.0247)  (0.0518)  (0.0402) 
4.1 Experiments on Synthetic Data
We experiment on several synthetic datasets with different $K$ (number of clusters). We generate four kinds of synthetic event sequences in a time interval $[0,T]\mathit{\hspace{1em}}(T=100)$ using the simulation method in [21] for TPP. We experiment with different $K$ from $K=2$ to $K=4$, and the results are given by the average of $10$ trails as also shown in Table 2. The ratio of each cluster size is the same. To make a fair comparison with the Hawkes process based models, we experiment on synthetic datasets generated by both nonHawkes processes and Hawkes processes.
Specifically, for nonHawkes dataset, we have sequences generated by a mixture of $K=2$ clusters from $Sine$ intensity and $Negative$$sine$ intensity, then we add sequences generated by $Constant$ intensity for $K=3$, followed by the cluster from $Bimdoal$ intensity for $K=4$. We list the formulas of the intensity and plot the curve of the groundtruth intensity and learned intensity in supplementary material.
For Hawkes processes, we adopt the conventional Hawkes process as: $\lambda (t)={\gamma}_{0}+\alpha {\sum}_{t\in \tau}{e}^{w(t{t}^{\prime})}$. We also experiment with different $K$ from $K=2$ to $K=4$. In line with [34], for each trial, the parameters of the intensity function for cluster $k$: $\{{\gamma}_{0}^{k},{\alpha}^{k}\}$ are sampled randomly from $[0,1]$ by keeping $w=1$.In line with [34], for each trial, the parameters of the intensity function for cluster $k$: $\{{\gamma}_{0}^{k},{\alpha}^{k}\}$ are sampled randomly from $[0,1]$ by keeping $w=1$.
4.2 Experiments on MemeTracker Data
We collect realworld event sequences from public MemeTracker [12] as widely used in TPP works [38, 18, 32]. It tracks meme diffusion over public media, containing more than 172 million news articles or blog posts. The memes are sentences, such as ideas, proverbs, and the time is recorded when it spreads to certain websites. We randomly sample 35,000 cascades from MemeTracker to study the diffusion process of the meme since its creation. The memes are supposed generated from different latent policies for discovery. Note one can only use clustering consistency as metrics as there is no groundtruth cluster labels. The results are shown in Table 4.
Method  WGANMM  RLPMM  

$Cluster\mathrm{\#}$  $K=2$  $K=3$  $K=4$  $K=5$  $second/iter$  $K=2$  $K=3$  $K=4$  $K=5$  $second/iter$  
Dataset  Synthetic data  $1.8\times {10}^{4}$  $3.0\times {10}^{4}$  $5.0\times {10}^{4}$  —  $10.32$ sec/iter  $1.0\times {10}^{4}$  $1.5\times {10}^{4}$  $2.0\times {10}^{4}$  —  $1.125$ sec/iter 
MemeTracker  $0.8\times {10}^{4}$  $1.2\times {10}^{4}$  $2\times {10}^{4}$  $3.0\times {10}^{4}$  $5.70$ sec/iter  $0.5\times {10}^{4}$  $0.8\times {10}^{4}$  $1.5\times {10}^{4}$  $2.0\times {10}^{4}$  $0.720$ sec/iter 
4.3 Findings and Discussion
We present interpretations to the results as follows.
1) Parametric point process vs. neural point process vs. discretized time series The two neural network based TPP models: RLPMM using reinforcement imitation learning and WGANMM using generative adversarial learning, in general outperform the (implicit) parametric intensity models^{2}^{2} 2 ODE and LS based methods are also called nonparametric TPP as they involve an implicit model to learn intensity forms. In this paper, to simplify the naming, we slightly abuse the term of parametric point process to distinguish them from the neural point process models with significantly more network parameters to learn.: ODE+GMM, LS+GMM and DMMHP using explicit intensity functions, on both clustering performance and modeling capability. While the performance of time series based method VAR+GMM is the worst. These results show the superiority of the neural model compared with parametric TPP which assume a predefined form with limited model capacity.
On the other hand, as shown in Table 2, when the data is exactly generated from the predefined point process – Hawkes processes, the model DMMHP which is based on Hawkes processes model can benefit significantly and even (slightly) outperforms the network based methods including RLPMM and WGANMM. This also suggests the parametric model still have their value when the distribution can be (exactly) known to allow for model specification. The standard deviation in Table 2 also shows the higher stability of neural TPP methods against parametric ones.
2) Twostep pipeline vs. joint model On both synthetic and realworld datasets, joint modeling methods DMMHP, WGANMM and RLPMM outperform twostep VAR/ODE/LS+GMM models which run clustering first followed by learning within each cluster. This shows the utility of an elegant joint learning framework.
3) WGANMM vs. RLPMM It is shown that WGANMM and RLPMM perform relatively close to each other (though often RLPMM outperforms WGANMM) regarding with clustering performance on all metrics, for both synthetic data and realworld dataset. However, we find that the proposed RLPMM shows superior efficiency to WGANMM. In fact, WGANMM adopts the adversarial training framework based on Wasserstein divergence, where both the generator and the discriminator are modeled as dynamic RNN of multiple LSTM cells. In contrast, RLPMM only models the policy as a single LSTM cell, so for the discriminator with an extra Logistic regression layer. As such, RLPMM requires less parameters and computations. Moreover, we find the realworld MemeTracker need notably fewer iterations to converge than synthetic data.
4) Learning synthetic data generated by different numbers of policies Clustering purity, RI and EID all degenerate as the number of policies used for generating the testing data (i.e. intensity functions) grows from $K=2$ to $K=4$ (see the protocol in Section 4.1). This is no surprise as it becomes more challenging when the sequence set becomes more mixed. While the impact is lessened on the joint modeling methods DMMHP, WGANMM and RLPMM than the twostep models VAR/ODE/LS+GMM. This also holds for the comparison between RLPMM and WGANMM whereby RLPMM need much less additional iterations to converge than WGANMM when cluster number increases from $K=2$ to $K=4$.
4) Learning latent policies of MemeTracker data Instead of using one less informative TPP model for the whole set of sequences as shown in Fig. 3(a), we set $K=4$ and plot the empirical intensity functions as discovered by the proposed RLPMM method in Fig. 3(b). The memes patterns are marked as ${C}_{1}$ – ${C}_{4}$, and we show the text statistics of these patterns in supplementary material which reveals a potential for joint modeling with topic model.
For reproducibility, the source code for the proposed RLPMM model and adapted WGANMM model, and the synthetic and MemeTracker dataset is available on Github. ^{3}^{3} 3 https://github.com/XXXX/RLPMM
5 Conclusion and Future Work
Clustering of event sequences in continuous time domain is challenging and has vast application for realworld problems. It is also useful for building eventdriven simulators. We study this problem from the reinforcement learning perspective for learning mixture of policies. Our approach involves IRL to learn reward for policy rather than resort to adhoc cost for varyinglength event sequences.
There are possible extensions in future: i) effective handling with multityped event sequences especially noticing the fact that the used generative adversarial imitation learning technique is directly applicable to multidimensional TPP. Note the RL method for multitype TPP in [29] does not involve IRL technique; ii) joint learning for TPP and topic model for text data associated with events, which is useful in practice such as for MemeTracker; iii) exploring the way of model sharing among clusters for more effective policy learning.
References
 [1] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial nets. In ICML, 2017.
 [2] J. Bayer and C. Osendorfer. Learning stochastic recurrent networks. In NIPS Workshop on Advances in Variational Inference, 2014.
 [3] D. Daley and D. VereJones. An introduction to the theory of point processes: volume II: general theory and structure. Springer Science & Business Media, 2007.
 [4] N. Du, H. Dai, R. Trivedi, U. Upadhyay, M. GomezRodriguez, and L. Song. Recurrent marked temporal point processes: Embedding event history to vector. In KDD, 2016.
 [5] M. Eichler, R. Dahlhaus, and J. Dueck. Graphical modeling for multivariate hawkes processes with nonparametric link functions. Journal of Time Series Analysis, 38(2):225–242, 2017.
 [6] S. Ertekin, C. Rudin, and T. H. McCormick. Reactive point processes: A new approach to predicting power failures in underground electrical systems. The Annals of Applied Statistics, 9(1):122–144, 2015.
 [7] M. Farajtabar, J. Yang, X. Ye, H. Xu, R. Trivedi, E. Khalil, S. Li, L. Song, and H. Zha. Fake news mitigation via point process based intervention. In ICML, 2017.
 [8] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, pages 2672–2680, 2014.
 [9] F. Han and H. Liu. Transition matrix estimation in high dimensional time series. In ICML, pages 172–180, 2013.
 [10] A. G. Hawkes. Point spectra of some mutually exciting point processes. Journal of the Royal Statistical Society. Series B (Methodological), 1971.
 [11] J. Ho and S. Ermon. Generative adversarial imitation learning. In NIPS, 2016.
 [12] J. Leskovec, L. Backstrom, and J. Kleinberg. Memetracking and the dynamics of the news cycle. In KDD, 2009.
 [13] E. Lewis and E. Mohler. A nonparametric em algorithm for multiscale hawkes processes. Journal of Nonparametric Statistics, 2011.
 [14] L. Li and H. Zha. Learning parametric models for social infectivity in multidimensional hawkes processes. In AAAI, pages 101–107, 2014.
 [15] S. Li, S. Xiao, S. Zhu, N. Du, Y. Xie, and L. Song. Learning temporal point processes via reinforcement learning. In NIPS, 2018.
 [16] T. W. Liao. Clustering of time series data  a survey. Pattern Recognition, 38(11):1857–1874, 2005.
 [17] E. A. Maharaj. Cluster of time series. Journal of Classification, 17(2), 2000.
 [18] H. Mei and J. M. Eisner. The neural hawkes process: A neurally selfmodulating multivariate point process. In NIPS, pages 6757–6767, 2017.
 [19] O. Mogren. Crnngan: Continuous recurrent neural networks with adversarial training. arXiv:1611.09904, 2016.
 [20] A. Y. Ng and S. Russell. Algorithms for inverse reinforcement learning. In ICML, 2000.
 [21] Y. Ogata. On lewis’ simulation method for point processes. IEEE Transactions on Information Theory, 27(1):23–31, 1981.
 [22] T. Ozaki. Maximum likelihood estimation of hawkes’ selfexciting point processes. Annals of the Institute of Statistical Mathematics, 31(1):145–155, 1979.
 [23] R. Pemantle. A survey of random processes with reinforcement. Probability Survey, 4(0):1–79, 2007.
 [24] W. M. Rand. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical association, 66(336):846–850, 1971.
 [25] C. E. Rasmussen. The infinite gaussian mixture model. In NIPS, 1999.
 [26] H. Schütze, C. D. Manning, and P. Raghavan. Introduction to information retrieval, volume 39. Cambridge University Press, 2008.
 [27] A. Talibi, B. Achchab, and R. Lasri. Variable selection for clustering with gaussian mixture models: state of the art. CoRR, abs/1701.08946, 2017.
 [28] R. Tibshirani and G. Walther. Cluster validation by prediction strength. Journal of Computational and Graphical Statistics, 14(3):511–528, 2005.
 [29] U. Upadhyay, A. De, and M. G. Rodriguez. Deep reinforcement learning of marked temporal point processes. In NIPS, 2018.
 [30] Y. Wang, H. Shen, S. Liu, J. Gao, and X. Cheng. Cascade dynamics modeling with attentionbased recurrent neural network. In AAAI, pages 2985–2991, 2017.
 [31] W. Wu, J. Yan, X. Yang, and H. Zha. Decoupled learning for factorial marked temporal point processes. In KDD, 2018.
 [32] S. Xiao, M. Farajtabar, X. Ye, J. Yan, L. Song, and H. Zha. Wasserstein learning of deep generative point process models. In NIPS, 2017.
 [33] S. Xiao, J. Yan, X. Yang, H. Zha, and S. Chu. Modeling the intensity function of point process via recurrent neural networks. In AAAI, 2017.
 [34] H. Xu and H. Zha. A dirichlet mixture model of hawkes processes for event sequence clustering. In NIPS, 2017.
 [35] J. Yan, X. Liu, L. Shi, C. Li, and H. Zha. Improving maximum likelihood estimation of temporal point process via discriminative and adversarial learning. In IJCAI, 2018.
 [36] J. Yan, Y. Wang, K. Zhou, J. Huang, C. Tian, H. Zha, and W. Dong. Towards effective prioritizing water pipe replacement and rehabilitation. In IJCAI, 2013.
 [37] Y. Yu and W. Zhou. Mixture of gans for clustering. In IJCAI, 2018.
 [38] K. Zhou, H. Zha, and L. Song. Learning social infectivity in sparse lowrank networks using multidimensional hawkes processes. In AISTATS, 2013.
 [39] K. Zhou, H. Zha, and L. Song. Learning triggering kernels for multidimensional hawkes processes. In ICML, pages 1301–1309, 2013.
Appendix
Appendix A Adapting [37] to WGANMM for TPP
For notation clearness, we slightly abuse the notations by assuming the used notations in this subsection is separated to the rest of the paper. Suppose for the $i$th cluster, training data ${D}_{i}=\{{x}_{1},{x}_{2},\mathrm{\dots}\}$ is generated by the oracle TPP $r$ and ${S}_{i}=\{{s}_{1},{s}_{2},\mathrm{\dots}\}$ is generated by the learned TPP $g$, then the Wasserstein distance between the distributions of the two point processes is given by
$$W({\mathbb{P}}_{r},{\mathbb{P}}_{g})=\underset{\varphi \in \mathrm{\Phi}({\mathbb{P}}_{r},{\mathbb{P}}_{g})}{inf}{\mathbb{E}}_{(x,s)\sim \varphi}[{xs}_{\star}],$$  (9) 
where $\mathrm{\Phi}({\mathbb{P}}_{r},{\mathbb{P}}_{g})$ denotes the set of all joint distributions $\varphi (x,s)$ whose marginals are ${\mathbb{P}}_{r}$ and ${\mathbb{P}}_{g}$, $g$ is learned by minimizing $W({\mathbb{P}}_{r},{\mathbb{P}}_{g})$.
As Eq.9 is computationally intractable, hence the dual form is used [1] to compute $W({\mathbb{P}}_{r},{\mathbb{P}}_{g})$ as
$$\underset{w\in \mathcal{W},{{f}_{w}}_{L}\le 1}{\mathrm{max}}{\mathbb{E}}_{x\sim {\mathbb{P}}_{r}}[{f}_{w}(x)]{\mathbb{E}}_{s\sim {\mathbb{P}}_{g}}[{f}_{w}(s)],$$  (10) 
where ${f}_{w}$ is the Lipschitz function with parameter $w\in \mathcal{W}$, that assign a value to a event sequence satisfying $1$Lipschitz constraint ${f}_{w}(x){f}_{w}(s)\le xs{}_{\star}$ for all $x$ and $s$. As we have supposed that event sequence $s$ is generated by ${g}_{\theta}$ with parameter $\theta $ using noisy input $\zeta \sim {\mathbb{P}}_{z}$, therefore the objective is to learn a generative model ${g}_{\theta}$ by minimize $W({\mathbb{P}}_{r},{\mathbb{P}}_{g})$ as
$$\underset{\theta}{min}\underset{w\in \mathcal{W},{{f}_{w}}_{L}\le 1}{\mathrm{max}}{\mathbb{E}}_{x\sim {\mathbb{P}}_{r}}[{f}_{w}(x)]{\mathbb{E}}_{\zeta \sim {\mathbb{P}}_{z}}[{f}_{w}({g}_{\theta}(\zeta ))],$$  (11) 
where ${f}_{w}$ is discriminator and ${g}_{\theta}$ is generator. Similar to [19], ${f}_{w}$ and ${g}_{\theta}$ are fulfilled by RNNs.
The generator ${g}_{\theta}$ transforms a noise input sequence $\zeta =\{{z}_{1},\mathrm{\dots}{z}_{n}\}$ to generated sequence $s=\{{t}_{1},\mathrm{\dots},{t}_{n}\}$ as ${g}_{\theta}(\zeta )=s$ using RNN with $n$ LSTM cells:
$${h}_{i}={\varphi}_{g}^{h}({A}_{g}^{h}{z}_{i}+{B}_{g}^{h}{h}_{i1}+{b}_{g}^{h}),{t}_{i}={\varphi}_{g}^{t}({B}_{g}^{t}{h}_{i}+{b}_{g}^{t}),$$ 
where ${h}_{i}$ is history embedding vector, ${\varphi}_{g}^{h}$ and ${\varphi}_{g}^{t}$ are activation functions, and parameters $\theta =\{{A}_{g}^{h},{B}_{g}^{h},{b}_{g}^{h},{B}_{g}^{t},{b}_{g}^{t}\}$. Similarly, the discriminator assigns a scalar value ${f}_{w}(\rho )={\sum}_{i=1}^{n}{a}_{i}$ to sequence $\rho =\{{t}_{1},\mathrm{\dots},{t}_{n}\}$ ($\rho $ can be real data $x$ or generated one $s$), also by RNN with $n$ cells: ${h}_{i}={\varphi}_{f}^{h}({A}_{f}^{h}{z}_{i}+{B}_{f}^{h}{h}_{i1}+{b}_{f}^{h}),{a}_{i}={\varphi}_{f}^{t}({B}_{f}^{t}{h}_{i}+{b}_{f}^{t})$, where the parameters of the discriminator $w=\{{A}_{f}^{h},{B}_{f}^{h},{b}_{f}^{h},{B}_{f}^{a},{b}_{f}^{a}\}$.
In general, the adapted WGANMM and the proposed RLPMM use a similar EM clustering framework. The major differences for our method to WGANMM lie in that WGANMM learns TPP by using Wasserstein distance between event sequences as cost, which may not be optimum for the clustering task at hand. While RLPMM is more like a metalearning method which adopts IRL – and the adversarial imitation technique to learn the cost function. Also, RL is used for learning policy of each cluster by RLPMM but WGANMM involves no RL. Moreover, instead of using RNN with multiple cells to learn the discriminators and generators in WGANMM for each cluster, we use RL to learn the latent policy for each cluster. We only need to learn a LSTM cell for the each latent policy, with another LSTM cell to learn the corresponding cost function by IRL. The empirical results in the paper show that our method runs one order faster against WGANMM.
Appendix B The proof of EM learning convergence
Due to page limitation, we present the detailed derivation of the EM learning framework and its convergence in this subsection.
Given a temporal event set $X$ with $M$ observed event sequences: $X=\{{x}_{1},{x}_{2},\mathrm{\dots},{x}_{M}\}$ and the discrete latents i.e. cluster labels $Y=\{{y}_{1},{y}_{2},\mathrm{\dots},{y}_{M}\}$ for ${y}_{i}\in \{1,2,\mathrm{\dots},N\}$, we suppose that $X$ are generated by a mixture of $N$ experts with a latent policy for each expert, parameterized by $\theta $ as a whole. The log likelihood is:
$$\mathcal{L}(\theta ;X,Y)=\mathrm{log}p(X,Y\theta ),$$ 
where $p(X,Y\theta )$ is the conditional probability of observing $X,Y$ given parameter $\theta $, and $\theta $ is determined by maximizing the marginal log likelihood of observed $X$:
$${\theta}^{*}=\underset{\theta}{\mathrm{arg}\mathrm{max}}\mathcal{L}(\theta ;X)=\underset{\theta}{\mathrm{arg}\mathrm{max}}\mathrm{log}p(X\theta ),$$ 
where $p(X\theta )=\int p(X,Y\theta )\mathit{d}Y$ is the marginal probability.
Suppose the latents $Y$ are sampled from an arbitrary valid probability distribution $q(Y)$, then a lower bound $\mathcal{F}(q,\theta )$ of the marginal log likelihood $\mathcal{L}(\theta ;X)$ can be obtained by Jensen’s inequality:
$\mathcal{L}(\theta ;X)=$  $\mathrm{log}{\displaystyle \int p(X,Y\theta )\mathit{d}Y}=\mathrm{log}{\displaystyle \int q(Y)\frac{p(X,Y\theta )}{q(Y)}\mathit{d}Y}$  
$\ge $  $\int q(Y)\mathrm{log}\frac{p(YX,\theta )p(X\theta )}{q(Y)}dY$  
$=$  $\int q(Y)\mathrm{log}p(X\theta )\mathit{d}Y}+{\displaystyle \int q(Y)\mathrm{log}\frac{p(YX,\theta )}{q(Y)}dY$  
$=$  $\mathcal{L}(\theta ;X){D}_{KL}(qp),$  (12) 
where ${D}_{KL}(qp)$ is the KLdivergence between $q(Y)$ and $p(YX,\theta )$, and we have the lower bound:
$$\mathcal{F}(q,\theta )=\mathcal{L}(\theta ;X){D}_{KL}(qp).$$  (13) 
Equation B gives the relation between hidden variable distribution $q(Y)$ and the likelihood function of the observed data $\mathcal{L}(\theta ;X)$: $\forall q(Y):\mathcal{L}(\theta ;X)\ge \mathcal{F}(q,\theta )$ and $\mathcal{L}(\theta ;X)=\mathcal{F}(q,\theta )$ if and only if $q(Y)=p(YX,\theta )$ so that ${D}_{KL}(qp)=0$.
Based on Eq. B, given randomly initialized parameter ${\theta}^{(0)}$ and arbitrary distribution ${q}^{(0)}$, we have the following EM procedure that iteratively updates ${q}^{(k)}$ and parameter ${\theta}^{(k)}$:
Estep: given model parameter ${\theta}^{(k)}$, update ${q}^{(k)}$ to ${q}^{(k+1)}$ by matching $q$ to posterior $p(YX,{\theta}^{(k)})$, so that $\mathcal{L}({\theta}^{(k)})=\mathcal{F}({q}^{(k+1)},{\theta}^{(k)})$
Mstep: given ${q}^{(k+1)}$, update $\theta $: ${\theta}^{(k+1)}\leftarrow {\theta}^{(k)}+\nabla \mathcal{F}({q}^{(k+1)},\theta )$ by maximizing $\mathcal{F}({q}^{(k+1)},\theta )$, so that we have: $\mathcal{F}({q}^{(k+1)},{\theta}^{(k+1)})>\mathcal{F}({q}^{(k+1)},{\theta}^{(k)})$.
By Eq. B, we have: $\mathcal{L}({\theta}^{(k+1)})\ge \mathcal{F}({q}^{(k+1)},{\theta}^{(k+1)})>\mathcal{F}({q}^{(k+1)},{\theta}^{(k)})=\mathcal{L}({\theta}^{(k)})$, which suggests that for each iteration with Estep and Mstep, $\mathcal{L}(\theta ;X)$ converges to its maximum. Alternatively perform Estep and Mstep iteratively, we can get the optimal solution ${\theta}^{*}$ for the model.
Appendix C Implementation Trick: Dealing with imbalanced classification
As we employed the Estep by training a classifier samples generated by the learned policies, the classifier is easy to assign clusters with imbalanced instances, particularly at the beginning of the procedure. The imbalance could get reinforced through the EM procedure. As a result, some policy models receives more and more data and the remaining gets fewer and fewer. Eventually the RLPMM model would converge to an imbalanced solution,
To fix this issue, at the beginning of the EM procedure, we allow a policy model to explore new by augmenting their training data. After the assignment of clusters $D=\{{D}_{1},\mathrm{\dots},{D}_{N}\}$, we augment each cluster data set ${D}_{i}$ by adding an amount of instances from $D{D}_{i}$ with the highest posterior probability of belonging to the $i$th cluster according to the output of the classifier. The amount of the augmented data is reducing along the procedure for the convergence.
Appendix D Details of Baselines
For the convenience of reproducibility, we present the technical details of the baselines used in the paper as follows:
1) Gaussian Mixture Model (GMM) To tackle the sequential data clustering problem, traditional methods usually implement aggregated time series clustering with discrete timelagged variable [16, 17]. These methods use a probabilistic mixture model to perform sequence clustering with two procedures: firstly extracting features from sequential data, then identifying clusters via GMM [27, 25]. We use three methods to extract features from sequential events for the Gaussian Mixture Model (GMM), including:

•
Vector AutoRegression (VAR) The VAR [9] model discretizes event sequences to time series, and learns a transition matrix as features for clustering.

•
Ordinary Differential Equation (ODE) We also use a nonparametric Hawkes process model [13] based on Ordinary Differential Equation (ODE).

•
Least Squares (LS) We also test another nonparametric Hawkes model based on contrast function in [5] relating to the Least Square (LS) problem.
Both ODE and LS learn a Hawkes process for each sequence. In line with the protocol in [34], we use its parameter $\theta =[\mu ,\alpha ]$ as feature for each sequence, and employ GMM for clustering.
2) Dirichlet Mixture Model of Hawkes Processes (DMMHP) Learning mixture of policies for TPP has been rarely considered in literature, the most related work to our best knowledge is DMMHP [34]. It generates event sequences with different clusters from Hawkes processes of different parameters, and uses a Dirichlet distribution as clusters’ prior.
Both our model and DMMHP are modelbased methods that can accomplish temporal processes clustering. The differences lie in that DMMHP use conventional parameterized Hawkes process and cluster with Latent Dirichlet Allocation (LDA), we propose a novel generative adversarial imitation learning point process model.
3) Wasserstein Generative Adversarial Network Mixture Model (WGANMM) Similar to Gaussian mixture model (GMM), a GAN mixture model in [37] has been used for image clustering with fixed sized matrixlike data. In [37] $N$ GAN models are trained to capture the each cluster’s distribution respectively. To adapt its processing domain from image to event sequences in continuous domain (TPP), we modify vanilla WGANMM by replacing CNN with RNN and introduce the Wassertein distance between point processes as proposed by [32] for adversarial learning.
Appendix E Details of Evaluation Metrics
The detailed definitions of the metrics used in paper include:
1) Clustering Purity (CP): Purity is the average of portion of true positive class in each cluster [26]:
$$Purity=\frac{1}{M}\sum _{k=1}^{K}\underset{i\in \{1,\mathrm{\dots},{K}^{\prime}\}}{\mathrm{max}}{\mathcal{W}}_{k}\cap {\mathcal{C}}_{i},$$ 
where ${\mathcal{W}}_{k}$ is the learned index set belonging to cluster $k$, ${\mathcal{C}}_{i}$ is the real index set of sequences belonging to class $i$, $M$ is the total number of sequences. Purity lies in between 0 and 1. Higher purity indicates more concentration in each cluster.
2) Rand Index (RI): By treating the labels as a clustering ground truth, RI can be used as clustering accuracy (the higher the better), measuring the similarity between the learned sequence clustering and real labels, as given by [24]: $RI=\frac{{n}_{11}+{n}_{00}}{n}\in [0,1]$, where ${n}_{11}$ is the number of sequence pairs that are in the same cluster with the same label, and ${n}_{00}$ is the number of pairs that are in different clusters with different labels.
3) Empirical Intensity Deviation (EID): To measure the learned latent policy for each cluster, we follow the protocol in [32] to compute the deviation of empirical intensity function (accumulated absolute error over time for a windowed period) between the real event sequences and sequences generated by the learned policy, for which the lower the better. The empirical intensity is given by: ${\lambda}^{\prime}(t)=\mathbb{E}(N(t+\delta t))/\delta t$, where $N(t)$ is the count process for $\lambda (t)$, and the expectation $\mathbb{E}(N(t+\delta t))$ is computed by sufficient number of generated sequences through counting the average number of events during $[t,t+\delta t]$. Note that it can be only applied to synthetic dataset where the groundtruth cluster label is known.
4) Clustering Consistency (CC): Purity, RI and EID can be used to measure clustering performance when the cluster labels are known as for synthetic sequences. For realworld sequences without labels, we measure the clustering performance by clustering consistency via crossvalidation as in [28].
We test each clustering method with $J=100$ trails. In trial $j$, all sequences are randomly divided into training fold and testing fold. After learning the model from the training fold, we apply the model to the corresponding testing fold. We enumerate all pairs of sequences within a same cluster in the $j$th trial and count the pairs preserved in all other trials. The clustering consistency is the minimum proportion of preserved pairs over all trials computed by
$$Consistency=\underset{j\in \{1,\mathrm{\dots},J\}}{\mathrm{min}}\sum _{{j}^{\prime}\ne j}\sum _{(m,{m}^{\prime})\in {\mathcal{M}}_{j}}\frac{1\{{k}_{m}^{{j}^{\prime}}={k}_{{m}^{\prime}}^{{j}^{\prime}}\}}{(J1){\mathcal{S}}_{j}},$$ 
where ${\mathcal{S}}_{j}=\{(m,{m}^{\prime}){k}_{m}^{j}={k}_{{m}^{\prime}}^{j}\}$ is the sequence pair set within the same cluster in trial $j$, ${k}_{m}^{j}$ is the cluster index of for sequence $m$.
Appendix F Formulas and plots for nonHawkes processes
In the experiments with nonHawkes datasets, the sequences are generated by 4 kinds of intensity functions. The formulas of these functions are:

•
Sinelike: $\lambda (t)=\frac{\mathrm{sin}(\frac{\pi t}{50})}{10}+0.1,t\in (0,T).$

•
Negativesinelike: $\lambda (t)=\frac{\mathrm{sin}(\frac{\pi t}{50})}{10}+0.1,t\in (0,T).$

•
Constant: $\lambda (t)=0.1,t\in (0,T).$

•
Bimodallike: $\lambda (t)=\{\begin{array}{cc}0.15\mathrm{exp}(\frac{t{(\frac{T}{4})}^{2}}{2*(\frac{T}{8})}),\hfill & t\in (0,\frac{T}{2}].\hfill \\ 0.15\mathrm{exp}(\frac{t{(\frac{3T}{4})}^{2}}{2*(\frac{T}{8})}),\hfill & t\in (\frac{T}{2},T).\hfill \end{array}$
We also plot the intensity of the groundtruth, and the estimated empirical intensity of RLPMM model and baseline models as shown in Fig.5.
Appendix G Interpretation to clustering on MemeTracker
For the discovered memes patterns marked as ${C}_{1}$ – ${C}_{4}$, their statistics of average meme diffusion length and word frequency are listed in Table 5. We make the following interpretations which we believe can at least be partially supported by our results:
17.3 {people, life, God, woman, friend, country, right, America, faith, state, peace, party, earth, future, child} 

33.9 {country, people, government, health, economy, campaign, war, market, system, situation, crisis, company, nation, power, market} 
77.5 {people, government, election, McCain, decision, world, challenge, America, policy, security, Europe, law, election, force, Israel} 
84.1 {people, world, state, Muslim, threat, industry, terrorist, money, enemy, citizen, Europe, event, America, safety, Afghanistan} 
i) ${C}_{1}$: memes have a wide spread as soon as it is generated, and quickly disappear in around $30$ days. These memes are mostly catchword, e.g. peace, child, being usually short and clear.
ii) ${C}_{2}$: the diffusion intensity decays in $20$ days and then holds in subsequent days. These memes are mostly hot topics like economy, health care, job opportunity, etc..
iii) ${C}_{3}$: the diffusion intensity gradually decay for around $40$ days. The memes are mostly long and complete statements and opinions.
iv) ${C}_{4}$: the diffusion suspends for short around $20$ days, then begins to diffuse. Though the diffusion processes of ${C}_{3}$ and $4$ are different, the average length and top frequent words of ${C}_{4}$ are similar to ${C}_{3}$ as in Table 5. It suggests that the diffusion of the long statements and opinions contains two patterns that are quite different from each other.
Such results also show the potential and need for comprehensive modeling by combing RLPMM with topic models on meme content.