### Abstract

Temporal point process is an expressive tool for modeling event sequencesover time. In this paper, we take a reinforcement learning view whereby theobserved sequences are assumed to be generated from a mixture of latentpolicies. The purpose is to cluster the sequences with different temporalpatterns into the underlying policies while learning each of the policy model.The flexibility of our model lies in: i) all the components are networksincluding the policy network for modeling the intensity function of temporalpoint process; ii) to handle varying-length event sequences, we resort toinverse reinforcement learning by decomposing the observed sequence into states(RNN hidden embedding of history) and actions (time interval to next event) inorder to learn the reward function, thus achieving better performance orincreasing efficiency compared to existing methods using rewards over theentire sequence such as log-likelihood or Wasserstein distance. We adopt anexpectation-maximization framework with the E-step estimating the clusterlabels for each sequence, and the M-step aiming to learn the respective policy.Extensive experiments show the efficacy of our method againststate-of-the-arts.

### Quick Read (beta)

# Reinforcement Learning with Policy Mixture Model

for Temporal Point Processes Clustering

###### Abstract

Temporal point process is an expressive tool for modeling event sequences over time. In this paper, we take a reinforcement learning view whereby the observed sequences are assumed to be generated from a mixture of latent policies. The purpose is to cluster the sequences with different temporal patterns into the underlying policies while learning each of the policy model. The flexibility of our model lies in: i) all the components are networks including the policy network for modeling the intensity function of temporal point process; ii) to handle varying-length event sequences, we resort to inverse reinforcement learning by decomposing the observed sequence into states (RNN hidden embedding of history) and actions (time interval to next event) in order to learn the reward function, thus achieving better performance or increasing efficiency compared to existing methods using rewards over the entire sequence such as log-likelihood or Wasserstein distance. We adopt an expectation-maximization framework with the E-step estimating the cluster labels for each sequence, and the M-step aiming to learn the respective policy. Extensive experiments show the efficacy of our method against state-of-the-arts.

Reinforcement Learning with Policy Mixture Model

for Temporal Point Processes Clustering

Weichang Wu
Shanghai Jiao Tong University
[email protected]
Junchi Yan ^{†}^{†}thanks: Corresponding author.
Shanghai Jiao Tong University
[email protected]
Xiaokang Yang
Shanghai Jiao Tong University
[email protected]
Hongyuan Zha
Georgia Institute of Technology
[email protected]

noticebox[b]33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\[email protected]

## 1 Introduction

Event sequences with time stamps in continuous domain are ubiquitous across different areas and applications. In e-commerce, online purchase records over time can form event sequences. In health informatics, a series of treatments taken by patient can be tracked as an event sequence. In seismology, a sequence of earthquakes can be recorded. Recognizing and understanding the structure in the event sequences is of vital importance for downstream applications.

Temporal point processes (TPP) [3] are useful tools for modeling event sequences whereby one key concept is to model the event occurrence rate over time using the conditional intensity function. Lots of literatures have been proposed for both data-mining like in [34, 14] and prediction task like in [31].

In this paper, we are devoted to another important but relatively less studied scenario: event sequence clustering in the continuous time domain, which can be more challenging than the traditional (aggregated and discrete) time series clustering. Event sequence clustering can find its utility in many real-world applications. Given a number of event sequences, it is important to discover and learn the underlying clustering structure robustly. For example, the purchase records can help cluster e-commerce users into different groups to benefit a recommender system; clustering patients according to their treatment logs helps hospitals optimize medication resources.

However, despite the extensive existing works on event sequence modeling and prediction as mentioned above, event sequence clustering and especially learning mixture model of event sequences have rarely been addressed, except in [29] to our best knowledge. In [29], a parametric likelihood based latent Dirichlet allocation model is proposed for event sequences clustering using Hawkes processes. In this paper, we propose a deep reinforcement learning (RL) based EM framework for event sequence clustering using likelihood-free temporal point processes learning, and the purpose is to discover and learn these underlying policies of experts which can be shared over similar sequences. he highlights of our work include:

1) We present a network based EM framework for TPP clustering, differing from previous work using parametric clustering models [29]. Under the EM scheme, clustering of the entire dataset and model fitting of each cluster are jointly performed rather than separated in two steps [9].

2) We take an RL view to the TPP clustering problem, whereby each cluster corresponds to one of the latent expert policies that generates the observed sequences in the cluster. Our method can be seen as a meta learning approach by adopting IRL to learn policy’s reward function. In particular, we employ generative adversarial imitation learning [11] as an efficient IRL embodiment for TPP.

3) We empirically show that our method exceeds peer models notably including a very recent one [29]. We also compare with the recent mixture GAN based (image) clustering model [32] and adapt it to temporal event sequences by using the Wasserstein distance between point processes according to [27] (see details in supplementary material), for a more fair comparison. The results show that our model outperforms significantly regarding training efficiency (one order faster), with similar performance.

## 2 Related Work

We review TPP methods in literature in two aspects: i) modeling of the intensity function; ii) learning objectives and algorithms. The relevance lies in that both the intensity function and learning objective relate to latent policy discovering and learning.

### 2.1 Temporal Point Process Intensity Modeling

Traditional TPP models are mostly developed around the design of the intensity function $\lambda (t)$ which measures the instantaneous event occurrence rate at time $t$, like Reinforced Poisson processes [20], Self-exciting process (Hawkes process) [10], Reactive point process [6], etc. An obvious limitation of these traditional models is that they all assume all the samples obey a single parametric form which is too idealistic for real-world data. This also suggests the need for learning clustered behaviors beyond single model based methods.

By contrast, in neural point process [4, 16, 28], recurrent neural network (RNN) and its variants e.g. long-short term memory (LSTM) are used for modeling the conditional intensity function over time. As such, the learning can be generally fulfilled by gradient descent and no restriction on the form of the intensity function is imposed. More recently attention mechanisms are introduced to improve the interpretability of the neural model [25]. However, in all these models, the event sequences are all fed into the model without any discrimination to the groups they may differently belong to. In fact, the behind mechanisms for generating each event sequence can be very different, thus it may be a better idea to learn one model for each cluster of similar samples for more accurate modeling.

### 2.2 Temporal Point Process Learning

There are alternative objectives for TPP learning for both parametric models and neural models. Traditional methods mostly follow the maximum likelihood estimation (MLE) procedure under the probabilistic framework [19]. While the MLE objective may not be the only choice. This is because we are often given only a limited number of sequences which may further contain arbitrary noises. Recent efforts have been made for devising adversarial learning based objective inspired by generative adversarial networks (GAN) [8] and especially Wasserstein GAN [1]. In [30], adversarial objective is developed in addition with MLE loss by approximating the continuous domain predictions using a discrete time series. In [27], Wasserstein distance over temporal event sequences is explicitly defined to learn a deep generative point process model for temporal events generation.

Another line of works consider the challenge for learning high-dimensional TPP models, whereby the so-called infectivity matrix to be learned can be of squared size of the dimensionality. One popular technique is imposing low-rank regularizer [33] or factorization model [26] on the infectivity matrix. However, they do not explicitly deal with the sequence clustering problem. In fact, the observed dimension marker does not correspond to the underlying cluster.

The mostly related work to our method appears in [29] as it deals with a similar problem setting: grouping event sequences into different clusters and learn the respective TPP model parameters for each cluster. However, the technical approaches are completely different. First the parametric model [29] is tailored to Hawkes process while our network based model is more general; Second, the work [29] is under the Bayesian probabilistic framework while our method is likelihood-free and incorporates both adversarial learning and inverse reinforcement learning [17] for more effective objective design beyond MLE; We show in the experiments that our method significantly outperforms [29] on real-world data. Source code will be made public available for reproducible research.

As shown in Fig. 1, this paper takes a reinforcement learning (RL) perspective on the modeling and clustering of temporal point processes for its dynamic sequence nature. Though there exist works [7, 15, 24] on (deep) RL and intervention of TPP, while little effort ([29] does not involve deep model nor RL) has been paid on TPP clustering which calls additional careful treatment on disentangling the mixture of policies. Using the language of RL, suppose there is a number of event sequences generated (with noise) by $N$ underlying expert policies, which can be reflected in the form of $N$ clusters. In this sense, we formulate the event sequence clustering task as a reinforcement learning problem whereby the purpose is to discover the unknown event generation policies, and meanwhile the learning cost function for fitting event sequences is also automatically learned from data using IRL.

## 3 Proposed Model

We present our approach under the expectation-maximization (EM) framework – a natural way for disentangling the clustering and fitting of each cluster of event sequences which are assumed to be generated by a mixture of policies. In expectation step, each sequence is assigned with a cluster label corresponding to its latent policy that generates this sequence. The latent policy of each cluster is learned in maximization step. The model is illustrated in in Fig. 1. The complete learning algorithm is shown in Alg. 1 which calls an IRL based subfunction *GAIL-TPP* described in Alg. 2.

### 3.1 EM Learning for Policy Mixture Model

Given a temporal event set $X$ with $M$ observed event sequences: $X=\{{x}_{1},{x}_{2},\mathrm{\dots},{x}_{M}\}$ and the discrete latents i.e. cluster labels $Y=\{{y}_{1},{y}_{2},\mathrm{\dots},{y}_{M}\}$ for ${y}_{i}\in \{1,2,\mathrm{\dots},N\}$, we suppose that $X$ are generated by a mixture of $N$ experts with a latent policy for each expert, parameterized by $\theta $ as a whole. The log likelihood is:

$$\mathcal{L}(\theta ;X,Y)=\mathrm{log}p(X,Y|\theta ),$$ |

where $p(X,Y|\theta )$ is the conditional probability of observing $X,Y$ given parameter $\theta $, and $\theta $ is determined by maximizing the marginal log likelihood of observed $X$:

$${\theta}^{*}=\underset{\theta}{\mathrm{arg}\mathrm{max}}\mathcal{L}(\theta ;X)=\underset{\theta}{\mathrm{arg}\mathrm{max}}\mathrm{log}p(X|\theta ),$$ |

where $p(X|\theta )=\int p(X,Y|\theta )\mathit{d}Y$ is the marginal probability.

Suppose the latents $Y$ are sampled from an arbitrary valid probability distribution $q(Y)$, then a lower bound $\mathcal{F}(q,\theta )$ of the marginal log likelihood $\mathcal{L}(\theta ;X)$ can be obtained by Jensen’s inequality as:

$$\mathcal{F}(q,\theta )=\mathcal{L}(\theta ;X)-{D}_{KL}(q||p),$$ | (1) |

which means we have $\forall q(Y):\mathcal{L}(\theta ;X)\ge \mathcal{F}(q,\theta )$^{1}^{1}
1
The detailed derivation is presented in supplementary material in the EM learning convergence proof. Given randomly initialized parameter ${\theta}^{(0)}$ and arbitrary distribution ${q}^{(0)}$, we iteratively update ${q}^{(k)}$ and parameter ${\theta}^{(k)}$ by the following Expectation-Maximization procedure:

E-step: given model parameter ${\theta}^{(k)}$, update ${q}^{(k)}$ to ${q}^{(k+1)}$ by matching $q$ to posterior $p(Y|X,{\theta}^{(k)})$

M-step: given ${q}^{(k+1)}$, update $\theta $: ${\theta}^{(k+1)}\leftarrow {\theta}^{(k)}+\nabla \mathcal{F}({q}^{(k+1)},\theta )$ by maximizing $\mathcal{F}({q}^{(k+1)},\theta )$

We iteratively perform E-step and M-step until ${\theta}^{*}$, and the distribution of the hidden variable $Y$ is: ${q}^{*}(Y)=p(Y|X,{\theta}^{*}).$ A detailed proof for the convergence of the EM procedure is presented in the supplementary material.

Specifically, the computational components in E-step and M-step are all implemented by neural networks as the Reinforcement Learning with Policy Mixture Model (RLPMM), including the *E-step for policy clustering* and *M-step for policy learning*.

### 3.2 Expectation Step for Policy Clustering

As mentioned above, in E-step, we match the hidden variable distribution $q$ to posterior distribution $p(Y|X,{\theta}^{(k)})$, and fill in values of latent variables $Y$ for samples in observed data $X$ according $q(Y)$, so that we can re-recompute the expectation of $X$ given ${\theta}^{(k)}$, i.e., the likelihood function $\mathcal{L}({\theta}^{(k)})$.

We compute the hidden variable distribution $q$ as

$${q}^{(k)}=\underset{q\in {\mathscr{H}}_{k}}{\mathrm{arg}\mathrm{min}}KL(q||{p}^{(k)}),$$ | (2) |

where we restrict the distribution of hidden variable ${q}^{(k)}$ is in a bounded hypothesis space ${\mathscr{H}}_{k}$.

For mixture of policies, given parameter ${\theta}^{(k)}$ and observed data $X$, the posterior distribution ${p}^{(k)}$ is:

$$p({y}_{ij}|{x}_{j},{\theta}^{(k)})=\frac{p({x}_{j},{y}_{ij}|{\theta}^{(k)})}{p({x}_{j}|{\theta}^{(k)})},$$ | (3) |

where ${y}_{ij}=1$ if and only if ${x}_{j}$ is generated by the $i$-th policy.

Inspired by Eq. 3, to find ${q}^{(k)}$ in Eq. 2, we train a classifier to fit the current guess of the discrete hidden variable distribution ${q}^{(k)}$ to ${p}^{(k)}$, i.e., holding the policies parameter ${\theta}^{(k)}$ fixed, train a classifier ${h}_{q}$ by data generated by learned policies. Therefore, the E-step involves line 1, 1, 1 in Alg. 1.

In practical application, ${h}_{q}$ is a $3$-layer classifier including sequence embedding layer, RNN layer and classification layer as used in [4, 28]. In addition, an implementation trick dealing with imbalanced classification at the beginning of the procedure is used, as presented in the supplementary material.

### 3.3 Maximization Step for Policy Learning

Given the hidden variable $Y$ estimated by the classifier in E-step, each event sequence $x$ in the training dataset $D$ is classified to a specific policy with discrete hidden variable ${y}_{i}$. In M-step, dataset $D$ is divided into $N$ clusters $\{{D}_{1},{D}_{2},\mathrm{\dots},{D}_{N}\}$ to train each policy model.

Now we present an IRL based policy learning method. Differing from previous works imposing a specific form for sequence learning e.g. Wasserstein distance [27], we assume the policy reward (cost) is unknown which need be learned via IRL. The rationale is that it is nontrivial to define the temporal event sequence fitting error with varying length in contrast to the vector-like data. To make the learning scalable to real-world data, the IRL procedure is efficiently fulfilled by a generative adversarial imitation learning scheme [11].

#### 3.3.1 Policy network for sequence generation

The policy function ${\pi}_{\theta}$ should have the capacity to capture the complex sequential dependency pattern and stochastic nature in the point process. We adopt RNN with stochastic neurons [2] as the policy network. Here action refers to the time to next event from current event timestamp and state refers to the hidden embedding of RNN for the history. Note action $a$ is sampled from distribution $\pi (a|\theta ({h}_{i-1}))$ as

$${a}_{i}\sim \pi (a|\theta ({h}_{i-1})),$$ | (4) |

where ${h}_{i}=\psi (V{a}_{i}+W{h}_{i-1})$, ${a}_{i}={t}_{i}-{t}_{i-1}\in {\mathbb{R}}^{+}$ is the $i$-th inter-event time (${t}_{0}=0$), ${h}_{i}\in {\mathbb{R}}^{d}$ is hidden state of RNN encoding history before ${t}_{i}$, $\theta $ is nonlinear mapping from $\mathbb{R}$ to policy’s parameter space, $V\in {\mathbb{R}}^{d}$ and $W\in {\mathbb{R}}^{d\times d}$ are coefficients, $\psi $ is nonlinear activation function, e.g., $\psi (z)=\frac{{e}^{z}-{e}^{-z}}{{e}^{z}+{e}^{-z}}$ as used in this paper.

There are alternatives for parameterizing the policy function, i.e. the probability density function $\pi $, as long as they satisfy the constraint that the random variable sampled from $\pi $ is positive since $a>0$, such as exponential distribution: $\pi (a|\theta ({h}_{i-1}))=\theta (h){e}^{-\theta (h)a}$ and Rayleigh distribution: $\theta (h)a{e}^{-\theta (h){a}^{2}/2}$ as used in this paper.

So far the RNN policy network with stochastic neurons is able to mimic the event generating mechanism of stochastic temporal point process by Eq. 4. Given a sequence of past events $$, the next event time is generated as ${t}_{i+1}={t}_{i}+a$, with the inter-event time $a$ sampled from stochastic policy ${\pi}_{\theta}(a|{s}_{t})$ as the action.

#### 3.3.2 Inverse reinforcement learning for cost modeling

As shown in Fig.2, we use the GAIL framework to learn the cost function for reinforcement learning, in which the IRL procedure is substituted by training a Discriminator ${D}_{w}$, with the gradient of discriminator parameter $w$ given by

$${\mathbb{E}}_{{x}_{\theta}}[{\nabla}_{w}\mathrm{log}({D}_{w}(s,a))]+{\mathbb{E}}_{{x}_{E}}({\nabla}_{w}\mathrm{log}(1-{D}_{w}(s,a)),$$ | (5) |

where ${x}_{\theta}\sim {\pi}_{\theta}$ is the sequences sampled from learned policy ${\pi}_{\theta}$ and ${x}_{E}\sim {\pi}_{E}$ is the sequences sampled from expert’s true policy.

Given the cost function $\mathrm{log}{D}_{w}(s,a)$, the RL procedure is implemented by the policy gradient descent with gradient of policy network parameter $\theta $ given by

$${\mathbb{E}}_{{x}_{\theta}}[{\nabla}_{\theta}\mathrm{log}{\pi}_{\theta}(a|s)Q(s,a)]-\lambda {\nabla}_{\theta}H({\pi}_{\theta}),$$ | (6) |

where $Q(\overline{s},\overline{a})={\mathbb{E}}_{{x}_{\theta}}[\mathrm{log}({D}_{w}(s,a))|{s}_{0}=\overline{s},{a}_{0}=\overline{a}]$. And the gradient of the causal entropy regularizer is given by

${\nabla}_{\theta}H(\theta )$ | $={\nabla}_{\theta}{\mathbb{E}}_{{\pi}_{\theta}}[-{\mathrm{log}}_{{\pi}_{\theta}}(a|s)]$ | (7) | ||

$={\mathbb{E}}_{{\pi}_{\theta}}[{\nabla}_{\theta}\mathrm{log}{\pi}_{\theta}(a|s){Q}_{\mathrm{log}}(s,a)],$ |

where ${Q}_{\mathrm{log}}(\overline{s},\overline{a})={\mathbb{E}}_{{\pi}_{\theta}}[-{\mathrm{log}}_{{\pi}_{\theta}}(a|s)|{s}_{0}=\overline{s},{a}_{0}=\overline{a}]$

In essence, by iteratively updating $w$ using IRL in Eq. 5, and updating $\theta $ using RL in Eq. 6, the GAIL algorithm find a saddle point $({\pi}_{\theta}^{*},{D}_{w}^{*})$ of the expression

$${\mathbb{E}}_{{\pi}_{\theta}}[\mathrm{log}(D(s,a))]+{\mathbb{E}}_{{\pi}_{E}}[\mathrm{log}(1-D(s,a))]-\lambda H(\pi ),$$ | (8) |

which is equivalent to find the optimal policy ${\pi}_{\theta}^{*}$.

Method | Intensity function | Framework | Running pipeline | Learning model |
---|---|---|---|---|

RLPMM | Neural networks (RNN) | Temporal point process | Joint clustering and learning | Reinforcement learning |

WGANMM [32] | Neural networks (RNN) | Temporal point process | Joint clustering and learning | Adversarial learning |

DMMHP [29] | Parametric Hawkes | Temporal point process | Joint clustering and learning | Maximum likelihood learning |

ODE/LS+GMM [13, 5] | Nonparametric Hawkes | Temporal point process | Separate clustering and learning | Feature based GMM |

VAR+GMM [9] | Not applicable | Discretized time series | Separate clustering and learning | Feature based GMM |

## 4 Experiments

We evaluate the performance of our Reinforcement Learning for Policy Mixture Model (RLPMM) on both synthetic and real-world data. To demonstrate the effectiveness and efficiency of our model, we compare with state-of-the-art methods for event sequence clustering.

As summarized in Table 1, peer methods include: 1) Gaussian Mixture Model (GMM) includes 3 Two-step pipeline models that firstly extract features from sequential events using Vector Auto-Regression (VAR)[9], Ordinary Differential Equation (ODE)[13] or Least Squares (LS)[5], the use the GMM to cluster the event sequences; 2) Dirichlet Mixture Model of Hawkes Processes (DMMHP)[29] which is the most related work to our knowledge; 3) Wasserstein Generative Adversarial Network Mixture Model (WGANMM) that we adapt from [32]. Due to page limit, we present the technical details of our adaption and the other baselines in the supplementary material.

Metrics used to measure clustering performance are 1) Clustering Purity (CP) [22]; 2) Rand Index (RI) [21]; 3) Empirical Intensity Deviation (EID) [27]; 4) Clustering Consistency (CC) [23]. All the metrics except for clustering consistency (as CC itself already involves random trials) are computed by the average of 10 trials on the whole dataset by random initialization for clustering. The details of their definitions are also presented in supplementary material.

Model | VAR+ | ODE+ | LS+ | DMMHP | WGAN | RLPMM | VAR+ | ODE+ | LS+ | DMMHP | WGAN | RLPMM |
---|---|---|---|---|---|---|---|---|---|---|---|---|

Cluster# | GMM | GMM | GMM | -MM | GMM | GMM | GMM | -MM | ||||

Evaluation | Clustering Purity (CP): the higher the better | |||||||||||

Dataset | noHawkes | Hawkes | ||||||||||

$K=2$ | 0.5722 | 0.7167 | 0.7493 | 0.9557 | 0.9801 | 0.9776 | 0.5037 | 0.7525 | 0.7741 | 0.9879 | 0.9710 | 0.9653 |

(0.0387) | (0.0290) | (0.0239) | (0.0164) | (0.0053) | (0.0035) | (0.0548) | (0.0219) | (0.0152) | (0.0110) | (0.0126) | (0.0141) | |

$K=3$ | 0.4117 | 0.5518 | 0.6235 | 0.9047 | 0.9515 | 0.9660 | 0.3788 | 0.5739 | 0.6405 | 0.9558 | 0.9316 | 0.9464 |

(0.0447) | (0.0388) | (0.0412) | (0.0179) | (0.0141) | (0.0130) | (0.0783) | (0.0255) | (0.0239) | (0.0164) | (0.0205) | (0.0184) | |

$K=4$ | 0.2694 | 0.4108 | 0.4556 | 0.8755 | 0.9374 | 0.9528 | 0.2593 | 0.4290 | 0.4634 | 0.9256 | 0.9091 | 0.9182 |

(0.0714) | (0.0447) | (0.0436) | (0.0173) | (0.0176) | (0.0170) | (0.0707) | (0.0412) | (0.0311) | (0.0176) | (0.0257) | (0.0155) | |

Evaluation | Random Index (RI): the higher the better | |||||||||||

Dataset | noHawkes | Hawkes | ||||||||||

$K=2$ | 0.3874 | 0.6114 | 0.6685 | 0.9006 | 0.9547 | 0.9517 | 0.3293 | 0.6283 | 0.6832 | 0.9418 | 0.9274 | 0.9297 |

(0.0917) | (0.0701) | (0.0529) | (0.0184) | (0.0105) | (0.0082) | (0.0503) | (0.0197) | (0.0173) | (0.0095) | (0.0195) | (0.0130) | |

$K=3$ | 0.2940 | 0.4822 | 0.5265 | 0.8618 | 0.9262 | 0.9439 | 0.2411 | 0.4995 | 0.5361 | 0.9251 | 0.8960 | 0.9193 |

(0.1342) | (0.0755) | (0.0748) | (0.0296) | (0.0192) | (0.0114) | (0.0640) | (0.0268) | (0.0217) | (0.0087) | (0.0243) | (0.0145) | |

$K=4$ | 0.1182 | 0.3750 | 0.4127 | 0.8043 | 0.8857 | 0.9128 | 0.0946 | 0.3898 | 0.4189 | 0.8912 | 0.8626 | 0.8779 |

(0.0424) | (0.0837) | (0.0954) | (0.0247) | (0.0214) | (0.0130) | (0.0775) | (0.0361) | (0.0212) | (0.0143) | (0.0207) | (0.0170) | |

Evaluation | Empirical Intensity Deviation (EID): the lower the better | |||||||||||

Dataset | noHawkes | Hawkes | ||||||||||

$K=2$ | — | 3.734 | 2.947 | 1.512 | 0.355 | 0.358 | — | 2.787 | 1.963 | 0.338 | 0.524 | 0.417 |

— | (0.1612) | (0.1789) | (0.0279) | (0.0158) | (0.0126) | — | (0.0819) | (0.0768) | (0.0187) | (0.0236) | (0.0241) | |

$K=3$ | — | 3.960 | 3.255 | 2.173 | 0.484 | 0.419 | — | 3.149 | 2.918 | 0.373 | 0.557 | 0.446 |

— | (0.1549) | (0.2324) | (0.0557) | (0.0181) | (0.0138) | — | (0.1342) | (0.0849) | (0.0235) | (0.0259) | (0.0274) | |

$K=4$ | — | 3.904 | 3.726 | 2.219 | 0.554 | 0.472 | — | 3.321 | 3.135 | 0.409 | 0.570 | 0.491 |

— | (0.2449) | (0.2168) | (0.0469) | (0.0202) | (0.0152) | — | (0.1817) | (0.1183) | (0.0247) | (0.0518) | (0.0402) |

### 4.1 Experiments on Synthetic Data

We experiment on several synthetic datasets with different $K$ (number of clusters). We generate four kinds of synthetic event sequences in a time interval $[0,T]\mathit{\hspace{1em}}(T=100)$ using the simulation method in [18] for TPP. We experiment with different $K$ from $K=2$ to $K=4$, and the results are given by the average of $10$ trails as also shown in Table 2. The ratio of each cluster size is the same. To make a fair comparison with the Hawkes process based models, we experiment on synthetic datasets generated by both non-Hawkes processes and Hawkes processes.

Specifically, for non-Hawkes dataset, we have sequences generated by a mixture of $K=2$ clusters from $Sine$ intensity and $Negative$-$sine$ intensity, then we add sequences generated by $Constant$ intensity for $K=3$, followed by the cluster from $Bimdoal$ intensity for $K=4$. We list the formulas of the intensity and plot the curve of the ground-truth intensity and learned intensity in supplementary material.

For Hawkes processes, we adopt the conventional Hawkes process as: $\lambda (t)={\gamma}_{0}+\alpha {\sum}_{t\in \tau}{e}^{-w(t-{t}^{\prime})}$. We also experiment with different $K$ from $K=2$ to $K=4$. In line with [29], for each trial, the parameters of the intensity function for cluster $k$: $\{{\gamma}_{0}^{k},{\alpha}^{k}\}$ are sampled randomly from $[0,1]$ by keeping $w=1$.In line with [29], for each trial, the parameters of the intensity function for cluster $k$: $\{{\gamma}_{0}^{k},{\alpha}^{k}\}$ are sampled randomly from $[0,1]$ by keeping $w=1$.

### 4.2 Experiments on MemeTracker Data

We collect real-world event sequences from public MemeTracker [12] as widely used in TPP works [33, 16, 27]. It tracks meme diffusion over public media, containing more than 172 million news articles or blog posts. The memes are sentences, such as ideas, proverbs, and the time is recorded when it spreads to certain websites. We randomly sample 35,000 cascades from MemeTracker to study the diffusion process of the meme since its creation. The memes are supposed generated from different latent policies for discovery. Note one can only use clustering consistency as metrics as there is no ground-truth cluster labels. The results are shown in Table 4.

Method | WGANMM | RLPMM | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|

$Cluster\mathrm{\#}$ | $K=2$ | $K=3$ | $K=4$ | $K=5$ | $second/iter$ | $K=2$ | $K=3$ | $K=4$ | $K=5$ | $second/iter$ | |

Dataset | Synthetic data | $1.8\times {10}^{4}$ | $3.0\times {10}^{4}$ | $5.0\times {10}^{4}$ | — | $10.32$ sec/iter | $1.0\times {10}^{4}$ | $1.5\times {10}^{4}$ | $2.0\times {10}^{4}$ | — | $1.125$ sec/iter |

MemeTracker | $0.8\times {10}^{4}$ | $1.2\times {10}^{4}$ | $2\times {10}^{4}$ | $3.0\times {10}^{4}$ | $5.70$ sec/iter | $0.5\times {10}^{4}$ | $0.8\times {10}^{4}$ | $1.5\times {10}^{4}$ | $2.0\times {10}^{4}$ | $0.720$ sec/iter |

### 4.3 Findings and Discussion

We present interpretations to the results as follows.

1) Parametric point process vs. neural point process vs. discretized time series The two neural network based TPP models: RLPMM using reinforcement imitation learning and WGANMM using generative adversarial learning, in general outperform the (implicit) parametric intensity models^{2}^{2}
2
ODE and LS based methods are also called nonparametric TPP as they involve an implicit model to learn intensity forms. In this paper, to simplify the naming, we slightly abuse the term of parametric point process to distinguish them from the neural point process models with significantly more network parameters to learn.: ODE+GMM, LS+GMM and DMMHP using explicit intensity functions, on both clustering performance and modeling capability. While the performance of time series based method VAR+GMM is the worst. These results show the superiority of the neural model compared with parametric TPP which assume a predefined form with limited model capacity.

On the other hand, as shown in Table 2, when the data is exactly generated from the predefined point process – Hawkes processes, the model DMMHP which is based on Hawkes processes model can benefit significantly and even (slightly) outperforms the network based methods including RLPMM and WGANMM. This also suggests the parametric model still have their value when the distribution can be (exactly) known to allow for model specification. The standard deviation in Table 2 also shows the higher stability of neural TPP methods against parametric ones.

2) Two-step pipeline vs. joint model On both synthetic and real-world datasets, joint modeling methods DMMHP, WGANMM and RLPMM outperform two-step VAR/ODE/LS+GMM models which run clustering first followed by learning within each cluster. This shows the utility of an elegant joint learning framework.

3) WGANMM vs. RLPMM It is shown that WGANMM and RLPMM perform relatively close to each other (though often RLPMM outperforms WGANMM) regarding with clustering performance on all metrics, for both synthetic data and real-world dataset. However, we find that the proposed RLPMM shows superior efficiency to WGANMM. In fact, WGANMM adopts the adversarial training framework based on Wasserstein divergence, where both the generator and the discriminator are modeled as dynamic RNN of multiple LSTM cells. In contrast, RLPMM only models the policy as a single LSTM cell, so for the discriminator with an extra Logistic regression layer. As such, RLPMM requires less parameters and computations. Moreover, we find the real-world MemeTracker need notably fewer iterations to converge than synthetic data.

4) Learning synthetic data generated by different numbers of policies Clustering purity, RI and EID all degenerate as the number of policies used for generating the testing data (i.e. intensity functions) grows from $K=2$ to $K=4$ (see the protocol in Section 4.1). This is no surprise as it becomes more challenging when the sequence set becomes more mixed. While the impact is lessened on the joint modeling methods DMMHP, WGANMM and RLPMM than the two-step models VAR/ODE/LS+GMM. This also holds for the comparison between RLPMM and WGANMM whereby RLPMM need much less additional iterations to converge than WGANMM when cluster number increases from $K=2$ to $K=4$.

4) Learning latent policies of MemeTracker data Instead of using one less informative TPP model for the whole set of sequences as shown in Fig. 3(a), we set $K=4$ and plot the empirical intensity functions as discovered by the proposed RLPMM method in Fig. 3(b). The memes patterns are marked as ${C}_{1}$ – ${C}_{4}$, and we show the text statistics of these patterns in supplementary material which reveals a potential for joint modeling with topic model.

For reproducibility, the source code for the proposed RLPMM model and adapted WGANMM model, and the synthetic and MemeTracker dataset is available on Github. ^{3}^{3}
3
https://github.com/XXXX/RLPMM

## 5 Conclusion and Future Work

Clustering of event sequences in continuous time domain is challenging and has vast application for real-world problems. It is also useful for building event-driven simulators. We study this problem from the reinforcement learning perspective for learning mixture of policies. Our approach involves IRL to learn reward for policy rather than resort to ad-hoc cost for varying-length event sequences.

There are possible extensions in future: i) effective handling with multi-typed event sequences especially noticing the fact that the used generative adversarial imitation learning technique is directly applicable to multi-dimensional TPP. Note the RL method for multi-type TPP in [24] does not involve IRL technique; ii) joint learning for TPP and topic model for text data associated with events, which is useful in practice such as for MemeTracker; iii) exploring the way of model sharing among clusters for more effective policy learning.

## References

- [1] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial nets. In ICML, 2017.
- [2] J. Bayer and C. Osendorfer. Learning stochastic recurrent networks. In NIPS Workshop on Advances in Variational Inference, 2014.
- [3] D. Daley and D. Vere-Jones. An introduction to the theory of point processes: volume II: general theory and structure. Springer Science & Business Media, 2007.
- [4] N. Du, H. Dai, R. Trivedi, U. Upadhyay, M. Gomez-Rodriguez, and L. Song. Recurrent marked temporal point processes: Embedding event history to vector. In KDD, 2016.
- [5] M. Eichler, R. Dahlhaus, and J. Dueck. Graphical modeling for multivariate hawkes processes with nonparametric link functions. Journal of Time Series Analysis, 38(2):225–242, 2017.
- [6] S. Ertekin, C. Rudin, and T. H. McCormick. Reactive point processes: A new approach to predicting power failures in underground electrical systems. The Annals of Applied Statistics, 9(1):122–144, 2015.
- [7] M. Farajtabar, J. Yang, X. Ye, H. Xu, R. Trivedi, E. Khalil, S. Li, L. Song, and H. Zha. Fake news mitigation via point process based intervention. In ICML, 2017.
- [8] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, pages 2672–2680, 2014.
- [9] F. Han and H. Liu. Transition matrix estimation in high dimensional time series. In ICML, pages 172–180, 2013.
- [10] A. G. Hawkes. Point spectra of some mutually exciting point processes. Journal of the Royal Statistical Society. Series B (Methodological), 1971.
- [11] J. Ho and S. Ermon. Generative adversarial imitation learning. In NIPS, 2016.
- [12] J. Leskovec, L. Backstrom, and J. Kleinberg. Meme-tracking and the dynamics of the news cycle. In KDD, 2009.
- [13] E. Lewis and E. Mohler. A nonparametric em algorithm for multiscale hawkes processes. Journal of Nonparametric Statistics, 2011.
- [14] L. Li and H. Zha. Learning parametric models for social infectivity in multi-dimensional hawkes processes. In AAAI, pages 101–107, 2014.
- [15] S. Li, S. Xiao, S. Zhu, N. Du, Y. Xie, and L. Song. Learning temporal point processes via reinforcement learning. In NIPS, 2018.
- [16] H. Mei and J. M. Eisner. The neural hawkes process: A neurally self-modulating multivariate point process. In NIPS, pages 6757–6767, 2017.
- [17] A. Y. Ng and S. Russell. Algorithms for inverse reinforcement learning. In ICML, 2000.
- [18] Y. Ogata. On lewis’ simulation method for point processes. IEEE Transactions on Information Theory, 27(1):23–31, 1981.
- [19] T. Ozaki. Maximum likelihood estimation of hawkes’ self-exciting point processes. Annals of the Institute of Statistical Mathematics, 31(1):145–155, 1979.
- [20] R. Pemantle. A survey of random processes with reinforcement. Probability Survey, 4(0):1–79, 2007.
- [21] W. M. Rand. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical association, 66(336):846–850, 1971.
- [22] H. Schütze, C. D. Manning, and P. Raghavan. Introduction to information retrieval, volume 39. Cambridge University Press, 2008.
- [23] R. Tibshirani and G. Walther. Cluster validation by prediction strength. Journal of Computational and Graphical Statistics, 14(3):511–528, 2005.
- [24] U. Upadhyay, A. De, and M. G. Rodriguez. Deep reinforcement learning of marked temporal point processes. In NIPS, 2018.
- [25] Y. Wang, H. Shen, S. Liu, J. Gao, and X. Cheng. Cascade dynamics modeling with attention-based recurrent neural network. In AAAI, pages 2985–2991, 2017.
- [26] W. Wu, J. Yan, X. Yang, and H. Zha. Decoupled learning for factorial marked temporal point processes. In KDD, 2018.
- [27] S. Xiao, M. Farajtabar, X. Ye, J. Yan, L. Song, and H. Zha. Wasserstein learning of deep generative point process models. In NIPS, 2017.
- [28] S. Xiao, J. Yan, X. Yang, H. Zha, and S. Chu. Modeling the intensity function of point process via recurrent neural networks. In AAAI, 2017.
- [29] H. Xu and H. Zha. A dirichlet mixture model of hawkes processes for event sequence clustering. In NIPS, 2017.
- [30] J. Yan, X. Liu, L. Shi, C. Li, and H. Zha. Improving maximum likelihood estimation of temporal point process via discriminative and adversarial learning. In IJCAI, 2018.
- [31] J. Yan, Y. Wang, K. Zhou, J. Huang, C. Tian, H. Zha, and W. Dong. Towards effective prioritizing water pipe replacement and rehabilitation. In IJCAI, 2013.
- [32] Y. Yu and W. Zhou. Mixture of gans for clustering. In IJCAI, 2018.
- [33] K. Zhou, H. Zha, and L. Song. Learning social infectivity in sparse low-rank networks using multi-dimensional hawkes processes. In AISTATS, 2013.
- [34] K. Zhou, H. Zha, and L. Song. Learning triggering kernels for multi-dimensional hawkes processes. In ICML, pages 1301–1309, 2013.