Mixture-based Multiple Imputation Models for Clinical Data with a Temporal Dimension

  • 2019-08-12 15:47:10
  • Ye Xue, Diego Klabjan, Yuan Luo
  • 4

Abstract

The problem of missing values in multivariable time series is a key challengein many applications such as clinical data mining. Although many imputationmethods show their effectiveness in many applications, few of them are designedto accommodate clinical multivariable time series. In this work, we proposemultiple imputation models that capture both cross-sectional information andtemporal correlations. We integrate Gaussian processes with mixture models andintroduce individualized mixing weights to handle the variance of predictiveconfidence of Gaussian process models. The proposed models are compared withseveral state-of-the-art imputation algorithms on both real-world and syntheticdatasets. Experiments show that our best model can provide more accurateimputation than the benchmarks on all of our datasets.

 

Quick Read (beta)

Mixture-based Multiple Imputation Models for Clinical Data with a Temporal Dimension
thanks: This work is supported in part by NIH grant R21LM012618.

1st Ye Xue Northwester University
Evanston, USA
[email protected]
   2nd Diego Klabjan Northwestern University
Evanston, USA
[email protected]
   3rd Yuan Luo Northwestern University
Chicago, USA
[email protected]
Abstract

The problem of missing values in multivariable time series is a key challenge in many applications such as clinical data mining. Although many imputation methods show their effectiveness in many applications, few of them are designed to accommodate clinical multivariable time series. In this work, we propose multiple imputation models that capture both cross-sectional information and temporal correlations. We integrate Gaussian processes with mixture models and introduce individualized mixing weights to handle the variance of predictive confidence of Gaussian process models. The proposed models are compared with several state-of-the-art imputation algorithms on both real-world and synthetic datasets. Experiments show that our best model can provide more accurate imputation than the benchmarks on all of our datasets.

Imputation, missing data, EHR, Gaussian process, data mining

I Introduction

The computational modeling in clinical applications attracts growing interest with the realization that the quantitative understanding of patient pathophysiological progression is crucial to clinical studies [57]. With a comprehensive and precise modeling, we can have a better understanding of a patient’s state, offer more precise diagnosis and provide better individualized therapies [26]. Researchers are increasingly motivated to build more accurate computational models from multiple types of clinical data. However, missing values in clinical data challenge researchers using analytic techniques for modeling, as many of the techniques are designed for complete data.

Traditional strategies used in clinical studies to handle missing values include deleting records with missing values and imputing missing entries by mean values. However, deleting records with missing values and some other filtering strategies can introduce biases [56] that can impact modeling in many ways, thus limiting its generalizability. Mean imputation is widely used by researchers to handle missing values. However, it is shown to yield less effective estimates than many other modern imputation techniques [5, 58, 1, 55], such as maximum likelihood approaches and multiple imputation methods (e.g. multiple imputation by chained equations (MICE) [6]). The maximum likelihood and multiple imputation methods are based on solid statistical foundations and become standard in the last few decades [17, 41].

Multiple imputation is originally proposed by Rubin [38] and with the idea of replacing each missing value with multiple estimates that reflect variation within each imputation model. The classic multiple imputation method treats the complete variables as predictors and incomplete variables as outcomes, and uses multivariate regression models to estimate missing values [39, 43]. Another imputation approach relies on chained equations where a sequence of regression models is fitted by treating one incomplete variable as the outcome and other variables with possible best estimates for missing values as predictors [37, 54, 53]. Multiple imputation has its success in healthcare applications, where many imputation methods designed for various tasks are based on multiple imputation [18, 36, 47, 22, 31, 52, 10].

In recent years, additional imputation methods are proposed. Although many imputation methods [55, 6, 46, 37, 27, 32, 64, 51] show their effectiveness in many applications, few of them are designed for time series-based clinical data. These clinical data are usually multivariable time series, where patients have measurements of multiple laboratory tests at different times. Many methods are designed for cross-sectional imputation (measurements taken at the same time point) and do not consider temporal information that is useful in making predictions or imputing missing values. Ignoring informative temporal correlations and only capturing cross-sectional information may yield less effective imputation.

In order to address the limitations mentioned above, we present mixture-based multiple imputation models for clinical time series. Our models capture both cross-sectional information and temporal correlations to estimate missing values using mixture models. We model the distribution of measurements using a mixture model. The mixture is composed of linear regression to model cross-sectional correlations and Gaussian processes (GPs) to capture temporal correlations. The problem of integrating GP within a standard mixture model is that GP models in all patient cases get the same mixing weights, while the confidence of predictions by GP models can vary largely across different patient cases. We overcome this problem by introducing individualized mixing weights for each patient cases, instead of assigning a fixed weight. We train our models using the Expectation-Maximization (EM) algorithm. We demonstrate the effectiveness of our models by comparing them with several state-of-the-art imputation algorithms on multiple clinical datasets.

Our main contributions are summarized as follows.

1. To the best of our knowledge, we are the first to build imputation models for time series by integrating GP within mixture models. We address the problem that all GP models in all patient cases get a fixed mixing weight by introducing individualized mixing weights.

2. We test the performance of our models on two real-world clinical datasets and several synthetic datasets. Our best model outperforms all comparison models including several state-of-the-art imputation models. Using synthetic datasets, we also explore and discover the properties of the data that benefit our models and/or comparison models. Experiments show that our best model is robust to the variation of these properties and outperforms comparison models on all synthetic datasets.

The remainder of this paper is structured as follows. Section II discusses related work while in Section III, the proposed method is described. The experimental setup, including dataset collection and evaluation procedure, is described in Section IV. Section V discusses the computational results and underlying analyses. The conclusions are drawn in Section VI.

II Related work

Research in designing imputation methods for multivariable time series attracts growing interest in recent decades. Previous studies generally fall into two categories. One comes from methods using linear or other simple parametric functions to estimate missing values. The other is the methods treating time series as smooth curves and estimating missing values using GP or other nonparametric methods.

In the first category, multivariable time series are modeled based on either linear models [44], linear mixed models [29, 42] or autoregressive models [20, 3]. However, in these methods, the potential trajectories of variables are only limited to linear or other simple parametric functions. Alternatively, many authors choose GPs or other nonparametric functions to model time series. Compared to linear models, GPs only have locality constraints in which close time points in a time series usually have close values. Therefore, GPs bring in more flexibility in capturing temporal trajectories of variables.

The straightforward way of applying GPs to the imputation on multivariable time series is to fit a single GP model on each time series and then make predictions for missing entries separately. However, without taking into account similarities and correlations across multiple time series, only fitting a single GP model on each time series may yield less effective imputation. Many researchers attempt to extend GP-based methods to multivariable settings. Futoma et al. [14] apply the multi-task Gaussian Processes (MTGP) prediction [4], a multiple-outcome modeling approach in the context of GPs, to impute missing values in longitudinal data. However, the quality of estimating missing values relies on the estimation of covariance structure among variables when using MTGP or other multi-task functional approaches [19, 8, 25]. To make a confident estimation of the covariance, a large amount of time points with shared observations of these variables are often required by these multi-task approaches [4, 62]. Due to the fact that many patients only have records with a limited number of time points, time series of inpatient clinical laboratory tests fall short of such a requirement. Hori et al. [21] extend MTGP to imputation on multi-environmental trial data in a third-order tensor. However, this approach is not applicable to clinical data with a large number of patients, since the covariance matrix of observed values needs to be computed together with its inverse, which is intractable for our datasets.

Recently, Luo et al. [33] explore the application of GPs in clinical data and propose an algorithm, 3-dimensional multiple imputation with chained equations (3D-MICE), that combines the imputation from GP and traditional MICE based on weighting equations. However, the weighting equations are calculated only based on the standard deviations of GP and MICE predictions for missing values. The weighting strategy is static and not optimized. We postulate that calculating weights through an optimization problem can help to improve the imputation quality. In our work, instead of the predictive mean matching used in [33], we choose linear regression as one component of our model. Our model is also based on a statistical model and thus statistically justified which is not the case for [33].

The mixture model proposed in this work differs from the mixtures of Gaussian processes (MGP) [50] in several ways. The mixing components in our model are not limited to Gaussian process, thus bringing in possibilities of capturing correlations with various choices of parametric functions. More importantly, our model is designed for multivariable time series and captures both temporal and cross-sectional correlations. The latter is ignored in [50] when MGP is applied to multivariable time series, which yields less effective predictions since some clinical tests have strong cross-sectional relations. Without a Gaussian process component, the standard Gaussian mixture model (GMM) is well studied and has been extended to imputation tasks [11, 9, 60, 45]. These approaches still suffer from losing informative temporal correlations when imputing missing values in longitudinal data.

In order to effectively model the interaction between different aspects, we represent the data as a tensor with each aspect being one mode. For that reason, our method is also considered as a tensor completion approach. Tensor completion problems are extensively studied. However, the classic tensor completion methods [12, 30, 49] focus on general tensors and usually do not consider temporal aspects. In recent years, many studies explore the application of temporal augmented tensor completion on imputing missing values in time series [2, 63, 15, 48, 7]. These methods discretize time into evenly sampled intervals. However, due to the fact that inpatient clinical laboratory tests are usually measured at varying intervals, assembling clinical data over regularly sampled time periods might have several drawbacks, such as leading to sparse tensors if discretizing time at fine granularity (e.g. every minute) while some laboratory tests are measured less frequently (e.g. daily). Furthermore, extending these methods to the case, where time is not regularly sampled, is not easy and straightforward, requiring changing design details and the objective functions to be optimized. Recently, Yang et al. [61] propose a tensor completion method that can deal with irregularly sampled time. They extend the PACIFIER imputation framework [66] and propose a time-aware matrix decomposition method to estimate missing values in predicting septic shock. However, most components of this approach are tailored to the characteristics of septic patients. In this work, we implement the imputation approach proposed in [61] with only the time-aware mechanism, which is general and applicable to our experimental settings. However, this approach is not so effective in our experiments and thus it is not included as a benchmark in this paper. Lately, Zhe et al. [65] propose a Bayesian nonparametric tensor decomposition model that captures temporal correlations between interacting events. However, this approach is not directly applicable to continuous multivariable time series because it focuses on discrete events and captures temporal correlations between the occurrences of events.

III Methodology

III-A Imputation Framework

In many predictive tasks on temporal clinical data, time series are often aligned into the same-length sequences to derive more robust patient phenotypes through matrix decomposition or discover feature groups by applying sequence/graph mining techniques [61, 59]. We model under this assumption. In this work, we use tensor representation, in which patients have the same number of time points. We represent the data collected from P patients with V laboratory tests and B time points as two 3D tensors 𝒳P×V×B and 𝒯P×V×B, shown in Fig. 1. Each laboratory test measurement xp,v,b is stored in the measurement tensor 𝒳. Each time tp,v,b, when xp,v,b is measured, is stored in the time tensor 𝒯.

Table I lists the main symbols we use throughout the paper. Missing values in the measurement tensor are denoted as xp,v,bmis, showing that the value of test v at time index b for patient p is missing. Correspondingly, xp,v,bobs denotes an observed value. The time tensor 𝒯 is complete, since we only collect patient records at the time when at least one laboratory test measurement is available. We assume we know the “prescribed” time a missing measurement should have been taken. In the matrix x:,:,b at time index b, the measurement time tp,v,b and tq,v,b can be different when pq, whereas for a given patient p, we have tp,v,b=tp,u,b for v,u[1:V]. That is, all tests for a particular patient are taken at the same time.

Fig. 1: Measurement and time tensor. An example of the inputs and output of the mixture model MixV,B is shown as the colorful fiber and matrices. In MixV,B, the target output is x:,V,B shown in purple and the inputs are x:,-V,B and x:,V,-B shown in red and blue matrices, respectively, excluding the purple fiber. We model the output with a mixture model, where we train a linear regression on the red matrix and train GPs or/and another linear regression on the blue matrix.

Disregarding the temporal dimension, the imputation problem is well studied. If one dimension is time, in order to apply imputation methods that are not designed for time series, we need to disregard the temporal aspect or ignore temporal correlations of the data. However, temporal trajectories can reveal patient’s underlying pathophysiological evolution, the modeling of which can help to better estimate missing values. For the reason that both cross-sectional information and temporal aspects can impact the estimation of missing measurements, we explore mixture models, which are composed of several base models through either a cross-sectional or temporal view. We introduce these base models in Section III-B.

TABLE I: Main symbols and definitions
Symbol Definition
𝒳, 𝒯 Measurement and time tensor
xp,v,-b Measurements in fiber xp,v,: excluding xp,v,b
tp,v,-b Times in fiber tp,v,: excluding tp,v,b
Mixv,b Mixture model for v and b.
Vv,b Concatenation of x:,-v,b and x:,v,-b
𝒩(𝝁v,b(k),𝚺v,b(k))

The kth prior multivariate normal distribution in Mixv,b

m(3)(), Σ(3)() Predictive mean and variance of a GP model
γv,b

The set of all trainable parameters of Mixv,b

I The identity matrix

In our imputation framework, a mixture model is trained for each variable and time index. We use Mixv,b to denote the mixture model to impute missing values of variable v at time index b. The missing values x:,v,bmis in the fiber x:,v,b are imputed by the optimized Mixv,b. In each iteration of the algorithm, it is assumed that all other values are known and only x:,v,bmis is imputed. This is wrapped in an outer loop. We call this procedure simple-pass tensor imputation, i.e., one pass through all v,b. Since several simple-pass tensor imputations are conducted, our approach is also considered as an iterative imputation [16], which can also be regarded as a sampling-based approach where a Gibbs sampler is used to approximate convergence. The convergence of iterative imputation methods can be quite fast with a few iterations [6].

In detail, the iterative imputation approaches start by replacing all missing data with values from simple guesses; we fill in all missing values with initial estimates by taking random draws from observed values. This procedure is called an initial imputation. Then we perform iterative tensor imputation on each copy separately. The training procedure and imputation for fibers are introduced in Section III-C and III-D. We also rely on the concept of multiple imputations, where several iterative imputations are performed and the imputed values are averaged at the end. Each iterative imputation starts with a different iterative imputation tensor and/or uses a different order of v,b.

In summary, the algorithm creates M different copies 𝒳1,,𝒳M of 𝒳, each one filled with different random 𝒳mis. For each i=1,,M, we then perform K simple-pass tensor imputations. Each simple-pass has a loop over all v,b, which uses Mixv,b to adjust 𝒳i,mis. At the end of the M imputations, 𝒳i,mis are averaged across all i to yield the final imputed tensor.

The whole imputation process involves M×K×V×B imputation models. We next first focus on the base models behind Mixv,b and then on the actual mixture model.

III-B Base Models

Our mixture models are composed of three components that are derived from two base models, linear regression and Gaussian processes. One component consists of GP models and the other two components are linear models through two different views of the measurement tensor. Through a cross-sectional view, the tensor can be considered as a vector of patient-by-variable matrices at different time indices. Through a temporal view, we can view the tensor as a vector of patient-by-time matrices for different variables.

III-B1 Linear model through cross-sectional view

We can view the measurement tensor 𝒳 as a vector of patient-by-variable matrices. On the slice x:,:,b, we use a linear regression model to fit the target variable v as a function of the other variables except v. The target values x:,v,b are modeled as

x:,v,b=x:,-v,bβv,b(1)+ϵv,b(1),ϵv,b(1)𝒩(0,σv,b(1)2I) (1)

where βv,b(1) is the column vector of coefficients and σv,b(1) is the standard deviation of the error ϵv,b(1), regarding the regression model through cross-sectional view for variable v and time index b.

The likelihood distribution of x:,v,b is then given by

x:,v,b|x:,-v,b,βv,b(1),σv,b(1)2𝒩(x:,-v,bβv,b(1),σv,b(1)2I). (2)

The training data consists of observed target values (xp,v,b)pPv,btr and the input data (xp,-v,b)pPv,btr, where Pv,btr is the training patient set and includes p only if xp,v,b is observed.

III-B2 Linear model through temporal view

In addition to the cross-sectional view, we can also view the measurement tensor 𝒳 as a vector of patient-by-time matrices. On matrix x:,v,:, we use linear regression to model the measurements at time index b against those at other indices.

The target values x:,v,b are modeled as

x:,v,b=x:,v,-bβv,b(2)+ϵv,b(2),ϵv,b(2)𝒩(0,σv,b(2)2I) (3)

where βv,b(2) is the column vector of coefficients and σv,b(2) is the standard deviation of the error ϵv,b(2), regarding the linear regression model though temporal view for variable v and time index b.

The likelihood distribution of x:,v,b is given by

x:,v,b|x:,v,-b,βv,b(2),σv,b(2)2𝒩(x:,v,-bβv,b(2),σv,b(2)2I). (4)

The training data consists of observed target values (xp,v,b)pPv,btr and input data (xp,v,-b)pPv,btr.

III-B3 Gaussian processes through temporal view

Gaussian processes are commonly used to capture trajectories of variables, thus used in our mixture model to capture temporal correlations. Through the same temporal view as introduced above, on matrix x:,v,:, we fit GPs on time series for each patient.

The target value xp,v,b is modeled as

xp,v,b=μp,v,b+f(tp,v,b),f(tp,v,b)𝒢𝒫(0,𝒦(tp,v,b,tp,v,b)) (5)

where μp,v,b is the overall mean of the model, f() is a Gaussian process with mean of 0 and a covariance matrix 𝒦(t,t) of time pairs (t,t). Then the likelihood distribution of xp,v,b is written as

xp,v,b|αp,v,b𝒩(m(3)(αp,v,b),Σ(3)(αp,v,b))αp,v,b=(θv,b,xp,v,-b,tp,v,-b) (6)

where θv,b are the kernel parameters of the GP models, and the predictive mean and variance are given by m(3)() and Σ(3)(); see more details in Appendix B. For a certain v and b, all GP models share the same kernel parameters θv,b.

III-C The Mixture Model

Given the likelihood distribution of all three components, we model the joint mixture distribution, regarding the variable v and time index b, in the following way

p(x:,v,b,Vv,b)=πv,b(1)𝒩(Vv,b|δv,b(1))𝒩(x:,v,b|x:,-v,bβv,b(1),σv,b(1)2I)+πv,b(2)𝒩(Vv,b|δv,b(2))𝒩(x:,v,b|x:,v,-bβv,b(2),σv,b(2)2I)+πv,b(3)𝒩(Vv,b|δv,b(3))𝒩(x:,v,b|m(3)(𝜶v,b),diag(Σ(3)(𝜶v,b))) (7)

where we define Vv,b=[x:,-v,bx:,v,-b], 𝜶v,b=(αp,v,b)pP, δv,b(k)=(𝝁v,b(k),𝚺v,b(k)) and πv,b(k) is the mixing weight for the kth component. This model can be interpreted as the joint distribution between observed data Vv,b and missing values x:,v,b, which consists of a mixture of three distributions. The first one p1(x:,v,b,Vv,b) is modeled as

p1(x:,v,b,Vv,b)=p(x:,v,b|Vv,b)p(Vv,b)=𝒩(x:,v,b|μ1(Vv,b),σ1(Vv,b))𝒩(Vv,b|δv,b(1))=𝒩(x:,v,b|x:,-v,bβv,b(1),σv,b(1)2I)𝒩(Vv,b|δv,b(1)) (8)

the remaining two follow the same logic.

By marginalizing over x:,v,b, the prior probability distribution p(Vv,b) is written as

p(Vv,b)=k=13πv,b(k)𝒩(Vv,b|δv,b(k)) (9)

which is a mixture of Gaussians. It also follows that

p(x:,v,b|Vv,b)=p(x:,v,b,Vv,b)p(Vv,b)=Cv,b(1)𝒩(x:,v,b|x:,-v,bβv,b(1),σv,b(1)2I)+Cv,b(2)𝒩(x:,v,b|x:,v,-bβv,b(2),σv,b(2)2I)+Cv,b(3)𝒩(x:,v,b|m(3)(𝜶v,b),diag(Σ(3)(𝜶v,b))) (10)

where we define Cv,b(k)=πv,b(k)𝒩(Vv,b|δv,b(k))j=13πv,b(j)𝒩(Vv,b|δv,b(j)).

We train our mixture model on observed target values by maximizing the log likelihood of the joint mixture distribution

γ^v,b=argmaxγv,blnp(x:,v,bobs,Vv,btr;γv,b)

where Vv,btr is the training input data and defined as the concatenation of (xp,-v,b)pPv,btr and (xp,v,-b)pPv,btr, and γv,b is the set of all trainable parameters

βv,b(1),σv,b(1)2,βv,b(2),σv,b(2)2,θv,bandπv,b(k),𝝁v,b(k),𝚺v,b(k)fork=1,2,3.

After training, the missing values are imputed using individualized (per-patient) mixing weights that are derived from the conditional distribution. The missing value xp,v,bmis is imputed by

Πp,v,b(1)xp,-v,bβ^v,b(1)+Πp,v,b(2)xp,v,-bβ^v,b(2)+Πp,v,b(3)m(3)(α^p,v,b)

where the individualized mixing weight of the kth component for patient p, variable v and time index b is defined as

Πp,v,b(k)=π^v,b(k)𝒩(Vp,v,b|𝝁^v,b(k),𝚺^v,b(k))j=13π^v,b(j)𝒩(Vp,v,b|𝝁^v,b(j),𝚺^v,b(j)),k=1,2,3 (11)

where Vp,v,b is the observed data and defined as the concatenation of xp,-v,b and xp,v,-b.

III-D Mixture Parameter Estimation

Let (γv,b) be the log likelihood lnp(x:,v,bobs,Vv,btr;γv,b). Explicitly maximizing (γv,b) is hard. Instead, we use the EM algorithm to repeatedly construct a lower-bound on (γv,b) and then optimize that lower-bound (γv,b). We first define a latent indicator variable qv,b{1,2,3} that specifies which mixing component that data points come from. Then we use Jensen’s inequality to get the lower-bound (γv,b), which is given by

(γv,b)=pPv,btrqv,bQp(qv,b)lnp(xp,v,b,Vp,v,b,qv,b;γv,b)Qp(qv,b)pPv,btrlnk=13p(qv,b=k)p(xp,v,b,Vp,v,b;γv,b|qv,b=k)=(γv,b) (12)

where

Qp(qv,b=k)=p(qv,b=k)p(xp,v,b,Vp,v,b|qv,b=k)j=13p(qv,b=j)p(xp,v,b,Vp,v,b|qv,b=j). (13)

In (12) and (13), the marginal distribution p(qv,b=k) over qv,b is specified by the mixing coefficients πv,b(k)=p(qv,b=k). We can view Qp(qv,b=k) as the responsibility that component k of the mixture model Mixv,b takes to “explain” xp,v,b. We use the standard EM algorithm to maximize the lower-bound (γv,b); see more details about the estimation of the parameters in Appendix A.

TABLE II: Overall MASE by dataset and imputation model. The bold numbers are the significantly best values among all imputation models.
Dataset GP MTGP1 GMM MICE 3D-MICE LL En-LLG
Real-world MIMIC (d=0) 0.13072 0.12599 0.11741 0.11763 0.11186 0.09304 0.09285
Synthetic MIMIC (d=0.5) 0.10466 - 0.11455 0.11573 0.09087 0.08612 0.08448
Synthetic MIMIC (d=1) 0.09220 - 0.11436 0.11561 0.07715 0.08427 0.07538
NMEDW 0.18353 0.17370 0.13593 0.13718 0.13624 0.11600 0.11589
1 MTGP cannot have multiple inputs, thus not applicable to the synthetic datasets.
TABLE III: MASE on the real-world MIMIC dataset by variable and imputation model. The bold numbers are the best values among all imputation models.
Variable GP MTGP GMM MICE 3D-MICE LL En-LLG
Chloride 0.12993 0.12549 0.10650 0.10575 0.10836 0.08664 0.08603
Potassium 0.11533 0.11256 0.10963 0.10997 0.10822 0.09453 0.09442
Bicarbonate 0.13196 0.12905 0.12231 0.12275 0.11984 0.10302 0.10254
Sodium 0.12525 0.11939 0.10159 0.10138 0.10727 0.08787 0.08807
Hematocrit 0.11436 0.10780 0.06518 0.06558 0.06726 0.05486 0.05482
Hemoglobin 0.14168 0.12997 0.05774 0.05772 0.06301 0.05117 0.05103
MCV 0.14215 0.13732 0.13391 0.13474 0.13340 0.11634 0.11657
Platelets 0.13855 0.13087 0.14203 0.14236 0.12815 0.10090 0.10070
WBC count 0.13963 0.13614 0.14026 0.14068 0.13060 0.10934 0.10913
RDW 0.14592 0.13668 0.15778 0.15836 0.13897 0.11340 0.11340
Blood urea nitrogen 0.12358 0.12720 0.15160 0.15189 0.11814 0.09479 0.09410
Creatinie 0.13341 0.12803 0.13211 0.13212 0.12217 0.10067 0.10014
Glucose 0.12491 0.12420 0.11771 0.11794 0.11921 0.10493 0.10501

III-E Special Cases and an Ensemble Model

Our imputation model provides flexibility in changing base models. The mixture model mentioned above consists of three base models. Two of them are applied through the same temporal view of the measurement tensor. We can drop one of the models through temporal view to yield a new mixture model. For example, if the GP model through temporal view is removed from the mixture model for variable v and time point b, the new mixture model is now a mixture of two linear models.

Each mixture model can be a mixture of all three base models (the Linear-Linear-GP [LLG] model), a mixture of two linear models (the Linear-Linear [LL] model) or a mixture of linear and GP models. All these mixtures may have an overfitting problem due to the large amount of trainable parameters. To further improve the imputation quality, we build an ensemble model (En-LLG) to reduce variance by allowing each mixture model to be a mixture of either two linear models or all three base models.

In En-LLG, for each variable and time index, we train the mixture model with two linear models and the mixture model with all three base models, and then select the one with less training error as the final mixture model, which is used to do imputation. Although we train our mixture models by maximizing the likelihood, we use the absolute training error as the selection criteria because the likelihood of the mixture model with two linear models is not in the same scale as the mixture model with three base models.

IV Experimental Setup

IV-A Real-world Datasets

We collect two real-world datasets from the Medical Information Mart for Intensive Care (MIMIC-III) database [24] and the Northwestern Medicine Enterprise Data Warehouse (NMEDW). Each dataset contains inpatient test results from 13 laboratory tests. These tests are quantitative and frequently measured on hospital inpatients. They are the same as those used in [33] in their imputation study. We organize the data by unique admissions. We distinguish multiple admissions of the same patient. Each admission consists of time series of the 13 laboratory tests. Although our models are evaluated only on continuous variables, they are also applicable on categorical variables with a simple extension (e.g. through predictive mean matching) [28].

In both MIMIC-III and NMEDW datasets, the length of time series varies across admissions. To apply our imputation models on these datasets, we truncate time series so that they have the same length. The length is the average number of time points across all admissions. Before truncating, the average number of time points in MIMIC-III dataset is 11. We first exclude admissions that have less than 11 time points, and then we truncate time series by removing measurements taken after the 11-th time point. We also exclude admissions that contain time series that have no observed values. Our MIMIC-III dataset includes 26,154 unique admissions and the missing rate is about 28.71%. The same data collection procedure is applied on the NMEDW dataset (approved by Northwestern IRB) where we end up with 13,892 unique admissions that have 7 time points. The missing rate of the NMEDW dataset is 24.22%.

IV-B Synthetic Datasets

We create synthetic datasets to explore and discover the properties of the data that might benefit our models and/or comparison models. In synthetic datasets, we augment the correlation between measurements and times. We do not augment correlations by imposing strong constraints on time series where closer measurements have closer values. Instead, we generate synthetic times by altering real times so that the constraints in synthetic data are “slightly” stronger than real-world data. We also introduce a scaling factor d to control the strength of the constraints in synthetic data.

The synthetic datasets are generated based on the real-world MIMIC-III dataset. We move two consecutive times of a time series closer, if the relative difference Δx~ in two consecutive measurements is smaller than the relative difference Δt~ in two consecutive times. The relative differences Δx~ and Δt~ of a time series are given by

Δx~i=|xi-xi-1|i=2B|xi-xi-1|Δt~i=|ti-ti-1|i=2B|ti-ti-1|.

The scaling factor d(0,1) controls how much we move times. If d=0, we do not move times. In other words, the synthetic dataset at d=0 is the same as the real-world MIMIC-III dataset. As d increases, stronger constraints are introduced to synthetic data. The synthetic time t for a time series is generated as follows:

ti={t1,if i=1ti+j=2i[d(Δx~j-Δt~j)S],otherwiseS=j=2B(|tj-tj-1|).

If a time series has missing values, we first calculate the synthetic times for the observed measurements. Then we perform a linear interpolation between real times and synthetic times for observed measurements to generate synthetic times for missing measurements.

IV-C Evaluation of Imputation Quality

We randomly mask 20% observed measurements in a data set as missing and treat the masked values as the test set. The remaining observed values are used in training. We impute originally missing and masked values together, and compare the imputed values with the ground truth for masked data to evaluate imputation performance.

We use Mean Absolute Scaled Error (MASE) [23] to measure the quality of imputation on the test set. MASE is a scale-free measure of the accuracy of predictions and recommended by Hyndman et al [23, 13] to measure the accuracy of predictions for series. In this work, we calculate MASE for all tests (variables) and take a weighted average, according to the number of masked values of a variable, to get an overall MASE per dataset.

Let maskp,v be the set of cardinality Ip,v of all time indices that have been masked for patient p and variable v. Also let Yp,v=(xp,v,jobs)j be the sequence of length Jp,v of all observed values for patient p and variable v, and let x~p,v,i represent the imputed value. The MASE for variable v is defined as

MASE(v)=1p¯Ip¯,vpimaskp,v|x~p,v,i-xp,v,iobs|Jp,vJp,v-1j=2Jp,v|Yp,v,j-Yp,v,j-1|.

To show the effectiveness of our imputation models, we compare the MASE scores of our models (LL and En-LLG) with other five imputation methods: (a) MICE [6] with 100 imputations, where the average of all imputations are used for evaluation; (b) the pure Gaussian processes, where a GP model is fitted to the observed data of each time series using GPfit [34] in R and missing values are replaced with the predictions from the fitted GP models; (c) 3D-MICE, a state-of-the-art imputation model [33] for which we obtain their code and adapt it to account for our use of the tensor representation; (d) multi-task Gaussian process prediction (MTGP) [4] and (e) a Gaussian mixture model for imputation (GMM) [11, 35]. To tune hyperparameters if any in these models, we mask out 20% observed measurements in the training set as a validation set and tune hyperparameters on the validation set.

We introduce π(1),π(2) and π(3) as hyperparameters of our models. In our experiments, all mixture models start with the same mixing weights, i.e., πv,b(k)=π(k) for k=1,2,3. Other trainable parameters are initialized directly from the data after initial imputation and do not require a manual specification.

We run our En-LLG model for M=3 multiple imputations with K=2 iterations and run the LL model with more multiple imputations (M=5) and more iterations (K=5). We run more iterations of the LL model, because the LL model takes less time than the En-LLG model. For 3D-MICE, we set the number of multiple imputation to 40, instead of 100 that is suggested in [33], to balance the performance and running time.

V Results

V-A Performance Comparison

Table II and Fig. 2 compare all the imputation models on 4 datasets using MASE. Table II shows the overall MASE score of each imputation model on all datasets. Fig. 2 provides a comparison for all imputation models in the MASE score over 3D-MICE by showing the percentage deviation against 3D-MICE. We select 3D-MICE since it is a state-of-the-art benchmark model. We observe that our En-LLG model outperforms all comparison models on all 4 datasets. The LL model is outperformed only by 3D-MICE on the synthetic MIMIC (d=1) dataset. The En-LLG model is significantly better than the second best model (p=.001, permutation test with 1000 replicates) on all 4 datasets.

Fig. 2: Percentage deviation of MASE score against 3D-MICE

Table III shows a variable-wise comparison of the imputation models on the real-world MIMIC (d=0) dataset. Our two imputation models outperform all comparison models on all variables. The En-LLG model is better than the LL model on most variables. All models except GP and MTGP achieve a much lower error on Hematocrit and Hemoglobin than on other variables. The reason is that these two variables are highly correlated. Those methods that capture the correlation between variables can reasonably infer missing values for Hematocrit from observed measurements of Hemoglobin, and vice versa. Compared to MICE, our models achieve even lower errors on these two variables, which indicates that temporal correlations captured by our models help to make better estimation of missing values, even when there is a more dominant cross-sectional correlation.

As shown in Table II, all models except MTGP benefit from the increment of d, the scaling factor when generating synthetic data. The reason is that all models take into account temporal aspects and the measurements in the synthetic time series have stronger temporal correlations as d increases. The reason that MICE and GMM also benefit from the temporal correlations is that we include time as a feature in addition to variable features. We also try to exclude times, however, experiments show that these two models perform better when times are included. As shown in Table II, GP and 3D-MICE benefit the most as d increases from 0 to 1, MICE and GMM benefit the least and our models (LL and En-LLG) are in the middle. The LL model is outperformed by 3D-MICE when d increases to 1. However, En-LLG shows its robustness to the variation of d in our current experimental settings.

(a) Chloride, t1, d=0
(b) Chloride, t6, d=0
(c) Chloride, t1, d=0.5
(d) Chloride, t6, d=0.5
Fig. 3: A comparison between the individualized mixing weights Π and the optimized responsibilities Q that the GP component should take to “explain” observed measurements. The plots are from the real-world MIMIC (d=0) and synthetic MIMIC (d=0.5) dataset, and for the mixture models of Chloride at time point 1 and 6. The distributions of the optimized responsibilities are shown in blue and the distributions of individualized mixing weights are in yellow.

V-B Individualized Weights

By introducing individualized (per patient) mixing weights Π defined in (11), we improve the performance of our En-LLG model in the MASE score from 0.08351 to 0.07538, an improvement of 9.73% compared against the model where each mixture component has a fixed weight for all patient cases. The reason that individualized weights are better than fixed weights in our model might be that they better approximate the responsibilities.

In training, we can optimize the responsibility a component should take to “explain” an observed target value xp,v,b for pPv,btr. These correspond to Q in (13). However, when making inference, we can not calculate the responsibility each component should take to “explain” missing values, because responsibilities depend on observed target values, according to (13). We have to use Π, the individualized mixing weights. In a standard mixture model, we could use πv,b(k), which is the average of responsibilities of the kth component across all training patients, as a fixed weight that the kth component should contribute to impute missing values x:,v,bmis for all test patients. However, patient time series can be very different and the confidence of predictions by the GP component can vary largely across different patient cases. In our mixture model, therefore, a fixed weight can not reflect such variation in prediction confidence.

We shall view an individualized mixing weight as an approximation of how much responsibility a component should take to impute the missing value for a particular patient case. It is tailored for each patient. As defined in (11), the individualized mixing weights only depend on the inputs, therefore, we can calculate them when making inferences on the test set.

In Fig. 3, we plot the distribution of the individualized weights Π of the GP component in the training set and compare it with the distribution of the optimized responsibility values Q. The responsibilities that the GP component should take can vary a lot in different patient cases, especially on the synthetic dataset, which implies that it is more reasonable for patients to get individualized mixing weights than a fixed weight. We also observe that the individualized mixing weights reasonably mimic the distribution of the optimized responsibilities on the training set. The improvement of our model on the test set attests that the individualized weights approximate the responsibilities better than fixed weights.

In addition, 3D-MICE also assigns individualized weights at the same level of granularity as our models, however, weights in 3D-MICE are calculated only based on the deviations of the cross-sectional imputation and the temporal imputation, and are not optimized. As shown in Table III, the improvement of our models over 3D-MICE implies that our models provide a more accurate weighting solution when combining the cross-sectional and temporal imputation.

V-C Time Complexity

(a) EM
(b) Adam
Fig. 4: Log likelihood of the whole mixture model in EM and log likelihood of the GP component in Adam for Chloride at time point 1.

We compare the running times of our proposed models and all comparison models on the real-world MIMIC dataset. All models ran on the same Linux server. The MTGP and GMM models are executed sequentially and other models run in parallel with 20 cores. The LL model (taking 4.2 hours), GP (1.1 hours), MTGP (1.2 hours) and GMM (2.2 hours) are the four fastest models; 3D-MICE (156.1 hours) is the slowest; MICE (77.5 hours) and the En-LLG model (109.5 hours) are in the middle.

The LL model is much faster than the En-LLG model because we can explicitly calculate the estimates of parameters for the mixture model with only linear components. However, it is hard to directly calculate the parameter estimates of the GP component. In En-LLG, we use the Adam optimizer to update the parameter estimates of the GP component in each EM iteration. In our experiments, the EM algorithm and the Adam optimization procedure converge in a few iterations. As shown in Fig. 3(a), the increment speed of optimizing the log likelihood of the mixture model decreases dramatically after a few EM iterations. Fig. 3(b) shows that the log likelihood with respect to the GP component converges within 10 Adam iterations. We observe a similar convergence property in other mixture models.

We notice that our models can have an overfitting problem due to the large amount of trainable parameters. We observe that, in many mixture models, the log likelihood keeps increasing, however, the imputation error stops decreasing after a few EM iterations. We alleviate the overfitting problem by terminating the EM algorithm earlier. We terminate it when the mean absolute error on the training set stops decreasing. In our experiment, this strategy is better than a vanilla ridge regularization for handling the overfitting problem. In addition to the improvement of imputation quality, this strategy also helps to reduce the running time of our model, as the EM algorithm stops earlier with this strategy. Although we alleviate the overfitting problem, we expect further improvements in the performance of our imputation model with better strategies for handling overfitting.

VI Conclusions

We present and demonstrate mixture-based imputation models for multivariable clinical time series. Our models can capture both cross-sectional and temporal correlations in time series. We integrate Gaussian processes with mixture models and introduce individualized mixing weights to further improve imputation accuracy. We show that our best model can provide more accurate imputation than MICE, GMM, GP, MTGP and 3D-MICE, a state-of-the-art imputation model that integrates cross-sectional and longitudinal imputation. Although in this work our models are tested on inpatient clinical data, they can also be applied to other multivariable time series data in healthcare and other domains with necessary adaptation.

References

  • [1] J. L. Arbuckle (1996) Full information estimation in the presence of incomplete data. In Advanced Structural Equation Modeling: Issues and Techniques, G. A. Marcoulides and R. E. Schumacker (Eds.), Vol. 243, pp. 277. Cited by: §I.
  • [2] M. T. Bahadori, Q. R. Yu, and Y. Liu (2014) Fast multivariate spatio-temporal analysis via low rank tensor learning. In Advances in Neural Information Processing Systems, pp. 3491–3499. Cited by: §II.
  • [3] F. Bashir and H. Wei (2017) Handling missing data in multivariate time series using a vector autoregressive model-imputation (var-im) algorithm. Neurocomputing. Cited by: §II.
  • [4] E. V. Bonilla, K. M. Chai, and C. Williams (2008) Multi-task gaussian process prediction. In Advances in Neural Information Processing Systems, pp. 153–160. Cited by: §II, §IV-C.
  • [5] R. L. Brown (1994) Efficacy of the indirect approach for estimating structural equation models with missing data: a comparison of five methods. Structural Equation Modeling: A Multidisciplinary Journal 1 (4), pp. 287–316. Cited by: §I.
  • [6] S. Buuren and K. Groothuis-Oudshoorn (2011) Mice: multivariate imputation by chained equations in r. Journal of Statistical Software 45 (3). Cited by: §I, §I, §III-A, §IV-C.
  • [7] Y. Cai, H. Tong, W. Fan, P. Ji, and Q. He (2015) Facets: fast comprehensive mining of coevolving high-order time series. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 79–88. Cited by: §II.
  • [8] J. Chiou, Y. Zhang, W. Chen, and C. Chang (2014) A functional data approach to missing value imputation and outlier detection for traffic flow data. Transportmetrica B: Transport Dynamics 2 (2), pp. 106–129. Cited by: §II.
  • [9] O. Delalleau, A. Courville, and Y. Bengio (2012) Efficient em training of gaussian mixtures with missing data. arXiv preprint arXiv:1209.0521. Cited by: §II.
  • [10] Y. Deng, C. Chang, M. S. Ido, and Q. Long (2016) Multiple imputation for general missing data patterns in the presence of high-dimensional data. Scientific Reports 6, pp. 21689. Cited by: §I.
  • [11] M. Di Zio, U. Guarnera, and O. Luzi (2007) Imputation through finite gaussian mixture models. Computational Statistics & Data Analysis 51 (11), pp. 5305–5316. Cited by: §II, §IV-C.
  • [12] M. Filipović and A. Jukić (2015) Tucker factorization with missing data with application to low-n-rank tensor completion. Multidimensional Systems and Signal Processing 26 (3), pp. 677–692. Cited by: §II.
  • [13] P. H. Franses (2016) A note on the mean absolute scaled error. International Journal of Forecasting 32 (1), pp. 20–22. Cited by: §IV-C.
  • [14] J. Futoma, S. Hariharan, and K. Heller (2017) Learning to detect sepsis with a multitask gaussian process rnn classifier. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1174–1182. Cited by: §II.
  • [15] H. Ge, J. Caverlee, N. Zhang, and A. Squicciarini (2016) Uncovering the spatio-temporal dynamics of memes in the presence of incomplete information. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pp. 1493–1502. Cited by: §II.
  • [16] A. Gelman (2004) Parameterization and bayesian modeling. Journal of the American Statistical Association 99 (466), pp. 537–545. Cited by: §III-A.
  • [17] J. W. Graham (2009) Missing data analysis: making it work in the real world. Annual Review of Psychology 60, pp. 549–576. Cited by: §I.
  • [18] O. Harel and X. Zhou (2007) Multiple imputation for the comparison of two screening tests in two-phase alzheimer studies. Statistics in Medicine 26 (11), pp. 2370–2388. Cited by: §I.
  • [19] Y. He, R. Yucel, and T. E. Raghunathan (2011) A functional multiple imputation approach to incomplete longitudinal data. Statistics in Medicine 30 (10), pp. 1137–1156. Cited by: §II.
  • [20] E. E. Holmes, E. J. Ward, and K. Wills (2012) MARSS: multivariate autoregressive state-space models for analyzing time-series data. R Journal 4 (1). Cited by: §II.
  • [21] T. Hori, D. Montcho, C. Agbangla, K. Ebana, K. Futakuchi, and H. Iwata (2016) Multi-task gaussian process for imputing missing data in multi-trait and multi-environment trials. Theoretical and Applied Genetics 129 (11), pp. 2101–2115. Cited by: §II.
  • [22] C. Hsu, J. M. Taylor, S. Murray, and D. Commenges (2006) Survival analysis using auxiliary variables via non-parametric multiple imputation. Statistics in Medicine 25 (20), pp. 3503–3517. Cited by: §I.
  • [23] R. J. Hyndman and A. B. Koehler (2006) Another look at measures of forecast accuracy. International Journal of Forecasting 22 (4), pp. 679–688. Cited by: §IV-C.
  • [24] A. E. Johnson, T. J. Pollard, L. Shen, H. L. Li-wei, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. A. Celi, and R. G. Mark (2016) MIMIC-iii, a freely accessible critical care database. Scientific Data 3, pp. 160035. Cited by: §IV-A.
  • [25] S. Kliethermes and J. Oleson (2014) A bayesian approach to functional mixed-effects modeling for longitudinal data with binomial outcomes. Statistics in Medicine 33 (18), pp. 3130–3146. Cited by: §II.
  • [26] I. S. Kohane (2015) Ten things we have to do to achieve precision medicine. Science 349 (6243), pp. 37–38. Cited by: §I.
  • [27] R. Little and H. An (2004) Robust likelihood-based analysis of multivariate data with missing values. Statistica Sinica, pp. 949–968. Cited by: §I.
  • [28] R. J. Little (1988) Missing-data adjustments in large surveys. Journal of Business & Economic Statistics 6 (3), pp. 287–296. Cited by: §IV-A.
  • [29] M. Liu, J. M. Taylor, and T. R. Belin (2000) Multiple imputation and posterior simulation for multivariate missing data in longitudinal studies. Biometrics 56 (4), pp. 1157–1163. Cited by: §II.
  • [30] Y. Liu, F. Shang, L. Jiao, J. Cheng, and H. Cheng (2015) Trace norm regularized candecomp/parafac decomposition with missing data. IEEE Transactions on Cybernetics 45 (11), pp. 2437–2448. Cited by: §II.
  • [31] Q. Long, C. Hsu, and Y. Li (2012) Doubly robust nonparametric multiple imputation for ignorable missing data. Statistica Sinica 22, pp. 149. Cited by: §I.
  • [32] Y. Luo, P. Szolovits, A. S. Dighe, and J. M. Baron (2016) Using machine learning to predict laboratory test results. American Journal of Clinical Pathology 145 (6), pp. 778–788. Cited by: §I.
  • [33] Y. Luo, P. Szolovits, A. S. Dighe, and J. M. Baron (2017) 3D-mice: integration of cross-sectional and longitudinal imputation for multi-analyte longitudinal clinical data. Journal of the American Medical Informatics Association 25 (6), pp. 645–653. Cited by: §II, §IV-A, §IV-C, §IV-C.
  • [34] B. MacDonald, P. Ranjan, and H. Chipman (2015) GPfit: an r package for fitting a gaussian process model to deterministic simulator outputs. Journal of Statistical Software 64 (1), pp. 1–23. Cited by: §IV-C.
  • [35] K. P. Murphy (2012) Machine learning: a probabilistic perspective. MIT press. Cited by: §IV-C.
  • [36] L. Qi, Y. Wang, and Y. He (2010) A comparison of multiple imputation and fully augmented weighted estimators for cox regression with missing covariates. Statistics in Medicine 29 (25), pp. 2592–2604. Cited by: §I.
  • [37] T. E. Raghunathan, J. M. Lepkowski, J. Van Hoewyk, and P. Solenberger (2001) A multivariate technique for multiply imputing missing values using a sequence of regression models. Survey Methodology 27 (1), pp. 85–96. Cited by: §I, §I.
  • [38] D. B. Rubin (1978) Multiple imputations in sample surveys-a phenomenological bayesian approach to nonresponse. In Proceedings of the survey research methods section of the American Statistical Association, Vol. 1, pp. 20–34. Cited by: §I.
  • [39] D. B. Rubin (1987) Multiple imputation for nonresponse in surveys. John Wiley & Sons. Cited by: §I.
  • [40] J. Sacks, W. J. Welch, T. J. Mitchell, and H. P. Wynn (1989) Design and analysis of computer experiments. Statistical Science, pp. 409–423. Cited by: Appendix B.
  • [41] J. L. Schafer and J. W. Graham (2002) Missing data: our view of the state of the art.. Psychological Methods 7 (2), pp. 147. Cited by: §I.
  • [42] J. L. Schafer and R. M. Yucel (2002) Computational strategies for multivariate linear mixed-effects models with missing values. Journal of Computational and Graphical Statistics 11 (2), pp. 437–457. Cited by: §II.
  • [43] J. L. Schafer (1997) Analysis of incomplete multivariate data. Chapman and Hall/CRC. Cited by: §I.
  • [44] A. Shah, N. Laird, and D. Schoenfeld (1997) A random-effects model for multiple characteristics with possibly missing data. Journal of the American Statistical Association 92 (438), pp. 775–779. Cited by: §II.
  • [45] D. S. Silva and C. V. Deutsch (2018) Multivariate data imputation using gaussian mixture models. Spatial statistics 27, pp. 74–90. Cited by: §II.
  • [46] D. J. Stekhoven and P. Bühlmann (2011) MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics 28 (1), pp. 112–118. Cited by: §I.
  • [47] Y. Su, A. Gelman, J. Hill, M. Yajima, et al. (2011) Multiple imputation with diagnostics (mi) in r: opening windows into the black box. Journal of Statistical Software 45 (2), pp. 1–31. Cited by: §I.
  • [48] K. Takeuchi, H. Kashima, and N. Ueda (2017) Autoregressive tensor factorization for spatio-temporal predictions. IEEE International Conference on Data Mining. Cited by: §II.
  • [49] R. Tomioka, K. Hayashi, and H. Kashima (2010)(Website) External Links: Link Cited by: §II.
  • [50] V. Tresp (2001) Mixtures of gaussian processes. In Advances in neural information processing systems, pp. 654–660. Cited by: §II.
  • [51] O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D. Botstein, and R. B. Altman (2001) Missing value estimation methods for dna microarrays. Bioinformatics 17 (6), pp. 520–525. Cited by: §I.
  • [52] S. Van Buuren, H. C. Boshuizen, and D. L. Knook (1999) Multiple imputation of missing blood pressure covariates in survival analysis. Statistics in Medicine 18 (6), pp. 681–694. Cited by: §I.
  • [53] S. Van Buuren, J. P. Brand, C. G. Groothuis-Oudshoorn, and D. B. Rubin (2006) Fully conditional specification in multivariate imputation. Journal of statistical computation and simulation 76 (12), pp. 1049–1064. Cited by: §I.
  • [54] S. Van Buuren (2018) Flexible imputation of missing data. Chapman and Hall/CRC. Cited by: §I.
  • [55] A. K. Waljee, A. Mukherjee, A. G. Singal, Y. Zhang, J. Warren, U. Balis, J. Marrero, J. Zhu, and P. D. Higgins (2013) Comparison of imputation methods for missing laboratory data in medicine. BMJ Open 3 (8), pp. e002847. Cited by: §I, §I.
  • [56] G. M. Weber, W. G. Adams, E. V. Bernstam, J. P. Bickel, K. P. Fox, K. Marsolo, V. A. Raghavan, A. Turchin, X. Zhou, S. N. Murphy, et al. (2017) Biases introduced by filtering electronic health records for patients with “complete data”. Journal of the American Medical Informatics Association 24 (6), pp. 1134–1141. Cited by: §I.
  • [57] R. L. Winslow, N. Trayanova, D. Geman, and M. I. Miller (2012) Computational medicine: translating models to clinical care. Science Translational Medicine 4 (158), pp. 158rv11–158rv11. Cited by: §I.
  • [58] W. Wothke (2000) Longitudinal and multi-group modeling with missing data. In Modeling longitudinal and multiplegroup data: Practical issues, applied approaches, and specific examples, T. D. Little, K. U. Schnabel, and J. Baumert (Eds.), pp. 219–240. Cited by: §I.
  • [59] Y. Xue, D. Klabjan, and Y. Luo (2018) Predicting icu readmission using grouped physiological and medication trends. Artificial Intelligence in Medicine. Cited by: §III-A.
  • [60] X. Yan, W. Xiong, L. Hu, F. Wang, and K. Zhao (2015) Missing value imputation based on gaussian mixture model for the internet of things. Mathematical Problems in Engineering 2015. Cited by: §II.
  • [61] X. Yang, Y. Zhang, and M. Chi (2018) Time-aware subgroup matrix decomposition: imputing missing data using forecasting events. In 2018 IEEE International Conference on Big Data, pp. 1524–1533. Cited by: §II, §III-A.
  • [62] K. Yu, V. Tresp, and A. Schwaighofer (2005) Learning gaussian processes from multiple tasks. In Proceedings of the 22nd International Conference on Machine Learning, pp. 1012–1019. Cited by: §II.
  • [63] R. Yu, D. Cheng, and Y. Liu (2015) Accelerated online low-rank tensor learning for multivariate spatio-temporal streams. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pp. 238–247. Cited by: §II.
  • [64] G. Zhang and R. Little (2009) Extensions of the penalized spline of propensity prediction method of imputation. Biometrics 65 (3), pp. 911–918. Cited by: §I.
  • [65] S. Zhe and Y. Du (2018) Stochastic nonparametric event-tensor decomposition. In Advances in Neural Information Processing Systems, pp. 6855–6865. Cited by: §II.
  • [66] J. Zhou, F. Wang, J. Hu, and J. Ye (2014) From micro to macro: data driven phenotyping by densification of longitudinal electronic medical records. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 135–144. Cited by: §II.

Appendix A Parameter Estimation in EM

In the E (Expectation) step, we calculate the responsibilities wp,v,b(k)=Qp(qv,b=k) for pPv,btr using the current values of the parameters in iteration j:

[Cp,v,b(1)](j)=[πv,b(1)](j)[Dp,v,b(1)](j)𝒩(xp,v,b|xp,-v,b[βv,b(1)](j),[σv,b(1)2](j))[Cp,v,b(2)](j)=[πv,b(2)](j)[Dp,v,b(2)](j)𝒩(xp,v,b|xp,v,-b[βv,b(2)](j),[σv,b(2)2](j))[Cp,v,b(3)](j)=[πv,b(3)](j)[Dp,v,b(3)](j)𝒩(xp,v,b|mG([αp,v,b](j)),ΣG([αp,v,b](j)))[Dp,v,b(k)](j)=𝒩(Vp,v,b|[𝝁v,b(k)](j),[𝚺v,b(k)](j)),k=1,2,3[wp,v,b(k)](j)=[Cp,v,b(k)](j)i=13[Cp,v,b(i)](j),k=1,2,3

Let Zv,b=(xp,-v,b)pPv,btr and Yv,b=(xp,v,-b)pPv,btr. In the M (Maximization) step, we re-estimate the parameters in iteration (j+1) using the jth responsibilities:

[πv,b(k)](j+1)=1|Pv,btr|pPv,btr[wp,v,b(k)](j),k=1,2,3[𝝁v,b(k)](j+1)=pPv,btr[wp,v,b(k)](j)Vp,v,bpPv,btr[wp,v,b(k)](j)[𝚺v,b(k)](j+1)=1pPv,btr[wp,v,b(k)](j)pPv,btr[wp,v,b(k)](j)[Up,v,b(k)](j+1)[Up,v,b(k)](j+1)={Vp,v,b-[𝝁v,b(k)](j+1)}{Vp,v,b-[𝝁v,b(k)](j+1)}[βv,b(1)](j+1)={{Zv,b[𝐰v,b(1)](j)Zv,b}-1Zv,b[𝐰v,b(1)](j)x:,v,bobs}[σv,b(1)2](j+1)=1pPv,btr[wp,v,b(1)](j)[Sv,b(1)](j+1)[Sv,b(1)](j+1)=pPv,btr[wp,v,b(1)](j){xp,v,b-xp,-v,b[βv,b(1)](j+1)}2[βv,b(2)](j+1)={{Yv,b[𝐰v,b(2)](j)Yv,b}-1Yv,b[𝐰v,b(2)](j)x:,v,bobs}[σv,b(2)2](j+1)=1pPv,btr[wp,v,b(2)](j)[Sv,b(2)](j+1)[Sv,b(2)](j+1)=pPv,btr[wp,v,b(2)](j){xp,v,b-xp,v,-b[βv,b(2)](j+1)}2[θv,b](j+1)=G([𝐰v,b(3)](j),[θv,b](j),x:,v,bobs,Yv,b,t:,v,:)

where [𝐰v,b(k)](j) is the vector of [wp,v,b(k)](j) for pPv,btr in iteration j. The kernel parameters θv,b of GP models are evaluated by function G, a gradient descent method that calculates the estimates of [θv,b](j+1) to maximize v,b(γ), using [θv,b](j) as the starting point. The first order derivatives of v,b(γ) with respect to θv,b that are used in G are given in Appendix C.

Appendix B GP Model

We assume the GP model discussed here in a mixture model for a certain variable and time, and thus we exclude the subscripts v and b. We use xp,t to denote a measurement of the time series xp at time t for patient p of a certain variable. We use xp,-t to denote a time series without the measurement at time t. The GP model is given by

xp,t=μp,t+f(t),f(t)𝒢𝒫(0,𝒦(t,t))

where μp,t is the overall mean of the model and f(t) is a Gaussian process with mean of 0 and covariance of 𝒦(t,t). Following the maximum likelihood approach, the best linear unbiased predictor (BLUP) [40] at t and the mean squared error are

m(3)(θ,xp,-t,t¯)=(1-rTR-11n1nTR-11n1nT+rT)R-1xp,-tΣ(3)(θ,xp,-t,t¯)=σf2[1-rTR-1r+(1-1nTR-1r)21nR-11n]

where rt(t)=corr(f(t),f(t)), r is the vector of rt(t) for all possible t, t¯ is a vector of time except for time t, R is the (B-1)×(B-1) correlation matrix and the correlation function is given by

Rt,t=exp(-θ|t-t|2).

The estimator σ2 is given by

σf2=CTR-1Cn,C=xp,-t-1n(1nTR-11n)-1(1nTR-1xp,-t)

where 1n is a vector with length (B-1) of all ones.

Appendix C Partial Derivatives in GP

To simplify the notations, we assume that the likelihood function L under consideration is for a mixture model for a certain variable and time. The partial derivative with respect to Gaussian process parameters θ is

Lθ=p=1|ptr|wpθln𝒩(xp,t;m(3)(θ,xp,-t,t¯),Σ(3)(θ,xp,-t,t¯)).

Letting gp(θ)=m(3)(θ,xp,-t,t¯) and hp(θ)=Σ(3)(θ,xp,-t,t¯), we have

Lθ=p=1|ptr|wpθln𝒩(xp,t;gp(θ),hp(θ))=p=1|ptr|wpθ{ln12πhp(θ)-[xp,t-gp(θ)]22hp(θ)}=p=1|ptr|wp{-12hp(θ)hp(θ)θ-θ[xp,t-gp(θ)]22hp(θ)}=p=1|ptr|wp{-12hp(θ)hp(θ)θ-12hp2(θ){2[xp,t-gp(θ)][-gp(θ)θ]hp(θ)-hp(θ)θ[xp,t-gp(θ)]2}}=p=1|ptr|wp{-12hp(θ)hp(θ)θ+[xp,t-gp(θ)]gp(θ)θhp(θ)+hp(θ)θ[xp,t-gp(θ)]22hp2(θ)}.

Then gp(θ)θ and hp(θ)θ are given by

gp(θ)θ=(H1θR-1+H1R-1θ)xp,-thp(θ)θ=σf2H3θ+σf2θH3

where H1, H1θ, H3 and H3θ are given as follows:

H1=[1-(rR-11n)]1nTR-11n1nT+rH1θ=-(rθR-1+rR-1θ)1n(1nTR-11n)1nTR-11n2-(1nTR-1θ1n)[1-(rR-11n)]1nTR-11n21nT+rθfc=(1-1nTR-1rT)2gc=1nTR-11nfcθ=2(1-1nTR-1rT)[-1nT(R-1θrT+R-1rTθ)]gcθ=1nTR-1θ1nH2=(1-1nTR-1rT)21nTR-11nH2θ=fcθgc-gcθfcgc2
H3=1-(rR-1rT)+H2H3θ=-(rθR-1rT+rR-1θrT+rR-1rTθ)+H2θH4=xp,-t-1n(1nTR-1xp,-t)1nTR-11nH4θ=-1n1(1nTR-11n)2[(1nTR-1θxp,-t)(1nTR-11n)-(1nTR-1θ1n)(1nTR-1xp,-t)]σf2θ=1n[(H4θ)TR-1H4+H4TR-1θH4+H4TR-1H4θ].