Abstract
Multi-output Gaussian processes (MOGPs) leverage the flexibility andinterpretability of GPs while capturing structure across outputs, which isdesirable, for example, in spatio-temporal modelling. The key problem withMOGPs is the cubic computational scaling in the number of both inputs (e.g.,time points or locations), n, and outputs, p. Current methods reduce this toO(n^3 m^3), where m < p is the desired degrees of freedom. This computationalcost, however, is still prohibitive in many applications. To address thislimitation, we present the Orthogonal Linear Mixing Model (OLMM), an MOGP inwhich exact inference scales linearly in m: O(n^3 m). This advance opens up awide range of real-world tasks and can be combined with existing GPapproximations in a plug-and-play way as demonstrated in the paper.Additionally, the paper organises the existing disparate literature on MOGPmodels into a simple taxonomy called the Mixing Model Hierarchy (MMH).
Quick Read (beta)
Abstract
Multi-output Gaussian processes (MOGPs) leverage the flexibility and interpretability of GPs while capturing structure across outputs, which is desirable, for example, in spatio-temporal modelling. The key problem with MOGPs is the cubic computational scaling in the number of both inputs (e.g., time points or locations), , and outputs, . Current methods reduce this to , where is the desired degrees of freedom. This computational cost, however, is still prohibitive in many applications. To address this limitation, we present the Orthogonal Linear Mixing Model (OLMM), an MOGP in which exact inference scales linearly in : . This advance opens up a wide range of real-world tasks and can be combined with existing GP approximations in a plug-and-play way as demonstrated in the paper. Additionally, the paper organises the existing disparate literature on MOGP models into a simple taxonomy called the Mixing Model Hierarchy (MMH).
remresetThe remreset package \WarningFilter*[parskip]latexCommand \pgfplotssetcompat=newest \usetikzlibrary calc, positioning, fit, tikzmark, arrows.meta, shapes, decorations.pathreplacing \tikzset line/.style = thick, ->, > = Triangle[length=2.0mm, width=2.0mm] , arrow/.style = line , hidden node/.style = circle, minimum size = 0.9cm, draw = white, thick , latent node/.style = hidden node, draw = black, , factor node/.style = hidden node, rectangle, draw = black, , observed node/.style = latent node, fill = gray!15 , plate/.style = draw, label=[anchor=north west]south west:#1, rounded corners=2pt, shape=rectangle, inner sep=10pt, thick \tikzset every picture/.append style = xscale = 2, yscale = 2 + \crefnameassumptionAssumptionAssumptions \CrefnameassumptionAssumptionAssumptions \crefnamecorollaryCorollaryCorollaries \CrefnamecorollaryCorollaryCorollaries \crefnamedefinitionDefinitionDefinitions \CrefnamedefinitionDefinitionDefinitions \crefnameexampleExampleExamples \CrefnameexampleExampleExamples \crefnamefactFactFacts \CrefnamefactFactFacts \crefnamelemmaLemmaLemmas \CrefnamelemmaLemmaLemmas \crefnamemodelModelModels \CrefnamemodelModelModels \crefnamepropositionPropositionPropositions \CrefnamepropositionPropositionPropositions \crefnamequestionQuestionQuestions \CrefnamequestionQuestionQuestions \crefnameremarkRemarkRemarks \CrefnameremarkRemarkRemarks \crefnametheoremTheoremTheorems \CrefnametheoremTheoremTheorems \crefnameasslistiAssumptionAssumptions \CrefnameasslistiAssumptionAssumptions \crefnamecorlistiCorollaryCorollaries \CrefnamecorlistiCorollaryCorollaries \crefnamedeflistiDefinitionDefinitions \CrefnamedeflistiDefinitionDefinitions \crefnameexlistiExampleExamples \CrefnameexlistiExampleExamples \crefnamefactlistiFactFacts \CrefnamefactlistiFactFacts \crefnamelemlistiLemmaLemmas \CrefnamelemlistiLemmaLemmas \crefnamemodlistiModelModels \CrefnamemodlistiModelModels \crefnameproplistiPropositionPropositions \CrefnameproplistiPropositionPropositions \crefnameqlistiQuestionQuestions \CrefnameqlistiQuestionQuestions \crefnameremlistiRemarkRemarks \CrefnameremlistiRemarkRemarks \crefnamethmlistiTheoremTheorems \CrefnamethmlistiTheoremTheorems \crefnamesectionSec.Sects. \crefnamepropositionProp.Props. \crefnamelemmaLem.Lems. \crefnamemodelMod.Mods. \crefnameappendixApp.Apps.
Scalable Exact Inference in Multi-Output Gaussian Processes
Wessel P. Bruinsma Eric Perim Will Tebbutt J. Scott Hosking Arno Solin Richard E. Turner
University of Cambridge, Invenia Labs, British Antarctic Survey, Aalto University, Microsoft Research
1 INTRODUCTION
Application areas such as geostatistics, sensor networks, psychometrics, neuroscience, brain imaging, and finance all utilise multi-output statistical models. A rich, but disparate literature has deployed Gaussian processes (GPs, Rasmussen:2006:Gaussian_Processes) in such modelling problems, the aim being to leverage their interpretability, modularity, and tractability. Indeed, GPs have been successfully applied in a wide variety of single-output contexts: they can automatically discover structure in signals (Duvenaud:2014:Automatic_Construction), achieve state-of-the-art performance in regression tasks (Bui:2016:Deep_Gaussian_Processes_for_Regression), enable data-efficient models in reinforcement learning (Deisenroth:2011:PILCO_A_Model-Based_and_Data-Efficient), and support many applications in probabilistic numerics (Hennig:2015:Probabilistic_Numerics_and_Uncertainty_in), such as in optimisation (Brochu:2010:A_Tutorial_on_Bayesian_Optimization) and quadrature (Minka:2000:Quadrature_GP).
One of the first applications of GPs to multiple outputs was in geostatistics (Matheron:1969:Le_Krigeage_Universel). Today, multi-output GP (MOGP) models can be found in various areas, including geostatistics (Wackernagel:2003:Multivariate_Geostatistics), factor analysis (Teh:2005:Semiparametric_Latent_Factor; Yu:2009:Gaussian-Process_Factor_Analysis_for_Low-Dimensional), dependent or multi-task learning (Boyle:2005:Dependent_Gaussian_Processes; Bonilla:2007:Kernel_Multi-Task_Learning_Using_Task-Specific; Bonilla:2008:Multi-Task_Gaussian_Process; Osborne:2008:Towards_Real-Time_Information_Processing_of), latent force models (Alvarez:2009:Latent_Force_Models; Alvarez:2009:Sparse_Convolved_Gaussian_Processes_for; Alvarez:2010:Efficient_Multioutput_Gaussian_Processes_Through; Alvarez:2011:Computationally_Efficient_Convolved), state space modelling (Sarkka:2013:Spatiotemporal_Learning_via), regression networks (Wilson:2012:GP_Regression_Networks; Nguyen:2014:Collaborative_Multi-Output; Dezfouli:2017:Semi-Parametric_Network_Structure_Discovery_Models), and mixture models (Ulrich:2015:Cross_Spectrum; Bruinsma:2016:GGPCM; Parra:2017:Spectral_Mixture_Kernels_for_Multi-Output; Requeima:2019:The_Gaussian_Process_Autoregressive_Regression). Although many of these models are constructed in a similar spirit, they differ in their details and terminology, which makes them difficult to compare and organise.
A key practical problem with existing MOGPs is their computational complexity. For input points each having outputs, inference and learning in general MOGPs take time and memory, although these may be alleviated by a wide range of approximations (Quinonero:2005:Unifying_View; Titsias:2009:Variational_Learning; Lazaro-Gredilla:2010:Sparse_Spectrum_Gaussian_Process_Regression; Hensman:2013:Gaussian_Processes_for_Big_Data; Wilson:2015:Kernel_Interpolation_for_Scalable_Structured; Bui:2016:A_Unifying_Framework_for_Gaussian; Cheng:2017:Variational_Inference_for_Gaussian_Process; Hensman:2018:Variational_Fourier_Features_for_Gaussian). To mitigate these unfavourable scalings, a particular class of MOGPs, called the Linear Mixing Model (LMM, creftype 2.2), exploits low-rank structure of their covariance via the matrix inversion and determinant lemmas to reduce the complexity of inference and learning to time and memory, where is the desired degrees of freedom of the observations, typically much less than . However, many multi-output regression problems cannot be solved adequately with only a small number of degrees of freedom, making the LMM prohibitively expensive in such cases. For example, consider temperatures measured across the Earth’s surface concatenated into a vector. Due to the complexity of the climate, we anticipate behaviour at many spatial length scales, so a model that supposes that this vector lives in a particularly low-dimensional “pancake” will not accurately reflect reality.
We present a unifying view of MOGPs to simplify the complicated web of models in this class. We call this the Mixing Model Hierarchy (MMH). Building on the Linear Mixing Model (LMM), which is the class of models at the bottom of the MMH, we introduce the Orthogonal Linear Mixing Model (OLMM). The OLMM is a particular LMM in which inference and learning take time and memory, without significantly sacrificing expressivity nor requiring any approximations. The OLMM is simple to implement and trivially compatible with existing single-output GP scaling techniques to scale to large numbers of data points . In particular, for a large class of multi-output problems, e.g. many spatio-temporal problems, the time and memory complexity can be further reduced to , which is linear in both the number of data points and degrees of freedom . We demonstrate the OLMM in experiments on synthetic and real-world data sets.
[
scale=1.5,
labelbg/.style=
fill=white,
inner sep=0pt,
opacity=0.7,
text opacity=1
]
\draw[thick, draw=black!30]
(0, 0, -1) coordinate (backsw)
– ++(0.8, 0, 0) coordinate (backse);
\draw[thick]
(backse)
– ++(0, 0.8, 0) coordinate (backne)
– ++(-0.8, 0, 0) coordinate (backnw);
\draw[thick, draw=black!30]
(backnw) – (backsw);
\draw[thick]
(0, 0, 0) coordinate (frontsw)
– ++(1, 0, 0) coordinate (frontse)
– ++(0, 1, 0) coordinate (frontne)
– ++(-1, 0, 0) coordinate (frontnw)
– cycle;
\draw[thick, draw=black!30]
(frontsw) – (backsw);
\draw[thick]
(frontse) – (backse)
(frontnw) – (backnw)
(frontne) – (backne);
\draw[arrow]
()
– ()
node [pos=0.475,
fill=white,
inner sep=.5pt,
] convolutional;
\draw[arrow]
()
– ()
node [pos=0.475,
rotate=90,
fill=white,
inner sep=.5pt,
] time-varying;
\draw[arrow]
()
– ()
node [pos=0.475,
rotate=22,
align=center,
anchor=south
] prior
on ;
\node[anchor=west]
(labelfrontsw) at ()
;
\node[anchor=west]
(labelfrontnw) at ()
;
\node[anchor=west, labelbg]
(labelfrontne) at ()
;
\node[anchor=west]
(labelfrontse) at ()
;
\draw[dashed] (labelfrontsw.west) – (frontsw);
\draw[dashed] (labelfrontse.west) – (frontse);
\draw[dashed] (labelfrontnw.west) – (frontnw);
\draw[dashed] (labelfrontne.west) – (frontne);
\node[anchor=south west,
align=left,
labelbg,
xshift=3pt,
yshift=3pt] at (frontsw) [1–3, 5, 6,
8, 17–19];
\node[anchor=south east,
yshift=3pt,
xshift=-3pt,
labelbg] at (frontse) [4, 7, 9, 10];
\node[anchor=south west] at (backnw) [11, 15, 16];
\node[anchor=south east, color=black] at (backse) [13];
\node[anchor=south west] at (backsw) [14, 15];
\node[anchor=west, align=center, text width=7cm] (Hs) at (1.0, 0.565, 0)
Form of
Form of
Mixing
[1, 5]
Inst.
[2]
Inst.
[3, 6, 8, 17–19]
Inst.
[4, 9, 10, 13]
Conv.
[7]
Green’s function
Conv.
[11, 16]
Inst.
[12, 14]
Inst.
[15]
Lower triangular
Inst.
;
\node[anchor=north east, align=center, text width=.25]
at ()
Models:
\setstretch
0.9
[1]
Intrinstic Coregionalisation Model (Goovaerts:1997:Geostatistics_for_Natural_Resources_Evaluation)
[2]
Linear Model of Coregionalisation (Goovaerts:1997:Geostatistics_for_Natural_Resources_Evaluation)
[3]
Semiparametric Latent Factor Model (Teh:2005:Semiparametric_Latent_Factor)
[4]
Dependent Gaussian Processes (Boyle:2005:Dependent_Gaussian_Processes)
[5]
Multi-Task Gaussian Processes (Bonilla:2008:Multi-Task_Gaussian_Process)
[6]
Osborne:2008:Towards_Real-Time_Information_Processing_of
[7]
Latent Force Models (Alvarez:2009:Latent_Force_Models)
[8]
Gaussian Process Factor Analysis (Yu:2009:Gaussian-Process_Factor_Analysis_for_Low-Dimensional)
[9]
Efficient Multioutput Gaussian Processes Through Variational Inducing Kernels (Alvarez:2010:Efficient_Multioutput_Gaussian_Processes_Through)
[10]
Convolved Multiple Output Gaussian Processes (Alvarez:2011:Computationally_Efficient_Convolved)
;
\node[anchor=north west, align=center, text width=.25]
at ()
\setstretch
0.9
[11]
Gaussian Process Regression Network (Wilson:2012:GP_Regression_Networks)
[12]
Collaborative Multi-Output Gaussian Processes (Nguyen:2014:Collaborative_Multi-Output)
[13]
Generalised Gaussian Process Convolution Model (Bruinsma:2016:GGPCM)
[14]
Semi-Parametric Network Structure Discovery Models (Dezfouli:2017:Semi-Parametric_Network_Structure_Discovery_Models)
[15]
The Gaussian Process Autoregressive Regression Model (Requeima:2019:The_Gaussian_Process_Autoregressive_Regression)
[16]
Grouped Gaussian Processes (Dahl:2018:Grouped_Gaussian_Processes_for_Solar)
[17]
Spatio-temporal Bayesian filtering and smoothing (Sarkka:2013:Spatiotemporal_Learning_via)
[18]
Linear Mixing Model (creftype 1)
[19]
Orthogonal Linear Mixing Model (creftype 2)
;
2 MULTI-OUTPUT GP MODELS
For tasks with outputs, multi-output Gaussian processes induce a prior distribution over vector-valued functions by requiring that any finite number of function values with are multivariate Gaussian distributed. Often, the input space is time, or is some feature space. An MOGP is described by a vector-valued mean function
and a matrix-valued covariance function
For observations , inference and learning take time and memory.
2.1 Mixing Model Hierarchy
Our unifying Mixing Model Hierarchy begins with Linear Mixing Models (LMMs), a simple class of MOGP models typically characterised by low-rank covariance structure. An LMM decomposes a signal comprising outputs into a fixed basis whose coefficients are time-varying and modelled independently with Gaussian processes:
The noisy signal is, then, generated by adding -distributed noise to . Intuitively, this means that the -dimensional data lives in a “pancake” around the -dimensional column space of , where typically .
Model 1 (Linear Mixing Model).
Let be an diagonal multi-output kernel with , a matrix, and . Then the LMM is given by the following generative model:
(latent processes) | ||||
(likelihood) | ||||
(noise model) |
We call the latent processes and the mixing matrix or basis. If we marginalise out and , we find , which reveals the low-rank covariance structure of the LMM. It also shows that the LMM is a time-varying generalisation of factor analysis (FA): choosing recovers FA exactly.
The graphical model of the LMM is illustrated in the top-left corner of creftype 6 in creftype A, which highlights two restrictions of the LMM compared to a general MOGP: (i) the instantaneous spatial covariance of , , does not vary with time, because does not vary with time; and (ii) the noise-free observation is generated from for only, meaning that, for example, cannot be with a delay or a smoothed version of . We hence call the LMM a time-invariant (due to (i)) and instantaneous (due to (ii)) MOGP.
The LMM can be generalised in three ways. First, the mixing matrix may vary with time. Then becomes a matrix-valued function , and the likelihood becomes . We call such MOGP models time-varying (see creftype 6, top-right). Second, may depend on for all . Then the mixing matrix becomes a matrix-valued time-invariant filter , and the likelihood becomes . We call such MOGP models convolutional (see creftype 6, bottom-left). Finally, may depend on for all and this relationship may vary with time. Then the mixing matrix becomes a matrix-valued time-varying filter , and the likelihood becomes . We call such MOGP models time-varying and convolutional (see creftype 6, bottom-right). The graphical models corresponding to these generalisations of the LMM are depicted in creftype 6 in the creftype A.
The LMM can be extended in one other way, which is to include a prior distribution on . This extension and the three previously proposed generalisations together form the Mixing Model Hierarchy (MMH), which is depicted in creftype 1. The MMH organises multi-output Gaussian process models according to their distinctive modelling assumptions. creftype 1 shows how sixteen MOGP models from the machine learning and geostatistics literature can be recovered as special cases of the various generalisations of the LMM. Not all multi-output Gaussian process models are covered by the MMH, however. For example, Deep GPs (Damianou:2015:Deep_Gaussian_Processes_and_Variational) and variations thereon (Kaiser:2017:Bayesian_Alignments_of_Warped_Multi-Output) are excluded because they transform the latent processes nonlinearly to generate the observations.
2.2 Inference and Learning in the LMM
In MOGPs, the complexities of inference and learning can often be alleviated by exploiting structure in the kernel. This is also the case for the Linear Mixing Model: the -dimensional observations have only degrees of freedom, where typically , and it turns out that the time and memory complexities can be brought down to and . In particular, in the following we show that the covariance structure of the LMM allows us to summarise the -dimensional observations in an -dimensional space, without loss of information.
We identify a low-dimensional projection of observations that is a sufficient statistic for the model and which can therefore be used to accelerate inference. This projection of is given by the maximum likelihood estimate (MLE) of under the likelihood of the LMM. Denote , and let be the matrix . creftype 3 in creftype B shows that and that is an unbiased estimate of . Hence, is a MLE for , which means that it is a function of a sufficient statistic for , if one exists. The following proposition shows that is actually minimally sufficient itself.
Prop. 1.
The MLE of is a minimal sufficient statistic for . Proof. See creftype C.
For any prior over , sufficiency of gives that ; i.e., conditioning on is equivalent to conditioning on , where can be interpreted as a “summary” or “projection” of . This idea is formalised in the following proposition, which is an immediate consequence of creftypeplural 1\crefpairconjunction1 in creftype C.
Prop. 2.
Let be a model for , not necessarily Gaussian, a matrix, and . Then consider the following generative model:
(latent processes) | ||||
(likelihood) | ||||
(noise model) |
Consider a matrix of observations of . Then . Furthermore, the distribution of the projected observed signal is given by
and the likelihood of is given by
Crucially, are -dimensional observations, are -dimensional summaries, and typically , so conditioning on is often much cheaper. In particular, if we apply creftype 2 to the LMM by letting , we immediately get the claimed reduction in complexities: whereas conditioning on takes time and memory, we may equivalently condition on , which takes time and memory instead. This important observation is depicted in creftype 1(a).
In creftype 2, we call the projected observation noise. The noise term is important, because it couples the latent processes upon observing data. In particular, if the latent processes are independent under the prior and is diagonal, then the latent processes remain independent when data is observed. This observation forms the basis of the computational gains achieved by the Orthogonal Linear Mixing Model, which we present next.
[anchor=center] (Y) ; \node[below=1.5cm of Y] (TY) ; \node[right=2.5cm of Y] (pyY) ; \node[below=1.5cm of pyY] (pTY) ; \draw[arrow] (Y) – (TY); \draw[arrow] (Y) – node [pos=0.5, anchor=south] inference node [pos=0.5, anchor=north] (pyY); \draw[arrow] (TY) – node [pos=0.5, anchor=south] inference node [pos=0.5, anchor=north] (pTY); \draw[arrow] (pTY) – (pyY); \useasboundingbox(-.5cm,.5cm) rectangle (2.25cm,-1.5cm); |
[anchor=center] (Y) ; \node[below=1.5cm of Y, anchor=center] (TY) ; \node[right=3cm of Y, anchor=center] (pyY) ; \node[right=3cm of TY, yshift=.75cm, anchor=center] (pu1TY) ; \node[right=3cm of TY, yshift=-.75cm, anchor=center] (pumTY) ; \draw[arrow] (Y) – (TY); \draw[arrow] (Y) – node [pos=0.5, anchor=south] inference node [pos=0.5, anchor=north] (pyY); \draw[arrow] (TY) – node [pos=0.5, anchor=south, rotate=20] inference node [pos=0.5, anchor=north, rotate=20] (pu1TY.west); \node[right=3cm of TY, anchor=center, yshift=.1cm] ; \draw[arrow] (TY) – node [pos=0.5, anchor=south, rotate=-20] inference node [pos=0.5, anchor=north, rotate=-20] (pumTY.west); \draw[arrow, ->] (pu1TY.east) to[bend right=40] ([yshift=-2pt]pyY.east); \draw[arrow, ->] (pumTY.east) to[out=60, in=-15] ([yshift=2pt]pyY.east); \useasboundingbox(-.25cm,.5cm) rectangle (2.5cm,-1.5cm); |
{tikzpicture}
[ scale=1.5, arrowlabel/.style= fill=white, opacity=0.7, text opacity=1, inner sep=0 , rotate=-30 ] [black] (0, 0) circle (0.02); \nodeat (0, 0) [anchor=north east] ; \draw[dashed, thick] plot [smooth, tension=0.5] coordinates (0, 0.75) (0.125, 0.875) (0.25, 1.1) (0.375, 1.25) (0.5, 1.38) (0.75, 1.58) (1, 1.625) (1.25, 1.55) (1.375, 1.45) (1.45, 1.25) (1.375, 1) (1.25, 0.9) (1, 0.95) (0.75, 0.75) (0.8, 0.5) (1, 0.25) (1.05, 0) ; [black] (1.0, 0.95) circle (0.035) node [arrowlabel, anchor=south west, yshift=0.3em, xshift=0.5em] ; \draw[mplorange, arrow] (0, 0) – node [midway, anchor=north west, arrowlabel, yshift=-0.1em, xshift=-0.1em] (1.175, 0.55); \draw[mplorange, arrow] (1.175, 0.55) – node [pos=.5, anchor=west, arrowlabel, xshift=0.2em] (1.0, 0.95); \draw[thick, mplorange!50] (1.084, 0.508) – (1.044, 0.599) – (1.135, 0.642); \draw[mplblue, arrow] (0, 0) – node [pos=0.75, anchor=east, arrowlabel, xshift=-0.2em, yshift=0.2em] (1.2, 1.9); \draw[mplblue, arrow] (1.2, 1.9) – node [pos=.5, anchor=west, arrowlabel, xshift=0.2em] (1.0, 0.95); |
3 THE ORTHOGONAL LINEAR MIXING MODEL
The Orthogonal Linear Mixing Model (OLMM) is the special case of the LMM where the basis, the columns of , is orthogonal. In particular, where is a truncated orthogonal matrix and a diagonal scaling. We define this model as follows.
Model 2 (Orthogonal Linear Mixing Model).
Let be an diagonal multi-output kernel with , a matrix of the form with a truncated orthogonal matrix and diagonal, and a matrix with diagonal. Then the OLMM is given by the following generative model:
(latent processes) | ||||
(likelihood) | ||||
(noise model) |
The difference between the LMM and the OLMM is illustrated in creftype 1(c). In the OLMM, the number of latent processes must be smaller than or equal to the number of outputs, since the number of -dimensional vectors that can be mutually orthogonal is at most . Also, in can be interpreted as heterogeneous noise deriving from the latent processes. Moreover, although and do not depend on time, our analysis and results also apply to the case where and do vary with time. Finally, for the OLMM, creftype 5 in creftype E shows that and
Whereas the LMM is a time-varying generalisation of FA, the OLMM is a time-varying generalisation of probabilistic principal component analysis (PPCA, Tipping:1999:Probabilistic_Principal_Component_Analysis): and recovers PPCA exactly. The OLMM is also similar to Gaussian Process Factor Analysis (GPFA, Yu:2009:Gaussian-Process_Factor_Analysis_for_Low-Dimensional), with the crucial difference being that in GPFA orthogonalisation of the columns of is done as a post-processing step, whereas in the OLMM orthogonality of the columns of is built into the model.
Generality of the OLMM.
In the separable case, where for a scalar-valued kernel , for every LMM with homogeneous observation noise (), there exists an OLMM with that is equal in distribution of . To see this, note that
Hence, letting be the eigendecomposition of gives an OLMM equal in distribution of . In the nonseparable case, where diagonal elements of are linearly independent, in general only the distribution of at every can be recovered by an OLMM, but the correlation between and for may be different. In terms of the joint distribution over and , which is important for interpretability of the latent processes, creftype G shows that the Kullback–Leibler (KL) divergence between two LMMs with bases and is proportional to , hence symmetric, where denotes the Frobenius norm. As a consequence (creftype G), the KL between an LMM with basis and the OLMM closest in KL is upper bounded by where are the right singular vectors of . This makes sense: implies that is of the form with a truncated orthogonal matrix and diagonal. It also shows that an LMM is close to an OLMM if is close to in the sense of the Frobenius norm.
Choice of .
The basis can be initialised and learned in several ways. (i) can be initialised randomly and learned through gradient-based optimisation of the marginal likelihood. (ii) Using that , the basis can be initialised to where is the eigendecomposition of an estimate of the spatial covariance. (iii) In the case that there is a kernel over the outputs, e.g. in separable spatio-temporal GP models, can be set to where is an eigendecomposition of the kernel matrix over the outputs. The hyperparameters of the kernel over the outputs can then be learned with gradient-based optimisation by differentiating through the eigendecomposition. We discuss this choice in more detail in creftype 4.
Diagonal projected noise.
As alluded to in creftype 2.2, the most important property of the OLMM is that the projected noise from creftype 2 is diagonal: creftype 4 in the creftype D characterises exactly when this is the case. Specifically, creftype 4, says that the projected noise is diagonal if and only if is of the form with a truncated orthogonal matrix and diagonal. This condition is awkward, as it couples and . Fortunately, creftype 4 also shows that we may drop ’s dependency on if and only if every column of is an eigenvector of . In the OLMM, every column of is indeed an eigenvector of . We therefore identify the OLMM as a general parametrisation of the LMM for which the projected noise is diagonal. Compared to the LMM, this brings computational advantages, which we discuss next.
Inference.
Since the projected noise is diagonal, the latent processes remain independent when data is observed. We may hence treat the latent processes independently, conditioning the th latent process on under noise , where is the th column of . Therefore, inference takes time and memory, which are linear in the number of degrees of freedom in the data . This decoupled inference procedure is depicted in creftype 1(b). Remark that the decoupled problems can be treated in parallel to achieve sublinear wall time, and that in the separable case further speedups are possible.
Learning.
For computing the likelihood of observed data, the OLMM also offers computational benefits. As the projected noise is diagonal, by absorbing into , the expression for from creftype 1 further simplifies to the following:
where is the kernel matrix for the th latent process . We conclude that learning also takes time and memory, again linear in the number of latent processes. Furthermore, we observe that is convex in , which significantly aids learning; in comparison, for the LMM, is highly nonconvex in the mixing matrix .
Interpretability.
Besides computational benefits, that the OLMM breaks down into independent problems for the latent processes also promotes interpretability. Namely, the independent problems can be separately inspected to interpret, diagnose, and improve the model. This is much easier than directly working with predictions for the data, which are high dimensional and often strongly correlated between outputs. In particular, if is sufficiently rich and the important independent problems are solved sufficiently well, then the OLMM achieves good mean squared error (MSE) for the data:
where is the orthogonal projection onto the orthogonal complement of . See creftype F for a proof.
Scaling.
For both learning and inference, the problem decouples into independent single-output problems. Therefore, to scale to a large number of data points , off-the-shelf single-output GP scaling techniques can be trivially applied to these independent problems. For example, if the variational inducing point method by Titsias:2009:Variational_Learning is used with inducing points, then inference and learning are further reduced to time and memory. Most importantly, if is Markovian (e.g. of the Matérn class) one can leverage state-space methods to exactly solve the independent problems in closed-form (Hartikainen:2010:Kalman_filtering_and_smoothing_solutions; Sarkka:2019:Applied_Stochastic_Differential_Equations). This brings down the scaling to time and memory, where is the state dimension (typically ). We further discuss this approach in creftype 4.
Missing data and heterogeneous observation noise.
Missing data is troublesome for the OLMM, because it is not possible to take away a subset of the rows of and retain orthogonality of the columns. Possible methods to circumvent this issue are discussed in creftype H, but are not explored further in this work. Furthermore, the specification of the OLMM does not allow for heterogeneous observation noise. This issue is discussed in creftype I, which shows that the OLMM can have automatic whitening of the data built in to allow for heterogeneous observation noise.
4 APPLICATION TO SEPARABLE SPATIO-TEMPORAL MODELS
Separable spatio-temporal GPs, which are of the form , fall neatly within the OLMM framework if observed at a fixed number of locations in space. Specifically, the vector-valued process has distribution where is the matrix with . Letting be the eigendecomposition of , is an OLMM with , which we note is square, and . Note that, in this case, projecting the data takes time. By relaxing to be a general diagonal multi-output kernel with , we obtain a nonseparable relaxation of the original separable model whilst retaining exact, computationally efficient inference. In particular, the orthogonal basis for the OLMM are the eigenvectors of a kernel matrix whose hyperparameters can be optimised.
Combining the OLMM framework with efficient state-space scaling techniques (Hartikainen:2010:Kalman_filtering_and_smoothing_solutions; Sarkka:2019:Applied_Stochastic_Differential_Equations; Solin:2018:Infinite-Horizon_Gaussian_Processes; nickisch2018state), which are either exact or arbitrarily good approximations, the complexities are reduced to time and memory for the entire problem, which are linear in . This compares favourably with the filtering approach by Sarkka:2013:Spatiotemporal_Learning_via, which takes time and memory. The sparse approximation by hartikainen2011sparse provides a reduced-rank approximation with time and memory for some , although it comes at the expense of accuracy and yields no asymptotic savings if must increase linearly in to maintain a reasonable accuracy. It also compares favourably with the Kronecker product decomposition (saatcci2012scalable, Ch. 5) approach, which requires time and memory complexity.
5 EXPERIMENTS
We test the OLMM in experiments on synthetic and real-world data sets. An implementation is available at https://github.com/wesselb/olmm.
Computational scaling.
We demonstrate that exact inference scales favourably in for the OLMM, whereas the LMM quickly becomes computationally infeasible as increases. We use a highly optimised implementation of exact inference for the LMM, kindly made available by Invenia Labs22 2 https://invenialabs.co.uk. creftype 3 shows the runtime and the memory usage of the LMM and OLMM. Observe that the LMM scales in time and in memory, whereas the OLMM scales in both time and memory. For , the LMM takes nearly 10 minutes to compute the evidence, whereas the OLMM only requires a couple seconds.33 3 Measurements were performed using a MacBook Pro with a 2.7 GHz Intel Core i7 processor and 16 GB RAM. Code was implemented in Julia 1.0 (bezanson2017julia) and memory and time were measured using the @allocated and the @elapsed macros, respectively, with the measurements averaged over 10 samples run serially. This means memory reported is the total memory allocated, not peak memory consumption.
Temperature extrapolation.
We consider a simple spatio-temporal temperature prediction problem over Europe. Approximately 30 years worth of the ERA-Interim reanalysis temperature data44 4 All output from CMIP5 and ERA-Interim models was regridded onto the latitude–longitude grid used for the IPSL-CM5A-LR model. (dee2011era) is smoothed in time with a Hamming window of width 31 and sub-sampled once every 31 days to produce a data set comprising outputs and approximately months worth of data. We train the OLMM and independent GPs (both models use Matérn– kernels with a periodic component) on the first 250 months of the data and test on the next 100 months. The basis for the OLMM is given by the eigenvectors of the kernel matrix over the points in space (Matérn– with a different length scale for latitude and longitude). creftype 1 shows the log-marginal likelihood (LML) of the training data, the posterior predictive log-probability (PPLP) and the root-mean-square error (RMSE) of the test data. Although the RMSE is equal to that of the independent GPs—the data is highly periodic and the predictions are highly accurate for both models—the LML and PPLP are much better, which indicates that that the OLMM is able to capture meaningful structure between temperatures at different points in space.
LML | PPLP | RMSE | |
---|---|---|---|
IGP | |||
OLMM |

every picture/.append style = xscale = .5, yscale = .5 \pgfplotsset width=height=0.5xlabel=Number of bins along , ylabel=, grid=major, xmin=5, xmax=15000, major grid style=densely dashed,black!25, legend style = at=(0.05, 0.95),anchor=north west, legend style=legend cell align=left, align=left, draw=none, fill=black!10, fill opacity=0.35, text opacity=1, draw opacity=1, scale only axis, axis on top, axis x line*=bottom, axis y line*=left, every axis plot/.append style=thick, axis line style = thick, xtick align=outside, xtick style=black, thick, ytick align=outside, ytick style=black, thick, clip=false {tikzpicture} {axis}[xmode = log] \addplot+[color=mycolor1, mark options=fill=mycolor1, opacity=1, mark size=1.5pt] coordinates (10000.0, 4.5624386332067655) (2000.0, 3.7446352996685253) (1000.0, 3.1158154627754544) (200.0, 2.0277622456016386) (100.0, 1.6077527986229443) (40.0, 1.0629959708548498) (20.0, 0.5306449960733431) (10.0, 0.18566913734215784) ; \addlegendentryLML \addplot+[color=mycolor2, mark options=fill=mycolor2, opacity=1, mark size=1.5pt] coordinates (10000.0, 4.879464511037345) (2000.0, 3.349257213603934) (1000.0, 2.983019409435243) (200.0, 1.8322661062389873) (100.0, 1.3563551273191006) (40.0, 0.9694708480653853) (20.0, 0.5251749491862987) (10.0, 0.18835279833487512) ; \addlegendentryRNG \addplot[color=black, dashed] coordinates (15000.0, 1) (5.0, 1) ;
![]() |
![]() |
![]() |
Rainforest tree point process modelling.
We consider a subset of the extensive rainforest data set credited to rainforest; Condit:2005; Hubbell:1999. The data features a 1000m 500m rainforest dynamics plot in Barro Colorado Island, Panama. In the survey area, the locations of all Trichilia tuberculata (a tree species of the Mahogany family) have been measured (see creftype 7 in creftype J). We tackle this spatial point pattern with a log-Gaussian Cox process model, which is an inhomogeneous Poisson process model for count data. The unknown intensity function is modelled with a Gaussian process such that . We model locally constant intensity in subregions by discretising the region into bins (Moller+Syversveen+Waagepetersen:1998). This leads to having a Poisson observation model in each bin (see creftype J). This model reaches posterior consistency in the limit of bin width going to zero (Tokdar+Ghosh:2007). The accuracy thus improves with tighter binning. We use a separable Matérn- GP prior over , and discretise the area into a grid with grid bins, and treat the first dimension as time.
We use Elliptical Slice Sampling (Murray:2010:Elliptical_Slice_Sampling; murray2010slice) and Metropolis–Hastings (hastings1970monte) to perform inference in this problem, which require many evaluations of the log marginal likelihood and draws from the prior. The proposed methodology renders these operations much less computationally intensive than the general case. We find approximately an hour is required to perform MCMC iterations on a MacBook Pro55 5 2.5 GHz Intel Core i7 processor and 16 GB RAM . See creftype J for more details.
As previously discussed, the Kronecker product factorisation techniques (saatcci2012scalable, Ch. 5) are applicable in separable spatio-temporal scenarios. While our method in combination with state-space inference offers superior performance asymptotically, we demonstrate that we also obtain benefits on practical problems. creftype 4 shows for different number of bins (with ) the time for log marginal likelihood evaluation and sampling from the GP prior for our method and for the Kronecker product decomposition approach. Observe that the linear scaling of our methodology quickly renders it competitive. While this simple experiment does not provide a comprehensive assessment of the relative merits of these techniques—for example, only a single CPU core was used in this experiment—it clearly establishes the practical relevance of our technique in the separable setting.
Large-scale climate model calibration.
In this final experiment, we consider a large-scale climate modelling task over Europe. In particular, we use the OLMM to find relationships between 28 climate simulators4 (see taylor2012overview, for background) by letting , where are the first eigenvectors of a covariance matrix between the simulators, and are the first eigenvectors of the kernel matrix over the points in space (Matérn– with a different length scale for latitude and longitude). This means the latent processes are indexed by two indices and , one corresponding to the eigenvector of the simulator covariance and one correponding to the eigenvector of the spatial covariance. In the above, the covariance matrix between the simulators is a parameter that we learn. The kernels for the latent processes are Matérn–. We consider time points for all 28 simulators, each with 247 outputs, giving a total of roughly 70 million data points. For the independent problems, we use the variational inducing point method by Titsias:2009:Variational_Learning, where the positions of the inducing points are initialised to one every two months. All hyperparameters and the locations of the inducing points are optimised until convergence using scipy’s implementation of the L-BFGS-B algorithm (Nocedal:2006:Numerical_Optimisation), which takes about 4 hours on a MacBook Pro5. The learned length scales were for latitude and for longitude.
creftype 4(a) shows the empirical correlations and the correlations learned by the OLMM (derived from ). Observe that the empirical correlation is not informative, as it ignores all temporal structure. The OLMM, however, does not, and is consequently able to identify a rich correlation structure. In order to get insight into the learned correlations, we hierarchically cluster the models using farthest point linkage with as the distance. creftype 4(b) shows the resulting dendrogram, in which models are grouped by their similarity. For two models, the further to the right the branch connecting them is, the less similar the models are. In creftypeplural 4(b)\crefpairconjunction4(a), HadGEM2 is clearly singled out: it is one of the simplest models, not including several processes that can be found in others, such as ocean & sea-ice, terrestrial carbon cycle, stratosphere, and ocean biogeochemistry (bellouin2011hadgem2). Furthermore, if we inspect the names of the simulators in the groups in creftype 4(b), we observe that often simulators of the same family are grouped together. We observe some interesting cases. (i) Although IPSL-CM5A-LR and IPSL-CM5A-MR are close, IPSL-CM5B-LR is grouped far apart. It turns out that IPSL-CM5A-LR and IPSL-CM5A-MR are different-resolution versions of the same model, while IPSL-CM5B-LR employs a different atmospheric model.66 6 https://portal.enes.org/models/earthsystem-models/ipsl/ipslesm (ii) ACCESS1.0 and ACCESS1.3 have a similar name, but differ greatly in their implementation: ACCESS1.0 is the basic model, while ACCESS1.3 is much more aspirational, including experimental atmospheric physics models and a particular land surface model (bi2013access). (iii) The distance between BCC_CSM1.1(m) and BCC_CSM1.1 can be explained by the more realistic surface air temperature predictions obtained by the former (wu2014overview), which is exactly the quantity we study. Finally, creftype 4(c) shows predictions for four latent processes ( with ). The first spatial eigenvector () is constant in space; combined with the strongest eigenvector of (), we obtain a strong signal constituting seasonal temperature changes.
6 DISCUSSION AND CONCLUSION
In this paper, we introduced the Orthogonal Linear Mixing Model as a multi-output GP in the Linear Mixing Model class where exact inference and learning scales linearly in the desired number of degrees of freedom of the observations. The OLMM is simple to implement and trivially compatible with off-the-shelf single-output GP scaling techniques to scale to large numbers of observations . We introduced the Mixing Model Hierarchy, which organises MOGPs according to their generative models, and placed the OLMM within this scheme.
The OLMM is exciting for several reasons. It provides a way to do efficient inference in separable spatio-temporal models that is trivially compatible with single-output scaling techniques. Exact inference is linear in number of outputs, and, with state space methods, it is also linear in number of time points. The OLMM even extends separable spatio-temporal models to non-separable models whilst retaining exact, computationally efficient inference. Furthermore, the OLMM is interpretable: the latent processes are independent, which means that they can be inspected separately to diagnose the model. Finally, as demonstrated in the real-world experiments in the paper, the OLMM is applicable to a wide range of modelling tasks. An interesting future direction for the method would be targeting sub-linear time complexity by parallelisation (see, e.g., Sarkka:2019:Temporal_Parallelization_of_Bayesian_Filters).
Bibliography
Supplementary Material:
Scalable Exact Inference in Multi-Output Gaussian Processes
[
xscale=0.85,
yscale=0.85,
latent node/.append style=
align=center,
inner sep=0pt,
text width=0.75cm
]
\node[latent node] (xt) at (0, 0) ;
\node[latent node] (ft) at () ;
\node[latent node] (xt2) at () ;
\node[latent node] (ft2) at () ;
\node[latent node, dashed] (xtpre) at () ;
\node[latent node, dashed] (xtpost) at () ;
\node[latent node, dashed] (ftpre) at () ;
\node[latent node, dashed] (ftpost) at () ;
\draw[arrow, ->] (xt) – (ft) node [pos=.45, fill=white] ;
\draw[arrow, ->] (xt2) – (ft2) node [pos=.45, fill=white] ;
\draw[arrow, ->] (xt) – (xt2);
\draw[arrow, ->, dashed] (xtpre) – (xt);
\draw[arrow, ->, dashed] (xt2) – (xtpost);
\node[anchor=south, rotate=90, draw=black, thick] at (-1.5, 0.9) instantaneous mixing;
\node[anchor=south, draw=black, thick] at (0.5, 2.3) time invariant;
\node[anchor=south] (desc) at (0.5, 1.85) is a fixed matrix;
\node[plate,
fit=(xtpre) (desc) (ftpost),
inner sep=5pt,
inner ysep=3pt,
yshift=-2pt,
label=[anchor=south west]north west:LMM] () ;
{tikzpicture}[
xscale=0.85,
yscale=0.85,
latent node/.append style=
align=center,
inner sep=0pt,
text width=0.75cm
]
\node[latent node] (xt) at (0, 0) ;
\node[latent node] (ft) at () ;
\node[latent node] (xt2) at () ;
\node[latent node] (ft2) at () ;
\node[latent node, dashed] (xtpre) at () ;
\node[latent node, dashed] (xtpost) at () ;
\node[latent node, dashed] (ftpre) at () ;
\node[latent node, dashed] (ftpost) at () ;
\draw[arrow, ->] (xt) – (ft) node [pos=.45, fill=white, fill opacity=.85] ;
\draw[arrow, ->] (xt2) – (ft2) node [pos=.45, fill=white, fill opacity=.85] ;
\draw[arrow, ->] (xt) – (xt2);
\draw[arrow, ->, dashed] (xtpre) – (xt);
\draw[arrow, ->, dashed] (xt2) – (xtpost);
\node[anchor=south, draw=black, thick] at (0.5, 2.3) time-varying;
\node[anchor=south] at (0.5, 1.85) is a time-varying matrix;
\node[plate,
fit=(xtpre) (desc) (ftpost),
inner sep=5pt,
inner ysep=3pt,
yshift=-2pt,
draw=none] () ;
{tikzpicture}[
xscale=0.85,
yscale=0.85,
latent node/.append style=
align=center,
inner sep=0pt,
text width=0.75cm
]
\node[latent node] (xt) at (0, 0) ;
\node[latent node] (ft) at () ;
\node[latent node] (xt2) at () ;
\node[latent node] (ft2) at () ;
\node[latent node, dashed] (xtpre) at () ;
\node[latent node, dashed] (xtpost) at () ;
\node[latent node, dashed] (ftpre) at () ;
\node[latent node, dashed] (ftpost) at () ;
\draw[arrow, ->, dashed] (xtpre) – (xt);
\draw[arrow, ->, dashed] (xt2) – (xtpost);
\draw[arrow, ->, dashed] (xt) – (ftpre);
\draw[arrow, ->, dashed] (xt) – (ftpost);
\draw[arrow, ->, dashed] (xt2) – (ftpre);
\draw[arrow, ->, dashed] (xt2) – (ftpost);
\draw[arrow, ->] (xt) – (ft) node [pos=0.45, fill=white, fill opacity=.85, inner sep=1pt] ;
\draw[arrow, ->] (xt) – (ft2) node [pos=0.2, fill=white, fill opacity=.85, inner sep=1pt] ;
\draw[arrow, ->] (xt2) – (ft) node [pos=0.7, fill=white, fill opacity=.85, inner sep=1pt] ;
\draw[arrow, ->] (xt2) – (ft2) node [pos=0.45, fill=white, fill opacity=.85, inner sep=1pt] ;
\draw[arrow, ->] (xt) – (xt2);
\node[anchor=south, rotate=90, draw=black, thick] at (-1.5, 0.9) convolutional mixing;
\node[anchor=south] at (0.5, 1.85) is a time-invariant filter;
\node[plate,
fit=(xtpre) (desc) (ftpost),
inner sep=5pt,
inner ysep=3pt,
yshift=-2pt,
draw=none] () ;
{tikzpicture}[
xscale=0.85,
yscale=0.85,
latent node/.append style=
align=center,
inner sep=0pt,
text width=0.75cm
]
\node[latent node] (xt) at (0, 0) ;
\node[latent node] (ft) at () ;
\node[latent node] (xt2) at () ;
\node[latent node] (ft2) at () ;
\node[latent node, dashed] (xtpre) at () ;
\node[latent node, dashed] (xtpost) at () ;
\node[latent node, dashed] (ftpre) at () ;
\node[latent node, dashed] (ftpost) at () ;
\draw[arrow, ->, dashed] (xtpre) – (xt);
\draw[arrow, ->, dashed] (xt2) – (xtpost);
\draw[arrow, ->, dashed] (xt) – (ftpre);
\draw[arrow, ->, dashed] (xt) – (ftpost);
\draw[arrow, ->, dashed] (xt2) – (ftpre);
\draw[arrow, ->, dashed] (xt2) – (ftpost);
\draw[arrow, ->] (xt) – (ft) node [pos=0.45, fill=white, fill opacity=.85, inner sep=1pt] ;
\draw[arrow, ->] (xt) – (ft2) node [pos=0.2, fill=white, fill opacity=.85, inner sep=1pt] ;
\draw[arrow, ->] (xt2) – (ft) node [pos=0.7, fill=white, fill opacity=.85, inner sep=1pt] ;
\draw[arrow, ->] (xt2) – (ft2) node [pos=0.45, fill=white, fill opacity=.85, inner sep=1pt] ;
\draw[arrow, ->] (xt) – (xt2);
\node[anchor=south] at (0.5, 1.85) is a time-varying filter;
\node[plate,
fit=(xtpre) (desc) (ftpost),
inner sep=5pt,
inner ysep=3pt,
yshift=-2pt,
draw=none] () ;
Appendix A Characterisation of MOGPs
creftypecap 6 illustrates the graphical models of time-invariant/time-varying and instantaneous/convolutional multi-output GP models.
Appendix B Maximum Likelihood Estimate
Throughout the appendix, we assume that the columns of are linearly independent and . As a consequence, is invertible and .
Prop. 3.
Denote , and let be the matrix . Then
and is an unbiased estimate of : .
Proof.
Note that
Using invertibility of , an elementary calculation then shows that the maximum with respect to is given by
To show that is an unbiased estimate of , we use that :
Appendix C Proof of Prop. 1
To prove creftype 1, we need the property of that it “preserves the signal-to-noise ratio”. This is characterised in the following lemma.
Lem. 1.
Proof.
It is simple to check the equality by direct verification. We show, however, how the equality may be derived. To begin with, we have
Here
so
Adding and subtracting , we find
Hence, rearranging,
which yields the result.
Proof (creftype 1).
By a general characterisation of minimal sufficient statistics (see, e.g., Th. 6.2.13 in Casella:2001:Statistical_Inference), is a minimal sufficient statistic for if and only if it is true that is constant as a function of if and only if . Indeed, by creftype 1,
which, by invertibility of , does not depend on if and only if .
Appendix D Characterisation of Diagonal Projected Noise
Lem. 2.
Let be a truncated orthogonal matrix and diagonal. Then every solution of is of the form with diagonal and such that is orthogonal.
Proof.
We equivalently show that for every solution of is of the desired form. Denote
If , then shows that . Therefore, if , then . It is clear that . Thus, Let , and let be the spectral decomposition of . Assume that by dropping the appropriate columns from . Then says that and have pairwise orthogonal columns, so the columns of are in . We conclude that every is of the form with diagonal and a truncated orthogonal matrix with columns in . If , then in particular has full rank, so it is clear that must form an orthogonal matrix. Therefore, , which is of the desired form.
Prop. 4.
The projected noise is diagonal if and only if is of the form with a truncated orthogonal matrix and diagonal. Suppose that this is the case, and fix such a . Then is of the form with diagonal if and only if every column of is an eigenvector of .
Proof.
The projected noise is diagonal if and only if for some diagonal. This condition is equivalent to
which, in turn, holds if and only if is a truncated orthogonal matrix. Thus, the projected noise is diagonal if and only if is of the form with a truncated orthogonal matrix and diagonal.
For the second statement, note that the condition that every column of is an eigenvector of is equivalent to the condition that is of the form with diagonal and such that forms an orthogonal matrix. Suppose that is of this form. Then
implies that
which is of the desired form. Conversely, suppose that with diagonal. Then
Therefore, by creftype 2,
with diagonal and such that is orthogonal, which again is of the desired form.
Appendix E Form of Projection and Projected Noise
Prop. 5.
Let and be of the form for the OLMM (creftype 2). Then
Proof.
To begin with, using the matrix inversion lemma once and once in reverse,
so
Furthermore, by a similar calculation,
Appendix F Decomposition of the Mean Squared Error
Prop. 6.
Let with a truncated orthogonal matrix and diagonal. Then
where and is the orthogonal projection onto the orthogonal complement of .
Proof.
By expanding and using orthogonality of ,
where is the th column of . Note that is the orthogonal projection onto , so is the orthogonal projection onto the orthogonal complement of . Therefore,
Appendix G Kullback–Leibler Divergence Between an LMM and OLMM
Prop. 7.
Consider two LMMs with equal , equal , but different bases and . Let and denote and . It then holds that
and
where ranges over matrices of the form with a truncated orthogonal matrix and diagonal, is the orthogonal matrix collecting the right singular vectors of , and denotes the variance of the observations under the first LMM before adding noise.
Proof.
Start out by expanding the Kullback–Leibler divergence and noting that :
Here
where
Therefore,
Let be the SVD of where is a truncated orthogonal matrix with the same shape as , is a square diagonal matrix, and is a square orthogonal matrix. Note that , but . Then, choosing ,
since . We now further simplify:
Hence, by definition of the trace and the fact that is diagonal,
since
Therefore,
where the equality follows from a similar calculation:
Appendix H Missing Data
Missing data is troublesome for the OLMM, because it is not possible to take away a subset of the rows of and retain orthogonality of the columns. Suppose that data are missing for a fixed subset of outputs. Collect that subset of outputs into , and let be the remainder of the outputs. The issue can then be resolved by considering a multi-headed version of the OLMM:
For heads, if for a subset of heads are observed, then
While this new model resolves the missing data problem, the assumption that each collection of outputs (each head) is a linear transformation of the same collection of latent processes will be overly restrictive for many contexts: it is not typically the case that our prior beliefs about the relationships between outputs have anything to do with the availability of data pertaining to them.
In the general case, where data can be missing arbitrarily, let be the observed data. Complement with missing data such that is complete. Then a simple solution is to use variational inference. Assume a Gaussian approximate posterior distribution over , and maximise the evidence lower bound (ELBO) using gradient-based optimisation:
where the expectation can be approximated using the reparametrisation trick (Kingma:2013:Auto-Encoding_VB), can be computed efficiently because is complete, and denotes the entropy of . This approach provides a tractable solution when the number of missing data are not too numerous.
Appendix I Heterogeneous Observation Noise
Although the specification of the observation noise in the OLMM does not allow for heterogeneous observation noise, it is possible to set and use creftype 4 to include in the parametrisation of : . This parametrisation can be interpreted in two ways: {enumerate*}
The model has a whitening transform built in. In the projection , the (noise in the) data will first by whitened by . Hence, this parametrisation can be used as a more principled substitute for the usual data normalisation where the outputs are divided by their empirical standard deviation prior to feeding them to the model.
The basis is orthogonal with respect to a weighted Euclidean inner product: for . Intuitively, this means that the basis is orthogonal in the usual sense after stretching the th dimension by . Although this construction provides additional flexiblity, it does require that to avoid a circular dependency between and .
Appendix J Point Process Experimental Details and Additional Results
We use a total of bins in the model, where each bin is . The likelihood for this experiment is given by
where is the coordinate of the th bin, is the number of data points in the th bin, and is the matrix of counts.
[axis on top, xmin=0, xmax=1000, xlabel= (meters), y dir=reverse, ymin=0, ymax=500, ylabel= (meters), axis background/.style=fill=white, legend style=legend cell align=left,align=left,draw=white!15!black, width=height=] \addplot[forget plot] graphics [xmin=-0.394321766561514,xmax=1000.39432176656,ymin=-0.394944707740916,ymax=500.394944707741] ./figures/trees-1.png; |
[point meta min=-3, point meta max=3, axis on top, xmin=0, xmax=1000, xlabel= (meters), y dir=reverse, ymin=0, ymax=500, ylabel= (meters), axis background/.style=fill=white, legend style=legend cell align=left,align=left,draw=white!15!black, width=height=] \addplot[forget plot] graphics [xmin=-5.05050505050505,xmax=1005.05050505051,ymin=-1.25628140703518,ymax=501.256281407035] ./figures/intensity-1.png; |
We perform iterations of Gibbs sampling, alternating between Elliptical Slice Sampling for the Gaussian process given its hyperparameters, and Metropolis Hastings with proposal distribution for the log of the hyperparameters given the latent GP-distributed function. The kernel is a product of two Matérn- kernels, one over the vertical dimension and the other over the horizontal one, each with their own length scale. A single process variance is utilised, and a nugget term is added. The of each of the four hyperparameters was given a prior. creftypecap 8 shows the log joint of the entire state after each iteration, while creftype 9 shows the progress of each hyperparameter per iteration.


The times in creftype 4 were obtained via BenchmarkTools.jl (BenchmarkTools.jl-2016). The implementation of the standard Kronecker product decomposition trick makes use of Kronecker.jl, and Julia’s (bezanson2017julia) standard linear algebra libraries, which make use of OpenBLAS and LAPACK to efficiently perform matrix-matrix products and compute eigendecompositions. The implementation of the state-space GP additionally makes use of StaticArrays.jl for efficient stack-allocated matrices, and Stheno.jl for GP-related functionality. Timing experiments were conducted on a single CPU core to make the implementations comparable.
The state-space implementation of the GP makes use of the trick introduced by Solin:2018:Infinite-Horizon_Gaussian_Processes when computing the log marginal likelihood. However, the infinite-horizon trick is exploited only once the filtering covariance has converged, which is determined by the point at which the Frobenius norm of the difference between the filtering covariance at the th and th iterations drops below . In practice, we find that this produces log-marginal likelihoods that are essentially exact, as expected.