Scalable Exact Inference in Multi-Output Gaussian Processes

  • 2019-11-14 18:19:22
  • Wessel P. Bruinsma, Eric Perim, Will Tebbutt, J. Scott Hosking, Arno Solin, Richard E. Turner
  • 3

Abstract

Multi-output Gaussian processes (MOGPs) leverage the flexibility andinterpretability of GPs while capturing structure across outputs, which isdesirable, for example, in spatio-temporal modelling. The key problem withMOGPs is the cubic computational scaling in the number of both inputs (e.g.,time points or locations), n, and outputs, p. Current methods reduce this toO(n^3 m^3), where m < p is the desired degrees of freedom. This computationalcost, however, is still prohibitive in many applications. To address thislimitation, we present the Orthogonal Linear Mixing Model (OLMM), an MOGP inwhich exact inference scales linearly in m: O(n^3 m). This advance opens up awide range of real-world tasks and can be combined with existing GPapproximations in a plug-and-play way as demonstrated in the paper.Additionally, the paper organises the existing disparate literature on MOGPmodels into a simple taxonomy called the Mixing Model Hierarchy (MMH).

 

Quick Read (beta)

Abstract

Multi-output Gaussian processes (MOGPs) leverage the flexibility and interpretability of GPs while capturing structure across outputs, which is desirable, for example, in spatio-temporal modelling. The key problem with MOGPs is the cubic computational scaling in the number of both inputs (e.g., time points or locations), n, and outputs, p. Current methods reduce this to O(n3m3), where m<p is the desired degrees of freedom. This computational cost, however, is still prohibitive in many applications. To address this limitation, we present the Orthogonal Linear Mixing Model (OLMM), an MOGP in which exact inference scales linearly in m: O(n3m). This advance opens up a wide range of real-world tasks and can be combined with existing GP approximations in a plug-and-play way as demonstrated in the paper. Additionally, the paper organises the existing disparate literature on MOGP models into a simple taxonomy called the Mixing Model Hierarchy (MMH).

\WarningFilter

remresetThe remreset package \WarningFilter*[parskip]latexCommand \pgfplotssetcompat=newest \usetikzlibrary calc, positioning, fit, tikzmark, arrows.meta, shapes, decorations.pathreplacing \tikzset line/.style = thick, ->, > = Triangle[length=2.0mm, width=2.0mm] , arrow/.style = line , hidden node/.style = circle, minimum size = 0.9cm, draw = white, thick , latent node/.style = hidden node, draw = black, , factor node/.style = hidden node, rectangle, draw = black, , observed node/.style = latent node, fill = gray!15 , plate/.style = draw, label=[anchor=north west]south west:#1, rounded corners=2pt, shape=rectangle, inner sep=10pt, thick \tikzset every picture/.append style = xscale = 2, yscale = 2 + \crefnameassumptionAssumptionAssumptions \CrefnameassumptionAssumptionAssumptions \crefnamecorollaryCorollaryCorollaries \CrefnamecorollaryCorollaryCorollaries \crefnamedefinitionDefinitionDefinitions \CrefnamedefinitionDefinitionDefinitions \crefnameexampleExampleExamples \CrefnameexampleExampleExamples \crefnamefactFactFacts \CrefnamefactFactFacts \crefnamelemmaLemmaLemmas \CrefnamelemmaLemmaLemmas \crefnamemodelModelModels \CrefnamemodelModelModels \crefnamepropositionPropositionPropositions \CrefnamepropositionPropositionPropositions \crefnamequestionQuestionQuestions \CrefnamequestionQuestionQuestions \crefnameremarkRemarkRemarks \CrefnameremarkRemarkRemarks \crefnametheoremTheoremTheorems \CrefnametheoremTheoremTheorems \crefnameasslistiAssumptionAssumptions \CrefnameasslistiAssumptionAssumptions \crefnamecorlistiCorollaryCorollaries \CrefnamecorlistiCorollaryCorollaries \crefnamedeflistiDefinitionDefinitions \CrefnamedeflistiDefinitionDefinitions \crefnameexlistiExampleExamples \CrefnameexlistiExampleExamples \crefnamefactlistiFactFacts \CrefnamefactlistiFactFacts \crefnamelemlistiLemmaLemmas \CrefnamelemlistiLemmaLemmas \crefnamemodlistiModelModels \CrefnamemodlistiModelModels \crefnameproplistiPropositionPropositions \CrefnameproplistiPropositionPropositions \crefnameqlistiQuestionQuestions \CrefnameqlistiQuestionQuestions \crefnameremlistiRemarkRemarks \CrefnameremlistiRemarkRemarks \crefnamethmlistiTheoremTheorems \CrefnamethmlistiTheoremTheorems \crefnamesectionSec.Sects. \crefnamepropositionProp.Props. \crefnamelemmaLem.Lems. \crefnamemodelMod.Mods. \crefnameappendixApp.Apps.

 

Scalable Exact Inference in Multi-Output Gaussian Processes


 


Wessel P. Bruinsma1,2 Eric Perim2 Will Tebbutt1 J. Scott Hosking3 Arno Solin4 Richard E. Turner1,5

1University of Cambridge, 2Invenia Labs, 3British Antarctic Survey, 4Aalto University, 5Microsoft Research

1 INTRODUCTION

Application areas such as geostatistics, sensor networks, psychometrics, neuroscience, brain imaging, and finance all utilise multi-output statistical models. A rich, but disparate literature has deployed Gaussian processes (GPs, Rasmussen:2006:Gaussian_Processes) in such modelling problems, the aim being to leverage their interpretability, modularity, and tractability. Indeed, GPs have been successfully applied in a wide variety of single-output contexts: they can automatically discover structure in signals (Duvenaud:2014:Automatic_Construction), achieve state-of-the-art performance in regression tasks (Bui:2016:Deep_Gaussian_Processes_for_Regression), enable data-efficient models in reinforcement learning (Deisenroth:2011:PILCO_A_Model-Based_and_Data-Efficient), and support many applications in probabilistic numerics (Hennig:2015:Probabilistic_Numerics_and_Uncertainty_in), such as in optimisation (Brochu:2010:A_Tutorial_on_Bayesian_Optimization) and quadrature (Minka:2000:Quadrature_GP).

One of the first applications of GPs to multiple outputs was in geostatistics (Matheron:1969:Le_Krigeage_Universel). Today, multi-output GP (MOGP) models can be found in various areas, including geostatistics (Wackernagel:2003:Multivariate_Geostatistics), factor analysis (Teh:2005:Semiparametric_Latent_Factor; Yu:2009:Gaussian-Process_Factor_Analysis_for_Low-Dimensional), dependent or multi-task learning (Boyle:2005:Dependent_Gaussian_Processes; Bonilla:2007:Kernel_Multi-Task_Learning_Using_Task-Specific; Bonilla:2008:Multi-Task_Gaussian_Process; Osborne:2008:Towards_Real-Time_Information_Processing_of), latent force models (Alvarez:2009:Latent_Force_Models; Alvarez:2009:Sparse_Convolved_Gaussian_Processes_for; Alvarez:2010:Efficient_Multioutput_Gaussian_Processes_Through; Alvarez:2011:Computationally_Efficient_Convolved), state space modelling (Sarkka:2013:Spatiotemporal_Learning_via), regression networks (Wilson:2012:GP_Regression_Networks; Nguyen:2014:Collaborative_Multi-Output; Dezfouli:2017:Semi-Parametric_Network_Structure_Discovery_Models), and mixture models (Ulrich:2015:Cross_Spectrum; Bruinsma:2016:GGPCM; Parra:2017:Spectral_Mixture_Kernels_for_Multi-Output; Requeima:2019:The_Gaussian_Process_Autoregressive_Regression). Although many of these models are constructed in a similar spirit, they differ in their details and terminology, which makes them difficult to compare and organise.

A key practical problem with existing MOGPs is their computational complexity. For n input points each having p outputs, inference and learning in general MOGPs take O(n3p3) time and O(n2p2) memory, although these may be alleviated by a wide range of approximations (Quinonero:2005:Unifying_View; Titsias:2009:Variational_Learning; Lazaro-Gredilla:2010:Sparse_Spectrum_Gaussian_Process_Regression; Hensman:2013:Gaussian_Processes_for_Big_Data; Wilson:2015:Kernel_Interpolation_for_Scalable_Structured; Bui:2016:A_Unifying_Framework_for_Gaussian; Cheng:2017:Variational_Inference_for_Gaussian_Process; Hensman:2018:Variational_Fourier_Features_for_Gaussian). To mitigate these unfavourable scalings, a particular class of MOGPs, called the Linear Mixing Model (LMM, creftype 2.2), exploits low-rank structure of their covariance via the matrix inversion and determinant lemmas to reduce the complexity of inference and learning to O(n3m3) time and O(n2m2) memory, where m is the desired degrees of freedom of the observations, typically much less than p. However, many multi-output regression problems cannot be solved adequately with only a small number of degrees of freedom, making the LMM prohibitively expensive in such cases. For example, consider temperatures measured across the Earth’s surface concatenated into a vector. Due to the complexity of the climate, we anticipate behaviour at many spatial length scales, so a model that supposes that this vector lives in a particularly low-dimensional “pancake”1 will not accurately reflect reality.

1The distribution of Ax+ε for A tall, x Gaussian, and ε noise (Roweis:1999:A_Unifying_Review_of_Linear; MacKay:2002:Information_Theory_Learning).

We present a unifying view of MOGPs to simplify the complicated web of models in this class. We call this the Mixing Model Hierarchy (MMH). Building on the Linear Mixing Model (LMM), which is the class of models at the bottom of the MMH, we introduce the Orthogonal Linear Mixing Model (OLMM). The OLMM is a particular LMM in which inference and learning take O(n3m) time and O(n2m) memory, without significantly sacrificing expressivity nor requiring any approximations. The OLMM is simple to implement and trivially compatible with existing single-output GP scaling techniques to scale to large numbers of data points n. In particular, for a large class of multi-output problems, e.g. many spatio-temporal problems, the time and memory complexity can be further reduced to O(nm), which is linear in both the number of data points n and degrees of freedom m. We demonstrate the OLMM in experiments on synthetic and real-world data sets.

{tikzpicture}

[ scale=1.5, labelbg/.style= fill=white, inner sep=0pt, opacity=0.7, text opacity=1 ] \draw[thick, draw=black!30] (0, 0, -1) coordinate (backsw) – ++(0.8, 0, 0) coordinate (backse); \draw[thick] (backse) – ++(0, 0.8, 0) coordinate (backne) – ++(-0.8, 0, 0) coordinate (backnw); \draw[thick, draw=black!30] (backnw) – (backsw); \draw[thick] (0, 0, 0) coordinate (frontsw) – ++(1, 0, 0) coordinate (frontse) – ++(0, 1, 0) coordinate (frontne) – ++(-1, 0, 0) coordinate (frontnw) – cycle; \draw[thick, draw=black!30] (frontsw) – (backsw); \draw[thick] (frontse) – (backse) (frontnw) – (backnw) (frontne) – (backne); \draw[arrow] ((frontsw)+(0,-0.07,0)) – ((frontse)+(0,-0.07,0)) node [pos=0.475, fill=white, inner sep=.5pt, ] convolutional; \draw[arrow] ((frontsw)+(-0.07,0.07,0)) – ((frontnw)+(-0.07,-0.07,0)) node [pos=0.475, rotate=90, fill=white, inner sep=.5pt, ] time-varying; \draw[arrow] ((frontnw)+(-0.1,0,-0.17)) – ((backnw)+(-0.1,0.025,0)) node [pos=0.475, rotate=22, align=center, anchor=south ] prior
on H; \node[anchor=west] (labelfrontsw) at ((frontsw)+(-110:0.25)) f(t)=Hx(t); \node[anchor=west] (labelfrontnw) at ((frontnw)+(110:0.45)) f(t)=H(t)x(t); \node[anchor=west, labelbg] (labelfrontne) at ((frontne)+(25:0.35)) f(t)=H(t,τ)x(τ)dτ; \node[anchor=west] (labelfrontse) at ((frontse)+(-25:0.1)) f(t)=H(t-τ)x(τ)dτ; \draw[dashed] (labelfrontsw.west) – (frontsw); \draw[dashed] (labelfrontse.west) – (frontse); \draw[dashed] (labelfrontnw.west) – (frontnw); \draw[dashed] (labelfrontne.west) – (frontne); \node[anchor=south west, align=left, labelbg, xshift=3pt, yshift=3pt] at (frontsw) [1–3, 5, 6,
[8, 17–19]; \node[anchor=south east, yshift=3pt, xshift=-3pt, labelbg] at (frontse) [4, 7, 9, 10]; \node[anchor=south west] at (backnw) [11, 15, 16]; \node[anchor=south east, color=black] at (backse) [13]; \node[anchor=south west] at (backsw) [14, 15]; \node[anchor=west, align=center, text width=7cm] (Hs) at (1.0, 0.565, 0) Form of H Form of K Mixing [1, 5] H kI Inst. [2] [H1HQ] diag(k1I,,kQI) Inst. [3, 6, 8, 17–19] H diag(k1,,kQ) Inst. [4, 9, 10, 13] H(t-t) diag(δ,,δ) Conv. [7] Green’s function diag(k1,,kQ) Conv. [11, 16] H(t) diag(k1,,kQ) Inst. [12, 14] [HI] diag(k1,,kQ+P) Inst. [15] Lower triangular diag(k1,,kQ) Inst. ; \node[anchor=north east, align=center, text width=.25] at ((4.5,1.35,0)) Models:
\setstretch 0.9 [1] Intrinstic Coregionalisation Model (Goovaerts:1997:Geostatistics_for_Natural_Resources_Evaluation) [2] Linear Model of Coregionalisation (Goovaerts:1997:Geostatistics_for_Natural_Resources_Evaluation) [3] Semiparametric Latent Factor Model (Teh:2005:Semiparametric_Latent_Factor) [4] Dependent Gaussian Processes (Boyle:2005:Dependent_Gaussian_Processes) [5] Multi-Task Gaussian Processes (Bonilla:2008:Multi-Task_Gaussian_Process) [6] Osborne:2008:Towards_Real-Time_Information_Processing_of [7] Latent Force Models (Alvarez:2009:Latent_Force_Models) [8] Gaussian Process Factor Analysis (Yu:2009:Gaussian-Process_Factor_Analysis_for_Low-Dimensional) [9] Efficient Multioutput Gaussian Processes Through Variational Inducing Kernels (Alvarez:2010:Efficient_Multioutput_Gaussian_Processes_Through) [10] Convolved Multiple Output Gaussian Processes (Alvarez:2011:Computationally_Efficient_Convolved)

; \node[anchor=north west, align=center, text width=.25] at ((4.2,1.35,0)) Models:
\setstretch 0.9 [11] Gaussian Process Regression Network (Wilson:2012:GP_Regression_Networks) [12] Collaborative Multi-Output Gaussian Processes (Nguyen:2014:Collaborative_Multi-Output) [13] Generalised Gaussian Process Convolution Model (Bruinsma:2016:GGPCM) [14] Semi-Parametric Network Structure Discovery Models (Dezfouli:2017:Semi-Parametric_Network_Structure_Discovery_Models) [15] The Gaussian Process Autoregressive Regression Model (Requeima:2019:The_Gaussian_Process_Autoregressive_Regression) [16] Grouped Gaussian Processes (Dahl:2018:Grouped_Gaussian_Processes_for_Solar) [17] Spatio-temporal Bayesian filtering and smoothing (Sarkka:2013:Spatiotemporal_Learning_via) [18] Linear Mixing Model (creftype 1) [19] Orthogonal Linear Mixing Model (creftype 2)

;

Figure 1: The Mixing Model Hierarchy, which organises multi-output GPs from the machine learning and geostatistics literature according to their modelling assumptions. In the table, ki=ki(t,t) denotes a kernel, δ=δ(t-t) is the Dirac delta function, “inst.” abbreviates “instantaneous”, and “conv.” abbreviates “convolutional”.

2 MULTI-OUTPUT GP MODELS

For tasks with p outputs, multi-output Gaussian processes induce a prior distribution over vector-valued functions f:𝒯p by requiring that any finite number of function values fp1(t1),,fpn(tn) with (pi)i=1n{1,,p} are multivariate Gaussian distributed. Often, the input space 𝒯= is time, or 𝒯=d is some feature space. An MOGP f𝒢𝒫(m,K) is described by a vector-valued mean function

m(t)=𝔼[f(t)]=[𝔼[f1(t)]𝔼[fp(t)]]𝖳

and a matrix-valued covariance function

K(t,t)=𝔼[f(t)f𝖳(t)]-𝔼[f(t)]𝔼[f𝖳(t)]
=[cov(f1(t),f1(t))cov(f1(t),fp(t))cov(fp(t),f1(t))cov(fp(t),fp(t))].

For n observations y(t1),,y(tn)p, inference and learning take O(n3p3) time and O(n2p2) memory.

2.1 Mixing Model Hierarchy

Our unifying Mixing Model Hierarchy begins with Linear Mixing Models (LMMs), a simple class of MOGP models typically characterised by low-rank covariance structure. An LMM decomposes a signal f(t) comprising p outputs into a fixed basis h1,,hmp whose coefficients x1(t),,xm(t) are time-varying and modelled independently with Gaussian processes:

f(t) =h1x1(t)++hmxm(t)
=[h1hm]H[x1(t)xm(t)]𝖳x(t)=Hx(t).

The noisy signal y(t) is, then, generated by adding 𝒩(0,Λ)-distributed noise to f(t). Intuitively, this means that the p-dimensional data lives in a “pancake” around the m-dimensional column space of H, where typically mp.

Model 1 (Linear Mixing Model).

Let K be an m×m diagonal multi-output kernel with K(t,t)=Im, H a p×m matrix, and Λ=diag(σ12,,σp2). Then the LMM is given by the following generative model:

x 𝒢𝒫(0,K(t,t)), (latent processes)
f(t)|H,x =Hx(t), (likelihood)
y|f 𝒢𝒫(f(t),δ[t-t]Λ). (noise model)

We call x the latent processes and H the mixing matrix or basis. If we marginalise out f and x, we find y𝒢𝒫(0,HK(t,t)H𝖳+Λ), which reveals the low-rank covariance structure of the LMM. It also shows that the LMM is a time-varying generalisation of factor analysis (FA): choosing K(t,t)=δ[t-t]Im recovers FA exactly.

The graphical model of the LMM is illustrated in the top-left corner of creftype 6 in creftype A, which highlights two restrictions of the LMM compared to a general MOGP: (i) the instantaneous spatial covariance of f, 𝔼[f(t)f𝖳(t)]=HH𝖳, does not vary with time, because H does not vary with time; and (ii) the noise-free observation f(t) is generated from x(t) for t=t only, meaning that, for example, f cannot be x with a delay or a smoothed version of x. We hence call the LMM a time-invariant (due to (i)) and instantaneous (due to (ii)) MOGP.

The LMM can be generalised in three ways. First, the mixing matrix H may vary with time. Then Hp×m becomes a matrix-valued function H:𝒯p×m, and the likelihood becomes f(t)|H,x=H(t)x(t). We call such MOGP models time-varying (see creftype 6, top-right). Second, f(t) may depend on x(t) for all t𝒯. Then the mixing matrix Hp×m becomes a matrix-valued time-invariant filter H:𝒯p×m, and the likelihood becomes f(t)|H,x=H(t-τ)x(τ)dτ. We call such MOGP models convolutional (see creftype 6, bottom-left). Finally, f(t) may depend on x(t) for all t𝒯 and this relationship may vary with time. Then the mixing matrix Hp×m becomes a matrix-valued time-varying filter H:𝒯×𝒯p×m, and the likelihood becomes f(t)|H,x=H(t,τ)x(τ)dτ. We call such MOGP models time-varying and convolutional (see creftype 6, bottom-right). The graphical models corresponding to these generalisations of the LMM are depicted in creftype 6 in the creftype A.

The LMM can be extended in one other way, which is to include a prior distribution on H. This extension and the three previously proposed generalisations together form the Mixing Model Hierarchy (MMH), which is depicted in creftype 1. The MMH organises multi-output Gaussian process models according to their distinctive modelling assumptions. creftype 1 shows how sixteen MOGP models from the machine learning and geostatistics literature can be recovered as special cases of the various generalisations of the LMM. Not all multi-output Gaussian process models are covered by the MMH, however. For example, Deep GPs (Damianou:2015:Deep_Gaussian_Processes_and_Variational) and variations thereon (Kaiser:2017:Bayesian_Alignments_of_Warped_Multi-Output) are excluded because they transform the latent processes nonlinearly to generate the observations.

2.2 Inference and Learning in the LMM

In MOGPs, the complexities of inference and learning can often be alleviated by exploiting structure in the kernel. This is also the case for the Linear Mixing Model: the p-dimensional observations have only m degrees of freedom, where typically mp, and it turns out that the O(n3p3) time and O(n2p2) memory complexities can be brought down to O(n3m3) and O(n2m2). In particular, in the following we show that the covariance structure of the LMM allows us to summarise the p-dimensional observations in an m-dimensional space, without loss of information.

We identify a low-dimensional projection of observations y that is a sufficient statistic for the model and which can therefore be used to accelerate inference. This projection of y is given by the maximum likelihood estimate (MLE) of x under the likelihood of the LMM. Denote p(y|x)=𝒩(y;Hx,Λ), and let T be the m×p matrix (H𝖳Λ-1H)-1H𝖳Λ-1. creftype 3 in creftype B shows that Ty=argmaxxp(y|x) and that Ty is an unbiased estimate of x. Hence, Ty is a MLE for x, which means that it is a function of a sufficient statistic for x, if one exists. The following proposition shows that Ty is actually minimally sufficient itself.

Prop. 1.

The MLE Ty of x is a minimal sufficient statistic for x. Proof. See creftype C.

For any prior p(x) over x, sufficiency of T gives that p(x|y)=p(x|Ty); i.e., conditioning on y is equivalent to conditioning on Ty, where Ty can be interpreted as a “summary” or “projection” of y. This idea is formalised in the following proposition, which is an immediate consequence of creftypeplural 1\crefpairconjunction1 in creftype C.

Prop. 2.

Let p(x) be a model for x:𝒯m, not necessarily Gaussian, H a p×m matrix, and Λ=diag(σ12,,σp2). Then consider the following generative model:

x p(x), (latent processes)
f(t)|H,x =Hx(t), (likelihood)
y|f 𝒢𝒫(f(t),δ[t-t]Λ). (noise model)

Consider a p×n matrix Y of observations of y. Then p(f|Y)=p(f|TY). Furthermore, the distribution of the projected observed signal Ty is given by

Ty|x𝒢𝒫(x(t),δ[t-t]ΛT) with ΛT=(H𝖳Λ-1H)-1,

and the likelihood of Y=(yi)i=1np is given by

p(Y)=[i=1n𝒩(yi;0,Λ)𝒩(yi;0,ΛT)]p(x)i=1n𝒩(Tyi;xi,ΛT)dx.

Crucially, Y are p-dimensional observations, TY are m-dimensional summaries, and typically mp, so conditioning on TY is often much cheaper. In particular, if we apply creftype 2 to the LMM by letting x𝒢𝒫(0,K(t,t)), we immediately get the claimed reduction in complexities: whereas conditioning on Y takes O(n3p3) time and O(n2p2) memory, we may equivalently condition on TY, which takes O(n3m3) time and O(n2m2) memory instead. This important observation is depicted in creftype 1(a).

In creftype 2, we call ΛT=TΛT𝖳=(H𝖳Λ-1H)-1 the projected observation noise. The noise term ΛT is important, because it couples the latent processes upon observing data. In particular, if the latent processes are independent under the prior and ΛT is diagonal, then the latent processes remain independent when data is observed. This observation forms the basis of the computational gains achieved by the Orthogonal Linear Mixing Model, which we present next.

{tikzpicture}\node

[anchor=center] (Y) Y; \node[below=1.5cm of Y] (TY) TY; \node[right=2.5cm of Y] (pyY) p(f|Y); \node[below=1.5cm of pyY] (pTY) p(f|TY); \draw[arrow] (Y) – (TY); \draw[arrow] (Y) – node [pos=0.5, anchor=south] inference node [pos=0.5, anchor=north] O(n3p3) (pyY); \draw[arrow] (TY) – node [pos=0.5, anchor=south] inference node [pos=0.5, anchor=north] O(n3m3) (pTY); \draw[arrow] (pTY) – (pyY);

\useasboundingbox

(-.5cm,.5cm) rectangle (2.25cm,-1.5cm);

(a) Conditioning on Y in the LMM
{tikzpicture}\node

[anchor=center] (Y) Y; \node[below=1.5cm of Y, anchor=center] (TY) TY; \node[right=3cm of Y, anchor=center] (pyY) p(f|Y); \node[right=3cm of TY, yshift=.75cm, anchor=center] (pu1TY) p(x1|(TY)1:); \node[right=3cm of TY, yshift=-.75cm, anchor=center] (pumTY) p(xm|(TY)m:); \draw[arrow] (Y) – (TY); \draw[arrow] (Y) – node [pos=0.5, anchor=south] inference node [pos=0.5, anchor=north] O(n3p3) (pyY); \draw[arrow] (TY) – node [pos=0.5, anchor=south, rotate=20] inference node [pos=0.5, anchor=north, rotate=20] O(n3) (pu1TY.west); \node[right=3cm of TY, anchor=center, yshift=.1cm] ; \draw[arrow] (TY) – node [pos=0.5, anchor=south, rotate=-20] inference node [pos=0.5, anchor=north, rotate=-20] O(n3) (pumTY.west); \draw[arrow, ->] (pu1TY.east) to[bend right=40] ([yshift=-2pt]pyY.east); \draw[arrow, ->] (pumTY.east) to[out=60, in=-15] ([yshift=2pt]pyY.east);

\useasboundingbox

(-.25cm,.5cm) rectangle (2.5cm,-1.5cm);

(b) Conditioning on Y in the OLMM
{tikzpicture}

[ scale=1.5, arrowlabel/.style= fill=white, opacity=0.7, text opacity=1, inner sep=0 , rotate=-30 ] [black] (0, 0) circle (0.02); \nodeat (0, 0) [anchor=north east] 0; \draw[dashed, thick] plot [smooth, tension=0.5] coordinates (0, 0.75) (0.125, 0.875) (0.25, 1.1) (0.375, 1.25) (0.5, 1.38) (0.75, 1.58) (1, 1.625) (1.25, 1.55) (1.375, 1.45) (1.45, 1.25) (1.375, 1) (1.25, 0.9) (1, 0.95) (0.75, 0.75) (0.8, 0.5) (1, 0.25) (1.05, 0) ; [black] (1.0, 0.95) circle (0.035) node [arrowlabel, anchor=south west, yshift=0.3em, xshift=0.5em] f(t)=(f1(t),f2(t)); \draw[mplorange, arrow] (0, 0) – node [midway, anchor=north west, arrowlabel, yshift=-0.1em, xshift=-0.1em] h1x1(t) (1.175, 0.55); \draw[mplorange, arrow] (1.175, 0.55) – node [pos=.5, anchor=west, arrowlabel, xshift=0.2em] h2x2(t) (1.0, 0.95); \draw[thick, mplorange!50] (1.084, 0.508) – (1.044, 0.599) – (1.135, 0.642); \draw[mplblue, arrow] (0, 0) – node [pos=0.75, anchor=east, arrowlabel, xshift=-0.2em, yshift=0.2em] h1x1(t) (1.2, 1.9); \draw[mplblue, arrow] (1.2, 1.9) – node [pos=.5, anchor=west, arrowlabel, xshift=0.2em] h2x2(t) (1.0, 0.95);

(c) Difference between LMM and OLMM
Figure 2: (a–b) Commutative diagram depicting that conditioning on Y in the LMM and in the OLMM is equivalent to independently conditioning every xi on (TY)i:, but yields different computational complexities. (c) Illustration of the difference between the LMM and OLMM. The trajectory of a particle in two dimensions is modelled by the LMM ( blue) and OLMM ( orange). The noise-free position f(t) is modelled as a linear combination of basis vectors h1 and h2 with coefficients x1(t) and x2(t) (two independent GPs). In the OLMM, the basis vectors h1 and h2 are constrained to be orthogonal, whereas in the LMM, h1 and h2 are unconstrained.

3 THE ORTHOGONAL LINEAR MIXING MODEL

The Orthogonal Linear Mixing Model (OLMM) is the special case of the LMM where the basis, the columns of H, is orthogonal. In particular, H=US12 where U is a truncated orthogonal matrix and S>0 a diagonal scaling. We define this model as follows.

Model 2 (Orthogonal Linear Mixing Model).

Let K be an m×m diagonal multi-output kernel with K(t,t)=Im, H a p×m matrix of the form H=US12 with U a truncated orthogonal matrix and S>0 diagonal, and Λ=σ2Ip+HDH𝖳 a p×p matrix with D>0 diagonal. Then the OLMM is given by the following generative model:

x 𝒢𝒫(0,K(t,t)), (latent processes)
f(t)|H,x =Hx(t), (likelihood)
y|f 𝒢𝒫(f(t),δ[t-t]Λ). (noise model)

The difference between the LMM and the OLMM is illustrated in creftype 1(c). In the OLMM, the number of latent processes must be smaller than or equal to the number of outputs, since the number of p-dimensional vectors that can be mutually orthogonal is at most p. Also, D in Λ can be interpreted as heterogeneous noise deriving from the latent processes. Moreover, although H and Λ do not depend on time, our analysis and results also apply to the case where Ht and Λt do vary with time. Finally, for the OLMM, creftype 5 in creftype E shows that T=S-12U𝖳 and ΛT=σ2S-1+D.

Whereas the LMM is a time-varying generalisation of FA, the OLMM is a time-varying generalisation of probabilistic principal component analysis (PPCA, Tipping:1999:Probabilistic_Principal_Component_Analysis): D=0 and K(t,t)=δ[t-t]Im recovers PPCA exactly. The OLMM is also similar to Gaussian Process Factor Analysis (GPFA, Yu:2009:Gaussian-Process_Factor_Analysis_for_Low-Dimensional), with the crucial difference being that in GPFA orthogonalisation of the columns of H is done as a post-processing step, whereas in the OLMM orthogonality of the columns of H is built into the model.

Generality of the OLMM.

In the separable case, where K(t,t)=k(t,t)Im for a scalar-valued kernel k, for every LMM with homogeneous observation noise (Λ=σ2Ip), there exists an OLMM with D=0 that is equal in distribution of y. To see this, note that

y(LMM) 𝒢𝒫(0,k(t,t)HH𝖳+σ2δ[t-t]Ip),
y(OLMM) 𝒢𝒫(0,k(t,t)USU𝖳+σ2δ[t-t]Ip).

Hence, letting USU𝖳 be the eigendecomposition of HH𝖳 gives an OLMM equal in distribution of y. In the nonseparable case, where diagonal elements of K are linearly independent, in general only the distribution of y(t) at every t can be recovered by an OLMM, but the correlation between y(t) and y(t) for tt may be different. In terms of the joint distribution over x and y, which is important for interpretability of the latent processes, creftype G shows that the Kullback–Leibler (KL) divergence between two LMMs with bases H and H^ is proportional to H-H^F2, hence symmetric, where F denotes the Frobenius norm. As a consequence (creftype G), the KL between an LMM with basis H and the OLMM closest in KL is upper bounded by I-VF2 where V are the right singular vectors of H. This makes sense: V=I implies that H is of the form US12 with U a truncated orthogonal matrix and S>0 diagonal. It also shows that an LMM is close to an OLMM if V is close to I in the sense of the Frobenius norm.

Choice of H.

The basis H can be initialised and learned in several ways. (i) H can be initialised randomly and learned through gradient-based optimisation of the marginal likelihood. (ii) Using that 𝔼[f(t)f𝖳(t)]=HH𝖳, the basis H can be initialised to U^S^12 where Σ^=U^S^U^𝖳 is the eigendecomposition of an estimate Σ^ of the spatial covariance. (iii) In the case that there is a kernel over the outputs, e.g. in separable spatio-temporal GP models, H can be set to US12 where USU𝖳 is an eigendecomposition of the kernel matrix over the outputs. The hyperparameters of the kernel over the outputs can then be learned with gradient-based optimisation by differentiating through the eigendecomposition. We discuss this choice in more detail in creftype 4.

Diagonal projected noise.

As alluded to in creftype 2.2, the most important property of the OLMM is that the projected noise ΛT from creftype 2 is diagonal: ΛT=σ2S-1+D. creftype 4 in the creftype D characterises exactly when this is the case. Specifically, creftype 4, says that the projected noise is diagonal if and only if H is of the form H=Λ12US12 with U a truncated orthogonal matrix and S>0 diagonal. This condition is awkward, as it couples H and Λ. Fortunately, creftype 4 also shows that we may drop H’s dependency on Λ if and only if every column of U is an eigenvector of Λ. In the OLMM, every column of U is indeed an eigenvector of Λ. We therefore identify the OLMM as a general parametrisation of the LMM for which the projected noise is diagonal. Compared to the LMM, this brings computational advantages, which we discuss next.

Inference.

Since the projected noise is diagonal, the latent processes remain independent when data is observed. We may hence treat the latent processes independently, conditioning the ith latent process xi on (TY)i:=Y𝖳ui/Sii under noise (ΛT)ii=σ2/Sii+Dii, where ui is the ith column of U. Therefore, inference takes O(n3m) time and O(n2m) memory, which are linear in the number of degrees of freedom in the data m. This decoupled inference procedure is depicted in creftype 1(b). Remark that the decoupled problems can be treated in parallel to achieve sublinear wall time, and that in the separable case further speedups are possible.

Learning.

For computing the likelihood of observed data, the OLMM also offers computational benefits. As the projected noise is diagonal, by absorbing D into K, the expression for p(Y) from creftype 1 further simplifies to the following:

p(Y) =𝒩(vecY;0,σ2Inp)
×i=1m𝒩((TY)i:;0,Ki+(σ2/Sii+Dii)In)𝒩((TY)i:;0,(σ2/Sii)In),

where Ki is the n×n kernel matrix for the ith latent process xi. We conclude that learning also takes O(n3m) time and O(n2m) memory, again linear in the number of latent processes. Furthermore, we observe that p(Y) is convex in U, which significantly aids learning; in comparison, for the LMM, p(Y) is highly nonconvex in the mixing matrix H.

Interpretability.

Besides computational benefits, that the OLMM breaks down into independent problems for the latent processes also promotes interpretability. Namely, the independent problems can be separately inspected to interpret, diagnose, and improve the model. This is much easier than directly working with predictions for the data, which are high dimensional and often strongly correlated between outputs. In particular, if H is sufficiently rich and the important independent problems are solved sufficiently well, then the OLMM achieves good mean squared error (MSE) for the data:

y-Hx2MSE=PUy2variance notcaptured by basis+i=1mSii((Ty)i-xi)2MSE ofith latent process

where PU is the orthogonal projection onto the orthogonal complement of U. See creftype F for a proof.

Scaling.

For both learning and inference, the problem decouples into m independent single-output problems. Therefore, to scale to a large number of data points n, off-the-shelf single-output GP scaling techniques can be trivially applied to these independent problems. For example, if the variational inducing point method by Titsias:2009:Variational_Learning is used with rn inducing points, then inference and learning are further reduced to O(nr2m) time and O(nrm) memory. Most importantly, if k(t,t) is Markovian (e.g. of the Matérn class) one can leverage state-space methods to exactly solve the m independent problems in closed-form (Hartikainen:2010:Kalman_filtering_and_smoothing_solutions; Sarkka:2019:Applied_Stochastic_Differential_Equations). This brings down the scaling to O(nmd3) time and O(nmd2) memory, where d is the state dimension (typically dm,n). We further discuss this approach in creftype 4.

Missing data and heterogeneous observation noise.

Missing data is troublesome for the OLMM, because it is not possible to take away a subset of the rows of H and retain orthogonality of the columns. Possible methods to circumvent this issue are discussed in creftype H, but are not explored further in this work. Furthermore, the specification of the OLMM does not allow for heterogeneous observation noise. This issue is discussed in creftype I, which shows that the OLMM can have automatic whitening of the data built in to allow for heterogeneous observation noise.

4 APPLICATION TO SEPARABLE SPATIO-TEMPORAL MODELS

Separable spatio-temporal GPs, which are of the form f𝒢𝒫(0,kt(t,t)kr(r,r)), fall neatly within the OLMM framework if observed at a fixed number of locations in space. Specifically, the vector-valued process f(t)=(f(t,ri))r=1p has distribution 𝒢𝒫(0,kt(t,t)Kr) where Kr is the p×p matrix with (Kr)ij=kr(ri,rj). Letting USU𝖳 be the eigendecomposition of Kr, f(t) is an OLMM with H=US12, which we note is square, and K(t,t)=kt(t,t)Ip. Note that, in this case, projecting the data takes O(np2) time. By relaxing K to be a general diagonal multi-output kernel with K(t,t)=Ip, we obtain a nonseparable relaxation of the original separable model whilst retaining exact, computationally efficient inference. In particular, the orthogonal basis for the OLMM are the eigenvectors of a kernel matrix whose hyperparameters can be optimised.

Combining the OLMM framework with efficient state-space scaling techniques (Hartikainen:2010:Kalman_filtering_and_smoothing_solutions; Sarkka:2019:Applied_Stochastic_Differential_Equations; Solin:2018:Infinite-Horizon_Gaussian_Processes; nickisch2018state), which are either exact or arbitrarily good approximations, the complexities are reduced to O(np+p3) time and O(np+p2) memory for the entire problem, which are linear in n. This compares favourably with the filtering approach by Sarkka:2013:Spatiotemporal_Learning_via, which takes O(np3) time and O(np2) memory. The sparse approximation by hartikainen2011sparse provides a reduced-rank approximation with O(nq3) time and O(nq2) memory for some qp, although it comes at the expense of accuracy and yields no asymptotic savings if q must increase linearly in p to maintain a reasonable accuracy. It also compares favourably with the Kronecker product decomposition (saatcci2012scalable, Ch. 5) approach, which requires O(p3+n3) time and O(p2+n2) memory complexity.

5 EXPERIMENTS

We test the OLMM in experiments on synthetic and real-world data sets. An implementation is available at https://github.com/wesselb/olmm.

Computational scaling.

We demonstrate that exact inference scales favourably in m for the OLMM, whereas the LMM quickly becomes computationally infeasible as m increases. We use a highly optimised implementation of exact inference for the LMM, kindly made available by Invenia Labs22 2 https://invenialabs.co.uk. creftype 3 shows the runtime and the memory usage of the LMM and OLMM. Observe that the LMM scales O(m3) in time and O(m2) in memory, whereas the OLMM scales O(m) in both time and memory. For m=25, the LMM takes nearly 10 minutes to compute the evidence, whereas the OLMM only requires a couple seconds.33 3 Measurements were performed using a MacBook Pro with a 2.7 GHz Intel Core i7 processor and 16 GB RAM. Code was implemented in Julia 1.0 (bezanson2017julia) and memory and time were measured using the @allocated and the @elapsed macros, respectively, with the measurements averaged over 10 samples run serially. This means memory reported is the total memory allocated, not peak memory consumption.

Temperature extrapolation.

We consider a simple spatio-temporal temperature prediction problem over Europe. Approximately 30 years worth of the ERA-Interim reanalysis temperature data44 4 All output from CMIP5 and ERA-Interim models was regridded onto the latitude–longitude grid used for the IPSL-CM5A-LR model. (dee2011era) is smoothed in time with a Hamming window of width 31 and sub-sampled once every 31 days to produce a data set comprising 13×19=247 outputs and approximately 350 months worth of data. We train the OLMM and independent GPs (both models use Matérn–5/2 kernels with a periodic component) on the first 250 months of the data and test on the next 100 months. The basis for the OLMM is given by the eigenvectors of the kernel matrix over the points in space (Matérn–5/2 with a different length scale for latitude and longitude). creftype 1 shows the log-marginal likelihood (LML) of the training data, the posterior predictive log-probability (PPLP) and the root-mean-square error (RMSE) of the test data. Although the RMSE is equal to that of the independent GPs—the data is highly periodic and the predictions are highly accurate for both models—the LML and PPLP are much better, which indicates that that the OLMM is able to capture meaningful structure between temperatures at different points in space.

LML PPLP RMSE
IGP -0.247 -0.219 1.99
OLMM 1.22 1.19 1.99
Table 1: Log-marginal likelihood (LML) of the training data, posterior predictive log-probability (PPLP) and root-mean-square error (RMSE) of held-out test data of independent GPs (IGP) and the OLMM for the temperature extrapolation experiments.
Figure 3: Runtime (left) and memory usage (right) of the LMM and OLMM for computing the evidence of n=1500 observations for p=200 outputs
\tikzset

every picture/.append style = xscale = .5, yscale = .5 \pgfplotsset width=height=0.5xlabel=Number of bins n along x1, ylabel=Time Kron.Time OLMM, grid=major, xmin=5, xmax=15000, major grid style=densely dashed,black!25, legend style = at=(0.05, 0.95),anchor=north west, legend style=legend cell align=left, align=left, draw=none, fill=black!10, fill opacity=0.35, text opacity=1, draw opacity=1, scale only axis, axis on top, axis x line*=bottom, axis y line*=left, every axis plot/.append style=thick, axis line style = thick, xtick align=outside, xtick style=black, thick, ytick align=outside, ytick style=black, thick, clip=false {tikzpicture} {axis}[xmode = log] \addplot+[color=mycolor1, mark options=fill=mycolor1, opacity=1, mark size=1.5pt] coordinates (10000.0, 4.5624386332067655) (2000.0, 3.7446352996685253) (1000.0, 3.1158154627754544) (200.0, 2.0277622456016386) (100.0, 1.6077527986229443) (40.0, 1.0629959708548498) (20.0, 0.5306449960733431) (10.0, 0.18566913734215784) ; \addlegendentryLML \addplot+[color=mycolor2, mark options=fill=mycolor2, opacity=1, mark size=1.5pt] coordinates (10000.0, 4.879464511037345) (2000.0, 3.349257213603934) (1000.0, 2.983019409435243) (200.0, 1.8322661062389873) (100.0, 1.3563551273191006) (40.0, 0.9694708480653853) (20.0, 0.5251749491862987) (10.0, 0.18835279833487512) ; \addlegendentryRNG \addplot[color=black, dashed] coordinates (15000.0, 1) (5.0, 1) ;

Figure 4: Ratio of timings of the Kronecker approach (saatcci2012scalable, Ch. 5) and the OLMM to compute the log marginal likelihood (LML) and generate a single prior sample (RNG)
(a) Empirical and learned correlations
(b) Clustering
(c) Latent processes
Figure 5: Results of the large-scale climate simulator experiment, showing (a) the empirical and learned correlation between the simulators, (b) a dendrogram deriving from hierarhically clustering the simulators based on the learned correlations, and (c) predictions for the latent processes for the first two eigenvectors of the covariance matrix between simulators is=1,2 and the first two eigenvectors of the spatial covariance ir=1,2, for the last 1000 days.

Rainforest tree point process modelling.

We consider a subset of the extensive rainforest data set credited to rainforest; Condit:2005; Hubbell:1999. The data features a 1000m × 500m rainforest dynamics plot in Barro Colorado Island, Panama. In the survey area, the locations of all Trichilia tuberculata (a tree species of the Mahogany family) have been measured (see creftype 7 in creftype J). We tackle this spatial point pattern with a log-Gaussian Cox process model, which is an inhomogeneous Poisson process model for count data. The unknown intensity function λ(x) is modelled with a Gaussian process such that f(x)=logλ(x). We model locally constant intensity in subregions by discretising the region into np bins (Moller+Syversveen+Waagepetersen:1998). This leads to having a Poisson observation model in each bin (see creftype J). This model reaches posterior consistency in the limit of bin width going to zero (Tokdar+Ghosh:2007). The accuracy thus improves with tighter binning. We use a separable Matérn-5/2 GP prior over f(x1,x2), and discretise the area into a n×p=200×100 grid with np=20000 grid bins, and treat the first dimension as time.

We use Elliptical Slice Sampling (Murray:2010:Elliptical_Slice_Sampling; murray2010slice) and Metropolis–Hastings (hastings1970monte) to perform inference in this problem, which require many evaluations of the log marginal likelihood and draws from the prior. The proposed methodology renders these operations much less computationally intensive than the general case. We find approximately an hour is required to perform 105 MCMC iterations on a MacBook Pro55 5 2.5 GHz Intel Core i7 processor and 16 GB RAM . See creftype J for more details.

As previously discussed, the Kronecker product factorisation techniques (saatcci2012scalable, Ch. 5) are applicable in separable spatio-temporal scenarios. While our method in combination with state-space inference offers superior performance asymptotically, we demonstrate that we also obtain benefits on practical problems. creftype 4 shows for different number of bins n (with p=n/2) the time for log marginal likelihood evaluation and sampling from the GP prior for our method and for the Kronecker product decomposition approach. Observe that the linear scaling of our methodology quickly renders it competitive. While this simple experiment does not provide a comprehensive assessment of the relative merits of these techniques—for example, only a single CPU core was used in this experiment—it clearly establishes the practical relevance of our technique in the separable setting.

Large-scale climate model calibration.

In this final experiment, we consider a large-scale climate modelling task over Europe. In particular, we use the OLMM to find relationships between 28 climate simulators4 (see taylor2012overview, for background) by letting H=HsHr, where Hs are the first ms=5 eigenvectors of a 28×28 covariance matrix Ks between the simulators, and Hr are the first mr=10 eigenvectors of the kernel matrix over the points in space (Matérn–5/2 with a different length scale for latitude and longitude). This means the mrms=50 latent processes are indexed by two indices is and im, one corresponding to the eigenvector of the simulator covariance and one correponding to the eigenvector of the spatial covariance. In the above, the covariance matrix Ks between the simulators is a parameter that we learn. The kernels for the latent processes are Matérn–5/2. We consider n=10000 time points for all 28 simulators, each with 247 outputs, giving a total of roughly 70 million data points. For the independent problems, we use the variational inducing point method by Titsias:2009:Variational_Learning, where the positions of the inducing points are initialised to one every two months. All hyperparameters and the locations of the inducing points are optimised until convergence using scipy’s implementation of the L-BFGS-B algorithm (Nocedal:2006:Numerical_Optimisation), which takes about 4 hours on a MacBook Pro5. The learned length scales were 23.3 for latitude and 43.6 for longitude.

creftype 4(a) shows the empirical correlations and the correlations learned by the OLMM (derived from Ks). Observe that the empirical correlation is not informative, as it ignores all temporal structure. The OLMM, however, does not, and is consequently able to identify a rich correlation structure. In order to get insight into the learned correlations, we hierarchically cluster the models using farthest point linkage with 1-|corr.| as the distance. creftype 4(b) shows the resulting dendrogram, in which models are grouped by their similarity. For two models, the further to the right the branch connecting them is, the less similar the models are. In creftypeplural 4(b)\crefpairconjunction4(a), HadGEM2 is clearly singled out: it is one of the simplest models, not including several processes that can be found in others, such as ocean & sea-ice, terrestrial carbon cycle, stratosphere, and ocean biogeochemistry (bellouin2011hadgem2). Furthermore, if we inspect the names of the simulators in the groups in creftype 4(b), we observe that often simulators of the same family are grouped together. We observe some interesting cases. (i) Although IPSL-CM5A-LR and IPSL-CM5A-MR are close, IPSL-CM5B-LR is grouped far apart. It turns out that IPSL-CM5A-LR and IPSL-CM5A-MR are different-resolution versions of the same model, while IPSL-CM5B-LR employs a different atmospheric model.66 6 https://portal.enes.org/models/earthsystem-models/ipsl/ipslesm (ii) ACCESS1.0 and ACCESS1.3 have a similar name, but differ greatly in their implementation: ACCESS1.0 is the basic model, while ACCESS1.3 is much more aspirational, including experimental atmospheric physics models and a particular land surface model (bi2013access). (iii) The distance between BCC_CSM1.1(m) and BCC_CSM1.1 can be explained by the more realistic surface air temperature predictions obtained by the former (wu2014overview), which is exactly the quantity we study. Finally, creftype 4(c) shows predictions for four latent processes (is=1,2 with ir=1,2). The first spatial eigenvector (ir=1) is constant in space; combined with the strongest eigenvector of Ks (is=1), we obtain a strong signal constituting seasonal temperature changes.

6 DISCUSSION AND CONCLUSION

In this paper, we introduced the Orthogonal Linear Mixing Model as a multi-output GP in the Linear Mixing Model class where exact inference and learning scales linearly in the desired number m of degrees of freedom of the observations. The OLMM is simple to implement and trivially compatible with off-the-shelf single-output GP scaling techniques to scale to large numbers of observations n. We introduced the Mixing Model Hierarchy, which organises MOGPs according to their generative models, and placed the OLMM within this scheme.

The OLMM is exciting for several reasons. It provides a way to do efficient inference in separable spatio-temporal models that is trivially compatible with single-output scaling techniques. Exact inference is linear in number of outputs, and, with state space methods, it is also linear in number of time points. The OLMM even extends separable spatio-temporal models to non-separable models whilst retaining exact, computationally efficient inference. Furthermore, the OLMM is interpretable: the latent processes are independent, which means that they can be inspected separately to diagnose the model. Finally, as demonstrated in the real-world experiments in the paper, the OLMM is applicable to a wide range of modelling tasks. An interesting future direction for the method would be targeting sub-linear time complexity by parallelisation (see, e.g., Sarkka:2019:Temporal_Parallelization_of_Bayesian_Filters).

Bibliography

 

Supplementary Material:

Scalable Exact Inference in Multi-Output Gaussian Processes


 

{tikzpicture}

[ xscale=0.85, yscale=0.85, latent node/.append style= align=center, inner sep=0pt, text width=0.75cm ] \node[latent node] (xt) at (0, 0) xn; \node[latent node] (ft) at ((xt)+(0,1.5)) fn; \node[latent node] (xt2) at ((xt)+(1,0)) xn+1; \node[latent node] (ft2) at ((xt2)+(0,1.5)) fn+1; \node[latent node, dashed] (xtpre) at ((xt)-(1,0)) ; \node[latent node, dashed] (xtpost) at ((xt2)+(1,0)) ; \node[latent node, dashed] (ftpre) at ((xtpre)+(0,1.5)) ; \node[latent node, dashed] (ftpost) at ((xtpost)+(0,1.5)) ; \draw[arrow, ->] (xt) – (ft) node [pos=.45, fill=white] H; \draw[arrow, ->] (xt2) – (ft2) node [pos=.45, fill=white] H; \draw[arrow, ->] (xt) – (xt2); \draw[arrow, ->, dashed] (xtpre) – (xt); \draw[arrow, ->, dashed] (xt2) – (xtpost); \node[anchor=south, rotate=90, draw=black, thick] at (-1.5, 0.9) instantaneous mixing; \node[anchor=south, draw=black, thick] at (0.5, 2.3) time invariant; \node[anchor=south] (desc) at (0.5, 1.85) Hp×m is a fixed matrix; \node[plate, fit=(xtpre) (desc) (ftpost), inner sep=5pt, inner ysep=3pt, yshift=-2pt, label=[anchor=south west]north west:LMM] () ; {tikzpicture}[ xscale=0.85, yscale=0.85, latent node/.append style= align=center, inner sep=0pt, text width=0.75cm ] \node[latent node] (xt) at (0, 0) xn; \node[latent node] (ft) at ((xt)+(0,1.5)) fn; \node[latent node] (xt2) at ((xt)+(1,0)) xn+1; \node[latent node] (ft2) at ((xt2)+(0,1.5)) fn+1; \node[latent node, dashed] (xtpre) at ((xt)-(1,0)) ; \node[latent node, dashed] (xtpost) at ((xt2)+(1,0)) ; \node[latent node, dashed] (ftpre) at ((xtpre)+(0,1.5)) ; \node[latent node, dashed] (ftpost) at ((xtpost)+(0,1.5)) ; \draw[arrow, ->] (xt) – (ft) node [pos=.45, fill=white, fill opacity=.85] Hn; \draw[arrow, ->] (xt2) – (ft2) node [pos=.45, fill=white, fill opacity=.85] Hn+1; \draw[arrow, ->] (xt) – (xt2); \draw[arrow, ->, dashed] (xtpre) – (xt); \draw[arrow, ->, dashed] (xt2) – (xtpost); \node[anchor=south, draw=black, thick] at (0.5, 2.3) time-varying; \node[anchor=south] at (0.5, 1.85) H:𝒯p×m is a time-varying matrix; \node[plate, fit=(xtpre) (desc) (ftpost), inner sep=5pt, inner ysep=3pt, yshift=-2pt, draw=none] () ;
{tikzpicture}[ xscale=0.85, yscale=0.85, latent node/.append style= align=center, inner sep=0pt, text width=0.75cm ] \node[latent node] (xt) at (0, 0) xn; \node[latent node] (ft) at ((xt)+(0,1.5)) fn; \node[latent node] (xt2) at ((xt)+(1,0)) xn+1; \node[latent node] (ft2) at ((xt2)+(0,1.5)) fn+1; \node[latent node, dashed] (xtpre) at ((xt)-(1,0)) ; \node[latent node, dashed] (xtpost) at ((xt2)+(1,0)) ; \node[latent node, dashed] (ftpre) at ((xtpre)+(0,1.5)) ; \node[latent node, dashed] (ftpost) at ((xtpost)+(0,1.5)) ; \draw[arrow, ->, dashed] (xtpre) – (xt); \draw[arrow, ->, dashed] (xt2) – (xtpost); \draw[arrow, ->, dashed] (xt) – (ftpre); \draw[arrow, ->, dashed] (xt) – (ftpost); \draw[arrow, ->, dashed] (xt2) – (ftpre); \draw[arrow, ->, dashed] (xt2) – (ftpost); \draw[arrow, ->] (xt) – (ft) node [pos=0.45, fill=white, fill opacity=.85, inner sep=1pt] H0; \draw[arrow, ->] (xt) – (ft2) node [pos=0.2, fill=white, fill opacity=.85, inner sep=1pt] H-1; \draw[arrow, ->] (xt2) – (ft) node [pos=0.7, fill=white, fill opacity=.85, inner sep=1pt] H1; \draw[arrow, ->] (xt2) – (ft2) node [pos=0.45, fill=white, fill opacity=.85, inner sep=1pt] H0; \draw[arrow, ->] (xt) – (xt2); \node[anchor=south, rotate=90, draw=black, thick] at (-1.5, 0.9) convolutional mixing; \node[anchor=south] at (0.5, 1.85) H:𝒯p×m is a time-invariant filter; \node[plate, fit=(xtpre) (desc) (ftpost), inner sep=5pt, inner ysep=3pt, yshift=-2pt, draw=none] () ; {tikzpicture}[ xscale=0.85, yscale=0.85, latent node/.append style= align=center, inner sep=0pt, text width=0.75cm ] \node[latent node] (xt) at (0, 0) xn; \node[latent node] (ft) at ((xt)+(0,1.5)) fn; \node[latent node] (xt2) at ((xt)+(1,0)) xn+1; \node[latent node] (ft2) at ((xt2)+(0,1.5)) fn+1; \node[latent node, dashed] (xtpre) at ((xt)-(1,0)) ; \node[latent node, dashed] (xtpost) at ((xt2)+(1,0)) ; \node[latent node, dashed] (ftpre) at ((xtpre)+(0,1.5)) ; \node[latent node, dashed] (ftpost) at ((xtpost)+(0,1.5)) ; \draw[arrow, ->, dashed] (xtpre) – (xt); \draw[arrow, ->, dashed] (xt2) – (xtpost); \draw[arrow, ->, dashed] (xt) – (ftpre); \draw[arrow, ->, dashed] (xt) – (ftpost); \draw[arrow, ->, dashed] (xt2) – (ftpre); \draw[arrow, ->, dashed] (xt2) – (ftpost); \draw[arrow, ->] (xt) – (ft) node [pos=0.45, fill=white, fill opacity=.85, inner sep=1pt] Hnn; \draw[arrow, ->] (xt) – (ft2) node [pos=0.2, fill=white, fill opacity=.85, inner sep=1pt] Hnn+1; \draw[arrow, ->] (xt2) – (ft) node [pos=0.7, fill=white, fill opacity=.85, inner sep=1pt] Hn+1n; \draw[arrow, ->] (xt2) – (ft2) node [pos=0.45, fill=white, fill opacity=.85, inner sep=1pt] Hn+1n+1; \draw[arrow, ->] (xt) – (xt2); \node[anchor=south] at (0.5, 1.85) H:𝒯×𝒯p×m is a time-varying filter; \node[plate, fit=(xtpre) (desc) (ftpost), inner sep=5pt, inner ysep=3pt, yshift=-2pt, draw=none] () ;

Figure 6: Graphical models illustrating the difference between time-invariant/time-varying and instantaneous/convolutional multi-output GP models, for data sampled at real-valued times t1,t2, (sampling period Δt). Abbreviations used: xn=x(tn), fn=f(tn), Hn=H(nΔt), and Hnm=H(tm,tn). For simplicity, the dynamics of x are depicted as a Markov chain; since x is modelled with a GP, xn actually depends on xn for all nn.

Appendix A Characterisation of MOGPs

creftypecap 6 illustrates the graphical models of time-invariant/time-varying and instantaneous/convolutional multi-output GP models.

Appendix B Maximum Likelihood Estimate

Throughout the appendix, we assume that the columns of H are linearly independent and Λ>0. As a consequence, H𝖳Λ-1H is invertible and H𝖳Λ-1H>0.

Prop. 3.

Denote p(y|x)=𝒩(y;Hx,Λ), and let T be the m×p matrix (H𝖳Λ-1H)-1H𝖳Λ-1. Then

Ty=argmaxxp(y|x)

and Ty is an unbiased estimate of x: 𝔼[Ty|x]=x.

Proof.

Note that

logp(y|x)-12(y-Hx)𝖳Λ-1(y-Hx)

Using invertibility of H𝖳Λ-1H, an elementary calculation then shows that the maximum with respect to x is given by

x=(H𝖳Λ-1H)-1H𝖳Λ-1y=Ty.

To show that Ty is an unbiased estimate of x, we use that 𝔼[y|x]=Hx:

𝔼[Ty|x]=THx=(H𝖳Λ-1H)-1(H𝖳Λ-1H)x=x.

\scaleobj0.75

Appendix C Proof of Prop. 1

To prove creftype 1, we need the property of T that it “preserves the signal-to-noise ratio”. This is characterised in the following lemma.

Lem. 1.
𝒩(y;Hx,Λ)𝒩(y;0,Λ)=𝒩(Ty;x,(H𝖳Λ-1H)-1)𝒩(Ty;0,(H𝖳Λ-1H)-1).
Proof.

It is simple to check the equality by direct verification. We show, however, how the equality may be derived. To begin with, we have

(y-Hx)𝖳Λ-1(y-Hx)=y𝖳Λ-1y-2x𝖳H𝖳Λ-1y+x𝖳H𝖳Λ-1Hx.

Here

H𝖳Λ-1y=(H𝖳Λ-1H)(H𝖳Λ-1H)-1H𝖳Λ-1y=(H𝖳Λ-1H)Ty,

so

(y-Hx)𝖳Λ-1(y-Hx)=y𝖳Λ-1y-2x𝖳(H𝖳Λ-1H)Ty+x𝖳(H𝖳Λ-1H)x.

Adding and subtracting yT𝖳(H𝖳Λ-1H)Ty, we find

(y-Hx)𝖳Λ-1(y-Hx)=y𝖳Λ-1y-yT𝖳(H𝖳Λ-1H)Ty+(x-Ty)𝖳(H𝖳Λ-1H)(x-Ty).

Hence, rearranging,

(y-Hx)𝖳Λ-1(y-Hx)-y𝖳Λ-1y=(x-Ty)𝖳(H𝖳Λ-1H)(x-Ty)-yT𝖳(H𝖳Λ-1H)Ty,

which yields the result. \scaleobj0.75

Proof (creftype 1).

By a general characterisation of minimal sufficient statistics (see, e.g., Th. 6.2.13 in Casella:2001:Statistical_Inference), Ty is a minimal sufficient statistic for x if and only if it is true that p(y1|x)/p(y2|x) is constant as a function of x if and only if Ty1=Ty2. Indeed, by creftype 1,

logp(y1|x)p(y2|x)=(Ty1-Ty2)𝖳(H𝖳Λ-1H)-1x+const.

which, by invertibility of H𝖳Λ-1H, does not depend on x if and only if Ty1=Ty2. \scaleobj0.75

Appendix D Characterisation of Diagonal Projected Noise

Lem. 2.

Let U be a truncated orthogonal matrix and D>0 diagonal. Then every solution M>0 of U𝖳M-1U=D-1 is of the form M=UDU𝖳+VEV𝖳 with E>0 diagonal and V such that [UV] is orthogonal.

Proof.

We equivalently show that M-1 for every solution M>0 of U𝖳MU=D-1 is of the desired form. Denote

𝒟={M0:U𝖳MU=D-1},𝒩={M0:U𝖳MU=0}.

If M1,M2𝒟, then U𝖳(M1-M2)U=0 shows that M1-M2𝒩. Therefore, if M𝒟, then 𝒟=M+𝒩. It is clear that UD-1U𝖳𝒟. Thus, 𝒟=UD-1U𝖳+𝒩. Let M𝒩, and let M=VEV𝖳 be the spectral decomposition of M. Assume that E>0 by dropping the appropriate columns from V. Then U𝖳MU=0 says that U and V have pairwise orthogonal columns, so the columns of V are in U. We conclude that every M𝒟 is of the form M=UD-1U𝖳+VEV𝖳 with E>0 diagonal and V a truncated orthogonal matrix with columns in U. If M>0, then in particular M has full rank, so it is clear that [UV] must form an orthogonal matrix. Therefore, M-1=UDU𝖳+VE-1V𝖳, which is of the desired form. \scaleobj0.75

Prop. 4.

The projected noise is diagonal if and only if H is of the form H=Λ12US12 with U a truncated orthogonal matrix and S>0 diagonal. Suppose that this is the case, and fix such a U. Then H is of the form H=UD12 with D>0 diagonal if and only if every column of U is an eigenvector of Λ.

Proof.

The projected noise is diagonal if and only if H𝖳Λ-1H=S for some S>0 diagonal. This condition is equivalent to

S-12H𝖳Λ-12Λ-12HS-12=I,

which, in turn, holds if and only if Λ-12HS-12=U is a truncated orthogonal matrix. Thus, the projected noise is diagonal if and only if H is of the form H=Λ12US12 with U a truncated orthogonal matrix and S>0 diagonal.

For the second statement, note that the condition that every column of U is an eigenvector of Λ is equivalent to the condition that Λ is of the form Λ=UDU𝖳+VEV𝖳 with D,E>0 diagonal and V such that [UV] forms an orthogonal matrix. Suppose that Λ is of this form. Then

Λ12=UD12U𝖳+VE12V𝖳

implies that

H=UD12U𝖳US12+VE12V𝖳US12=UD12S12,

which is of the desired form. Conversely, suppose that H=UD12 with D>0 diagonal. Then

S-12H𝖳Λ-1HS-12=I=S-12D12U𝖳Λ-1UD12S-12U𝖳Λ-1U=SD-1.

Therefore, by creftype 2,

Λ=UD-1SU𝖳+VEV𝖳

with E>0 diagonal and V such that [UV] is orthogonal, which again is of the desired form. \scaleobj0.75

Appendix E Form of Projection and Projected Noise

Prop. 5.

Let H and Λ be of the form for the OLMM (creftype 2). Then

T=S-12U𝖳andΛT=σ2S-1+D.
Proof.

To begin with, using the matrix inversion lemma once and once in reverse,

H𝖳Λ-1H =S12U𝖳(σ2I+US12DS12U𝖳)-1US12
=S12U𝖳(σ-2I-σ-4US12(D-1+S12σ-2U𝖳US12)-1S12U𝖳)US12
=σ-2S-σ-4S(D-1+σ-2S)-1S
=(σ2S-1+D)-1,

so

ΛT=TΛT𝖳=(H𝖳Λ-1H)-1=σ2S-1+D.

Furthermore, by a similar calculation,

T =(H𝖳Λ-1H)-1H𝖳Λ-1
=(σ2S-1+D)S12U𝖳(σ-2I-σ-4US12(D-1+σ-2S)-1S12U𝖳)
=(σ2S-1+D)(σ-2S-σ-4S(D-1+σ-2S)-1S)S-12U𝖳
=(σ2S-1+D)(σ2S-1+D)-1S-12U𝖳
=S-12U𝖳.

\scaleobj0.75

Appendix F Decomposition of the Mean Squared Error

Prop. 6.

Let H=US12 with U a truncated orthogonal matrix and S12>0 diagonal. Then

y-Hx2MSE=PUy2variance notcaptured by basis+i=1mSiivariance ofith latent process((Ty)i-xi)2MSE ofith latent process

where T=S-12U𝖳 and PU is the orthogonal projection onto the orthogonal complement of U.

Proof.

By expanding and using orthogonality of U,

y-Hx2 =y2-2y,US12x+US12x2
=y2-U𝖳y2+U𝖳y2-2U𝖳y,S12x+S12x2
=y,(I-UU𝖳)y+i=1m(ui,y2-2ui,ySii12xi+(Sii12xi)2)
=y,(I-UU𝖳)y+i=1mSii(Sii-1ui,y2-2Sii-12ui,yxi+xi2)
=y,(I-UU𝖳)y+i=1mSii((Ty)i-xi)2,

where ui is the ith column of U. Note that PU=UU𝖳 is the orthogonal projection onto U, so I-UU𝖳=PU is the orthogonal projection onto the orthogonal complement of U. Therefore,

y,(I-UU𝖳)y=y,PUy=y,PU2y=PU𝖳y,PUy=PUy,PUy=PUy2.

\scaleobj0.75

Appendix G Kullback–Leibler Divergence Between an LMM and OLMM

Prop. 7.

Consider two LMMs with equal Λ=σ2Ip, equal K(t,t), but different bases H and H^. Let t1,,tn𝒯 and denote xi=x(ti) and yi=y(ti). It then holds that

DKL(p(y1:n,x1:n)p^(y1:n,x1:n))=DKL(p^(y1:n,x1:n)p(y1:n,x1:n))=n12σ2H-H^F2

and

infH^:OLMMDKL(p(y1:n,x1:n)p^(y1:n,x1:n))n𝔼[f(t)2]σ2maxi(1-Vii)n𝔼[f(t)2]2σ2I-VF2

where H^ ranges over matrices of the form US12 with U a truncated orthogonal matrix and S12>0 diagonal, V is the orthogonal matrix collecting the right singular vectors of H, and 𝔼[f(t)2] denotes the variance of the observations under the first LMM before adding noise.

Proof.

Start out by expanding the Kullback–Leibler divergence and noting that p(x1:n)=p^(x1:n):

DKL(p(y1:n,x1:n)p^(y1:n,x1:n)) =-𝔼p(y1:n,x1:n)logp^(y1:n|x1:n)p^(x1:n)p(y1:n|x1:n)p(x1:n)
=-i=1n𝔼p(yi,xi)[logp^(yi|xi)-logp(yi|xi)]
=-i=1n𝔼p(yi,xi)[log𝒩(yi;H^xi,σ2Ip)-log𝒩(yi;Hxi,σ2Ip)].

Here

𝔼p(yi,xi)[log𝒩(yi;H^xi,σ2Ip)]=-p2log2πσ2-12σ2𝔼p(yi,xi)[yi-H^xi2]

where

𝔼p(yi,xi)[yi-H^xi2] =𝔼p(yi,xi)tr[yiyi𝖳-2yixi𝖳H^𝖳+xixi𝖳H^H^𝖳]
=tr[HH𝖳+σ2I-2HH^𝖳+H^H^𝖳]
=pσ2+tr[(H-H^)(H-H^)𝖳]
=pσ2+H-H^F2.

Therefore,

DKL(p(y1:n,x1:n)p^(y1:n,x1:n))=n12σ2H-H^F2.

Let H=USV𝖳 be the SVD of H where U is a truncated orthogonal matrix with the same shape as H, S>0 is a square diagonal matrix, and V is a square orthogonal matrix. Note that U𝖳U=I, but UU𝖳I. Then, choosing H^=US,

infH^:OLMMDKL(p(y1:n,x1:n)p^(y1:n,x1:n))n12σ2U(SV𝖳-S)F2=n12σ2SV𝖳-SF2

since UAF2=tr[A𝖳U𝖳UA]=tr[A𝖳A]=AF2. We now further simplify:

SV𝖳-SF2=tr[(SV𝖳-S)(SV𝖳-S)𝖳]=tr[SV𝖳VS-SV𝖳S-SVS+SS]=2tr[SS-SVS].

Hence, by definition of the trace and the fact that S is diagonal,

SV𝖳-SF2=2i=1mSii2(1-Vii)2(i=1mSii2)maxi(1-Vii)=2𝔼[f2]maxi(1-Vii),

since

𝔼[f(t)2]=𝔼tr[f(t)f𝖳(t)]=tr[HH𝖳]=tr[S2].

Therefore,

SV𝖳-SF22𝔼[f2]maxi(1-Vii)2𝔼[f2]i=1m(1-Vii)=𝔼[f2]I-VF2,

where the equality follows from a similar calculation:

I-VF2=tr[I-V𝖳-V+V𝖳V]=2tr[I-V].

\scaleobj0.75

Appendix H Missing Data

Missing data is troublesome for the OLMM, because it is not possible to take away a subset of the rows of H and retain orthogonality of the columns. Suppose that data are missing for a fixed subset of outputs. Collect that subset of outputs into y1, and let y2 be the remainder of the outputs. The issue can then be resolved by considering a multi-headed version of the OLMM:

y1|H1,x 𝒢𝒫(H1x(t),δ[t-t]Λ1),
y2|H2,x 𝒢𝒫(H2x(t),δ[t-t]Λ2).

For k heads, if yI=(yi)iI for a subset of heads I{1,,k} are observed, then

TIyI=ΛTIiIΛTi-1Tiyi,ΛTI-1=iIΛTi-1.

While this new model resolves the missing data problem, the assumption that each collection of outputs (each head) is a linear transformation of the same collection of latent processes will be overly restrictive for many contexts: it is not typically the case that our prior beliefs about the relationships between outputs have anything to do with the availability of data pertaining to them.

In the general case, where data can be missing arbitrarily, let Yo be the observed data. Complement Yo with missing data Ym such that Y=YoYm is complete. Then a simple solution is to use variational inference. Assume a Gaussian approximate posterior distribution q(Ym) over Ym, and maximise the evidence lower bound (ELBO) using gradient-based optimisation:

logp(Yo)𝔼q(Ym)[logp(Y)]+H[q(Ym)]=[q(Ym)],

where the expectation can be approximated using the reparametrisation trick (Kingma:2013:Auto-Encoding_VB), logp(Y) can be computed efficiently because Y is complete, and H[q(Ym)] denotes the entropy of q(Ym). This approach provides a tractable solution when the number of missing data are not too numerous.

Appendix I Heterogeneous Observation Noise

Although the specification of the observation noise Λ=σ2Ip+HDH𝖳 in the OLMM does not allow for heterogeneous observation noise, it is possible to set Λ=diag(σ12,,σp2) and use creftype 4 to include Λ in the parametrisation of H: H=Λ12US12. This parametrisation can be interpreted in two ways: {enumerate*}

The model has a whitening transform built in. In the projection T, the (noise in the) data will first by whitened by Λ-12. Hence, this parametrisation can be used as a more principled substitute for the usual data normalisation where the outputs are divided by their empirical standard deviation prior to feeding them to the model.

The basis is orthogonal with respect to a weighted Euclidean inner product: hi,hjΛ-1=k=1phikhjk/σk2=0 for ij. Intuitively, this means that the basis is orthogonal in the usual sense after stretching the ith dimension by σi-1. Although this construction provides additional flexiblity, it does require that D=0 to avoid a circular dependency between Λ and H.

Appendix J Point Process Experimental Details and Additional Results

We use a total of np=200×100 bins in the model, where each bin is 5m × 5m. The likelihood for this experiment is given by

p(Yf)i=1nj=1pPoisson(Yijexp(f(rij)))

where rij is the coordinate of the ijth bin, Yij is the number of data points in the ijth bin, and Y is the n×p matrix of counts.

{tikzpicture}{axis}

[axis on top, xmin=0, xmax=1000, xlabel=x1 (meters), y dir=reverse, ymin=0, ymax=500, ylabel=x2 (meters), axis background/.style=fill=white, legend style=legend cell align=left,align=left,draw=white!15!black, width=height=] \addplot[forget plot] graphics [xmin=-0.394321766561514,xmax=1000.39432176656,ymin=-0.394944707740916,ymax=500.394944707741] ./figures/trees-1.png;

{tikzpicture}{axis}

[point meta min=-3, point meta max=3, axis on top, xmin=0, xmax=1000, xlabel=x1 (meters), y dir=reverse, ymin=0, ymax=500, ylabel=x2 (meters), axis background/.style=fill=white, legend style=legend cell align=left,align=left,draw=white!15!black, width=height=] \addplot[forget plot] graphics [xmin=-5.05050505050505,xmax=1005.05050505051,ymin=-1.25628140703518,ymax=501.256281407035] ./figures/intensity-1.png;

Figure 7: Observations of the rainforest tree locations (left), and posterior mean log-intensity for the log-Gaussian Cox process model (right) with a grid of np=20000 observation bins.

We perform 105 iterations of Gibbs sampling, alternating between Elliptical Slice Sampling for the Gaussian process given its hyperparameters, and Metropolis Hastings with proposal distribution 𝒩(θ,0.052) for the log of the hyperparameters given the latent GP-distributed function. The kernel is a product of two Matérn-5/2 kernels, one over the vertical dimension and the other over the horizontal one, each with their own length scale. A single process variance is utilised, and a nugget term is added. The log of each of the four hyperparameters was given a 𝒩(0,1) prior. creftypecap 8 shows the log joint of the entire state after each iteration, while creftype 9 shows the progress of each hyperparameter per iteration.

Figure 8: Log–joint probability per iteration
Figure 9: Hyperparameters per iteration. Shows the horizontal length scale (top-left), vertical length scale (top-right), nugget variance (bottom-right), and process variance (bottom-left).

The times in creftype 4 were obtained via BenchmarkTools.jl (BenchmarkTools.jl-2016). The implementation of the standard Kronecker product decomposition trick makes use of Kronecker.jl, and Julia’s (bezanson2017julia) standard linear algebra libraries, which make use of OpenBLAS and LAPACK to efficiently perform matrix-matrix products and compute eigendecompositions. The implementation of the state-space GP additionally makes use of StaticArrays.jl for efficient stack-allocated matrices, and Stheno.jl for GP-related functionality. Timing experiments were conducted on a single CPU core to make the implementations comparable.

The state-space implementation of the GP makes use of the trick introduced by Solin:2018:Infinite-Horizon_Gaussian_Processes when computing the log marginal likelihood. However, the infinite-horizon trick is exploited only once the filtering covariance has converged, which is determined by the point at which the Frobenius norm of the difference between the filtering covariance at the tth and (t-1)th iterations drops below 10-12. In practice, we find that this produces log-marginal likelihoods that are essentially exact, as expected.