### Abstract

Multi-output Gaussian processes (MOGPs) leverage the flexibility andinterpretability of GPs while capturing structure across outputs, which isdesirable, for example, in spatio-temporal modelling. The key problem withMOGPs is the cubic computational scaling in the number of both inputs (e.g.,time points or locations), n, and outputs, p. Current methods reduce this toO(n^3 m^3), where m < p is the desired degrees of freedom. This computationalcost, however, is still prohibitive in many applications. To address thislimitation, we present the Orthogonal Linear Mixing Model (OLMM), an MOGP inwhich exact inference scales linearly in m: O(n^3 m). This advance opens up awide range of real-world tasks and can be combined with existing GPapproximations in a plug-and-play way as demonstrated in the paper.Additionally, the paper organises the existing disparate literature on MOGPmodels into a simple taxonomy called the Mixing Model Hierarchy (MMH).

### Quick Read (beta)

###### Abstract

Multi-output Gaussian processes (MOGPs) leverage the flexibility and interpretability of GPs while capturing structure across outputs, which is desirable, for example, in spatio-temporal modelling. The key problem with MOGPs is the cubic computational scaling in the number of both inputs (e.g., time points or locations), $n$, and outputs, $p$. Current methods reduce this to $O({n}^{3}{m}^{3})$, where $$ is the desired degrees of freedom. This computational cost, however, is still prohibitive in many applications. To address this limitation, we present the Orthogonal Linear Mixing Model (OLMM), an MOGP in which exact inference scales linearly in $m$: $O({n}^{3}m)$. This advance opens up a wide range of real-world tasks and can be combined with existing GP approximations in a plug-and-play way as demonstrated in the paper. Additionally, the paper organises the existing disparate literature on MOGP models into a simple taxonomy called the Mixing Model Hierarchy (MMH).

remresetThe remreset package \WarningFilter*[parskip]latexCommand \pgfplotssetcompat=newest \usetikzlibrary calc, positioning, fit, tikzmark, arrows.meta, shapes, decorations.pathreplacing \tikzset line/.style = thick, ->, > = Triangle[length=2.0mm, width=2.0mm] , arrow/.style = line , hidden node/.style = circle, minimum size = 0.9cm, draw = white, thick , latent node/.style = hidden node, draw = black, , factor node/.style = hidden node, rectangle, draw = black, , observed node/.style = latent node, fill = gray!15 , plate/.style = draw, label=[anchor=north west]south west:#1, rounded corners=2pt, shape=rectangle, inner sep=10pt, thick \tikzset every picture/.append style = xscale = 2, yscale = 2 + \crefnameassumptionAssumptionAssumptions \CrefnameassumptionAssumptionAssumptions \crefnamecorollaryCorollaryCorollaries \CrefnamecorollaryCorollaryCorollaries \crefnamedefinitionDefinitionDefinitions \CrefnamedefinitionDefinitionDefinitions \crefnameexampleExampleExamples \CrefnameexampleExampleExamples \crefnamefactFactFacts \CrefnamefactFactFacts \crefnamelemmaLemmaLemmas \CrefnamelemmaLemmaLemmas \crefnamemodelModelModels \CrefnamemodelModelModels \crefnamepropositionPropositionPropositions \CrefnamepropositionPropositionPropositions \crefnamequestionQuestionQuestions \CrefnamequestionQuestionQuestions \crefnameremarkRemarkRemarks \CrefnameremarkRemarkRemarks \crefnametheoremTheoremTheorems \CrefnametheoremTheoremTheorems \crefnameasslistiAssumptionAssumptions \CrefnameasslistiAssumptionAssumptions \crefnamecorlistiCorollaryCorollaries \CrefnamecorlistiCorollaryCorollaries \crefnamedeflistiDefinitionDefinitions \CrefnamedeflistiDefinitionDefinitions \crefnameexlistiExampleExamples \CrefnameexlistiExampleExamples \crefnamefactlistiFactFacts \CrefnamefactlistiFactFacts \crefnamelemlistiLemmaLemmas \CrefnamelemlistiLemmaLemmas \crefnamemodlistiModelModels \CrefnamemodlistiModelModels \crefnameproplistiPropositionPropositions \CrefnameproplistiPropositionPropositions \crefnameqlistiQuestionQuestions \CrefnameqlistiQuestionQuestions \crefnameremlistiRemarkRemarks \CrefnameremlistiRemarkRemarks \crefnamethmlistiTheoremTheorems \CrefnamethmlistiTheoremTheorems \crefnamesectionSec.Sects. \crefnamepropositionProp.Props. \crefnamelemmaLem.Lems. \crefnamemodelMod.Mods. \crefnameappendixApp.Apps.

Scalable Exact Inference in Multi-Output Gaussian Processes

Wessel P. Bruinsma${}^{\mathrm{1}\mathrm{,}\mathrm{2}}$ Eric Perim${}^{\mathrm{2}}$ Will Tebbutt${}^{\mathrm{1}}$ J. Scott Hosking${}^{\mathrm{3}}$ Arno Solin${}^{4}$ Richard E. Turner${}^{1,5}$

${}^{1}$University of Cambridge, ${}^{2}$Invenia Labs, ${}^{3}$British Antarctic Survey, ${}^{4}$Aalto University, ${}^{5}$Microsoft Research

## 1 INTRODUCTION

Application areas such as geostatistics, sensor networks, psychometrics, neuroscience, brain imaging, and finance all utilise multi-output statistical models. A rich, but disparate literature has deployed Gaussian processes (GPs, Rasmussen:2006:Gaussian_Processes) in such modelling problems, the aim being to leverage their interpretability, modularity, and tractability. Indeed, GPs have been successfully applied in a wide variety of single-output contexts: they can automatically discover structure in signals (Duvenaud:2014:Automatic_Construction), achieve state-of-the-art performance in regression tasks (Bui:2016:Deep_Gaussian_Processes_for_Regression), enable data-efficient models in reinforcement learning (Deisenroth:2011:PILCO_A_Model-Based_and_Data-Efficient), and support many applications in probabilistic numerics (Hennig:2015:Probabilistic_Numerics_and_Uncertainty_in), such as in optimisation (Brochu:2010:A_Tutorial_on_Bayesian_Optimization) and quadrature (Minka:2000:Quadrature_GP).

One of the first applications of GPs to multiple outputs was in geostatistics (Matheron:1969:Le_Krigeage_Universel). Today, multi-output GP (MOGP) models can be found in various areas, including geostatistics (Wackernagel:2003:Multivariate_Geostatistics), factor analysis (Teh:2005:Semiparametric_Latent_Factor; Yu:2009:Gaussian-Process_Factor_Analysis_for_Low-Dimensional), dependent or multi-task learning (Boyle:2005:Dependent_Gaussian_Processes; Bonilla:2007:Kernel_Multi-Task_Learning_Using_Task-Specific; Bonilla:2008:Multi-Task_Gaussian_Process; Osborne:2008:Towards_Real-Time_Information_Processing_of), latent force models (Alvarez:2009:Latent_Force_Models; Alvarez:2009:Sparse_Convolved_Gaussian_Processes_for; Alvarez:2010:Efficient_Multioutput_Gaussian_Processes_Through; Alvarez:2011:Computationally_Efficient_Convolved), state space modelling (Sarkka:2013:Spatiotemporal_Learning_via), regression networks (Wilson:2012:GP_Regression_Networks; Nguyen:2014:Collaborative_Multi-Output; Dezfouli:2017:Semi-Parametric_Network_Structure_Discovery_Models), and mixture models (Ulrich:2015:Cross_Spectrum; Bruinsma:2016:GGPCM; Parra:2017:Spectral_Mixture_Kernels_for_Multi-Output; Requeima:2019:The_Gaussian_Process_Autoregressive_Regression). Although many of these models are constructed in a similar spirit, they differ in their details and terminology, which makes them difficult to compare and organise.

A key practical problem with existing MOGPs is their computational complexity. For $n$ input points each having $p$ outputs, inference and learning in general MOGPs take $O({n}^{3}{p}^{3})$ time and $O({n}^{2}{p}^{2})$ memory, although these may be alleviated by a wide range of approximations (Quinonero:2005:Unifying_View; Titsias:2009:Variational_Learning; Lazaro-Gredilla:2010:Sparse_Spectrum_Gaussian_Process_Regression; Hensman:2013:Gaussian_Processes_for_Big_Data; Wilson:2015:Kernel_Interpolation_for_Scalable_Structured; Bui:2016:A_Unifying_Framework_for_Gaussian; Cheng:2017:Variational_Inference_for_Gaussian_Process; Hensman:2018:Variational_Fourier_Features_for_Gaussian). To mitigate these unfavourable scalings, a particular class of MOGPs, called the Linear Mixing Model (LMM, creftype 2.2), exploits low-rank structure of their covariance via the matrix inversion and determinant lemmas to reduce the complexity of inference and learning to $O({n}^{3}{m}^{3})$ time and $O({n}^{2}{m}^{2})$ memory, where $m$ is the desired degrees of freedom of the observations, typically much less than $p$. However, many multi-output regression problems cannot be solved adequately with only a small number of degrees of freedom, making the LMM prohibitively expensive in such cases. For example, consider temperatures measured across the Earth’s surface concatenated into a vector. Due to the complexity of the climate, we anticipate behaviour at many spatial length scales, so a model that supposes that this vector lives in a particularly low-dimensional “pancake”${}^{\text{1}}$ will not accurately reflect reality.

^{†}

^{†}${}^{\text{1}}$The distribution of $Ax+\epsilon $ for $A$ tall, $x$ Gaussian, and $\epsilon $ noise (Roweis:1999:A_Unifying_Review_of_Linear; MacKay:2002:Information_Theory_Learning).

We present a unifying view of MOGPs to simplify the complicated web of models in this class.
We call this the Mixing Model Hierarchy (MMH).
Building on the Linear Mixing Model (LMM), which is the class of models at the bottom of the MMH, we introduce the Orthogonal Linear Mixing Model (OLMM).
The OLMM is a particular LMM in which inference and learning take $O({n}^{3}m)$ time and $O({n}^{2}m)$ memory, without significantly sacrificing expressivity nor requiring any approximations.
The OLMM is simple to implement and trivially compatible with existing single-output GP scaling techniques to scale to large numbers of data points $n$.
In particular, for a large class of multi-output problems, e.g. many spatio-temporal problems, the time and memory complexity can be further reduced to $O(nm)$, which is linear in *both* the number of data points $n$ and degrees of freedom $m$.
We demonstrate the OLMM in experiments on synthetic and real-world data sets.

## 2 MULTI-OUTPUT GP MODELS

For tasks with $p$ outputs, multi-output Gaussian processes induce a prior distribution over *vector-valued* functions $f:\mathcal{T}\to {\mathbb{R}}^{p}$ by requiring that any finite number of function values ${f}_{{p}_{1}}({t}_{1}),\mathrm{\dots},{f}_{{p}_{n}}({t}_{n})$ with ${({p}_{i})}_{i=1}^{n}\subseteq \{1,\mathrm{\dots},p\}$ are multivariate Gaussian distributed.
Often, the input space $\mathcal{T}=\mathbb{R}$ is time, or $\mathcal{T}={\mathbb{R}}^{d}$ is some feature space.
An MOGP $f\sim \mathcal{G}\mathcal{P}(m,K)$ is described by a *vector-valued* mean function

$$m(t)=\mathbb{E}[f(t)]={\left[\begin{array}{ccc}\hfill \mathbb{E}[{f}_{1}(t)]\hfill & \hfill \mathrm{\cdots}\hfill & \hfill \mathbb{E}[{f}_{p}(t)]\hfill \end{array}\right]}^{\text{\U0001d5b3}}$$ |

and a *matrix-valued* covariance function

$K(t,{t}^{\prime})=\mathbb{E}[f(t){f}^{\text{\U0001d5b3}}({t}^{\prime})]-\mathbb{E}[f(t)]\mathbb{E}[{f}^{\text{\U0001d5b3}}({t}^{\prime})]$ | |||

$=\left[\begin{array}{ccc}\hfill \mathrm{cov}({f}_{1}(t),{f}_{1}({t}^{\prime}))\hfill & \hfill \mathrm{\cdots}\hfill & \hfill \mathrm{cov}({f}_{1}(t),{f}_{p}({t}^{\prime}))\hfill \\ \hfill \mathrm{\vdots}\hfill & \hfill \mathrm{\ddots}\hfill & \hfill \mathrm{\vdots}\hfill \\ \hfill \mathrm{cov}({f}_{p}(t),{f}_{1}({t}^{\prime}))\hfill & \hfill \mathrm{\cdots}\hfill & \hfill \mathrm{cov}({f}_{p}(t),{f}_{p}({t}^{\prime}))\hfill \end{array}\right].$ |

For $n$ observations $y({t}_{1}),\mathrm{\dots},y({t}_{n})\in {\mathbb{R}}^{p}$, inference and learning take $O({n}^{3}{p}^{3})$ time and $O({n}^{2}{p}^{2})$ memory.

### 2.1 Mixing Model Hierarchy

Our unifying Mixing Model Hierarchy begins with Linear Mixing Models (LMMs), a simple class of MOGP models typically characterised by low-rank covariance structure. An LMM decomposes a signal $f(t)$ comprising $p$ outputs into a fixed basis ${h}_{1},\mathrm{\dots},{h}_{m}\in {\mathbb{R}}^{p}$ whose coefficients ${x}_{1}(t),\mathrm{\dots},{x}_{m}(t)$ are time-varying and modelled independently with Gaussian processes:

$f(t)$ | $={h}_{1}{x}_{1}(t)+\mathrm{\dots}+{h}_{m}{x}_{m}(t)$ | ||

$=\underset{H}{\underset{\ufe47}{\left[\begin{array}{ccc}\hfill {h}_{1}\hfill & \hfill \mathrm{\cdots}\hfill & \hfill {h}_{m}\hfill \end{array}\right]}}\underset{x(t)}{\underset{\ufe47}{{\left[\begin{array}{ccc}\hfill {x}_{1}(t)\hfill & \hfill \mathrm{\cdots}\hfill & \hfill {x}_{m}(t)\hfill \end{array}\right]}^{\text{\U0001d5b3}}}}=Hx(t).$ |

The noisy signal $y(t)$ is, then, generated by adding $\mathcal{N}(0,\mathrm{\Lambda})$-distributed noise to $f(t)$. Intuitively, this means that the $p$-dimensional data lives in a “pancake” around the $m$-dimensional column space of $H$, where typically $m\ll p$.

###### Model 1 (Linear Mixing Model).

Let $K$ be an $m\times m$ diagonal multi-output kernel with $K(t,t)={I}_{m}$, $H$ a $p\times m$ matrix, and $\mathrm{\Lambda}=\mathrm{diag}({\sigma}_{1}^{2},\mathrm{\dots},{\sigma}_{p}^{2})$. Then the LMM is given by the following generative model:

$x$ | $\sim \mathcal{G}\mathcal{P}(0,K(t,{t}^{\prime})),$ | (latent processes) | ||

$f(t)|H,x$ | $=Hx(t),$ | (likelihood) | ||

$y|f$ | $\sim \mathcal{G}\mathcal{P}(f(t),\delta [t-{t}^{\prime}]\mathrm{\Lambda}).$ | (noise model) |

We call $x$ the latent processes and $H$ the mixing matrix or basis. If we marginalise out $f$ and $x$, we find $y\sim \mathcal{G}\mathcal{P}(0,HK(t,{t}^{\prime}){H}^{\text{\U0001d5b3}}+\mathrm{\Lambda})$, which reveals the low-rank covariance structure of the LMM. It also shows that the LMM is a time-varying generalisation of factor analysis (FA): choosing $K(t,{t}^{\prime})=\delta [t-{t}^{\prime}]{I}_{m}$ recovers FA exactly.

The graphical model of the LMM is illustrated in the top-left corner of creftype 6 in creftype A, which highlights two restrictions of the LMM compared to a general MOGP:
(i) the instantaneous spatial covariance of $f$, $\mathbb{E}[f(t){f}^{\text{\U0001d5b3}}(t)]=H{H}^{\text{\U0001d5b3}}$, does not vary with time, because $H$ does not vary with time; and
(ii) the noise-free observation $f(t)$ is generated from $x({t}^{\prime})$ for ${t}^{\prime}=t$ only, meaning that, for example, $f$ cannot be $x$ with a delay or a smoothed version of $x$.
We hence call the LMM a time-invariant (due to *(i)*) and instantaneous (due to *(ii)*) MOGP.

The LMM can be generalised in three ways. First, the mixing matrix $H$ may vary with time. Then $H\in {\mathbb{R}}^{p\times m}$ becomes a matrix-valued function $H:\mathcal{T}\to {\mathbb{R}}^{p\times m}$, and the likelihood becomes $f(t)|H,x=H(t)x(t)$. We call such MOGP models time-varying (see creftype 6, top-right). Second, $f(t)$ may depend on $x({t}^{\prime})$ for all ${t}^{\prime}\in \mathcal{T}$. Then the mixing matrix $H\in {\mathbb{R}}^{p\times m}$ becomes a matrix-valued time-invariant filter $H:\mathcal{T}\to {\mathbb{R}}^{p\times m}$, and the likelihood becomes $f(t)|H,x=\int H(t-\tau )x(\tau )d\tau $. We call such MOGP models convolutional (see creftype 6, bottom-left). Finally, $f(t)$ may depend on $x({t}^{\prime})$ for all ${t}^{\prime}\in \mathcal{T}$ and this relationship may vary with time. Then the mixing matrix $H\in {\mathbb{R}}^{p\times m}$ becomes a matrix-valued time-varying filter $H:\mathcal{T}\times \mathcal{T}\to {\mathbb{R}}^{p\times m}$, and the likelihood becomes $f(t)|H,x=\int H(t,\tau )x(\tau )d\tau $. We call such MOGP models time-varying and convolutional (see creftype 6, bottom-right). The graphical models corresponding to these generalisations of the LMM are depicted in creftype 6 in the creftype A.

The LMM can be extended in one other way, which is to include a prior distribution on $H$. This extension and the three previously proposed generalisations together form the Mixing Model Hierarchy (MMH), which is depicted in creftype 1. The MMH organises multi-output Gaussian process models according to their distinctive modelling assumptions. creftype 1 shows how sixteen MOGP models from the machine learning and geostatistics literature can be recovered as special cases of the various generalisations of the LMM. Not all multi-output Gaussian process models are covered by the MMH, however. For example, Deep GPs (Damianou:2015:Deep_Gaussian_Processes_and_Variational) and variations thereon (Kaiser:2017:Bayesian_Alignments_of_Warped_Multi-Output) are excluded because they transform the latent processes nonlinearly to generate the observations.

### 2.2 Inference and Learning in the LMM

In MOGPs, the complexities of inference and learning can often be alleviated by exploiting structure in the kernel. This is also the case for the Linear Mixing Model: the $p$-dimensional observations have only $m$ degrees of freedom, where typically $m\ll p$, and it turns out that the $O({n}^{3}{p}^{3})$ time and $O({n}^{2}{p}^{2})$ memory complexities can be brought down to $O({n}^{3}{m}^{3})$ and $O({n}^{2}{m}^{2})$. In particular, in the following we show that the covariance structure of the LMM allows us to summarise the $p$-dimensional observations in an $m$-dimensional space, without loss of information.

We identify a low-dimensional projection of observations $y$ that is a sufficient statistic for the model and which can therefore be used to accelerate inference. This projection of $y$ is given by the maximum likelihood estimate (MLE) of $x$ under the likelihood of the LMM. Denote $p(y|x)=\mathcal{N}(y;Hx,\mathrm{\Lambda})$, and let $T$ be the $m\times p$ matrix ${({H}^{\text{\U0001d5b3}}{\mathrm{\Lambda}}^{-1}H)}^{-1}{H}^{\text{\U0001d5b3}}{\mathrm{\Lambda}}^{-1}$. creftype 3 in creftype B shows that $Ty={\mathrm{arg}\mathrm{max}}_{x}p(y|x)$ and that $Ty$ is an unbiased estimate of $x$. Hence, $Ty$ is a MLE for $x$, which means that it is a function of a sufficient statistic for $x$, if one exists. The following proposition shows that $Ty$ is actually minimally sufficient itself.

###### Prop. 1.

The MLE $Ty$ of $x$ is a minimal sufficient statistic for $x$. Proof. See creftype C.

For any prior $p(x)$ over $x$, sufficiency of $T$ gives that $p(x|y)=p(x|Ty)$; i.e., conditioning on $y$ is equivalent to conditioning on $Ty$, where $Ty$ can be interpreted as a “summary” or “projection” of $y$. This idea is formalised in the following proposition, which is an immediate consequence of creftypeplural 1\crefpairconjunction1 in creftype C.

###### Prop. 2.

Let $p(x)$ be a model for $x:\mathcal{T}\to {\mathbb{R}}^{m}$, not necessarily Gaussian, $H$ a $p\times m$ matrix, and $\mathrm{\Lambda}=\mathrm{diag}({\sigma}_{1}^{2},\mathrm{\dots},{\sigma}_{p}^{2})$. Then consider the following generative model:

$x$ | $\sim p(x),$ | (latent processes) | ||

$f(t)|H,x$ | $=Hx(t),$ | (likelihood) | ||

$y|f$ | $\sim \mathcal{G}\mathcal{P}(f(t),\delta [t-{t}^{\prime}]\mathrm{\Lambda}).$ | (noise model) |

Consider a $p\times n$ matrix $Y$ of observations of $y$. Then $p(f|Y)=p(f|TY)$. Furthermore, the distribution of the projected observed signal $Ty$ is given by

$$Ty|x\sim \mathcal{G}\mathcal{P}(x(t),\delta [t-{t}^{\prime}]{\mathrm{\Lambda}}_{T})\text{with}{\mathrm{\Lambda}}_{T}={({H}^{\text{\U0001d5b3}}{\mathrm{\Lambda}}^{-1}H)}^{-1},$$ |

and the likelihood of $Y={({y}_{i})}_{i=1}^{n}\subseteq {\mathbb{R}}^{p}$ is given by

$$p(Y)=\left[\prod _{i=1}^{n}\frac{\mathcal{N}({y}_{i};0,\mathrm{\Lambda})}{\mathcal{N}({y}_{i};0,{\mathrm{\Lambda}}_{T})}\right]\int p(x)\prod _{i=1}^{n}\mathcal{N}(T{y}_{i};{x}_{i},{\mathrm{\Lambda}}_{T})\mathrm{d}x.$$ |

Crucially, $Y$ are $p$-dimensional observations, $TY$ are $m$-dimensional summaries, and typically $m\ll p$, so conditioning on $TY$ is often much cheaper. In particular, if we apply creftype 2 to the LMM by letting $x\sim \mathcal{G}\mathcal{P}(0,K(t,{t}^{\prime}))$, we immediately get the claimed reduction in complexities: whereas conditioning on $Y$ takes $O({n}^{3}{p}^{3})$ time and $O({n}^{2}{p}^{2})$ memory, we may equivalently condition on $TY$, which takes $O({n}^{3}{m}^{3})$ time and $O({n}^{2}{m}^{2})$ memory instead. This important observation is depicted in creftype 1(a).

In creftype 2, we call
${\mathrm{\Lambda}}_{T}=T\mathrm{\Lambda}{T}^{\text{\U0001d5b3}}={({H}^{\text{\U0001d5b3}}{\mathrm{\Lambda}}^{-1}H)}^{-1}$
the *projected observation noise*.
The noise term ${\mathrm{\Lambda}}_{T}$ is important, because it couples the latent processes upon observing data.
In particular, if the latent processes are independent under the prior and ${\mathrm{\Lambda}}_{T}$ is diagonal, then the latent processes remain independent when data is observed.
This observation forms the basis of the computational gains achieved by the Orthogonal Linear Mixing Model, which we present next.

## 3 THE ORTHOGONAL LINEAR MIXING MODEL

The Orthogonal Linear Mixing Model (OLMM) is the special case of the LMM where the basis, the columns of $H$, is *orthogonal*.
In particular, $H=U{S}^{\frac{1}{2}}$ where $U$ is a truncated orthogonal matrix and $S>0$ a diagonal scaling. We define this model as follows.

###### Model 2 (Orthogonal Linear Mixing Model).

Let $K$ be an $m\times m$ diagonal multi-output kernel with $K(t,t)={I}_{m}$, $H$ a $p\times m$ matrix of the form $H=U{S}^{\frac{1}{2}}$ with $U$ a truncated orthogonal matrix and $S>0$ diagonal, and $\mathrm{\Lambda}={\sigma}^{2}{I}_{p}+HD{H}^{\text{\U0001d5b3}}$ a $p\times p$ matrix with $D>0$ diagonal. Then the OLMM is given by the following generative model:

$x$ | $\sim \mathcal{G}\mathcal{P}(0,K(t,{t}^{\prime})),$ | (latent processes) | ||

$f(t)|H,x$ | $=Hx(t),$ | (likelihood) | ||

$y|f$ | $\sim \mathcal{G}\mathcal{P}(f(t),\delta [t-{t}^{\prime}]\mathrm{\Lambda}).$ | (noise model) |

The difference between the LMM and the OLMM is illustrated in creftype 1(c). In the OLMM, the number of latent processes must be smaller than or equal to the number of outputs, since the number of $p$-dimensional vectors that can be mutually orthogonal is at most $p$. Also, $D$ in $\mathrm{\Lambda}$ can be interpreted as heterogeneous noise deriving from the latent processes. Moreover, although $H$ and $\mathrm{\Lambda}$ do not depend on time, our analysis and results also apply to the case where ${H}_{t}$ and ${\mathrm{\Lambda}}_{t}$ do vary with time. Finally, for the OLMM, creftype 5 in creftype E shows that $T={S}^{-\frac{1}{2}}{U}^{\text{\U0001d5b3}}$ and ${\mathrm{\Lambda}}_{T}={\sigma}^{2}{S}^{-1}+D.$

Whereas the LMM is a time-varying generalisation of FA, the OLMM is a time-varying generalisation of probabilistic principal component analysis (PPCA, Tipping:1999:Probabilistic_Principal_Component_Analysis): $D=0$ and $K(t,{t}^{\prime})=\delta [t-{t}^{\prime}]{I}_{m}$ recovers PPCA exactly. The OLMM is also similar to Gaussian Process Factor Analysis (GPFA, Yu:2009:Gaussian-Process_Factor_Analysis_for_Low-Dimensional), with the crucial difference being that in GPFA orthogonalisation of the columns of $H$ is done as a post-processing step, whereas in the OLMM orthogonality of the columns of $H$ is built into the model.

#### Generality of the OLMM.

In the separable case, where $K(t,{t}^{\prime})=k(t,{t}^{\prime}){I}_{m}$ for a scalar-valued kernel $k$, for every LMM with homogeneous observation noise ($\mathrm{\Lambda}={\sigma}^{2}{I}_{p}$), there exists an OLMM with $D=0$ that is equal in distribution of $y$. To see this, note that

${y}^{\text{(LMM)}}$ | $\sim \mathcal{G}\mathcal{P}(0,k(t,{t}^{\prime})H{H}^{\text{\U0001d5b3}}+{\sigma}^{2}\delta [t-{t}^{\prime}]{I}_{p}),$ | ||

${y}^{\text{(OLMM)}}$ | $\sim \mathcal{G}\mathcal{P}(0,k(t,{t}^{\prime})US{U}^{\text{\U0001d5b3}}+{\sigma}^{2}\delta [t-{t}^{\prime}]{I}_{p}).$ |

Hence, letting $US{U}^{\text{\U0001d5b3}}$ be the eigendecomposition of $H{H}^{\text{\U0001d5b3}}$ gives an OLMM equal in distribution of $y$. In the nonseparable case, where diagonal elements of $K$ are linearly independent, in general only the distribution of $y(t)$ at every $t$ can be recovered by an OLMM, but the correlation between $y(t)$ and $y({t}^{\prime})$ for ${t}^{\prime}\ne t$ may be different. In terms of the joint distribution over $x$ and $y$, which is important for interpretability of the latent processes, creftype G shows that the Kullback–Leibler (KL) divergence between two LMMs with bases $H$ and $\widehat{H}$ is proportional to ${\parallel H-\widehat{H}\parallel}_{F}^{2}$, hence symmetric, where $\parallel \bullet {\parallel}_{F}$ denotes the Frobenius norm. As a consequence (creftype G), the KL between an LMM with basis $H$ and the OLMM closest in KL is upper bounded by ${\parallel I-V\parallel}_{F}^{2}$ where $V$ are the right singular vectors of $H$. This makes sense: $V=I$ implies that $H$ is of the form $U{S}^{\frac{1}{2}}$ with $U$ a truncated orthogonal matrix and $S>0$ diagonal. It also shows that an LMM is close to an OLMM if $V$ is close to $I$ in the sense of the Frobenius norm.

#### Choice of $H$.

The basis $H$ can be initialised and learned in several ways. (i) $H$ can be initialised randomly and learned through gradient-based optimisation of the marginal likelihood. (ii) Using that $\mathbb{E}[f(t){f}^{\text{\U0001d5b3}}(t)]=H{H}^{\text{\U0001d5b3}}$, the basis $H$ can be initialised to $\widehat{U}{\widehat{S}}^{\frac{1}{2}}$ where $\widehat{\mathrm{\Sigma}}=\widehat{U}\widehat{S}{\widehat{U}}^{\text{\U0001d5b3}}$ is the eigendecomposition of an estimate $\widehat{\mathrm{\Sigma}}$ of the spatial covariance. (iii) In the case that there is a kernel over the outputs, e.g. in separable spatio-temporal GP models, $H$ can be set to $U{S}^{\frac{1}{2}}$ where $US{U}^{\text{\U0001d5b3}}$ is an eigendecomposition of the kernel matrix over the outputs. The hyperparameters of the kernel over the outputs can then be learned with gradient-based optimisation by differentiating through the eigendecomposition. We discuss this choice in more detail in creftype 4.

#### Diagonal projected noise.

As alluded to in creftype 2.2, the most important property of the OLMM is that the projected noise ${\mathrm{\Lambda}}_{T}$ from creftype 2 is diagonal: ${\mathrm{\Lambda}}_{T}={\sigma}^{2}{S}^{-1}+D.$ creftype 4 in the creftype D characterises exactly when this is the case. Specifically, creftype 4, says that the projected noise is diagonal if and only if $H$ is of the form $H={\mathrm{\Lambda}}^{\frac{1}{2}}U{S}^{\frac{1}{2}}$ with $U$ a truncated orthogonal matrix and $S>0$ diagonal. This condition is awkward, as it couples $H$ and $\mathrm{\Lambda}$. Fortunately, creftype 4 also shows that we may drop $H$’s dependency on $\mathrm{\Lambda}$ if and only if every column of $U$ is an eigenvector of $\mathrm{\Lambda}$. In the OLMM, every column of $U$ is indeed an eigenvector of $\mathrm{\Lambda}$. We therefore identify the OLMM as a general parametrisation of the LMM for which the projected noise is diagonal. Compared to the LMM, this brings computational advantages, which we discuss next.

#### Inference.

Since the projected noise is diagonal, the latent processes remain independent when data is observed.
We may hence treat the latent processes independently, conditioning the $i$^{th} latent process ${x}_{i}$ on ${(TY)}_{i:}={Y}^{\text{\U0001d5b3}}{u}_{i}/\sqrt{{S}_{ii}}$ under noise ${({\mathrm{\Lambda}}_{T})}_{ii}={\sigma}^{2}/{S}_{ii}+{D}_{ii}$, where ${u}_{i}$ is the $i$^{th} column of $U$.
Therefore, inference takes $O({n}^{3}m)$ time and $O({n}^{2}m)$ memory, which are *linear* in the number of degrees of freedom in the data $m$.
This decoupled inference procedure is depicted in creftype 1(b).
Remark that the decoupled problems can be treated in parallel to achieve sublinear wall time, and that in the separable case further speedups are possible.

#### Learning.

For computing the likelihood of observed data, the OLMM also offers computational benefits. As the projected noise is diagonal, by absorbing $D$ into $K$, the expression for $p(Y)$ from creftype 1 further simplifies to the following:

$p(Y)$ | $=\mathcal{N}(\mathrm{vec}Y;0,{\sigma}^{2}{I}_{np})$ | ||

$\times {\displaystyle \prod _{i=1}^{m}}{\displaystyle \frac{\mathcal{N}({(TY)}_{i:};0,{K}_{i}+({\sigma}^{2}/{S}_{ii}+{D}_{ii}){I}_{n})}{\mathcal{N}({(TY)}_{i:};0,({\sigma}^{2}/{S}_{ii}){I}_{n})}},$ |

where ${K}_{i}$ is the $n\times n$ kernel matrix for the $i$^{th} latent process ${x}_{i}$.
We conclude that learning also takes $O({n}^{3}m)$ time and $O({n}^{2}m)$ memory, again *linear* in the number of latent processes.
Furthermore, we observe that $p(Y)$ is *convex* in $U$, which significantly aids learning;
in comparison, for the LMM, $p(Y)$ is highly nonconvex in the mixing matrix $H$.

#### Interpretability.

Besides computational benefits, that the OLMM breaks down into independent problems for the latent processes also promotes interpretability.
Namely, the independent problems can be separately inspected to interpret, diagnose, and improve the model.
This is *much* easier than directly working with predictions for the data, which are high dimensional and often strongly correlated between outputs.
In particular, if $H$ is sufficiently rich and the important independent problems are solved sufficiently well, then the OLMM achieves good mean squared error (MSE) for the data:

$$\underset{\text{MSE}}{\underset{\ufe47}{{\parallel y-Hx\parallel}^{2}}}=\underset{\begin{array}{c}\text{variance not}\\ \text{captured by basis}\end{array}}{\underset{\ufe47}{{\parallel {P}_{{U}^{\u27c2}}y\parallel}^{2}}}+\sum _{i=1}^{m}{S}_{ii}\underset{\begin{array}{c}\text{MSE of}\\ i\text{th}\text{latent process}\end{array}}{\underset{\ufe47}{{({(Ty)}_{i}-{x}_{i})}^{2}}}$$ |

where ${P}_{{U}^{\u27c2}}$ is the orthogonal projection onto the orthogonal complement of $U$. See creftype F for a proof.

#### Scaling.

For both learning and inference, the problem decouples into $m$ independent single-output problems. Therefore, to scale to a large number of data points $n$, off-the-shelf single-output GP scaling techniques can be trivially applied to these independent problems. For example, if the variational inducing point method by Titsias:2009:Variational_Learning is used with $r\ll n$ inducing points, then inference and learning are further reduced to $O(n{r}^{2}m)$ time and $O(nrm)$ memory. Most importantly, if $k(t,{t}^{\prime})$ is Markovian (e.g. of the Matérn class) one can leverage state-space methods to exactly solve the $m$ independent problems in closed-form (Hartikainen:2010:Kalman_filtering_and_smoothing_solutions; Sarkka:2019:Applied_Stochastic_Differential_Equations). This brings down the scaling to $O(nm{d}^{3})$ time and $O(nm{d}^{2})$ memory, where $d$ is the state dimension (typically $d\ll m,n$). We further discuss this approach in creftype 4.

#### Missing data and heterogeneous observation noise.

Missing data is troublesome for the OLMM, because it is not possible to take away a subset of the rows of $H$ and retain orthogonality of the columns. Possible methods to circumvent this issue are discussed in creftype H, but are not explored further in this work. Furthermore, the specification of the OLMM does not allow for heterogeneous observation noise. This issue is discussed in creftype I, which shows that the OLMM can have automatic whitening of the data built in to allow for heterogeneous observation noise.

## 4 APPLICATION TO SEPARABLE SPATIO-TEMPORAL MODELS

Separable spatio-temporal GPs, which are of the form $f\sim \mathcal{G}\mathcal{P}(0,{k}_{t}(t,{t}^{\prime}){k}_{r}(r,{r}^{\prime}))$, fall neatly within the OLMM framework if observed at a fixed number of locations in space. Specifically, the vector-valued process $f(t)={(f(t,{r}_{i}))}_{r=1}^{p}$ has distribution $\mathcal{G}\mathcal{P}(0,{k}_{t}(t,{t}^{\prime}){K}_{r})$ where ${K}_{r}$ is the $p\times p$ matrix with ${({K}_{r})}_{ij}={k}_{r}({r}_{i},{r}_{j})$. Letting $US{U}^{\text{\U0001d5b3}}$ be the eigendecomposition of ${K}_{r}$, $f(t)$ is an OLMM with $H=U{S}^{\frac{1}{2}}$, which we note is square, and $K(t,{t}^{\prime})={k}_{t}(t,{t}^{\prime}){I}_{p}$. Note that, in this case, projecting the data takes $O(n{p}^{2})$ time. By relaxing $K$ to be a general diagonal multi-output kernel with $K(t,t)={I}_{p}$, we obtain a nonseparable relaxation of the original separable model whilst retaining exact, computationally efficient inference. In particular, the orthogonal basis for the OLMM are the eigenvectors of a kernel matrix whose hyperparameters can be optimised.

Combining the OLMM framework with efficient state-space scaling techniques (Hartikainen:2010:Kalman_filtering_and_smoothing_solutions; Sarkka:2019:Applied_Stochastic_Differential_Equations; Solin:2018:Infinite-Horizon_Gaussian_Processes; nickisch2018state), which are either exact or arbitrarily good approximations, the complexities are reduced to $O(np+{p}^{3})$ time and $O(np+{p}^{2})$ memory for the entire problem, which are *linear* in $n$.
This compares favourably with the filtering approach by Sarkka:2013:Spatiotemporal_Learning_via, which takes $O(n{p}^{3})$ time and $O(n{p}^{2})$ memory.
The sparse approximation by hartikainen2011sparse provides a reduced-rank approximation with $O(n{q}^{3})$ time and $O(n{q}^{2})$ memory for some $q\ll p$, although it comes at the expense of accuracy and yields no asymptotic savings if $q$ must increase linearly in $p$ to maintain a reasonable accuracy.
It also compares favourably with the Kronecker product decomposition (saatcci2012scalable, Ch. 5) approach, which requires $O({p}^{3}+{n}^{3})$ time and $O({p}^{2}+{n}^{2})$ memory complexity.

## 5 EXPERIMENTS

We test the OLMM in experiments on synthetic and real-world data sets. An implementation is available at https://github.com/wesselb/olmm.

#### Computational scaling.

We demonstrate that exact inference scales favourably in $m$ for the OLMM, whereas the LMM quickly becomes computationally infeasible as $m$ increases.
We use a highly optimised implementation of exact inference for the LMM, kindly made available by Invenia Labs^{2}^{2}
2
https://invenialabs.co.uk.
creftype 3 shows the runtime and the memory usage of the LMM and OLMM.
Observe that the LMM scales $O({m}^{3})$ in time and $O({m}^{2})$ in memory, whereas the OLMM scales $O(m)$ in both time and memory.
For $m=25$, the LMM takes nearly 10 minutes to compute the evidence, whereas the OLMM only requires a couple seconds.^{3}^{3}
3
Measurements were performed using a MacBook Pro with a 2.7 GHz Intel Core i7 processor and 16 GB RAM. Code was implemented in Julia 1.0 (bezanson2017julia) and memory and time were measured using the @allocated and the @elapsed macros, respectively, with the measurements averaged over 10 samples run serially. This means memory reported is the total memory allocated, not peak memory consumption.

#### Temperature extrapolation.

We consider a simple spatio-temporal temperature prediction problem over Europe.
Approximately 30 years worth of the ERA-Interim reanalysis temperature data^{4}^{4}
4
All output from CMIP5 and ERA-Interim models was regridded onto the latitude–longitude grid used for the IPSL-CM5A-LR model.
(dee2011era) is smoothed in time with a Hamming window of width 31 and sub-sampled once every 31 days to produce a data set comprising $13\times 19=247$ outputs and approximately $350$ months worth of data.
We train the OLMM and independent GPs (both models use Matérn–$5/2$ kernels with a periodic component) on the first 250 months of the data and test on the next 100 months.
The basis for the OLMM is given by the eigenvectors of the kernel matrix over the points in space (Matérn–$5/2$ with a different length scale for latitude and longitude).
creftype 1 shows the log-marginal likelihood (LML) of the training data, the posterior predictive log-probability (PPLP) and the root-mean-square error (RMSE) of the test data.
Although the RMSE is equal to that of the independent GPs—the data is highly periodic and the predictions are highly accurate for both models—the LML and PPLP are much better, which indicates that that the OLMM is able to capture meaningful structure between temperatures at different points in space.

LML | PPLP | RMSE | |
---|---|---|---|

IGP | $-0.247$ | $-0.219$ | $1.99$ |

OLMM | $1.22$ | $1.19$ | $1.99$ |

#### Rainforest tree point process modelling.

We consider a subset of the extensive rainforest data set credited to rainforest; Condit:2005; Hubbell:1999. The data features a 1000$$m $\times $ 500$$m rainforest dynamics plot in Barro Colorado Island, Panama. In the survey area, the locations of all *Trichilia tuberculata* (a tree species of the Mahogany family) have been measured (see creftype 7 in creftype J).
We tackle this spatial point pattern with a log-Gaussian Cox process model, which is an inhomogeneous Poisson process model for count data. The unknown intensity function $\lambda (x)$ is modelled with a Gaussian process such that $f(x)=\mathrm{log}\lambda (x)$. We model locally constant intensity in subregions by discretising the region into $np$ bins (Moller+Syversveen+Waagepetersen:1998). This leads to having a Poisson observation model in each bin (see creftype J). This model reaches posterior consistency in the limit of bin width going to zero (Tokdar+Ghosh:2007). The accuracy thus improves with tighter binning. We use a separable Matérn-$5/2$ GP prior over $f({x}_{1},{x}_{2})$, and discretise the area into a $n\times p=200\times 100$ grid with $np=20000$ grid bins, and treat the first dimension as time.

We use Elliptical Slice Sampling (Murray:2010:Elliptical_Slice_Sampling; murray2010slice) and Metropolis–Hastings (hastings1970monte) to perform inference in this problem, which require many evaluations of the log marginal likelihood and draws from the prior. The proposed methodology renders these operations much less computationally intensive than the general case.
We find approximately an hour is required to perform ${10}^{5}$ MCMC iterations on a MacBook Pro^{5}^{5}
5
2.5 GHz Intel Core i7 processor and 16 GB RAM
.
See creftype J for more details.

As previously discussed, the Kronecker product factorisation techniques (saatcci2012scalable, Ch. 5) are applicable in separable spatio-temporal scenarios. While our method in combination with state-space inference offers superior performance asymptotically, we demonstrate that we also obtain benefits on practical problems. creftype 4 shows for different number of bins $n$ (with $p=n/2$) the time for log marginal likelihood evaluation and sampling from the GP prior for our method and for the Kronecker product decomposition approach. Observe that the linear scaling of our methodology quickly renders it competitive. While this simple experiment does not provide a comprehensive assessment of the relative merits of these techniques—for example, only a single CPU core was used in this experiment—it clearly establishes the practical relevance of our technique in the separable setting.

#### Large-scale climate model calibration.

In this final experiment, we consider a large-scale climate modelling task over Europe. In particular, we use the OLMM to find relationships between 28 climate simulators4 (see taylor2012overview, for background) by letting $H={H}_{s}\otimes {H}_{r}$, where ${H}_{s}$ are the first ${m}_{s}=5$ eigenvectors of a $28\times 28$ covariance matrix ${K}_{s}$ between the simulators, and ${H}_{r}$ are the first ${m}_{r}=10$ eigenvectors of the kernel matrix over the points in space (Matérn–$5/2$ with a different length scale for latitude and longitude). This means the ${m}_{r}{m}_{s}=50$ latent processes are indexed by two indices ${i}_{s}$ and ${i}_{m}$, one corresponding to the eigenvector of the simulator covariance and one correponding to the eigenvector of the spatial covariance. In the above, the covariance matrix ${K}_{s}$ between the simulators is a parameter that we learn. The kernels for the latent processes are Matérn–$5/2$. We consider $n=10000$ time points for all 28 simulators, each with 247 outputs, giving a total of roughly 70 million data points. For the independent problems, we use the variational inducing point method by Titsias:2009:Variational_Learning, where the positions of the inducing points are initialised to one every two months. All hyperparameters and the locations of the inducing points are optimised until convergence using scipy’s implementation of the L-BFGS-B algorithm (Nocedal:2006:Numerical_Optimisation), which takes about 4 hours on a MacBook Pro5. The learned length scales were ${23.3}^{\circ}$ for latitude and ${43.6}^{\circ}$ for longitude.

creftype 4(a) shows the empirical correlations and the correlations learned by the OLMM (derived from ${K}_{s}$).
Observe that the empirical correlation is not informative, as it ignores all temporal structure.
The OLMM, however, does not, and is consequently able to identify a rich correlation structure.
In order to get insight into the learned correlations, we hierarchically cluster the models using farthest point linkage with $1-|\text{corr.}|$ as the distance.
creftype 4(b) shows the resulting dendrogram, in which models are grouped by their similarity.
For two models, the further to the right the branch connecting them is, the less similar the models are.
In creftypeplural 4(b)\crefpairconjunction4(a), HadGEM2 is clearly singled out:
it is one of the simplest models, not including several processes that can be found in others, such as ocean & sea-ice, terrestrial carbon cycle, stratosphere, and ocean biogeochemistry (bellouin2011hadgem2).
Furthermore, if we inspect the names of the simulators in the groups in creftype 4(b), we observe that often simulators of the same family are grouped together.
We observe some interesting cases.
(i) Although IPSL-CM5A-LR and IPSL-CM5A-MR are close, IPSL-CM5B-LR is grouped far apart.
It turns out that IPSL-CM5A-LR and IPSL-CM5A-MR are different-resolution versions of the same model, while IPSL-CM5B-LR employs a different atmospheric model.^{6}^{6}
6
https://portal.enes.org/models/earthsystem-models/ipsl/ipslesm
(ii) ACCESS1.0 and ACCESS1.3 have a similar name, but differ greatly in their implementation: ACCESS1.0 is the basic model, while ACCESS1.3 is much more aspirational, including experimental atmospheric physics models and a particular land surface model (bi2013access).
(iii) The distance between BCC_CSM1.1(m) and BCC_CSM1.1 can be explained by the more realistic surface air temperature predictions obtained by the former (wu2014overview), which is exactly the quantity we study.
Finally, creftype 4(c) shows predictions for four latent processes (${i}_{s}=1,2$ with ${i}_{r}=1,2$).
The first spatial eigenvector (${i}_{r}=1$) is constant in space; combined with the strongest eigenvector of ${K}_{s}$ (${i}_{s}=1$), we obtain a strong signal constituting seasonal temperature changes.

## 6 DISCUSSION AND CONCLUSION

In this paper, we introduced the Orthogonal Linear Mixing Model as a multi-output GP in the Linear Mixing Model class where exact inference and learning scales linearly in the desired number $m$ of degrees of freedom of the observations. The OLMM is simple to implement and trivially compatible with off-the-shelf single-output GP scaling techniques to scale to large numbers of observations $n$. We introduced the Mixing Model Hierarchy, which organises MOGPs according to their generative models, and placed the OLMM within this scheme.

The OLMM is exciting for several reasons. It provides a way to do efficient inference in separable spatio-temporal models that is trivially compatible with single-output scaling techniques. Exact inference is linear in number of outputs, and, with state space methods, it is also linear in number of time points. The OLMM even extends separable spatio-temporal models to non-separable models whilst retaining exact, computationally efficient inference. Furthermore, the OLMM is interpretable: the latent processes are independent, which means that they can be inspected separately to diagnose the model. Finally, as demonstrated in the real-world experiments in the paper, the OLMM is applicable to a wide range of modelling tasks. An interesting future direction for the method would be targeting sub-linear time complexity by parallelisation (see, e.g., Sarkka:2019:Temporal_Parallelization_of_Bayesian_Filters).

## Bibliography

Supplementary Material:

Scalable Exact Inference in Multi-Output Gaussian Processes

## Appendix A Characterisation of MOGPs

creftypecap 6 illustrates the graphical models of time-invariant/time-varying and instantaneous/convolutional multi-output GP models.

## Appendix B Maximum Likelihood Estimate

Throughout the appendix, we assume that the columns of $H$ are linearly independent and $\mathrm{\Lambda}>0$. As a consequence, ${H}^{\text{\U0001d5b3}}{\mathrm{\Lambda}}^{-1}H$ is invertible and ${H}^{\text{\U0001d5b3}}{\mathrm{\Lambda}}^{-1}H>0$.

###### Prop. 3.

Denote $p(y|x)=\mathcal{N}(y;Hx,\mathrm{\Lambda})$, and let $T$ be the $m\times p$ matrix ${({H}^{\text{\U0001d5b3}}{\mathrm{\Lambda}}^{-1}H)}^{-1}{H}^{\text{\U0001d5b3}}{\mathrm{\Lambda}}^{-1}$. Then

$$Ty=\underset{x}{\mathrm{arg}\mathrm{max}}p(y|x)$$ |

and $Ty$ is an unbiased estimate of $x$: $\mathbb{E}[Ty|x]=x$.

###### Proof.

Note that

$$\mathrm{log}p(y|x)\simeq -\frac{1}{2}{(y-Hx)}^{\text{\U0001d5b3}}{\mathrm{\Lambda}}^{-1}(y-Hx)$$ |

Using invertibility of ${H}^{\text{\U0001d5b3}}{\mathrm{\Lambda}}^{-1}H$, an elementary calculation then shows that the maximum with respect to $x$ is given by

$$x={({H}^{\text{\U0001d5b3}}{\mathrm{\Lambda}}^{-1}H)}^{-1}{H}^{\text{\U0001d5b3}}{\mathrm{\Lambda}}^{-1}y=Ty.$$ |

To show that $Ty$ is an unbiased estimate of $x$, we use that $\mathbb{E}[y|x]=Hx$:

$$\mathbb{E}[Ty|x]=THx={({H}^{\text{\U0001d5b3}}{\mathrm{\Lambda}}^{-1}H)}^{-1}({H}^{\text{\U0001d5b3}}{\mathrm{\Lambda}}^{-1}H)x=x.$$ |

$\text{scaleobj}0.75\mathrm{\square}$

## Appendix C Proof of Prop. 1

To prove creftype 1, we need the property of $T$ that it “preserves the signal-to-noise ratio”. This is characterised in the following lemma.

###### Lem. 1.

$$\frac{\mathcal{N}(y;Hx,\mathrm{\Lambda})}{\mathcal{N}(y;0,\mathrm{\Lambda})}=\frac{\mathcal{N}(Ty;x,{({H}^{\text{\U0001d5b3}}{\mathrm{\Lambda}}^{-1}H)}^{-1})}{\mathcal{N}(Ty;0,{({H}^{\text{\U0001d5b3}}{\mathrm{\Lambda}}^{-1}H)}^{-1})}.$$ |

###### Proof.

It is simple to check the equality by direct verification. We show, however, how the equality may be derived. To begin with, we have

$${(y-Hx)}^{\text{\U0001d5b3}}{\mathrm{\Lambda}}^{-1}(y-Hx)={y}^{\text{\U0001d5b3}}{\mathrm{\Lambda}}^{-1}y-2{x}^{\text{\U0001d5b3}}{H}^{\text{\U0001d5b3}}{\mathrm{\Lambda}}^{-1}y+{x}^{\text{\U0001d5b3}}{H}^{\text{\U0001d5b3}}{\mathrm{\Lambda}}^{-1}Hx.$$ |

Here

$${H}^{\text{\U0001d5b3}}{\mathrm{\Lambda}}^{-1}y=({H}^{\text{\U0001d5b3}}{\mathrm{\Lambda}}^{-1}H){({H}^{\text{\U0001d5b3}}{\mathrm{\Lambda}}^{-1}H)}^{-1}{H}^{\text{\U0001d5b3}}{\mathrm{\Lambda}}^{-1}y=({H}^{\text{\U0001d5b3}}{\mathrm{\Lambda}}^{-1}H)Ty,$$ |

so

$${(y-Hx)}^{\text{\U0001d5b3}}{\mathrm{\Lambda}}^{-1}(y-Hx)={y}^{\text{\U0001d5b3}}{\mathrm{\Lambda}}^{-1}y-2{x}^{\text{\U0001d5b3}}({H}^{\text{\U0001d5b3}}{\mathrm{\Lambda}}^{-1}H)Ty+{x}^{\text{\U0001d5b3}}({H}^{\text{\U0001d5b3}}{\mathrm{\Lambda}}^{-1}H)x.$$ |

Adding and subtracting $y{T}^{\text{\U0001d5b3}}({H}^{\text{\U0001d5b3}}{\mathrm{\Lambda}}^{-1}H)Ty$, we find

$${(y-Hx)}^{\text{\U0001d5b3}}{\mathrm{\Lambda}}^{-1}(y-Hx)={y}^{\text{\U0001d5b3}}{\mathrm{\Lambda}}^{-1}y-y{T}^{\text{\U0001d5b3}}({H}^{\text{\U0001d5b3}}{\mathrm{\Lambda}}^{-1}H)Ty+{(x-Ty)}^{\text{\U0001d5b3}}({H}^{\text{\U0001d5b3}}{\mathrm{\Lambda}}^{-1}H)(x-Ty).$$ |

Hence, rearranging,

$${(y-Hx)}^{\text{\U0001d5b3}}{\mathrm{\Lambda}}^{-1}(y-Hx)-{y}^{\text{\U0001d5b3}}{\mathrm{\Lambda}}^{-1}y={(x-Ty)}^{\text{\U0001d5b3}}({H}^{\text{\U0001d5b3}}{\mathrm{\Lambda}}^{-1}H)(x-Ty)-y{T}^{\text{\U0001d5b3}}({H}^{\text{\U0001d5b3}}{\mathrm{\Lambda}}^{-1}H)Ty,$$ |

which yields the result. $\text{scaleobj}0.75\mathrm{\square}$

###### Proof (creftype 1).

By a general characterisation of minimal sufficient statistics (see, e.g., Th. 6.2.13 in Casella:2001:Statistical_Inference), $Ty$ is a minimal sufficient statistic for $x$ if and only if it is true that $p({y}_{1}|x)/p({y}_{2}|x)$ is constant as a function of $x$ if and only if $T{y}_{1}=T{y}_{2}$. Indeed, by creftype 1,

$$\mathrm{log}\frac{p({y}_{1}|x)}{p({y}_{2}|x)}={(T{y}_{1}-T{y}_{2})}^{\text{\U0001d5b3}}{({H}^{\text{\U0001d5b3}}{\mathrm{\Lambda}}^{-1}H)}^{-1}x+\text{const.}$$ |

which, by invertibility of ${H}^{\text{\U0001d5b3}}{\mathrm{\Lambda}}^{-1}H$, does not depend on $x$ if and only if $T{y}_{1}=T{y}_{2}$. $\text{scaleobj}0.75\mathrm{\square}$

## Appendix D Characterisation of Diagonal Projected Noise

###### Lem. 2.

Let $U$ be a truncated orthogonal matrix and $D>0$ diagonal. Then every solution $M>0$ of ${U}^{\text{\U0001d5b3}}{M}^{-1}U={D}^{-1}$ is of the form $M=UD{U}^{\text{\U0001d5b3}}+VE{V}^{\text{\U0001d5b3}}$ with $E>0$ diagonal and $V$ such that $\left[UV\right]$ is orthogonal.

###### Proof.

We equivalently show that ${M}^{-1}$ for every solution $M>0$ of ${U}^{\text{\U0001d5b3}}MU={D}^{-1}$ is of the desired form. Denote

$$\mathcal{D}=\{M\ge 0:{U}^{\text{\U0001d5b3}}MU={D}^{-1}\},\mathcal{N}=\{M\ge 0:{U}^{\text{\U0001d5b3}}MU=0\}.$$ |

If ${M}_{1},{M}_{2}\in \mathcal{D}$, then ${U}^{\text{\U0001d5b3}}({M}_{1}-{M}_{2})U=0$ shows that ${M}_{1}-{M}_{2}\in \mathcal{N}$. Therefore, if $M\in \mathcal{D}$, then $\mathcal{D}=M+\mathcal{N}$. It is clear that $U{D}^{-1}{U}^{\text{\U0001d5b3}}\in \mathcal{D}$. Thus, $\mathcal{D}=U{D}^{-1}{U}^{\text{\U0001d5b3}}+\mathcal{N}.$ Let $M\in \mathcal{N}$, and let $M=VE{V}^{\text{\U0001d5b3}}$ be the spectral decomposition of $M$. Assume that $E>0$ by dropping the appropriate columns from $V$. Then ${U}^{\text{\U0001d5b3}}MU=0$ says that $U$ and $V$ have pairwise orthogonal columns, so the columns of $V$ are in ${U}^{\u27c2}$. We conclude that every $M\in \mathcal{D}$ is of the form $M=U{D}^{-1}{U}^{\text{\U0001d5b3}}+VE{V}^{\text{\U0001d5b3}}$ with $E>0$ diagonal and $V$ a truncated orthogonal matrix with columns in ${U}^{\u27c2}$. If $M>0$, then in particular $M$ has full rank, so it is clear that $\left[UV\right]$ must form an orthogonal matrix. Therefore, ${M}^{-1}=UD{U}^{\text{\U0001d5b3}}+V{E}^{-1}{V}^{\text{\U0001d5b3}}$, which is of the desired form. $\text{scaleobj}0.75\mathrm{\square}$

###### Prop. 4.

The projected noise is diagonal if and only if $H$ is of the form $H={\mathrm{\Lambda}}^{\frac{1}{2}}U{S}^{\frac{1}{2}}$ with $U$ a truncated orthogonal matrix and $S>0$ diagonal. Suppose that this is the case, and fix such a $U$. Then $H$ is of the form $H=U{D}^{\frac{1}{2}}$ with $D>0$ diagonal if and only if every column of $U$ is an eigenvector of $\mathrm{\Lambda}$.

###### Proof.

The projected noise is diagonal if and only if ${H}^{\text{\U0001d5b3}}{\mathrm{\Lambda}}^{-1}H=S$ for some $S>0$ diagonal. This condition is equivalent to

$${S}^{-\frac{1}{2}}{H}^{\text{\U0001d5b3}}{\mathrm{\Lambda}}^{-\frac{1}{2}}{\mathrm{\Lambda}}^{-\frac{1}{2}}H{S}^{-\frac{1}{2}}=I,$$ |

which, in turn, holds if and only if ${\mathrm{\Lambda}}^{-\frac{1}{2}}H{S}^{-\frac{1}{2}}=U$ is a truncated orthogonal matrix. Thus, the projected noise is diagonal if and only if $H$ is of the form $H={\mathrm{\Lambda}}^{\frac{1}{2}}U{S}^{\frac{1}{2}}$ with $U$ a truncated orthogonal matrix and $S>0$ diagonal.

For the second statement, note that the condition that every column of $U$ is an eigenvector of $\mathrm{\Lambda}$ is equivalent to the condition that $\mathrm{\Lambda}$ is of the form $\mathrm{\Lambda}=UD{U}^{\text{\U0001d5b3}}+VE{V}^{\text{\U0001d5b3}}$ with $D,E>0$ diagonal and $V$ such that $\left[UV\right]$ forms an orthogonal matrix. Suppose that $\mathrm{\Lambda}$ is of this form. Then

$${\mathrm{\Lambda}}^{\frac{1}{2}}=U{D}^{\frac{1}{2}}{U}^{\text{\U0001d5b3}}+V{E}^{\frac{1}{2}}{V}^{\text{\U0001d5b3}}$$ |

implies that

$$H=U{D}^{\frac{1}{2}}{U}^{\text{\U0001d5b3}}U{S}^{\frac{1}{2}}+V{E}^{\frac{1}{2}}{V}^{\text{\U0001d5b3}}U{S}^{\frac{1}{2}}=U{D}^{\frac{1}{2}}{S}^{\frac{1}{2}},$$ |

which is of the desired form. Conversely, suppose that $H=U{D}^{\frac{1}{2}}$ with $D>0$ diagonal. Then

$${S}^{-\frac{1}{2}}{H}^{\text{\U0001d5b3}}{\mathrm{\Lambda}}^{-1}H{S}^{-\frac{1}{2}}=I={S}^{-\frac{1}{2}}{D}^{\frac{1}{2}}{U}^{\text{\U0001d5b3}}{\mathrm{\Lambda}}^{-1}U{D}^{\frac{1}{2}}{S}^{-\frac{1}{2}}\u27f9{U}^{\text{\U0001d5b3}}{\mathrm{\Lambda}}^{-1}U=S{D}^{-1}.$$ |

Therefore, by creftype 2,

$$\mathrm{\Lambda}=U{D}^{-1}S{U}^{\text{\U0001d5b3}}+VE{V}^{\text{\U0001d5b3}}$$ |

with $E>0$ diagonal and $V$ such that $\left[UV\right]$ is orthogonal, which again is of the desired form. $\text{scaleobj}0.75\mathrm{\square}$

## Appendix E Form of Projection and Projected Noise

###### Prop. 5.

Let $H$ and $\mathrm{\Lambda}$ be of the form for the OLMM (creftype 2). Then

$$T={S}^{-\frac{1}{2}}{U}^{\text{\U0001d5b3}}\mathit{\hspace{1em}}\text{and}\mathit{\hspace{1em}}{\mathrm{\Lambda}}_{T}={\sigma}^{2}{S}^{-1}+D.$$ |

###### Proof.

To begin with, using the matrix inversion lemma once and once in reverse,

${H}^{\text{\U0001d5b3}}{\mathrm{\Lambda}}^{-1}H$ | $={S}^{\frac{1}{2}}{U}^{\text{\U0001d5b3}}{({\sigma}^{2}I+U{S}^{\frac{1}{2}}D{S}^{\frac{1}{2}}{U}^{\text{\U0001d5b3}})}^{-1}U{S}^{\frac{1}{2}}$ | ||

$={S}^{\frac{1}{2}}{U}^{\text{\U0001d5b3}}({\sigma}^{-2}I-{\sigma}^{-4}U{S}^{\frac{1}{2}}{({D}^{-1}+{S}^{\frac{1}{2}}{\sigma}^{-2}{U}^{\text{\U0001d5b3}}U{S}^{\frac{1}{2}})}^{-1}{S}^{\frac{1}{2}}{U}^{\text{\U0001d5b3}})U{S}^{\frac{1}{2}}$ | |||

$={\sigma}^{-2}S-{\sigma}^{-4}S{({D}^{-1}+{\sigma}^{-2}S)}^{-1}S$ | |||

$={({\sigma}^{2}{S}^{-1}+D)}^{-1},$ |

so

$${\mathrm{\Lambda}}_{T}=T\mathrm{\Lambda}{T}^{\text{\U0001d5b3}}={({H}^{\text{\U0001d5b3}}{\mathrm{\Lambda}}^{-1}H)}^{-1}={\sigma}^{2}{S}^{-1}+D.$$ |

Furthermore, by a similar calculation,

$T$ | $={({H}^{\text{\U0001d5b3}}{\mathrm{\Lambda}}^{-1}H)}^{-1}{H}^{\text{\U0001d5b3}}{\mathrm{\Lambda}}^{-1}$ | ||

$=({\sigma}^{2}{S}^{-1}+D){S}^{\frac{1}{2}}{U}^{\text{\U0001d5b3}}({\sigma}^{-2}I-{\sigma}^{-4}U{S}^{\frac{1}{2}}{({D}^{-1}+{\sigma}^{-2}S)}^{-1}{S}^{\frac{1}{2}}{U}^{\text{\U0001d5b3}})$ | |||

$=({\sigma}^{2}{S}^{-1}+D)({\sigma}^{-2}S-{\sigma}^{-4}S{({D}^{-1}+{\sigma}^{-2}S)}^{-1}S){S}^{-\frac{1}{2}}{U}^{\text{\U0001d5b3}}$ | |||

$=({\sigma}^{2}{S}^{-1}+D){({\sigma}^{2}{S}^{-1}+D)}^{-1}{S}^{-\frac{1}{2}}{U}^{\text{\U0001d5b3}}$ | |||

$={S}^{-\frac{1}{2}}{U}^{\text{\U0001d5b3}}.$ |

$\text{scaleobj}0.75\mathrm{\square}$

## Appendix F Decomposition of the Mean Squared Error

###### Prop. 6.

Let $H=U{S}^{\frac{1}{2}}$ with $U$ a truncated orthogonal matrix and ${S}^{\frac{1}{2}}>0$ diagonal. Then

$$\underset{\text{MSE}}{\underset{\ufe47}{{\parallel y-Hx\parallel}^{2}}}=\underset{\begin{array}{c}\text{variance not}\\ \text{captured by basis}\end{array}}{\underset{\ufe47}{{\parallel {P}_{{U}^{\u27c2}}y\parallel}^{2}}}+\sum _{i=1}^{m}\stackrel{\begin{array}{c}\text{variance of}\\ i\text{th}\text{latent process}\end{array}}{\stackrel{\ufe48}{{S}_{ii}}}\underset{\begin{array}{c}\text{MSE of}\\ i\text{th}\text{latent process}\end{array}}{\underset{\ufe47}{{({(Ty)}_{i}-{x}_{i})}^{2}}}$$ |

where $T={S}^{-\frac{1}{2}}{U}^{\text{\U0001d5b3}}$ and ${P}_{{U}^{\u27c2}}$ is the orthogonal projection onto the orthogonal complement of $U$.

###### Proof.

By expanding and using orthogonality of $U$,

${\parallel y-Hx\parallel}^{2}$ | $={\parallel y\parallel}^{2}-2\u27e8y,U{S}^{\frac{1}{2}}x\u27e9+{\parallel U{S}^{\frac{1}{2}}x\parallel}^{2}$ | ||

$={\parallel y\parallel}^{2}-{\parallel {U}^{\text{\U0001d5b3}}y\parallel}^{2}+{\parallel {U}^{\text{\U0001d5b3}}y\parallel}^{2}-2\u27e8{U}^{\text{\U0001d5b3}}y,{S}^{\frac{1}{2}}x\u27e9+{\parallel {S}^{\frac{1}{2}}x\parallel}^{2}$ | |||

$=\u27e8y,(I-U{U}^{\text{\U0001d5b3}})y\u27e9+{\displaystyle \sum _{i=1}^{m}}({\u27e8{u}_{i},y\u27e9}^{2}-2\u27e8{u}_{i},y\u27e9{S}_{ii}^{\frac{1}{2}}{x}_{i}+{({S}_{ii}^{\frac{1}{2}}{x}_{i})}^{2})$ | |||

$=\u27e8y,(I-U{U}^{\text{\U0001d5b3}})y\u27e9+{\displaystyle \sum _{i=1}^{m}}{S}_{ii}({S}_{ii}^{-1}{\u27e8{u}_{i},y\u27e9}^{2}-2{S}_{ii}^{-\frac{1}{2}}\u27e8{u}_{i},y\u27e9{x}_{i}+{x}_{i}^{2})$ | |||

$=\u27e8y,(I-U{U}^{\text{\U0001d5b3}})y\u27e9+{\displaystyle \sum _{i=1}^{m}}{S}_{ii}{({(Ty)}_{i}-{x}_{i})}^{2},$ |

where ${u}_{i}$ is the $i$^{th} column of $U$.
Note that ${P}_{U}=U{U}^{\text{\U0001d5b3}}$ is the orthogonal projection onto $U$, so $I-U{U}^{\text{\U0001d5b3}}={P}_{{U}^{\u27c2}}$ is the orthogonal projection onto the orthogonal complement of $U$.
Therefore,

$\u27e8y,(I-U{U}^{\text{\U0001d5b3}})y\u27e9=\u27e8y,{P}_{{U}^{\u27c2}}y\u27e9=\u27e8y,{P}_{{U}^{\u27c2}}^{2}y\u27e9=\u27e8{P}_{{U}^{\u27c2}}^{\text{\U0001d5b3}}y,{P}_{{U}^{\u27c2}}y\u27e9=\u27e8{P}_{{U}^{\u27c2}}y,{P}_{{U}^{\u27c2}}y\u27e9={\parallel {P}_{{U}^{\u27c2}}y\parallel}^{2}.$ |

$\text{scaleobj}0.75\mathrm{\square}$

## Appendix G Kullback–Leibler Divergence Between an LMM and OLMM

###### Prop. 7.

Consider two LMMs with equal $\mathrm{\Lambda}={\sigma}^{2}{I}_{p}$, equal $K(t,{t}^{\prime})$, but different bases $H$ and $\widehat{H}$. Let ${t}_{1},\mathrm{\dots},{t}_{n}\in \mathcal{T}$ and denote ${x}_{i}=x({t}_{i})$ and ${y}_{i}=y({t}_{i})$. It then holds that

$${\mathrm{D}}_{\text{KL}}(p({y}_{1:n},{x}_{1:n})\parallel \widehat{p}({y}_{1:n},{x}_{1:n}))={\mathrm{D}}_{\text{KL}}(\widehat{p}({y}_{1:n},{x}_{1:n})\parallel p({y}_{1:n},{x}_{1:n}))=n\frac{1}{2{\sigma}^{2}}{\parallel H-\widehat{H}\parallel}_{F}^{2}$$ |

and

$$\underset{\widehat{H}:\text{OLMM}}{inf}{\mathrm{D}}_{\text{KL}}(p({y}_{1:n},{x}_{1:n})\parallel \widehat{p}({y}_{1:n},{x}_{1:n}))\le n\frac{\mathbb{E}[{\parallel f(t)\parallel}^{2}]}{{\sigma}^{2}}\underset{i}{max}(1-{V}_{ii})\le n\frac{\mathbb{E}[{\parallel f(t)\parallel}^{2}]}{2{\sigma}^{2}}{\parallel I-V\parallel}_{F}^{2}$$ |

where $\widehat{H}$ ranges over matrices of the form $U{S}^{\frac{1}{2}}$ with $U$ a truncated orthogonal matrix and ${S}^{\frac{1}{2}}>0$ diagonal, $V$ is the orthogonal matrix collecting the right singular vectors of $H$, and $\mathbb{E}[{\parallel f(t)\parallel}^{2}]$ denotes the variance of the observations under the first LMM before adding noise.

###### Proof.

Start out by expanding the Kullback–Leibler divergence and noting that $p({x}_{1:n})=\widehat{p}({x}_{1:n})$:

${\mathrm{D}}_{\text{KL}}(p({y}_{1:n},{x}_{1:n})\parallel \widehat{p}({y}_{1:n},{x}_{1:n}))$ | $=-{\mathbb{E}}_{p({y}_{1:n},{x}_{1:n})}\mathrm{log}{\displaystyle \frac{\widehat{p}({y}_{1:n}|{x}_{1:n})\overline{)\widehat{p}({x}_{1:n})}}{p({y}_{1:n}|{x}_{1:n})\overline{)p({x}_{1:n})}}}$ | ||

$=-{\displaystyle \sum _{i=1}^{n}}{\mathbb{E}}_{p({y}_{i},{x}_{i})}[\mathrm{log}\widehat{p}({y}_{i}|{x}_{i})-\mathrm{log}p({y}_{i}|{x}_{i})]$ | |||

$=-{\displaystyle \sum _{i=1}^{n}}{\mathbb{E}}_{p({y}_{i},{x}_{i})}[\mathrm{log}\mathcal{N}({y}_{i};\widehat{H}{x}_{i},{\sigma}^{2}{I}_{p})-\mathrm{log}\mathcal{N}({y}_{i};H{x}_{i},{\sigma}^{2}{I}_{p})].$ |

Here

$${\mathbb{E}}_{p({y}_{i},{x}_{i})}[\mathrm{log}\mathcal{N}({y}_{i};\widehat{H}{x}_{i},{\sigma}^{2}{I}_{p})]=-\frac{p}{2}\mathrm{log}2\pi {\sigma}^{2}-\frac{1}{2{\sigma}^{2}}{\mathbb{E}}_{p({y}_{i},{x}_{i})}[{\parallel {y}_{i}-\widehat{H}{x}_{i}\parallel}^{2}]$$ |

where

${\mathbb{E}}_{p({y}_{i},{x}_{i})}[{\parallel {y}_{i}-\widehat{H}{x}_{i}\parallel}^{2}]$ | $={\mathbb{E}}_{p({y}_{i},{x}_{i})}\mathrm{tr}[{y}_{i}{y}_{i}^{\text{\U0001d5b3}}-2{y}_{i}{x}_{i}^{\text{\U0001d5b3}}{\widehat{H}}^{\text{\U0001d5b3}}+{x}_{i}{x}_{i}^{\text{\U0001d5b3}}\widehat{H}{\widehat{H}}^{\text{\U0001d5b3}}]$ | ||

$=\mathrm{tr}[H{H}^{\text{\U0001d5b3}}+{\sigma}^{2}I-2H{\widehat{H}}^{\text{\U0001d5b3}}+\widehat{H}{\widehat{H}}^{\text{\U0001d5b3}}]$ | |||

$=p{\sigma}^{2}+\mathrm{tr}[(H-\widehat{H}){(H-\widehat{H})}^{\text{\U0001d5b3}}]$ | |||

$=p{\sigma}^{2}+{\parallel H-\widehat{H}\parallel}_{F}^{2}.$ |

Therefore,

$${\mathrm{D}}_{\text{KL}}(p({y}_{1:n},{x}_{1:n})\parallel \widehat{p}({y}_{1:n},{x}_{1:n}))=n\frac{1}{2{\sigma}^{2}}{\parallel H-\widehat{H}\parallel}_{F}^{2}.$$ |

Let $H=US{V}^{\text{\U0001d5b3}}$ be the SVD of $H$ where $U$ is a truncated orthogonal matrix with the same shape as $H$, $S>0$ is a square diagonal matrix, and $V$ is a square orthogonal matrix. Note that ${U}^{\text{\U0001d5b3}}U=I$, but $U{U}^{\text{\U0001d5b3}}\ne I$. Then, choosing $\widehat{H}=US$,

$$\underset{\widehat{H}:\text{OLMM}}{inf}{\mathrm{D}}_{\text{KL}}(p({y}_{1:n},{x}_{1:n})\parallel \widehat{p}({y}_{1:n},{x}_{1:n}))\le n\frac{1}{2{\sigma}^{2}}{\parallel U(S{V}^{\text{\U0001d5b3}}-S)\parallel}_{F}^{2}=n\frac{1}{2{\sigma}^{2}}{\parallel S{V}^{\text{\U0001d5b3}}-S\parallel}_{F}^{2}$$ |

since ${\parallel UA\parallel}_{F}^{2}=\mathrm{tr}[{A}^{\text{\U0001d5b3}}{U}^{\text{\U0001d5b3}}UA]=\mathrm{tr}[{A}^{\text{\U0001d5b3}}A]={\parallel A\parallel}_{F}^{2}$. We now further simplify:

$${\parallel S{V}^{\text{\U0001d5b3}}-S\parallel}_{F}^{2}=\mathrm{tr}[(S{V}^{\text{\U0001d5b3}}-S){(S{V}^{\text{\U0001d5b3}}-S)}^{\text{\U0001d5b3}}]=\mathrm{tr}[S{V}^{\text{\U0001d5b3}}VS-S{V}^{\text{\U0001d5b3}}S-SVS+SS]=2\mathrm{tr}[SS-SVS].$$ |

Hence, by definition of the trace and the fact that $S$ is diagonal,

$${\parallel S{V}^{\text{\U0001d5b3}}-S\parallel}_{F}^{2}=2\sum _{i=1}^{m}{S}_{ii}^{2}(1-{V}_{ii})\le 2\left(\sum _{i=1}^{m}{S}_{ii}^{2}\right)\underset{i}{max}(1-{V}_{ii})=2\mathbb{E}[{\parallel f\parallel}^{2}]\underset{i}{max}(1-{V}_{ii}),$$ |

since

$$\mathbb{E}[{\parallel f(t)\parallel}^{2}]=\mathbb{E}\mathrm{tr}[f(t){f}^{\text{\U0001d5b3}}(t)]=\mathrm{tr}[H{H}^{\text{\U0001d5b3}}]=\mathrm{tr}[{S}^{2}].$$ |

Therefore,

$${\parallel S{V}^{\text{\U0001d5b3}}-S\parallel}_{F}^{2}\le 2\mathbb{E}[{\parallel f\parallel}^{2}]\underset{i}{max}(1-{V}_{ii})\le 2\mathbb{E}[{\parallel f\parallel}^{2}]\sum _{i=1}^{m}(1-{V}_{ii})=\mathbb{E}[{\parallel f\parallel}^{2}]{\parallel I-V\parallel}_{F}^{2},$$ |

where the equality follows from a similar calculation:

$${\parallel I-V\parallel}_{F}^{2}=\mathrm{tr}[I-{V}^{\text{\U0001d5b3}}-V+{V}^{\text{\U0001d5b3}}V]=2\mathrm{tr}[I-V].$$ |

$\text{scaleobj}0.75\mathrm{\square}$

## Appendix H Missing Data

Missing data is troublesome for the OLMM, because it is not possible to take away a subset of the rows of $H$ and retain orthogonality of the columns. Suppose that data are missing for a fixed subset of outputs. Collect that subset of outputs into ${y}_{1}$, and let ${y}_{2}$ be the remainder of the outputs. The issue can then be resolved by considering a multi-headed version of the OLMM:

${y}_{1}|{H}_{1},x$ | $\sim \mathcal{G}\mathcal{P}({H}_{1}x(t),\delta [t-{t}^{\prime}]{\mathrm{\Lambda}}_{1}),$ | ||

${y}_{2}|{H}_{2},x$ | $\sim \mathcal{G}\mathcal{P}({H}_{2}x(t),\delta [t-{t}^{\prime}]{\mathrm{\Lambda}}_{2}).$ |

For $k$ heads, if ${y}_{I}={({y}_{i})}_{i\in I}$ for a subset of heads $I\subseteq \{1,\mathrm{\dots},k\}$ are observed, then

$${T}_{I}{y}_{I}={\mathrm{\Lambda}}_{{T}_{I}}{\sum}_{i\in I}{\mathrm{\Lambda}}_{{T}_{i}}^{-1}{T}_{i}{y}_{i},{\mathrm{\Lambda}}_{{T}_{I}}^{-1}={\sum}_{i\in I}{\mathrm{\Lambda}}_{{T}_{i}}^{-1}.$$ |

While this new model resolves the missing data problem, the assumption that each collection of outputs (each head) is a linear transformation of the same collection of latent processes will be overly restrictive for many contexts: it is not typically the case that our prior beliefs about the relationships between outputs have anything to do with the availability of data pertaining to them.

In the general case, where data can be missing arbitrarily, let ${Y}_{\text{o}}$ be the observed data. Complement ${Y}_{\text{o}}$ with missing data ${Y}_{\text{m}}$ such that $Y={Y}_{\text{o}}\cup {Y}_{\text{m}}$ is complete. Then a simple solution is to use variational inference. Assume a Gaussian approximate posterior distribution $q({Y}_{\text{m}})$ over ${Y}_{\text{m}}$, and maximise the evidence lower bound (ELBO) $\mathcal{L}$ using gradient-based optimisation:

$$\mathrm{log}p({Y}_{\text{o}})\ge {\mathbb{E}}_{q({Y}_{\text{m}})}[\mathrm{log}p(Y)]+H[q({Y}_{\text{m}})]=\mathcal{L}[q({Y}_{\text{m}})],$$ |

where the expectation can be approximated using the reparametrisation trick (Kingma:2013:Auto-Encoding_VB), $\mathrm{log}p(Y)$ can be computed efficiently because $Y$ is complete, and $H[q({Y}_{\text{m}})]$ denotes the entropy of $q({Y}_{\text{m}})$. This approach provides a tractable solution when the number of missing data are not too numerous.

## Appendix I Heterogeneous Observation Noise

Although the specification of the observation noise $\mathrm{\Lambda}={\sigma}^{2}{I}_{p}+HD{H}^{\text{\U0001d5b3}}$ in the OLMM does not allow for heterogeneous observation noise, it is possible to set $\mathrm{\Lambda}=\mathrm{diag}({\sigma}_{1}^{2},\mathrm{\dots},{\sigma}_{p}^{2})$ and use creftype 4 to include $\mathrm{\Lambda}$ in the parametrisation of $H$: $H={\mathrm{\Lambda}}^{\frac{1}{2}}U{S}^{\frac{1}{2}}$. This parametrisation can be interpreted in two ways: {enumerate*}

The model has a whitening transform built in. In the projection $T$, the (noise in the) data will first by whitened by ${\mathrm{\Lambda}}^{-\frac{1}{2}}$. Hence, this parametrisation can be used as a more principled substitute for the usual data normalisation where the outputs are divided by their empirical standard deviation prior to feeding them to the model.

The basis is orthogonal with respect to a weighted Euclidean inner product:
${\u27e8{h}_{i},{h}_{j}\u27e9}_{{\mathrm{\Lambda}}^{-1}}={\sum}_{k=1}^{p}{h}_{ik}{h}_{jk}/{\sigma}_{k}^{2}=0$
for $i\ne j$.
Intuitively, this means that the basis is orthogonal in the usual sense after stretching the $i$^{th} dimension by ${\sigma}_{i}^{-1}$.
Although this construction provides additional flexiblity, it does require that $D=0$ to avoid a circular dependency between $\mathrm{\Lambda}$ and $H$.

## Appendix J Point Process Experimental Details and Additional Results

We use a total of $np=200\times 100$ bins in the model, where each bin is $5\mathrm{m}$ $\times $ $5\mathrm{m}$. The likelihood for this experiment is given by

$$p(Y\mid f)\approx \prod _{i=1}^{n}\prod _{j=1}^{p}\mathrm{Poisson}({Y}_{ij}\mid \mathrm{exp}(f({r}_{ij})))$$ |

where ${r}_{ij}$ is the coordinate of the $ij$^{th} bin, ${Y}_{ij}$ is the number of data points in the $ij$^{th} bin, and $Y$ is the $n\times p$ matrix of counts.

We perform ${10}^{5}$ iterations of Gibbs sampling, alternating between Elliptical Slice Sampling for the Gaussian process given its hyperparameters, and Metropolis Hastings with proposal distribution $\mathcal{N}(\theta ,{0.05}^{2})$ for the log of the hyperparameters given the latent GP-distributed function. The kernel is a product of two Matérn-$5/2$ kernels, one over the vertical dimension and the other over the horizontal one, each with their own length scale. A single process variance is utilised, and a nugget term is added. The $\mathrm{log}$ of each of the four hyperparameters was given a $\mathcal{N}(0,1)$ prior. creftypecap 8 shows the log joint of the entire state after each iteration, while creftype 9 shows the progress of each hyperparameter per iteration.

The times in creftype 4 were obtained via BenchmarkTools.jl (BenchmarkTools.jl-2016). The implementation of the standard Kronecker product decomposition trick makes use of Kronecker.jl, and Julia’s (bezanson2017julia) standard linear algebra libraries, which make use of OpenBLAS and LAPACK to efficiently perform matrix-matrix products and compute eigendecompositions. The implementation of the state-space GP additionally makes use of StaticArrays.jl for efficient stack-allocated matrices, and Stheno.jl for GP-related functionality. Timing experiments were conducted on a single CPU core to make the implementations comparable.

The state-space implementation of the GP makes use of the trick introduced by Solin:2018:Infinite-Horizon_Gaussian_Processes when computing the log marginal likelihood. However, the infinite-horizon trick is exploited only once the filtering covariance has converged, which is determined by the point at which the Frobenius norm of the difference between the filtering covariance at the $t$^{th} and $(t-1)$^{th} iterations drops below ${10}^{-12}$. In practice, we find that this produces log-marginal likelihoods that are essentially exact, as expected.