### Abstract

We introduce a framework for continual learning based on Bayesian inferenceover the function space rather than the parameters of a deep neural network.This method, referred to as functional regularisation for continual learning,avoids forgetting a previous task by constructing and memorising an approximateposterior belief over the underlying task-specific function. To achieve this werely on a Gaussian process obtained by treating the weights of the last layerof a neural network as random and Gaussian distributed. Then, the trainingalgorithm sequentially encounters tasks and constructs posterior beliefs overthe task-specific functions by using inducing point sparse Gaussian processmethods. At each step a new task is first learnt and then a summary isconstructed consisting of (i) inducing inputs and (ii) a posterior distributionover the function values at these inputs. This summary then regulariseslearning of future tasks, through Kullback-Leibler regularisation terms, sothat catastrophic forgetting of earlier tasks is avoided. We demonstrate ouralgorithm in classification datasets, such as Split-MNIST, Permuted-MNIST andOmniglot.

### Quick Read (beta)

# Functional Regularisation for Continual Learning

###### Abstract

We introduce a framework for continual learning
based on Bayesian inference over the function space rather than the parameters of a deep neural network.
This method, referred to as functional regularisation for continual learning, avoids forgetting
a previous task by constructing and memorising an approximate posterior belief over the underlying task-specific function.
To achieve this we rely on a Gaussian process obtained by treating the weights of the last layer of a neural network as random and Gaussian distributed.
Then, the training algorithm sequentially encounters tasks and constructs
posterior beliefs over the task-specific functions by using *inducing point sparse Gaussian process* methods.
At each step a new task is first learnt and then a summary is constructed consisting of (i) inducing inputs
and (ii) a posterior distribution over the function values at these inputs. This summary then regularises learning of future tasks, through Kullback-Leibler regularisation terms,
so that catastrophic forgetting of earlier tasks is avoided. We demonstrate our algorithm
in classification datasets, such as Split-MNIST, Permuted-MNIST and Omniglot.

Functional Regularisation for Continual Learning

Michalis K. Titsias DeepMind Jonathan Schwarz DeepMind Alexander G. de G. Matthews DeepMind Razvan Pascanu DeepMind Yee Whye Teh DeepMind

noticebox[b]Preprint. Under review.\[email protected]

## 1 Introduction

Recent years have seen a resurgence of interest in continual learning, also called life-long learning, that refers to systems that learn in an online fashion from data associated with possibly an ever-increasing number of tasks (Ring, 1994; Robins, 1995; Schmidhuber, 2013; Goodfellow et al., 2013). A continual learning system must adapt to perform well on all earlier tasks without requiring extensive re-training on previous data. Two main challenges for continual learning are (i) avoiding catastrophic forgetting, i.e. remembering how to solve earlier tasks, and (ii) scalability over the number of tasks. Other possible desiderata may include forward and backward transfer, i.e. learning later tasks faster and retrospectively improving on earlier tasks. It is worth noting that continual learning is rather different from what is referred to as meta-learning or multi-task learning. In these latter methods all tasks are simultaneously learnt, e.g. training is carried out by subsampling minibatches of tasks (see e.g. Finn et al., 2017; Santoro et al., 2016), which means that there is no risk of forgetting.

Similarly to many recent works on continual learning (Kirkpatrick et al., 2017; Nguyen et al., 2017; Rusu et al., 2016; Li and Hoiem, 2017; Farquhar and Gal, 2018), we focus on the idealised situation where a sequence of supervised learning tasks, with known task boundaries, are presented to a continual learning system which is a deep neural network. One main challenge is then to efficiently regularise learning so that the deep neural network avoids catastrophic forgetting, i.e. it avoids configurations of the network parameters that lead to poor predictive performance on earlier tasks. Among the different techniques, we consider two different families of approaches with regards to managing catastrophic forgetting. On one hand, are methods which constrain or regularise the parameters of the network to not deviate significantly from those learnt on previous tasks. This includes methods that frame continual learning as sequential approximate Bayesian inference, including EWC (Kirkpatrick et al., 2017) and VCL (Nguyen et al., 2017). Such approaches suffer from brittleness due to representation drift. That is, as parameters adapt to new tasks the values that other parameters are constrained/regularised towards become obsolete (see Section 2.5 for further discussion on this). On the other hand, we have rehearsal/replay buffer methods, which use a memory store of past observations to remember previous tasks (Robins, 1995; Robins and McCallum, 1998; Lopez-Paz et al., 2017; Rebuffi et al., 2017). These do not suffer from brittleness, but they do not express uncertainty about the unknown functions (they only store input-output pairs), and can be less scalable if tasks are complex and require many observations to represent properly. Optimising for the best observations to store in the replay buffer is also an unsolved problem.

In this paper, we develop a new approach to continual learning which addresses the shortcomings of both categories. It is based on approximate Bayesian inference, but on the space of functions instead of neural network parameters, so does not suffer from the aforementioned brittleness. This approach avoids forgetting an earlier task by memorising an approximate posterior belief over the underlying task-specific function. To implement this, we consider Gaussian processes (GPs) (Rasmussen and Williams, 2005), and make use of inducing point sparse GP methods (Csato and Opper, 2002; Lawrence et al., 2002; Seeger et al., 2003; Quiñonero-Candela and Rasmussen, 2005; Snelson and Ghahramani, 2006; Titsias, 2009; Hensman et al., 2013; Bui et al., 2017b), which summarise posterior distributions over functions using small numbers of inducing points. These inducing points and their posterior distributions are used to regularise continual learning of future tasks, through Kullback-Leibler regularisation terms within a variational inference framework, so that catastrophic forgetting of earlier tasks is avoided. As such our approach bears similarities to replay based approaches, but has two important advantages. First the approximate posterior distribution at the inducing points captures the uncertainty of the unknown function and it summarizes the full posterior given all observations. Second, inducing points can be optimized using specialised criteria from the GP literature, achieving better performance than a random selection of observations.

To enable our functional regularisation approach to deal with high-dimensional and complex datasets, we use a linear kernel with features parameterised by neural networks (Wilson et al., 2016). Such GPs can be understood as Bayesian neural networks, where only the weights of the last layer are treated in a Bayesian fashion, while those in earlier layers are optimised (Riquelme et al., 2018). This view allows for a more computationally efficient and accurate training procedure to be carried out in weight space, before the approximation is translated into function space where the inducing points are constructed and then used for regularising learning of future tasks. We demonstrate our method in classification and show that it leads to state-of-the-art performance on Permuted-MNIST, Split-MNIST and Omniglot.

## 2 Functional Regularisation for Continual Learning

We consider supervised learning of multiple tasks, with known task boundaries, that are processed sequentially one at a time. At each step we receive a set of examples $({X}_{i},{\mathbf{y}}_{i})$ where ${X}_{i}={\{{x}_{i,j}\}}_{j=1}^{{N}_{i}}$ are input vectors and ${\mathbf{y}}_{i}={\{{y}_{i,j}\}}_{j=1}^{{N}_{i}}$ are output targets so that each ${y}_{i,j}$ is assigned to the input ${x}_{i,j}\in {\mathbb{R}}^{D}$. We assume the most extreme case (and challenging in terms of avoiding forgetting) where each dataset $({X}_{i},{\mathbf{y}}_{i})$ introduces a new task, while less extreme cases can be treated similarly. We wish to sequentially train a shared model or representation from all tasks so that catastrophic forgetting is avoided, i.e. when the model is trained on the $i$-th task it should still provide accurate predictions for all tasks $$ seen in the past. As a model we consider a deep neural network with its final hidden layer providing the feature vector $\varphi (x;\theta )\in {\mathbb{R}}^{K}$ where $x$ is the input vector and $\theta $ are the model parameters. This representation is shared across tasks and $\theta $ is a task-shared parameter. To solve a specific task $i$ we additionally construct an output layer

$${f}_{i}(x;{w}_{i})\equiv {f}_{i}(x;{w}_{i},\theta )={w}_{i}^{\top}\varphi (x;\theta ),$$ | (1) |

where for simplicity we assume that ${f}_{i}(x;{w}_{i})$ is a scalar function and ${w}_{i}$ is the vector of task-specific weights. Dealing with vector-valued functions is straightforward and is discussed in the Supplement. By placing a Gaussian prior on the output weights, ${w}_{i}\sim \mathcal{N}({w}_{i}|0,{\sigma}_{w}^{2}I)$, we obtain a distribution over functions. While each task has its own independent/private weight vector ${w}_{i}$ the whole distribution refers to the full infinite set of tasks that can be tackled by the same feature vector $\varphi (x;\theta )$. We can marginalise out ${w}_{i}$ and obtain the equivalent function space view of the model, where each task-specific function is an independent draw from a GP (Rasmussen and Williams, 2005), i.e.

$${f}_{i}(x)\sim \mathcal{G}\mathcal{P}(0,k(x,{x}^{\prime})),k(x,{x}^{\prime})={\sigma}_{w}^{2}\varphi {(x;\theta )}^{\top}\varphi ({x}^{\prime};\theta ),$$ |

where the kernel function is defined by the dot product of the neural network feature vector. By assuming for now that all possible tasks are simultaneously present similarly to multi-task GPs (Bonilla et al., 2008; Álvarez et al., 2012), the joint distribution over function values and output data for all tasks is written as ${\prod}_{i}p({\mathbf{y}}_{i}|{\mathbf{f}}_{i})p({\mathbf{f}}_{i})={\prod}_{i}p({\mathbf{y}}_{i}|{\mathbf{f}}_{i})\mathcal{N}({\mathbf{f}}_{i}|\mathrm{\U0001d7ce},{K}_{{X}_{i}}),$ where the vector ${\mathbf{f}}_{i}$ stores all function values for the input dataset ${X}_{i}$, i.e. ${f}_{i,j}=f({x}_{i,j}),j=1,\mathrm{\dots},{N}_{i}$. Also the kernel matrix ${K}_{{X}_{i}}$ is obtained by evaluating the kernel function on ${X}_{i}$, i.e. each element ${[{K}_{{X}_{i}}]}_{j,k}={\sigma}_{w}^{2}\varphi {({x}_{i,j};\theta )}^{\top}\varphi ({x}_{i,k};\theta )$ where ${x}_{i,j},{x}_{i,k}\in {X}_{i}$. The form of each likelihood function $p({\mathbf{y}}_{i}|{\mathbf{f}}_{i})$ depends on the task, for example if the $i$-th task involves binary classification then $p({\mathbf{y}}_{i}|{\mathbf{f}}_{i})={\prod}_{j=1}^{{N}_{i}}p({y}_{i,j}|{f}_{i,j})={\prod}_{j=1}^{{N}_{i}}\frac{1}{1+{e}^{-{y}_{i,j}{f}_{i,j}}}$ where ${y}_{i,j}\in \{-1,1\}$ indicates the binary class label.

Inference in this model requires estimating each posterior distribution $p({\mathbf{f}}_{i}|{\mathbf{y}}_{i},{X}_{i})$, which can be approximated by a multivariate Gaussian $\mathcal{N}({\mathbf{f}}_{i}|{\mu}_{i},{\mathrm{\Sigma}}_{i})$. Given this Gaussian we can express our posterior belief over any function value ${f}_{i,*}$ at some test input ${x}_{i,*}$ using the posterior GP (Rasmussen and Williams, 2005),

$$p({f}_{i,*}|{X}_{i},{y}_{i})=\int {p}_{\theta}({f}_{i,*}|{\mathbf{f}}_{i})\mathcal{N}({\mathbf{f}}_{i}|{\mu}_{i},{\mathrm{\Sigma}}_{i})\mathit{d}{\mathbf{f}}_{i}.$$ |

Given that the tasks arrive one at a time, the above suggests that one way to avoid forgetting the $i$-th task is to memorise the corresponding posterior belief $\mathcal{N}({\mathbf{f}}_{i}|{\mu}_{i},{\mathrm{\Sigma}}_{i})$. While this can regularise continual learning of subsequent tasks (similarly to the more general variational framework in the next section), it can be prohibitively expensive since the non-parametric nature of the model means that for each $\mathcal{N}({\mathbf{f}}_{i}|{\mu}_{i},{\mathrm{\Sigma}}_{i})$ we need to store $O({N}_{i}^{2})$ parameters and additionally we need to keep in memory the full set of input vectors ${X}_{i}$.

Therefore, in order to reduce the time and memory requirements we would like to apply data distillation and approximate each full posterior by applying sparse GP methods. As shown next, by applying variational sparse GP inference (Titsias, 2009) in a sequential fashion we obtain a new algorithm for function space regularisation in continual learning.

### 2.1 Learning the first task

Suppose we encounter the first task with data $({X}_{1},{\mathbf{y}}_{1})$. We introduce a small set ${Z}_{1}={\{{z}_{1,j}\}}_{j=1}^{{M}_{1}}$ of inducing inputs where each ${z}_{1,j}$ lives in the same space as each training input ${x}_{1,j}$. The inducing set ${Z}_{1}$ can be a subset of ${X}_{1}$ or it can contain pseudo inputs (Snelson and Ghahramani, 2006), i.e. points lying between the training inputs. For simplicity next we consider ${Z}_{1}$ as pseudo points, although in practice for continual learning it can be more suitable to select them from the training inputs (see Section 2.4). By evaluating the function output at each ${z}_{1,j}$ we obtain a vector of auxiliary function values ${\mathbf{u}}_{1}={\{{u}_{1,j}\}}_{j=1}^{{M}_{1}}$, where each ${u}_{1,j}=f({z}_{1,j})$. Hence, we obtain the joint distribution

$$p({\mathbf{y}}_{1},{\mathbf{f}}_{1},{\mathbf{u}}_{1})=p({\mathbf{y}}_{1}|{\mathbf{f}}_{1}){p}_{\theta}({\mathbf{f}}_{1}|{\mathbf{u}}_{1}){p}_{\theta}({\mathbf{u}}_{1}).$$ | (2) |

The exact posterior distribution ${p}_{\theta}({\mathbf{f}}_{1}|{\mathbf{u}}_{1},{\mathbf{y}}_{1}){p}_{\theta}({\mathbf{u}}_{1}|{\mathbf{y}}_{1})$ is approximated by a distribution of the form, $q({\mathbf{f}}_{1},{\mathbf{u}}_{1})={p}_{\theta}({\mathbf{f}}_{1}|{\mathbf{u}}_{1})q({\mathbf{u}}_{1}),$ where $q({\mathbf{u}}_{i})$ is a variational distribution and ${p}_{\theta}({\mathbf{f}}_{1}|{\mathbf{u}}_{1})$ is the GP prior conditional, ${p}_{\theta}({\mathbf{f}}_{1}|{\mathbf{u}}_{1})=\mathcal{N}({\mathbf{f}}_{1}|{K}_{{X}_{1}{Z}_{1}}{K}_{{Z}_{1}}^{-1}{\mathbf{u}}_{1},{K}_{{X}_{1}}-{K}_{{X}_{1}{Z}_{1}}{K}_{{Z}_{1}}^{-1}{K}_{{Z}_{1}{X}_{1}}).$ Here, ${K}_{{X}_{1}{Z}_{1}}$ is the cross kernel matrix between the sets ${X}_{1}$ and ${Z}_{1}$, ${K}_{{Z}_{1}{X}_{1}}={K}_{{X}_{1}{Z}_{1}}^{\top}$ and ${K}_{{Z}_{1}}$ is the kernel matrix on ${Z}_{1}$. The method learns $(q({\mathbf{u}}_{1}),{Z}_{1})$ by minimising the KL divergence $\text{KL}({p}_{\theta}({\mathbf{f}}_{1}|{\mathbf{u}}_{1})q({\mathbf{u}}_{1})||{p}_{\theta}({\mathbf{f}}_{1}|{\mathbf{u}}_{1},{\mathbf{y}}_{1}){p}_{\theta}({\mathbf{u}}_{1}|{\mathbf{y}}_{1}))$. The ELBO is also maximised over the neural network feature vector parameters $\theta $ that determine the kernel matrices. This ELBO is generally written in the form (Hensman et al., 2013; Lloyd et al., 2015; Dezfouli and Bonilla, 2015; Hensman et al., 2015; Sheth et al., 2015),

$$\mathcal{F}(\theta ,q({\mathbf{u}}_{1}))=\sum _{j=1}^{{N}_{1}}{\mathbb{E}}_{q({f}_{1,j})}[\mathrm{log}p({y}_{1,j}|{f}_{1,j})]-\text{KL}(q({\mathbf{u}}_{1})||{p}_{\theta}({\mathbf{u}}_{1})),$$ | (3) |

where $q({f}_{1,j})=\int p({f}_{1,j}|{\mathbf{u}}_{1})q({\mathbf{u}}_{1})\mathit{d}{\mathbf{u}}_{1}$ is an univariate Gaussian distribution with analytic mean and variance that depend on $(\theta ,{Z}_{1},q({\mathbf{u}}_{1}),{x}_{1,j})$. Each expectation ${\mathbb{E}}_{q({f}_{1,j})}[\mathrm{log}p({y}_{1,j}|{f}_{1,j})]$ is a one-dimensional integral and can be estimated by Gaussian quadrature. The variational distribution $q({\mathbf{u}}_{1})$ is chosen to be a Gaussian, parametrised as $q({\mathbf{u}}_{1})=\mathcal{N}({\mathbf{u}}_{1}|{\mu}_{{u}_{1}},{L}_{{u}_{1}}{L}_{{u}_{1}}^{\top})$, where ${L}_{{u}_{1}}$ is a square root matrix such as a lower triangular Cholesky factor. Then, based on the above we can jointly apply stochastic variational inference (Hensman et al., 2013) to maximise the ELBO over $(\theta ,{\mu}_{{u}_{1}},{L}_{{u}_{1}})$ and optionally over the inducing inputs ${Z}_{1}$.

### 2.2 Learning the second and subsequent tasks

The functional regularisation framework for continual learning arises from the variational sparse GP inference method as we encounter the second and subsequent tasks.

Once we have learned the first task we throw away the dataset $({X}_{1},{\mathbf{y}}_{1})$ and we keep in memory only a task summary consisting of the inducing inputs ${Z}_{1}$ and the variational Gaussian distribution $q({\mathbf{u}}_{1})$ (i.e. its parameters ${\mu}_{{u}_{1}}$ and ${L}_{{u}_{1}}$). Note also that $\theta $ (that determines the neural network feature vector $\varphi (x;\theta )$) has a current value obtained by learning the first task. When the dataset $({X}_{2},{\mathbf{y}}_{2})$ for the second task arrives, a suitable ELBO to continue learning $\theta $ and also estimate the second task summary $({Z}_{2},q({\mathbf{u}}_{2}))$ is

$\sum _{j=1}^{{N}_{1}}}{\mathbb{E}}_{q({f}_{1,j})}[\mathrm{log}p({y}_{1j}|{f}_{1,j})]+{\displaystyle \sum _{j=1}^{{N}_{2}}}{\mathbb{E}}_{q({f}_{2,j})}[\mathrm{log}p({y}_{2,j}|{f}_{2,j})]-{\displaystyle \sum _{i=1,2}}\text{KL}(q({\mathbf{u}}_{i})||{p}_{\theta}({\mathbf{u}}_{i})),$ |

which is just the sum of the corresponding ELBOs for the two tasks. We need to approximate this ideal objective by making use of the fixed summary $({Z}_{1},q({\mathbf{u}}_{1}))$ that we have kept in memory for the first task. By considering ${Z}_{1}$ as our “pseudo-replay buffer” inputs with pseudo-outputs ${\stackrel{~}{\mathbf{y}}}_{1}$, the above can be approximated by

$\frac{{N}_{1}}{{M}_{1}}}{\displaystyle \sum _{j=1}^{{M}_{1}}}{\mathbb{E}}_{q({u}_{1,j})}[\mathrm{log}p({\stackrel{~}{y}}_{1,j}|{u}_{1,j})]+{\displaystyle \sum _{j=1}^{{N}_{2}}}{\mathbb{E}}_{q({f}_{2,j})}[\mathrm{log}p({y}_{2,j}|{f}_{2,j})-{\displaystyle \sum _{i=1,2}}\text{KL}(q({\mathbf{u}}_{i})|{p}_{\theta}({\mathbf{u}}_{i})),$ |

where each $q({u}_{1,j})$ is a univariate marginal of $q({\mathbf{u}}_{1})$. However, since $q({\mathbf{u}}_{1})$ is kept fixed the whole expected log-likelihood term $\frac{{N}_{1}}{{M}_{1}}{\sum}_{j=1}^{{M}_{1}}{E}_{q({u}_{1,j})}[\mathrm{log}p({\stackrel{~}{y}}_{1,j}|{u}_{1,j})]$ is just a constant that does not depend on the parameters $\theta $ any more. Thus, the objective function when learning the second task reduces to maximising,

$$\mathcal{F}(\theta ,q({\mathbf{u}}_{2}))=\sum _{j=1}^{{N}_{2}}{\mathbb{E}}_{q({f}_{2,j})}[\mathrm{log}p({y}_{2,j}|{f}_{2,j})]-\sum _{i=1,2}\text{KL}(q({\mathbf{u}}_{i})||{p}_{\theta}({\mathbf{u}}_{i})).$$ |

The only term associated with the first task is $\text{KL}(q({\mathbf{u}}_{1})||{p}_{\theta}({\mathbf{u}}_{1}))$. While $q({\mathbf{u}}_{1})$ is fixed (i.e. its parameters are constant), the GP prior ${p}_{\theta}({\mathbf{u}}_{1})=\mathcal{N}({\mathbf{u}}_{1}|\mathrm{\U0001d7ce},{K}_{{Z}_{1}})$ is still a function of the feature vector parameters $\theta $, since ${K}_{{Z}_{1}}$ depends on $\theta $. Thus, this KL term regularises the parameters $\theta $ so that, while learning the second task, the feature vector still needs to explain the posterior distribution over the function values ${\mathbf{u}}_{1}$ at input locations ${Z}_{1}$. Notice that $-\text{KL}(q({\mathbf{u}}_{i})||{p}_{\theta}({\mathbf{u}}_{i})$ is further simplified as $\int q({\mathbf{u}}_{1})\mathrm{log}{p}_{\theta}({\mathbf{u}}_{1})\mathit{d}{\mathbf{u}}_{1}+const$, which shows that the regularisation is such that ${p}_{\theta}({\mathbf{u}}_{1})$ needs to be consistent with all infinite draws from $q({\mathbf{u}}_{1})$ in a moment-matching or maximum likelihood sense.

Similarly for the subsequent tasks we can conclude that for any new task $k$ the objective will be

$$\mathcal{F}(\theta ,q({\mathbf{u}}_{k}))=\underset{\text{objective for the current task}}{\underset{\u23df}{\sum _{j=1}^{{N}_{k}}{\mathbb{E}}_{q({f}_{k,j})}\mathrm{log}p({y}_{k,j}|{f}_{k,j})-\text{KL}(q({\mathbf{u}}_{k})||{p}_{\theta}({\mathbf{u}}_{k}))}}-\underset{\text{regularisation from previous tasks}}{\underset{\u23df}{\sum _{i=1}^{k-1}\text{KL}(q({\mathbf{u}}_{i})||{p}_{\theta}({\mathbf{u}}_{i}))}}.$$ | (4) |

Thus, functional regularisation when learning a new task is achieved through the sum of the KL divergences ${\sum}_{i=1}^{k-1}\text{KL}(q({\mathbf{u}}_{i})||{p}_{\theta}({\mathbf{u}}_{i}))$ of all previous tasks, where each $q({\mathbf{u}}_{i})$ is the fixed posterior distribution which encodes our previously obtained knowledge about task $$. Furthermore, in order to keep the optimisation scalable over tasks, we can form unbiased approximations of this latter sum by sub-sampling the KL terms, i.e. by performing minibatch-based stochastic approximation over the regularisation terms associated with these previous tasks.

### 2.3 Accurate posterior distributions by weight space inference for the current task

While the above framework arises by applying sparse GP inference, it can still be limited. When the budget of inducing variables is small, the sparse GP approximation may lead to inaccurate estimates of the posterior belief $q({\mathbf{u}}_{k})$, which will degrade the quality of regularisation when learning new tasks. This is worrisome as in continual learning it is desirable to keep the size of the inducing set as small as possible.

One way to deal with this issue is to use a much larger set of inducing points for the current task or even maximise the full GP ELBO ${\sum}_{j=1}^{{N}_{k}}{\mathbb{E}}_{q({f}_{k,j})}\mathrm{log}p({y}_{k,j}|{f}_{k,j})-\text{KL}(q({\mathbf{f}}_{k})||{p}_{\theta}({\mathbf{f}}_{k}))$ (i.e. by using as many inducing points as training examples), and once training is completed to distill the small subset ${Z}_{k},{\mathbf{u}}_{k}\subset {X}_{k},{\mathbf{f}}_{k}$, and the corresponding marginal distribution $q({\mathbf{u}}_{k})$ from $q({\mathbf{f}}_{k})$, for subsequently regularising continual learning. However, carrying out this maximisation in the function space can be extremely slow since it scales as $O({N}_{k}^{3})$ per optimisation step. To our rescue, there is an alternative computationally efficient way to achieve this, by relying on the linear form of the kernel, that performs inference over the current task in the weight space. While this inference does not immediately provides us with the summary (induced points) for building the functional regularisation term, we can distill this term afterwards as discussed next. This allows us to address the continual learning aspect of the problem. Given that the current $k$-th task is represented in the weight space as ${f}_{k}(x;{w}_{k})={w}_{k}^{\top}\varphi (x;\theta ),{w}_{k}\sim \mathcal{N}(0,{\sigma}_{w}^{2}I)$, we introduce a full Gaussian variational approximation $q({w}_{k})=\mathcal{N}({w}_{k}|{\mu}_{{w}_{k}},{\mathrm{\Sigma}}_{{w}_{k}})$, where ${\mu}_{k}$ is a $K$ dimensional mean vector and ${\mathrm{\Sigma}}_{{w}_{k}}$ is the corresponding $K\times K$ full covariance matrix parametrised as ${\mathrm{\Sigma}}_{{w}_{k}}={L}_{{w}_{k}}{L}_{{w}_{k}}^{\top}$. Learning the $k$-th task is carried out by maximising the objective in (4), with the only difference that the ELBO for the current task is now in the weight space. The objective becomes

$\mathcal{F}(\theta ,q({w}_{k}))={\displaystyle \sum _{j=1}^{{N}_{k}}}{\mathbb{E}}_{q({w}_{k})}[\mathrm{log}p({y}_{k,j}|{w}_{k}^{\top}\varphi ({x}_{k,j};\theta ))]-\text{KL}(q({w}_{k})||p({w}_{k}))-{\displaystyle \sum _{i=1}^{k-1}}\text{KL}(q({\mathbf{u}}_{i})||{p}_{\theta}({\mathbf{u}}_{i})),$ |

where ${\mathbb{E}}_{q({w}_{k})}[\mathrm{log}p({y}_{k,j}|{w}_{k}^{\top}\varphi ({x}_{k,j},\theta ))]$ can be re-written as one-dimensional integral and estimated using Gaussian quadrature. Once the variational distribution $q({w}_{k})$ has been optimised, together with the constantly updated feature parameters $\theta $, we can rely on this solution to select inducing points ${Z}_{k}$. See Section 2.4 for more detail. We also compute the posterior distribution over their function values ${\mathbf{u}}_{k}$ according to $q({\mathbf{u}}_{k})=\mathcal{N}({\mathbf{u}}_{k}|{\mu}_{{u}_{k}},{L}_{{u}_{k}}{L}_{{u}_{k}}^{\top})$, where

$${\mu}_{{u}_{k}}={\mathrm{\Phi}}_{{Z}_{k}}{\mu}_{{w}_{k}},{L}_{{u}_{k}}={\mathrm{\Phi}}_{{Z}_{k}}{L}_{{w}_{k}}$$ | (5) |

and the matrix ${\mathrm{\Phi}}_{{Z}_{k}}$ stores as rows the feature vectors evaluated at ${Z}_{k}$. Subsequently, we store the $k$-th task summary $({Z}_{k},{\mu}_{{u}_{k}},{L}_{{u}_{k}})$ and use it for regularising continual learning of subsequent tasks, by always maximising the objective $\mathcal{F}(\theta ,q({w}_{k}))$. Pseudo-code of the procedure is given in Algorithm 1.

[tb] \[email protected]@algorithmic \STATEInput: Feature vector $\varphi (x;\theta )$ and initial value of $\theta $. \FOR$k=1,2,\mathrm{\dots},$ \STATEReceive data $({X}_{k},{\mathbf{y}}_{k})$ of the task $k$ and construct suitable output weights ${w}_{k}$. \STATEInitialise variational parameters ${\mu}_{{w}_{k}}$ (e.g. around zero) and ${L}_{{w}_{k}}$ (e.g. to identity matrix $I$). \STATEOptimise $(\theta ,{\mu}_{{w}_{k}},{L}_{{w}_{k}})$ by maximising $\mathcal{F}(\theta ,q({w}_{k}))$. \STATESelect inducing inputs ${Z}_{k}$: either as random subset of ${X}_{k}$ or by using some other criterion (see Section 2.4). \STATECompute the parameters of $q({\mathbf{u}}_{k})$ from (5) and store them. \ENDFOR

### 2.4 Selection of the inducing points

After having seen the $k$-th task, and given that it is straightforward to compute the posterior distribution $q({\mathbf{u}}_{k})$ for any set of function values, the only issue remaining is to select the inducing inputs ${Z}_{k}$. A simple choice that works well in practice is to select ${Z}_{k}$ as a random subset of the training inputs ${X}_{k}$. The question is whether we can do better with some more structured criterion.

In our experiments we will investigate an unsupervised criterion that only depends on the training inputs and the neural network feature vector. Such a criterion quantifies how well we reconstruct the full kernel matrix ${K}_{{X}_{k}}$ from the inducing set ${Z}_{k}$ and it can be expressed as the trace of the covariance matrix of the prior GP conditional $p({\mathbf{f}}_{k}|{\mathbf{u}}_{k})$, i.e.

$$\mathcal{T}({Z}_{k})=\text{tr}\left({K}_{{X}_{k}}-{K}_{{X}_{k}{Z}_{K}}{K}_{{Z}_{k}}^{-1}{K}_{{Z}_{k}{X}_{k}}\right)=\sum _{j=1}^{{N}_{k}}\left[k({x}_{k,j},{x}_{k,j})-{k}_{{Z}_{K},{x}_{k,j}}^{\top}{K}_{{Z}_{k}}^{-1}{k}_{{Z}_{k},{x}_{k,j}}\right],$$ | (6) |

where each $k({x}_{k,j},{x}_{k,j})-{k}_{{Z}_{K},{x}_{k,j}}^{\top}{K}_{{Z}_{k}}^{-1}{k}_{{Z}_{k},{x}_{k,j}}\ge 0$ is a reconstruction error for an individual training point. The above quantity appears in the ELBO in (Titsias, 2009), is also used in (Csato and Opper, 2002) and it has deep connections with the Nystróm approximation (Williams and Seeger, 2001) and principal component analysis. The criterion in (6) promotes finding inducing points ${Z}_{k}$ that are repulsive with one another and are spread evenly in the input space under a similarity/distance implied by the dot product of the feature vector $\varphi (x;{\theta}_{k})$ (with ${\theta}_{k}$ being the current parameter values after having trained with the $k$-th task). An illustration of this repulsive property is given in Section 3.

To select ${Z}_{k}$, we minimise $\mathcal{T}({Z}_{k})$ by applying discrete optimisation where we select points from the training inputs ${X}_{k}$. The specific optimisation strategy we use in the experiments is to start with an initial random set ${Z}_{k}\subset {X}_{k}$ and then further refine it by doing local moves where random points in ${Z}_{k}$ are proposed to be changed with random points of ${X}_{k}$.

### 2.5 Prediction and differences with weight space methods

Prediction at any $i$-th task that has been encountered in the past follows the standard sparse GP predictive equations. Given a test data point ${x}_{i,*}$ the predictive density of its output value ${y}_{i,*}$ takes the form $p({y}_{i,*})=\int p({y}_{i,*}|{f}_{i,*}){p}_{\theta}({f}_{i,*}|{\mathbf{u}}_{i})q({\mathbf{u}}_{i})\mathit{d}{\mathbf{u}}_{i}\mathit{d}{f}_{i,*}=\int p({y}_{i,*}|{f}_{i,*}){q}_{\theta}({f}_{i,*})\mathit{d}{f}_{i,*}$ where ${q}_{\theta}({f}_{i,*})=\mathcal{N}({f}_{i,*}|{\mu}_{i,*},{\sigma}_{i,*}^{2})$ is an univariate posterior Gaussian with mean and variance,

$${\mu}_{i,*}={\mu}_{{u}_{i}}^{\top}{K}_{{Z}_{i}}^{-1}{\mathrm{\Phi}}_{{Z}_{1}}\varphi ({x}_{i,*};\theta ),{\sigma}_{i,*}^{2}=k({x}_{i,*},{x}_{i,*})+{k}_{{Z}_{i}{x}_{i,*}}^{\top}{K}_{{Z}_{i}}^{-1}\left[{L}_{{u}_{i}}{L}_{{u}_{i}}^{\top}-{K}_{{Z}_{i}}\right]{K}_{{Z}_{i}}^{-1}{k}_{{Z}_{i}{x}_{i,*}},$$ |

where in ${\mu}_{i,*}$ we have explicitly written the cross kernel vector ${k}_{{Z}_{k}{x}_{i,*}}={\mathrm{\Phi}}_{{Z}_{i}}\varphi ({x}_{i,*};\theta )$ to reveal a crucial property of this prediction. Specifically, given that ${f}_{i,*}={w}_{i}^{\top}\varphi ({x}_{i,*};\theta )$ the vector ${\mu}_{{u}_{i}}^{\top}{K}_{{Z}_{i}}^{-1}{\mathrm{\Phi}}_{{Z}_{i}}$ acts as a mean prediction for the task-specific parameter row vector ${w}_{i}^{\top}$. As we learn future tasks and the parameter $\theta $ changes, this mean parameter prediction automatically adapts (recall that ${K}_{{Z}_{i}}={\mathrm{\Phi}}_{{Z}_{i}}{\mathrm{\Phi}}_{{Z}_{i}}^{\top}$ and ${\mathrm{\Phi}}_{{Z}_{i}}$ vary with $\theta $ and only ${\mu}_{{u}_{i}}$ is constant) in order to counteract changes in the feature vector $\varphi ({x}_{i,*};\theta )$, so that the overall prediction for the function value, i.e. ${\mu}_{i,*}=\mathbb{E}[{f}_{i,*}]$, does not become obsolete. For instance, the prediction of the function values at the inducing inputs ${Z}_{i}$ always remains constant to our fixed/stored mean belief ${\mu}_{{u}_{i}}$ since by setting ${x}_{i,*}={Z}_{i}$ the formula gives ${\mu}_{{u}_{i}}^{\top}{K}_{{Z}_{i}}^{-1}{\mathrm{\Phi}}_{{Z}_{i}}{\mathrm{\Phi}}_{{Z}_{i}}^{\top}={\mu}_{{u}_{i}}^{\top}$. Similar observations can be made for the predictive variances.

The above analysis reveals an important difference between continual learning in function space and in weight space, where in the latter framework task-specific parameters such as ${w}_{i}$ might not automatically adapt to counteract changes in the feature vector $\varphi (x;\theta )$, as we learn new tasks and $\theta $ changes. For instance, if as a summary of the task, instead of the function space posterior distribution $q({\mathbf{u}}_{i})$, we had kept in memory the weight space posterior $q({w}_{i})$ (see Section (2.3)), then the corresponding mean prediction on the function value, $\mathbb{E}[{f}_{i,*}]={\mu}_{{w}_{i}}^{\top}\varphi ({x}_{i,*};\theta )$, can become obsolete as $\varphi ({x}_{i,*};\theta )$ changes and ${\mu}_{{w}_{i}}$ remains constant.

## 3 Experiments

We consider experiments in three continual learning classification problems: Split-MNIST, Permuted-MNIST and sequential Omniglot. We compare two variants of our method, referred to as Functional Regularised Continual Learning (frcl); see Algorithm 1. The first variant (frcl-rnd) selects randomly the inducing set for each task from the training set, while the second (frcl-tr) further optimises the inducing set by mininising the trace objective in (6). We compare our method with other approaches in the literature, by quoting published results, and with an additional baseline (baseline) corresponding to a simple replay-buffer method for continual learning (explained in the Supplement). For all implemented methods, i.e. frcl-rnd, frcl-tr and baseline, we do not place any extra regulariser on the shared feature vector parameter $\theta $ (such as an ${\mathrm{\ell}}_{2}$ penalty or batch-normalisation etc). Given that Permuted-MNIST and Omniglot are multi-class classification problems, where each $k$-th task involves classification over ${C}_{k}$ classes, we need to generalise the model and the variational method to deal with multiple GP functions per task. It is straightforward to do so as we detail in the Supplement (Section 1). The frcl methods have been implemented using GPflow (Matthews et al., 2017).

Algorithm | Permuted-MNIST | Split-MNIST |
---|---|---|

DLP (Eskin et al., 2004) | 82% | 61.2% |

EWC (Kirkpatrick et al., 2017) | 84% | 63.1% |

SI (Zenke et al., 2017) | 86% | 98.9% |

VCL (Nguyen et al., 2017) | 90% | 97.0% |

+ random Coreset | 93% (200 points/task) | |

+ k-center Coreset | 93% (200 points/task) | |

+ unspecified Coreset selection | 98.4% (40 points/task) | |

Methods evaluated in this paper | ||

baseline | 48.6% $\pm $ 1.73 (10 points/task) | |

frcl-rnd | 48.2% $\pm $ 4.0 (10 points/task) | |

frcl-tr | 61.7% $\pm $ 1.8 (10 points/task) | |

baseline | 83.1% $\pm $ 0.3 (200 points/task) | 95.8% $\pm $ 1.1 (40 points/task) |

frcl-rnd | 94.2% $\pm $ 0.1 (200 points/task) | 96.8% $\pm $ 1.0 (40 points/task) |

frcl-tr | 94.3% $\pm $ 0.1 (200 points/task) | 97.4% $\pm $ 0.6 (40 points/task) |

Split-MNIST and Permuted-MNIST. Among a large number of diverse experiments in continual learning publications, two versions of the popular MNIST dataset have recently started to become increasingly popular benchmarks: Permuted- and Split-MNIST. In Permuted-MNIST (e.g. Goodfellow et al., 2013; Kirkpatrick et al., 2017; Zenke et al., 2017), each task is a variant of the initial $10$-class MNIST classification task where all input pixels have undergone a fixed (random) permutation. The Split-MNIST experiment was introduced by Zenke et al. (2017), who propose a sequential version as follows: Five binary classification tasks are constructed from the classes in the following order: 0/1, 2/3, 4/5, 6/7, and 8/9.

In both cases, we obtain the feature vector $\varphi (x;\theta )$ using a fully connected MLP network with two hidden layers and ReLU activations. To ensure consistency with the results reported in Nguyen et al. (2017), we use 256 units each for Split-MNIST and 100 for Permuted-MNIST. Results for both benchmarks are reported in Table 1, which shows the averaged final test set accuracies (after all tasks) together with standard deviations for methods implemented by us.

The results in Table 1 show that both frcl methods perform well, outperforming all other methods on Permuted-MNIST when $200$ inducing points per task are used. We also obtain competitive results on Split-MNIST. The improvement on the baseline shows that approximate posterior distributions over functions values can lead to more effective regularisation for continual learning compared to just having a replay buffer of input-output pairs.

In order to better assess the effect of inducing points optimisation, we also report results for Permuted-MNIST when the number of inducing points is fixed to the number of classes, i.e. $10$ points. We observe a significantly larger improvement upon frcl-rnd and baseline (from about 49% to 62%), which can be attributed to sub-optimal random selection for the baselines. Indeed, Figure 1 shows that optimising the inducing points remarkably results in a consistent allocation of one example per class, despite using an unsupervised criterion. Also the optimised inducing points are spread across the input space as shown by the tsne visualisation. Note that for MNIST, because of the lack of diversity within a label class, a single example can provide quite a bit of information. Thus, sampling a small random subset of examples per label is sufficient to cover the distribution of that particular label relatively well. However the inducing mechanism might become crucial for richer distributions, where one can not cover the space efficiently by uniformly sampling points.

Algorithm | Test Accuracy | ||
---|---|---|---|

Single model per Task | 88.34 | ||

Progressive Nets | 86.50 $\pm $ 0.9 | ||

Finetuning | 26.20 $\pm $ 4.6 | ||

Learning Without Forgetting | 62.06 $\pm $ 2.0 | ||

Elastic Weight Consolidation (EWC) | 67.32 $\pm $ 4.7 | ||

Online EWC | 69.99 $\pm $ 3.2 | ||

Progress & Compress | 70.32 $\pm $ 3.3 | ||

Methods evaluated in this paper | 1 point/char | 2 points/char | 3 points/char |

baseline | 42.73 $\pm $ 1.2 | 57.17 $\pm $ 1.2 | 65.32 $\pm $ 1.1 |

frcl-rnd | 69.74 $\pm $ 1.1 | 80.32 $\pm $ 2.5 | 81.42 $\pm $ 1.2 |

frcl-tr | 72.02 $\pm $ 1.3 | 81.47 $\pm $ 1.6 | 81.01 $\pm $ 1.1 |

Omniglot. To assess our method under more challenging conditions, we consider the sequential Omniglot task proposed for continual learning in Schwarz et al. (2018). Omniglot Lake et al. (2011) is a dataset of 50 alphabets, each with a varying number of classes/characters which we treat of as a sequence of distinct classification problems. Note that the varying number of characters have the interesting characteristic of making the straightforward application of single-head networks difficult. In most cases the use of task-specific parameters is used as the most evident alternative. This has however been criticised in recent discussion around continual learning Farquhar and Gal (2018). This is something our method does not require.

As suggested in Schwarz et al. (2018), we apply data-augmentation and use the same train/validation/test split. Following the same experimental setup proposed, we used an identical convolutional network to construct the feature vector $\varphi (x;\theta )$. Results reported are obtained by training on the union of training and validation set after choosing any hyper-parameters based on the validation set. Note that all experiments were run with data processing and neural network construction code kindly provided by the authors in Schwarz et al. (2018), ensuring directly comparable results. For replay methods, we reserve either 1, 2, or 3 examples per class/character for the inducing set (or replay buffer for the baseline), resulting in a modest overall buffer size of $\approx 966,1450\text{or}\mathrm{\hspace{0.25em}2900}$ examples respectively.

Several observations can be made: (i) We obtain state-of-the-art results in comparison to various competitive baselines. (ii) Despite its simplicity, the simple baseline strategy performs competitively. In line with other results, we conclude that rehearsal-based ideas continue to provide a strong baseline. Nevertheless, other methods may be more suitable when storing data is not feasible. (iii) We observe only a modest improvement when inducing points are optimised.

## 4 Discussion

We introduced a functional regularisation approach for supervised continual learning that combines inducing point GP inference with deep neural networks. This method constructs task-specific posterior beliefs or summaries consisting of inducing inputs and distributions over function values that capture the uncertainly about the unknown function associated with the task. Subsequently, the task-specific summaries allow us to regularise continual learning and avoid catastrophic forgetting.

Regarding related work on online learning using GPs, notice that previous algorithms Bui et al. (2017a); Csato and Opper (2002); Csato (2002) learn in an online fashion a single task where data from this task arrive sequentially. In contrast in this paper we developed a continual learning method for dealing with a sequence of different tasks.

A direction for future research is to enforce a fixed memory buffer (or a buffer that grows sub-linearly w.r.t. the number of tasks), in which case one would need to compress the summaries of all previous seen tasks into a single summary. Finally, while in this paper we applied the method to supervised classification tasks with known task boundaries, it will be interesting to extend it to deal with unknown task boundaries and to consider also applications in other domains such as reinforcement learning.

## Acknowledgements

We would like to thank Hyunjik Kim and Raia Hadsell for useful discussions and feedback.

## References

- Álvarez et al. (2012) Álvarez, M. A., Rosasco, L., and Lawrence, N. D. (2012). Kernels for vector-valued functions: A review. Found. Trends Mach. Learn., 4(3):195–266.
- Bonilla et al. (2008) Bonilla, E. V., Chai, K. M., and Williams, C. (2008). Multi-task gaussian process prediction. In Platt, J. C., Koller, D., Singer, Y., and Roweis, S. T., editors, Advances in Neural Information Processing Systems 20, pages 153–160. Curran Associates, Inc.
- Bui et al. (2017a) Bui, T. D., Nguyen, C., and Turner, R. E. (2017a). Streaming Sparse Gaussian Process Approximations. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems 30, pages 3299–3307. Curran Associates, Inc.
- Bui et al. (2017b) Bui, T. D., Yan, J., and Turner, R. E. (2017b). A Unifying Framework for Gaussian Process Pseudo-Point Approximations using Power Expectation Propagation. Journal of Machine Learning Research, 18(104):1–72.
- Csato (2002) Csato (2002). Gaussian processes: iterative sparse approximations. PhD thesis, Aston University.
- Csato and Opper (2002) Csato, L. and Opper, M. (2002). Sparse online Gaussian processes. Neural Computation, 14:641–668.
- Dezfouli and Bonilla (2015) Dezfouli, A. and Bonilla, E. V. (2015). Scalable Inference for Gaussian Process Models with Black-Box Likelihoods. In Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M., and Garnett, R., editors, Advances in Neural Information Processing Systems 28, pages 1414–1422.
- Eskin et al. (2004) Eskin, E., Smola, A. J., and Vishwanathan, S. (2004). Laplace propagation. In Advances in neural information processing systems, pages 441–448.
- Farquhar and Gal (2018) Farquhar, S. and Gal, Y. (2018). Towards Robust Evaluations of Continual Learning. arXiv preprint arXiv:1805.09733.
- Finn et al. (2017) Finn, C., Abbeel, P., and Levine, S. (2017). Model-agnostic meta-learning for fast adaptation of deep networks. In Precup, D. and Teh, Y. W., editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1126–1135, International Convention Centre, Sydney, Australia. PMLR.
- Goodfellow et al. (2013) Goodfellow, I. J., Mirza, M., Xiao, D., Courville, A., and Bengio, Y. (2013). An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211.
- Hensman et al. (2013) Hensman, J., Fusi, N., and Lawrence, N. D. (2013). Gaussian processes for Big Data. In Conference on Uncertainty in Artificial Intellegence, pages 282–290. auai.org.
- Hensman et al. (2015) Hensman, J., Matthews, A. G. d. G., and Ghahramani, Z. (2015). Scalable Variational Gaussian Process Classification. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics.
- Kirkpatrick et al. (2017) Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., et al. (2017). Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, page 201611835.
- Lake et al. (2011) Lake, B., Salakhutdinov, R., Gross, J., and Tenenbaum, J. (2011). One shot learning of simple visual concepts. In Proceedings of the Annual Meeting of the Cognitive Science Society, volume 33.
- Lawrence et al. (2002) Lawrence, N. D., Seeger, M., and Herbrich, R. (2002). Fast sparse Gaussian process methods: the informative vector machine. In Neural Information Processing Systems, 13. MIT Press.
- Li and Hoiem (2017) Li, Z. and Hoiem, D. (2017). Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence.
- Lloyd et al. (2015) Lloyd, C., Gunter, T., Osborne, M. A., and Roberts, S. J. (2015). Variational inference for gaussian process modulated poisson processes. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, pages 1814–1822.
- Lopez-Paz et al. (2017) Lopez-Paz, D. et al. (2017). Gradient episodic memory for continual learning. In Advances in Neural Information Processing Systems, pages 6470–6479.
- Matthews et al. (2017) Matthews, A. G. d. G., van der Wilk, M., Nickson, T., Fujii, K., Boukouvalas, A., Le‘on-Villagr‘a, P., Ghahramani, Z., and Hensman, J. (2017). GPflow: A Gaussian process library using TensorFlow. Journal of Machine Learning Research, 18(40):1–6.
- Nguyen et al. (2017) Nguyen, C. V., Li, Y., Bui, T. D., and Turner, R. E. (2017). Variational continual learning. arXiv preprint arXiv:1710.10628.
- Quiñonero-Candela and Rasmussen (2005) Quiñonero-Candela, J. and Rasmussen, C. E. (2005). A unifying view of sparse approximate Gaussian process regression. Journal of Machine Learning Research, 6:1939–1959.
- Rasmussen and Williams (2005) Rasmussen, C. E. and Williams, C. K. I. (2005). Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). The MIT Press.
- Rebuffi et al. (2017) Rebuffi, S.-A., Kolesnikov, A., Sperl, G., and Lampert, C. H. (2017). icarl: Incremental classifier and representation learning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5533–5542. IEEE.
- Ring (1994) Ring, M. B. (1994). Continual learning in reinforcement environments. PhD thesis, University of Texas at Austin Austin, Texas 78712.
- Riquelme et al. (2018) Riquelme, C., Tucker, G., and Snoek, J. (2018). Deep bayesian bandits showdown: An empirical comparison of bayesian deep networks for thompson sampling. CoRR, abs/1802.09127.
- Robins (1995) Robins, A. (1995). Catastrophic forgetting, rehearsal and pseudorehearsal. Connection Science, 7(2):123–146.
- Robins and McCallum (1998) Robins, A. and McCallum, S. (1998). Catastrophic forgetting and the pseudorehearsal solution in hopfield-type networks. Connection Science, 10(2):121–135.
- Rusu et al. (2016) Rusu, A. A., Rabinowitz, N. C., Desjardins, G., Soyer, H., Kirkpatrick, J., Kavukcuoglu, K., Pascanu, R., and Hadsell, R. (2016). Progressive neural networks. arXiv preprint arXiv:1606.04671.
- Santoro et al. (2016) Santoro, A., Bartunov, S., Botvinick, M., Wierstra, D., and Lillicrap, T. (2016). Meta-learning with memory-augmented neural networks. In Balcan, M. F. and Weinberger, K. Q., editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 1842–1850, New York, New York, USA. PMLR.
- Schmidhuber (2013) Schmidhuber, J. (2013). Powerplay: Training an increasingly general problem solver by continually searching for the simplest still unsolvable problem. Frontiers in psychology, 4:313.
- Schwarz et al. (2018) Schwarz, J., Luketina, J., Czarnecki, W. M., Grabska-Barwinska, A., Teh, Y. W., Pascanu, R., and Hadsell, R. (2018). Progress & compress: A scalable framework for continual learning. arXiv preprint arXiv:1805.06370.
- Seeger et al. (2003) Seeger, M., Williams, C. K. I., and Lawrence, N. D. (2003). Fast forward selection to speed up sparse Gaussian process regression. In Ninth International Workshop on Artificial Intelligence. MIT Press.
- Sheth et al. (2015) Sheth, R., Wang, Y., and Khardon, R. (2015). Sparse variational inference for generalized gp models. In Bach, F. and Blei, D., editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 1302–1311, Lille, France. PMLR.
- Snelson and Ghahramani (2006) Snelson, E. and Ghahramani, Z. (2006). Sparse gaussian processes using pseudo-inputs. In Weiss, Y., Schölkopf, B., and Platt, J. C., editors, Advances in Neural Information Processing Systems 18, pages 1257–1264.
- Titsias (2009) Titsias, M. K. (2009). Variational learning of inducing variables in sparse Gaussian processes. In International Conference on Artificial Intelligence and Statistics, pages 567–574.
- Williams and Seeger (2001) Williams, C. K. I. and Seeger, M. (2001). Using the nyström method to speed up kernel machines. In Leen, T. K., Dietterich, T. G., and Tresp, V., editors, Advances in Neural Information Processing Systems 13, pages 682–688. MIT Press.
- Wilson et al. (2016) Wilson, A. G., Hu, Z., Salakhutdinov, R., and Xing, E. P. (2016). Deep kernel learning. In Gretton, A. and Robert, C. C., editors, Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, volume 51 of Proceedings of Machine Learning Research, pages 370–378, Cadiz, Spain. PMLR.
- Zenke et al. (2017) Zenke, F., Poole, B., and Ganguli, S. (2017). Continual learning through synaptic intelligence. arXiv preprint arXiv:1703.04200.

## Appendix A Extension to multi-class (or multiple-outputs) tasks

For multi-class classification problems, such as permuted MNIST and Omniglot considered in our experiments, where in general each $k$-th task involves classification over varying number of classes, we need to extend the method to deal with multiple functions per task. For instance, assume that the $k$-th task is a multi-class classification problem that inolves ${C}_{k}$ classes. To model this we need ${C}_{k}$ independent draws from the GP, such that ${f}_{k}^{c}(x)\sim \mathcal{G}\mathcal{P}(0,k(x,{x}^{\prime}))$ with $c=1,\mathrm{\dots},{C}_{k}$, which are combined based on a multi-class likelihood such as softmax. While in the main text we presented the method assuming a single GP function per task, the generalisation to multiple functions is straightforward by assuming that all variational approximations factorise across different functions/classes. For example, the variational distribution over all inducing variables ${U}_{k}={\{{\mathbf{u}}_{k}^{c}\}}_{c=1}^{{C}_{k}}$ takes the form $q({U}_{k})={\prod}_{c=1}^{{C}_{k}}q({\mathbf{u}}_{k}^{c})$ and similarly the variational approximation over the task weights ${W}_{k}=\{{w}_{k}^{c}\}$, needed in the ELBO in Section 3.2, also factorises across classes. Notice also that all inducing variables ${U}_{k}$ are evaluated on the same inputs ${Z}_{k}$. Furthermroe, the KL regularization term for each task takes the form of a sum over the different functions, i.e. $\text{KL}(q({U}_{k})||{p}_{\theta}({U}_{k}))={\sum}_{c=1}^{{C}_{k}}\text{KL}(q({\mathbf{u}}_{k}^{c})||{p}_{\theta}({\mathbf{u}}_{k}^{c}))$.

## Appendix B Baseline model

The baseline model (see main text) is based on storing an explicit replay buffer $({\stackrel{~}{\mathbf{y}}}_{i},{\stackrel{~}{X}}_{i})$, i.e. a subset of the training data where ${\stackrel{~}{\mathbf{y}}}_{i}\subset {\mathbf{y}}_{i}$ and ${\stackrel{~}{X}}_{i}\subset {X}_{i}$, for each past task. Then, at each step when we encounter the $k$-th task training is performed by optimising an unbiased estimate of the full loss (i.e. if we had all $k$ tasks at once), given by

$$L(\theta ,{w}_{1:k})={\mathrm{\ell}}_{k}({\mathbf{y}}_{k},{X}_{k};{w}_{k},\theta )+\sum _{i=1}^{k-1}\frac{{N}_{i}}{{M}_{i}}{\mathrm{\ell}}_{i}({\stackrel{~}{\mathbf{y}}}_{i},{\stackrel{~}{X}}_{i};{w}_{i},\theta ),$$ |

where each ${\mathrm{\ell}}_{i}(\cdot )$ is a task-specific loss function, such as cross entropy for classification, and each scalar $\frac{{N}_{i}}{{M}_{i}}$ corrects for the bias on the loss value caused by approximating the initial full loss by a random replay buffer of size ${M}_{i}$. All output weights ${w}_{i:k}$ of the current and old tasks, in the multi-head architecture, are constantly updated together with the feature parameter vector $\theta $. Also at each step a fresh set of output weights is constructed in order to deal with the current task.

## Appendix C Analysis of the number of inducing points

Figure 2 shows how the accuracy on the Permuted-MNIST benchmark changes as a function of the number of inducing points. This Figure reveals that while performance increases as we add more inducing points, when we add too many the performance for the frcl methods can deteriorate. This is because the KL regularization terms when the number of inducing points is too large can become too dominant and over-regularize the learning of new tasks. Possible solutions of this effect can be either to have an upper limit in the total number of inducing points (by further compressing the inducing points of all past tasks into a smaller subset), or add a regularization parameter in front of the KL penalties of the past tasks. We leave the investigation of this for future work.