Abstract
The prominence of deep learning, large amount of annotated data andincreasingly powerful hardware made it possible to reach remarkable performancefor supervised classification tasks, in many cases saturating the trainingsets. However, adapting the learned classification to new domains remains ahard problem due to at least three reasons: (1) the domains and the tasks mightbe drastically different; (2) there might be very limited amount of annotateddata on the new domain and (3) full training of a new model for each new taskis prohibitive in terms of memory, due to the shear number of parameter of deepnetworks. Instead, new tasks should be learned incrementally, building on priorknowledge from already learned tasks, and without catastrophic forgetting, i.e.without hurting performance on prior tasks. To our knowledge this paperpresents the first method for multidomain/task learning without catastrophicforgetting using a fully tensorized architecture. Our main contribution is amethod for multidomain learning which models groups of identically structuredblocks within a CNN as a highorder tensor. We show that this joint modellingnaturally leverages correlations across different layers and results in morecompact representations for each new task/domain over previous methods whichhave focused on adapting each layer separately. We apply the proposed method to10 datasets of the Visual Decathlon Challenge and show that our method offerson average about 7.5x reduction in number of parameters and superiorperformance in terms of both classification accuracy and Decathlon score. Inparticular, our method outperforms all prior work on the Visual DecathlonChallenge.
Quick Read (beta)
Incremental multidomain learning with network latent tensor factorization
Abstract
The prominence of deep learning, large amount of annotated data and increasingly powerful hardware made it possible to reach remarkable performance for supervised classification tasks, in many cases saturating the training sets. However, adapting the learned classification to new domains remains a hard problem due to at least three reasons: (1) the domains and the tasks might be drastically different; (2) there might be very limited amount of annotated data on the new domain and (3) full training of a new model for each new task is prohibitive in terms of memory, due to the shear number of parameter of deep networks. Instead, new tasks should be learned incrementally, building on prior knowledge from already learned tasks, and without catastrophic forgetting, i.e. without hurting performance on prior tasks. To our knowledge this paper presents the first method for multidomain/task learning without catastrophic forgetting using a fully tensorized architecture. Our main contribution is a method for multidomain learning which models groups of identically structured blocks within a CNN as a highorder tensor. We show that this joint modelling naturally leverages correlations across different layers and results in more compact representations for each new task/domain over previous methods which have focused on adapting each layer separately. We apply the proposed method to 10 datasets of the Visual Decathlon Challenge and show that our method offers on average about $7.5\times $ reduction in number of parameters and superior performance in terms of both classification accuracy and Decathlon score. In particular, our method outperforms all prior work on the Visual Decathlon Challenge.
1 Introduction
It is now commonly accepted that supervised learning with deep neural networks can provide satisfactory solutions for a wide range of problems. If the aim is to focus on a single task only, then a deep neural network can be trained to obtain satisfactory performance given the availability of sufficient amount of labelled training data and computational resources. This is the setting under which Convolutional Neural Networks (CNNs) have been employed in order to provide stateoftheart solutions for a wide range of Computer Vision problems such as recognition [16, 36, 9], detection [31], semantic segmentation [20, 8] and human pose estimation [25] to name a few.
However, visual perception is not just concerned with being able to learn a single task at a time, assuming an abundance of labelled data, memory and computing capacity. A more desirable property is to be able to learn a set of tasks, possibly over multiple different domains, under limited memory and finite computing power. This setting is a very general one and many instances of it have been studied in Computer Vision and Machine Learning under various names. The main difference comes from whether we vary the task to be performed (classification or regression), or the domain, which broadly speaking refers to the distribution of the data or the labels for the considered task. These can be classified in $5$ main categories:
Multitask learning:
most commonly this refers to learning different classification (or regression) tasks (typically) jointly from a single domain. For example, given a facial image one may want to train a CNN to estimate the bounding box, facial landmarks, facial attributes, facial expressions and identity [28].
Transfer learning:
this refers to transferring knoweldge from one learned task to another (possibly very different) one typically via finetuning [10]. For example, a pretrained model on Imagenet can be finetuned on another dataset for face detection. Transfer learning results in a different model for the new task.
Domain adaptation:
this setting most commonly refers to learning the same task over a different domain for which training data is available but typically there is little labelled data for the new domain (e.g. [34, 40]). For example, one may learn a model for semantic segmentation using synthetic data (where pixel labels are readily available) and try to convert this model into a new one that works well for the domain of real images [35].
Multidomain learning:
this refers to learning a single model to perform different tasks over different domains (e.g. [13, 3]). For example, one might want to learn a single model where most of the parameters are shared to classify facial expressions and MNIST digits. Note that this setting is much more challenging than the one of transfer learning which yields different models per each task.
Multidomain incremental learning:
this is the same as above but training data are not initially available for all tasks (e.g. [29, 30, 32]). For example, initially a model can be trained on Imagenet, and then new training data become available for facial expressions. In this case, one wants to learn a single model to handle Imagenet classification and facial expressions.
Our paper is concerned with this last problem: Multidomain incremental learning. A key aspect of this setting is that the new task should be learned without harnessing the classification accuracy and representational power of the original model. This is called learning without catastrophic forgetting [7, 19]. Another important aspect is to keep newly introduced memory requirements low: a newly learned model should use as much as possible existing knowledge learned from already learned tasks, i.e. from a practical perspective, it should reuse or adapt the weights of an already trained (on a different task) network.
The aforementioned setting has only recently attracted the attention of the neural network community. Notably, the authors of [29] introduced the Visual Decathlon Challenge which is concerned with incrementally converting an Imagenet classification model to new ones for another 9 different domain/tasks.
To our knowledge there are only a few methods that have been proposed recently in order to solve it [29, 30, 32, 22]. These works all have in common that incremental learning is achieved with layerspecific adapting modules (which are simply called adapters) applied to each CNN layer separately. Although the adapters have only a small number of parameters, because they are layer specific, the total number of parameters introduced by the adaptation process scales linearly with the the number of layers, and in practice an adaptation network requires about 10% extra parameters (see also [30]). Our main contribution is to propose a tensor method for multidomain incremental learning that requires significantly less number of new parameters for each new task. In summary, our contributions are:

•
We propose the first fullytensorized method for multidomain learning without catastrophic forgetting. Our method differs from previously proposed layerwise adaptation methods (and their straightforward layerwise extensions) by grouping all identically structured blocks of a CNN within a single highorder tensor.

•
Our proposed method outperforms all previous works on the Visual Decathlon Challenge, both in terms of average accuracy and challenge score.

•
We perform a thorough evaluation of our model on the 10 datasets of the visual decathlon challenge and show that our method offers on average about $7.5\times $ reduction in model parameters compared with training a new network from scratch and superior performance over the stateoftheart in terms of compression rate, classification accuracy and Decathlon points.

•
We show both theoretically and empirically that this joint modelling naturally leverages correlations across different layers and results in learning more compact representations for each new task/domain.
Intuitively, our method first learns, on the source domain, a task agnostic core tensor. This represents a shared, domainagnostic, latent subspace. For each new domains, this core is specialized by learning a set of task specific factors defining the multilinear mapping from the shared subspace to the parameter space of each of the domains.
2 Closely Related Work
In this section, we review the related work on incremental multidomain learning and tensor methods.
Incremental MultiDomain Learning is the focus of only a few methods, at least for visionrelated classification problems. The works of [32] and [29] introduce the concept of layer adapters. Theses convert each layer^{1}^{1} 1 The last layer typically requires retraining because the number of classes will be in general different. of a pretrained CNN (typically on Imagenet) to adapt to a new classification task, for which new training data becomes available. Because the layers of the pretrained CNN remain fixed, such approaches avoid the problem of catastrophic forgetting [7, 19] so that performance on the original task is preserved. The method of [32] achieves this by computing new weights for each layer as a linear combination of old weights where the combination is learned in an endtoend manner for all layers via backpropagation on the new task. The work in [29] achieves the same goal by introducing small residual blocks composed of batchnorm followed by $1\times 1$ convolutional layers after each $3\times 3$ convolution of the original pretrained network. Similarly, the newly introduced parameters are learned via backpropagation. The same work introduced the Visual Decathlon Challenge which is concerned with incrementally adapting an Imagenet classification model to $9$ new and completely different domains and tasks. More recently, [30] extends [29] by making the adapters to work in parallel with the $3\times 3$ convolutional layers.
Although the adapters have only a small number of parameters each, they are layer specific, and hence the total number of parameters introduced by the adaptation process grows linearly with the the number of layers. In practice, an adaptation network requires about $10\%$ extra parameters (see also [30]). Finally, the work of [22] learns to adapt to a new task by learning how to mask individual weights of a pretrained network.
Our method significantly differs from these works in that it models groups of identically structured blocks within a CNN with a single highorder tensor. This results in a much more compact representations for each new task/domain, with a latent subspace shared between domains. Only a set of factors, representing a very small fraction of this subspace, need to be learnt for each new task or domain.
Tensor methods. A detailed review of tensor methods falls outside the scope of this section. Herein, we focus on methods which have been used to reparametrize existing individual convolutional layers. This is done mainly to speed up computation or to reduce the number of parameters. The authors in [18] propose to decompose each of the 4D tensors representing the convolutional layers of a pretrained network into a sum of rank–$1$ tensors using CP decomposition. [12] propose a similar approach but use Tucker decomposition instead of CP. [1] also used CP decomposition, but optimize this using the tensor power method. The method of [5] proposed a method to share parameters within a ResNextlike block [41] by applying a Generalized Block Decomposition to a 4th order tensor. As we show a straightforward extension of existing multidomain adaptation methods (e.g. [32]) using tensors results in an adaptation model with a large number of parameters. To improve this, we propose to model groups of identically structured blocks within a CNN with a single highorder tensor.
3 Method
In this section, we introduce our method (depicted in Figure 1) for incremental multidomain learning, starting by the notation used Sec. 3.1. By considering a source domain ${X}^{s}$ and output space ${Y}^{s}$, we aim to learn a function $h$ (here, a ResNet based architecture) parametrized by a tensor ${\theta}^{s}$, $h({\theta}^{s}):{X}^{s}\to {Y}^{s}$. The model and its tensor parametrization are introduced in detail in Section 3.2. The main idea is to then learn a task agnostic latent manifold $\mathcal{K}$ on the source domain. The parameter tensor ${\theta}^{s}$ is obtained from $\mathcal{K}$ with task specific factors ${\mathbf{F}}_{s}^{(0)},\mathrm{\cdots},{\mathbf{F}}_{s}^{(N)}$. Given a new target task, we then adapt $h$ and learn a new parameter tensor ${\theta}^{t}$ by specialising $\mathcal{K}$ with a new set of task specific factors $({\mathbf{F}}_{t}^{(0)},\mathrm{\cdots},{\mathbf{F}}_{t}^{(N)})$. This learning process is detailed in Section 3.3. In practice, most of the parameters are shared in $\mathcal{K}$, while the factors only contain a fraction of the parameters, which leads to large savings in terms of number of parameters. We offer an indepth analysis of these space savings in Section 3.4.
3.1 Notation
In this paper, we denote vectors (1${}^{\text{st}}$ order tensors) as $\mathbf{v}$, matrices (2${}^{\text{nd}}$ order tensors) as $\mathbf{M}$ and tensors, which generalize the concept of matrices for orders (number of dimensions) higher than 2, as $\mathcal{X}$. $\mathrm{\mathbf{I}\mathbf{d}}$ is the identity matrix. Tensor contraction with a matrix, also called n–mode product, is defined, for a tensor $\mathcal{X}\in {\mathbb{R}}^{{D}_{0}\times {D}_{1}\times \mathrm{\cdots}\times {D}_{N}}$ and a matrix $\mathbf{M}\in {\mathbb{R}}^{R\times {D}_{n}}$, as the tensor $\mathcal{T}=\mathcal{X}{\times}_{n}\mathbf{M}\in {\mathbb{R}}^{{D}_{0}\times \mathrm{\cdots}\times {D}_{n1}\times R\times {D}_{n+1}\times \mathrm{\cdots}\times {D}_{N}}$, with: ${\mathcal{T}}_{{i}_{0},{i}_{1},\mathrm{\cdots},{i}_{N}}={\sum}_{k=0}^{{D}_{n}}{\mathbf{M}}_{{i}_{n},k}{\mathcal{X}}_{{i}_{0},\mathrm{\cdots},{i}_{n1},k,{i}_{n+1},\mathrm{\cdots},{i}_{N}}.$
3.2 Latent network parametrization
We propose to group all the parameters of a neural network into a set of highorder hyperparameters. We do so by collecting all the weights of the neural network into $3$ parameter tensors $({\theta}^{(0)},\mathrm{\cdots},{\theta}^{(2)})$ or order $6$. While the proposed method is not architecture specific, to allow for a fair comparison in terms of overall representation power, we follow [29, 30, 32] and use a modified ResNet26 [9]. The network consists of 3 macromodules, each consisting of 4 basic residual blocks [9] (see Fig. 2 for an overview). Each of these blocks contain two convolutional layers with $3\times 3$ filters. Following [29], the macromodules output 64, 128, and 256 channels respectively. Throughout the network the resolution is dropped multiple times. First, at the beginning of each macromodule using a convolutional layer with a stride of 2. A final drop in resolution is done at the end of the network, before the classification layer, using an adaptive average pooling layer that reduces the spatial dimensions to resolution of $1\times 1$ px.
In order to facilitate the proposed grouped tensorization process, we moved the feature projection layer (a convolutional layer with $1\times 1$ filters), required each time the number of features changes between blocks, outside of the macromodules (i.e. we place a convolutional layer with a $1\times 1$ kernel before the ${2}^{\text{nd}}$ and ${3}^{\text{rd}}$ macromodules). The overall architecture is depicted in Fig. 2.
We closely align our tensor reparametrization to the network structure by grouping together all the convolutional layers within the same macromodule. For each macromodule $b\in \{0,1,2\}$, we construct a ${6}^{\text{th}}$order tensor collecting the weights in that group:
$${\theta}^{(b)}\in {\mathbb{R}}^{{D}_{0}\times {D}_{1}\times \mathrm{\cdots}\times {D}_{5}}$$  (1) 
where ${\mathcal{W}}^{b}$ is the tensor for the $b$${}^{\text{th}}$ macromodule. The 6 dimensions of the tensor are obtained as follows: ${D}_{0}\times {D}_{1}\times {D}_{2}\times {D}_{3}$ corresponds to the shape of the weights of a particular convolution layer and represents the number of output channels, number of input channels, kernel width and kernel height respectively. The ${D}_{4}$${}^{\text{th}}$ mode corresponds to the number of basic blocks per residual module (2 in this case) and, finally, ${D}_{5}$ corresponds to the number of residual blocks present in each macromodule (4 for the specific architecture used).
Our model should be compared with previous methods for incremental multidomain adaptation like [32] (the method of [29] can be expressed in a similar way) which learn a linear transformation per layer. In particular, [32] learns a 2D adaptation matrix $\mathbf{F}\in {\mathbb{R}}^{{D}_{0}\times ({D}_{1}\times {D}_{2}\times {D}_{3})}$ per convolutional layer. Moreover, prior work on tensors (e.g. [12]) has focused on standard layerwise modelling with a ${4}^{{}^{\text{th}}}$order the shape of which is ${D}_{0}\times {D}_{1}\times {D}_{2}\times {D}_{3}$. In contrast, our model has two additional dimensions and, in general, can accommodate an arbitrary number of dimensions depending on the architecture used.
3.3 MultiDomain Tensorized Learning
We now consider we have $T$ tasks, from potentially very different domains. The traditional approach would consist in learning as many models, one for each task. In our framework, this would be equivalent to learning one parameter tensor ${\theta}_{d}^{(b)}$ independently for each task $d$. Instead, we propose that each of the parameters tensors are obtain from a latent subspace, modelled by a task agnostic tensor $\mathcal{K}$. The (multilinear) mapping between this task agnostic core and the parameter tensor is then given by a set of task specific factors $(\mathbf{F}_{\mathbf{s}}{}^{(0)},\mathrm{\cdots}\mathbf{F}_{\mathbf{s}}{}^{(5)})$ that specialize the task agnostic subspace for the source domain $s$. Since the reasoning is the same for each of the macromodules, for clarity, and without loss of generality, we omit the $b$ in the notation.
Specifically, we write, for the source domain $s$:
$${\theta}_{s}=\mathcal{K}{\times}_{0}\mathbf{F}_{\mathbf{s}}{}^{(0)}{\times}_{1}\mathbf{F}_{\mathbf{s}}{}^{(1)}\times \mathrm{\cdots}{\times}_{5}\mathbf{F}_{\mathbf{s}}{}^{(5)},$$  (2) 
where $\mathcal{K}\in {\mathbb{R}}^{{D}_{0}\mathrm{\cdots}\times {D}_{5}}$ is a taskagnostic full rank core shared between all domains and $({\mathbf{F}}_{s}^{(0)},{\mathbf{F}}_{s}^{(1)},\mathrm{\cdots},{\mathbf{F}}_{s}^{(5)})$ a set of task specific (for domain $s$) projection factors. We assume here that the task used to train the shared core is a general one with many classes and large amount of training data (e.g. Imagenet classification). Moreover, a key observation to make at this point is that the number of parameters for the factors is orders of magnitudes smaller than the number of parameters of the core.
For each new target domain $t$, we form a new parameter tensor ${\theta}_{t}$ obtained from the same latent subspace $\mathcal{K}$. This is done by learning a new set of factors $(\mathbf{F}_{\mathbf{t}}{}^{(0)},\mathrm{\cdots}\mathbf{F}_{\mathbf{t}}{}^{(5)})$ to specialize $\mathcal{K}$ for the new task:
$${\theta}_{t}=\mathcal{K}{\times}_{0}\mathbf{F}_{\mathbf{t}}{}^{(0)}{\times}_{1}\mathbf{F}_{\mathbf{t}}{}^{(1)}\times \mathrm{\cdots}{\times}_{5}\mathbf{F}_{\mathbf{t}}{}^{(5)}$$  (3) 
Note that the new factors represent only a small fraction of the total number of parameters, the majority of which are contained within the shared latent subspace. By expressing the new weight tensor ${\theta}_{t}$ as a function of the factors ${\mathbf{F}}_{\mathbf{t}}$, one can learn them on the new task given that labelled data are available in an endtoend manner via backpropagation. This allows to efficiently adapt the domain agnostic subspace to the new domains while retaining the performance on the original task, and training only a small number of additional parameters. Fig. 1 shows a graphical representation of our method, where the weight tensors have been simplified to 3D for clarity.
Auxiliary loss function:
To prevent degenerate solutions and facilitate learning, we additionally explore orthogonality constraints on the task specific factors. This type of constraints have been shown to encourage regularization, improving the overall convergence stability and final accuracy [4, 2]. In addition, by adding such constraint, we aim to enforce the factors of the decomposition to be fullcolumn rank, which would ensure that the core of the decomposition preserves essential properties of the full weight tensor such as the Kruskal rank [11]. In practice, rather than a hard constraint, we add a loss to the objective function:
$$\mathcal{L}=\lambda \sum _{k=0}^{5}{\parallel {\left(\mathbf{F}_{\mathbf{k}}{}^{(k)}\right)}^{\top}\mathbf{F}_{\mathbf{k}}{}^{(k)}\mathrm{\mathbf{I}\mathbf{d}}\parallel}_{F}^{2}.$$  (4) 
The regularization parameter $\lambda $ was validated on a small validation set.
3.4 Complexity Analysis
In terms of unique, task specific parameters learned, our grouping strategy is significantly more efficient than a layerwise parametrization. For a given group of convolutional layers, in this work defined by the macromodule structure present in a ResNet architecture, we can express the total number of parameters for a Layerwise Tucker case (this is not proposed in this work but mentioned here for comparison purposes) as follows:${N}_{\text{layerwise}}=({D}_{4}\times {D}_{5})\times ({\sum}_{k=0}^{3}{D}_{n}{R}_{k}).$
In particular, in the case of a full rank decomposition, by denoting $L={D}_{4}\times {D}_{5}$ the number of convolutional layers, we get:
$${N}_{\text{layerwise}}=\underset{L}{\underset{\u23df}{({D}_{4}\times {D}_{5})}}\times ({D}_{0}^{2}+{D}_{1}^{2}+{D}_{2}^{2}+{D}_{3}^{2}),$$  (5) 
where $L$ is the number of reparametrized layers in a given group.
For the linear case [32], we have that ${D}_{0}={D}_{1}={D}_{c}$, and the number of parameters simplifies to: ${N}_{\text{linear}}=\underset{L}{\underset{\u23df}{({D}_{4}\times {D}_{5})}}\times {D}_{c}^{2}$
As opposed to this, for our proposed method, by grouping the parameters together into a single highorder tensor, the total number of parameters is:
$${N}_{\text{TNet}}=\sum _{k=0}^{5}{D}_{n}{R}_{k}$$  (6) 
For the fullrank case $({D}_{n}={R}_{k})$, this simplies to:
$${N}_{\text{TNet}}={D}_{0}^{2}+{D}_{1}^{2}+{D}_{2}^{2}+{D}_{3}^{2}+\underset{{(\frac{L}{{D}_{5}})}^{2}+{(\frac{L}{{D}_{4}})}^{2}}{\underset{\u23df}{{D}_{4}^{2}+{D}_{5}^{2}}}$$  (7) 
Note that here, ${D}_{4}=2$ and ${D}_{5}=4$ so ${D}_{4}^{2}+{D}_{5}^{2}\le \frac{{L}^{2}}{4}$.
Because in practice $({D}_{0}^{2}+{D}_{1}^{2}+{D}_{2}^{2}+{D}_{3}^{2})\gg {L}^{2}$, by using the proposed method, we achieve $aq\mathrm{`}\frac{{N}_{\text{layerwise}}}{{N}_{\text{TNet}}}\approx L$ times less taskspecific parameters.
Substituting the variables from Eq. (5) and Eq. (7) with the numerical values specific to the architecture used in this work, for each of the 3 groups, for the layerwise case, we obtain in total: ${N}_{\text{layerwise}}=8\times ({64}^{2}+{64}^{2}+{3}^{2}+{3}^{2})+8\times ({128}^{2}+{128}^{2}+{3}^{2}+{3}^{2})+8*({256}^{2}+{2568}^{2}+{3}^{2}+{3}^{2})=1,376,688$ parameters. By contrast, using the same setting for our proposed method, we get ${N}_{\text{TNet}}=8210+32768+131090=172,068$, thus verifying $\frac{{N}_{\text{layerwise}}}{{N}_{\text{TNet}}}\approx 8=L$.
Making the same assumptions as for the linear case, given that we use square convolutional kernels (i.e. ${D}_{2}={D}_{3}={D}_{n}$), and ${D}_{c}\gg {D}_{n}$, Eq. (7) becomes: ${N}_{\text{TNet}}\le 2{D}_{c}^{2}+\frac{{L}^{2}}{4}$, resulting in $\approx \frac{L}{2}$ less parameters than in the linear case ($\frac{L}{2}=4$ for the model used).
Conclusion:
Our proposed approach uses $L$ times less parameters per group than the layerswise Tucker decomposition and $\frac{L}{2}$ times less parameters than the layerwise linear decomposition. For the ResNet26 architecture used in this work $L=8$.
Dataset  
Model  #param  ImNet  Airc.  C100  DPed  DTD  GTSR  Flwr  OGlt  SVHN  UCF  mean  Score 
#images    1.3M  7k  50k  30k  4k  40k  2k  26k  70k  9k     
Rebuffi et al. [29]  2$\times $  59.23  63.73  81.31  93.30  57.02  97.47  83.43  89.82  96.17  50.28  77.17  2643 
Rosenfeld et al. [32]  $2\times $  57.74  64.11  80.07  91.29  56.54  98.46  86.05  89.67  96.77  49.38  77.01  2851 
Mallaya et al. [22]  $1.28\times $  57.69  65.29  79.87  96.99  57.45  97.27  79.09  87.63  97.24  47.48  76.60  2838 
Series Adap. [30]  $2\times $  60.32  61.87  81.22  93.88  57.13  99.27  81.67  89.62  96.57  50.12  77.17  3159 
Parallel Adap. [30]  $2\times $  60.32  64.21  81.91  94.73  58.83  99.38  84.68  89.21  96.54  50.94  78.07  3412 
Parallel SVD [29]  $1.5\times $  60.32  66.04  81.86  94.23  57.82  99.24  85.74  89.25  96.62  52.50  78.36  3398 
Ours  $1.35\times $  61.48  67.36  80.84  93.22  59.10  99.64  88.99  88.91  96.95  47.90  78.43  3585 
4 Experimental setting
In this section, we detail the experimental setting, metrics used and implementation details.
Datasets:
We evaluate our method on the $10$ different datasets from very different visual domains that compose the Decathlon challenge [29]. This challenge assesses explicitly methods designed to solve problem 4 defined in section 1, i.e. incremental multidomain learning without catastrophic forgetting. Imagenet [33] contains $1.2$ millions images distributed across $1000$ classes. Following [29, 30, 32], this was used as the source domain to train the shared lowrank manifold for our model as detailed in Eq. (2).The FGVCAircraft Benchmark (Airc.) [21] contains 10,000 aircraft images across 100 different classes; CIFAR100 (C100) [15] is composed of $60000$ small $(32\times 32)$ images in $100$ classes; Daimler Mono Pedestrian Classification Benchmark (DPed) [23] is a dataset for pedestrian detection (binary classification) composed of 50,000 images; Describable Texture Dataset (DTD) [6] contains $5640$ images, for $47$ texture categories; the German Traffic Sign Recognition (GTSR) Benchmark [38] is a dataset of $50,000$ images of $43$ traffic sign categories; Flowers102 (Flwr) [26] contains $102$ flower categories with between $40$ and $258$ images per class; Omniglot (OGlt) [17] is a dataset of $32000$ images representing $1623$ handwritten characters from $50$ different alphabets; the Street View House Numbers (SVHN) [24] is a digit recognition dataset containing $70000$ images in $10$ classes. Finally, UCF101 (UCF) [37] is an action recognition dataset composed of 13,320 images representing 101 action classes.
Metrics:
We follow the evaluation protocol of the Decathlon Challenge and report results in terms of mean accuracy and decathlon score S, computed as follows:
$$S=\sum _{t=1}^{N}{\beta}_{t}\mathrm{max}{\{0,{E}_{t}^{reference}{E}_{t}\}}^{{\lambda}_{t}},$$  (8) 
where ${E}_{t}^{reference}$ is considered to be the upper limit allowed for a given task $t$ in order to receive points, ${\lambda}_{t}$ is an exponent that controls the reward proportionality, and ${\beta}_{t}$ a scalar that enforces the limit of 1000 points per task. ${E}^{reference}=2{E}^{baseline}$ where ${E}_{basline}$ is the strong baseline from [29].
Implementation details:
We first train our adapted ResNet26 model(Fig. 2) on ImageNet for 90 epochs using SGD with momentum ($0.9$), using a learning rate of $0.1$ that is decreased in steps by $10\times $ every 30 epochs. To avoid overfitting, we use a weight decay equal to ${10}^{5}$. During training, we follow the best practices and randomly apply scale jittering, random cropping (to $224\times 224$px) and flipping. We initialize our weights from a normal distribution $\mathcal{N}(0,0.002)$, before decomposing them using Tucker decomposition (Section 3). Finally, we train the obtained core and factors (via backpropagation) by reconstructing the weights on the fly.
For the remaining $9$ domains, we load the taskindependent core and the factors trained on imagenet, freeze the core weights and only finetune the factors, batchnorm layers and the two $1\times 1$ projection layers, all of which account for $\approx 3.5\%$ of the total number of parameters in total. The linear layer at the end of the network is trained from scratch for each task and was initialized from a uniform distribution. Depending on the size of the dataset, we adjust the weight decay to avoid overfitting (${10}^{}5$ for larger datasets) and up to $0.005$ for the smaller ones (e.g. Flowers102).
5 Results
Here, we assess the performance of the proposed approach by i) comparing to existing stateoftheart methods on the challenging Visual Decathlon Challenge [29] (5.1) i) a thorough study of the method, including constraints imposed on the core and factors of the model.
5.1 Comparison with stateoftheart
Herein, we compare against the current stateoftheart methods on multidomain transfer learning [29, 30, 32, 22] on the decathlon dataset. We train our core subspace on ImageNet and incrementally adapt to all $9$ other domains. We report, for all methods, the relative increase in number of parameters (per domain), the top1 accuracy on each of the $10$ domain, as well as the average accuracy and overall challenge score, Table 1.
Our approach outperforms all the methods, by $173$ points in terms of both decathlon score ($3585$ vs. $3412$) and mean average, despite requiring significantly less task dependent parameters. Furthermore, in terms of efficiency our approach outperforms even the joint compression method of [30] (denoted as “Parallel SVD”) that takes advantage of the data redundancy inbetween tasks.
5.2 Interclass transfer learning
Most of recent work on multidomain incremental learning attempts to transfer the knowledge from a model pretrained on a large scale dataset such as ImageNet to another, easier datasets and/or tasks. In this work, we go on step further and explore the efficiency of our transfer learning approach when such source dataset (or computational resources) are not available, by starting from a model pretrained on a much smaller dataset. Table 3 shows the results for a network pretrained of Cifar100. Notice that on some datasets (i.e. GTSRB, OGlt) such model can match and even marginally surpass the performance of its Imagenet counterpart. On the other hand, on some of the more challenging datasets (i.e. DTD, aircraft) there is still a large gap. This suggest that the features learned by Cifartrained model are less generic and diverse. This is due to both the low quantity of available samples and the easiness/overfitting on the original dataset. A potential solution for this may be to enforce a diversity loss, however we leave the exploration of this area for future work.
Dataset  $\lambda =0.1$  $\lambda =0.01$  $\lambda =0.001$ 

DTD  52.2%  51.3%  51.0% 
vggflowers  80.9%  83.8%  82.2% 
Model  Pretrained on  Airc.  C100  DPed  DTD  GTSR  Flwr  OGlt  SVHN  UCF 

ImageNet  55.6  80.7  99.67  52.2  99.96  83.8  88.18  95.66  78.6  
Ours  Cifar100  41.7  74.5  99.82  37.55  99.98  70.9  88.35  95.43  72.1 
5.3 Varying the amount of training data
An interesting aspect of incremental multidomain learning not addressed thus far is what the performance on new domains/tasks is the situations where there is only a limited amount of labelled data available for the new domains. Although not all 9 remaining tasks of the Decathlon assume abundance of training data, in this section, we systematically assess this by varying the amount of training data for 4 tasks, namely DPed, DTD, GTSRB, UFC. Fig. 4 shows the classification accuracy on these datasets as function of the amount of training data. In the same figure, we also report the performance of a network for which both the cores and the factors are finetuned on these datasets, also trained with the same amount of data. In general, we observe that our method is at least as good as the finetuned network which should be considered as a very strong baseline requiring as many parameters as the original Imagenettrained model. This validates the robustness of our model for the case of training with limited amount of training data.
5.4 Rank regularization
It is wellknown that lowrank structure act as regularization mechanisms [39]. By jointly modelling the parameters of our model as a high order tensor, our model allows such constraint, effectively regularizing the whole network, thus preventing overfitting. This also allows for more efficient representations, by leveraging the redundancy in the multilinear structure of the network, allowing for large compression ratios, without decrease in performance.
Therefore in this section we opted for investigating this possibility. To this end, we firstly attempted to train our Imagenet model by imposing a lowrank constraint on the weight tensor. However, as Fig. 3 shows by doing that performance on the base task of Imagenet already drops significantly; hence we did not pursue the possibility of rank regularization further. We attribute this to the very small number of parameters in the ResNet model.
5.5 Effect of orthogonality regularization
To prevent degenerate solutions and facilitate learning, we added orthogonality constraints on the task specific factors. This type of constraints have been shown to encourage regularization, improving the overall convergence stability and final accuracy [4, 2].In addition, by adding such constraints, we aim to enforce the factors of the decomposition to be fullcolumn rank, which would ensure that the core of the decomposition preserves essential properties of the full weight tensor such as the Kruskal rank [11]. This orthogonality constraint was enforced using a regularization term, rather than via a hard constraint. See Table 2 for results on two selected small datasets, namely DTD and vggflowers.
6 Conclusions
We proposed a novel method for incremental multidomain learning using tensors. By modelling groups of identically structured blocks within a CNN as a highorder tensor, we are able to express the parameter space of a deep neural network as a (multilinear) function of a taskagnostic subspace. This taskagnostic core is then specialized by learning a set of small, taskspecific factors for each new domains. We show that this joint modelling naturally leverages correlations across different layers and results in a more compact representations for each new task/domain over previous methods which have focused on adapting each layer separately. We test the proposed method on the $10$ datasets of the Visual Decathlon Challenge and show that our method offers on average about $7.5\times $ reduction in model parameters and outperforms existing work, both in terms of classification accuracy and Decathlon points.
References
 [1] M. Astrid and S. Lee. Cpdecomposition with tensor power method for convolutional neural networks compression. CoRR, abs/1701.07148, 2017.
 [2] N. Bansal, X. Chen, and Z. Wang. Can we gain more from orthogonality regularizations in training deep cnns? arXiv preprint arXiv:1810.09102, 2018.
 [3] H. Bilen and A. Vedaldi. Universal representations: The missing link between faces, text, planktons, and cat breeds. arXiv preprint arXiv:1701.07275, 2017.
 [4] A. Brock, T. Lim, J. M. Ritchie, and N. Weston. Neural photo editing with introspective adversarial networks. arXiv preprint arXiv:1609.07093, 2016.
 [5] Y. Chen, X. Jin, B. Kang, J. Feng, and S. Yan. Sharing residual units through collective tensor factorization in deep neural networks. 2017.
 [6] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi. Describing textures in the wild. In CVPR, 2014.
 [7] R. M. French. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 3(4):128–135, 1999.
 [8] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask rcnn. In ICCV, 2017.
 [9] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
 [10] M. Huh, P. Agrawal, and A. A. Efros. What makes imagenet good for transfer learning? arXiv preprint arXiv:1608.08614, 2016.
 [11] B. Jiang, F. Yang, and S. Zhang. Tensor and its tucker core: The invariance relationships. Numerical Linear Algebra with Applications, 24(3):e2086.
 [12] Y. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin. Compression of deep convolutional neural networks for fast and low power mobile applications. CoRR, 05 2016.
 [13] I. Kokkinos. Ubernet: Training a universal convolutional neural network for low, mid, and highlevel vision using diverse datasets and limited memory. In CVPR, 2017.
 [14] J. Kossaifi, Y. Panagakis, A. Anandkumar, and M. Pantic. Tensorly: Tensor learning in python. CoRR, abs/1610.09555, 2018.
 [15] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
 [16] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
 [17] B. M. Lake, R. Salakhutdinov, and J. B. Tenenbaum. Humanlevel concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.
 [18] V. Lebedev, Y. Ganin, M. Rakhuba, I. V. Oseledets, and V. S. Lempitsky. Speedingup convolutional neural networks using finetuned cpdecomposition. CoRR, abs/1412.6553, 2014.
 [19] D. Li, Y. Yang, Y.Z. Song, and T. M. Hospedales. Learning to generalize: Metalearning for domain generalization. arXiv preprint arXiv:1710.03463, 2017.
 [20] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
 [21] S. Maji, E. Rahtu, J. Kannala, M. Blaschko, and A. Vedaldi. Finegrained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013.
 [22] A. Mallya, D. Davis, and S. Lazebnik. Piggyback: Adapting a single network to multiple tasks by learning to mask weights. In ECCV, 2018.
 [23] S. Munder and D. M. Gavrila. An experimental study on pedestrian classification. IEEE TPAMI, 28(11):1863–1868, 2006.
 [24] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS Workshops, volume 2011, 2011.
 [25] A. Newell, K. Yang, and J. Deng. Stacked hourglass networks for human pose estimation. In ECCV, 2016.
 [26] M.E. Nilsback and A. Zisserman. Automated flower classification over a large number of classes. In ICVGIP, 2008.
 [27] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. In NIPSW, 2017.
 [28] R. Ranjan, V. M. Patel, and R. Chellappa. Hyperface: A deep multitask learning framework for face detection, landmark localization, pose estimation, and gender recognition. IEEE TPAMI, 2017.
 [29] S.A. Rebuffi, H. Bilen, and A. Vedaldi. Learning multiple visual domains with residual adapters. In NIPS, 2017.
 [30] S.A. Rebuffi, H. Bilen, and A. Vedaldi. Efficient parametrization of multidomain deep neural networks. In CVPR, 2018.
 [31] S. Ren, K. He, R. Girshick, and J. Sun. Faster rcnn: Towards realtime object detection with region proposal networks. In NIPS, 2015.
 [32] A. Rosenfeld and J. K. Tsotsos. Incremental learning through deep adaptation. arXiv preprint arXiv:1705.04228, 2017.
 [33] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
 [34] K. Saenko, B. Kulis, M. Fritz, and T. Darrell. Adapting visual category models to new domains. In ECCV, 2010.
 [35] S. Sankaranarayanan, Y. Balaji, A. Jain, S. N. Lim, and R. Chellappa. Learning from synthetic data: Addressing domain shift for semantic segmentation. In CVPR, 2018.
 [36] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. arXiv, 2014.
 [37] K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
 [38] J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel. Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition. Neural networks, 32, 2012.
 [39] C. Tai, T. Xiao, X. Wang, and W. E. Convolutional neural networks with lowrank regularization. CoRR, abs/1511.06067, 2015.
 [40] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discriminative domain adaptation. In CVPR, 2017.
 [41] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In CVPR, 2017.