In few-shot learning, typically, the loss function which is applied at testtime is the one we are ultimately interested in minimising, such as themean-squared-error loss for a regression problem. However, given that we havefew samples at test time, we argue that the loss function that we areinterested in minimising is not necessarily the loss function most suitable forcomputing gradients in a few-shot setting. We propose VIABLE, a genericmeta-learning extension that builds on existing meta-gradient-based methods bylearning a differentiable loss function, replacing the pre-defined inner-looploss function in performing task-specific updates. We show that learning a lossfunction capable of leveraging relational information between samples reducesunderfitting, and significantly improves performance and sample efficiency on asimple regression task. Furthermore, we show VIABLE is scalable by evaluatingon the Mini-Imagenet dataset.
Quick Read (beta)
VIABLE: Fast Adaptation via Backpropagating Learned Loss
VIABLE: Fast Adaptation via Backpropagating Learned Loss
Leo Feng††thanks: Correspondence to: [email protected] University of Oxford Luisa Zintgraf University of Oxford Bei Peng University of Oxford Shimon Whiteson University of Oxford Latent Logic
noticebox[b]3rd Workshop on Meta-Learning at NeurIPS 2019, Vancouver, Canada.\[email protected]
Meta-learning is a popular and general way to tackle few-shot learning problems, i.e., learning how to solve unseen tasks given only little data. Many meta-learning methods can be characterised as meta-gradient-based [7, 13, 17, 30]. Briefly speaking, meta-gradient-based methods work as follows. During training, at each iteration, these methods perform a gradient-based task-specific update (often referred to as the "inner loop"). Then, for the meta-update, so-called meta-gradients are computed by backpropagating through these inner loop updates (which therefore involves taking higher order gradients). At test time, on a new task, only the inner-loop update is performed using a few gradient updates. In few-shot learning, typically, the loss function applied at test time is the one we are ultimately interested in minimising, such as the mean-squared-error loss for a regression problem. However, given we have few samples at test time, we argue that the loss function we want to minimise is not necessarily the loss function most suitable for computing gradients in a few-shot setting. Such a loss function is naive in the sense that it treats each datapoint independently, disregarding any relationships between them. This can be particularly problematic when only few datapoints are given and include, e.g., outliers or correlated points. Furthermore, it can be prone to cause over- or underfitting , depending on the stepsize and number of gradient steps. Therefore, we propose to instead learn the test-time loss function for meta-gradient-based methods for few-shot adaptation. In this work, we introduce fast adaptation via backprogating learned loss (VIABLE), a generic meta-learning extension which builds on existing meta-gradient-based methods by learning a differentiable loss function using meta-gradients. This loss function replaces the pre-defined inner-loop loss function and is meta-learned such that it maximises performance (i.e., minimises the pre-defined loss) within a few gradient steps and with little data. We show that learning a loss function capable of leveraging relational information between samples reduces underfitting, and significantly improves performance and sample efficiency on a simple regression task. In addition, we show VIABLE is scalable by evaluating on the Mini-Imagenet dataset . Since we typically use neural networks as function approximators, we will refer to the network making predictions as the prediction network and the learned loss function as the loss network.
Learning a loss function has been explored in a variety of ways in machine learning fields [1, 5, 6, 10, 19, 22, 25, 27, 28] including reinforcement learning and semi-supervised learning. In this paper, we are concerned with the few-shot supervised learning setting. Closest related to our method is recent work by Chebotar et al. , who propose , in which they learn a loss function in a similar fashion as VIABLE. In contrast to our work, is not designed for few-shot learning and instead uses the learned loss function to learn a prediction network from scratch per task. VIABLE on the other hand can be applied on top of any meta-gradient-based meta-learning techniques designed for few-shot learning. Also closely related is work by Sung et al. , who propose meta-critics. In addition to also learning from scratch per task, during meta-training, the meta-critic (loss network) is updated after each batch of task-specific actor (prediction network) updates; while in VIABLE, the loss network is frozen during task-specific updates and thus requires far fewer updates in total. Most importantly, compared to the above methods, we propose to learn a loss function that is designed to operate on the entire dataset at once, thus leveraging relational information between datapoints. We achieve this by using a relation network  that looks at pairwise combinations of datapoints. As we show in this paper, this leads to a significant improvement in terms of performance.
We consider the problem setting of meta-learning for supervised learning problems. In supervised learning, we learn a model that maps data points that have a true label to predictions . In few-shot learning problems, during each meta-training iteration, a batch of tasks is sampled from a task distribution . A task is a tuple (, , , ), where is the input space, is the output space, is the task-specific loss function, and is a distribution over data points. During each meta-training iteration, for each , we sample from : and , where and are the fixed number of training and test datapoints respectively. The training data is used to perform updates on the model . Afterwards, the updates are evaluated on the test data and or the update rule are adjusted.
2.1 Context Adaptation via Meta-Learning: CAVIA
In theory, VIABLE can be generically applied to meta-gradient-based methods. In this paper, we evaluate on CAVIA  because it applies the inner-loop update only on a small set of so-called context parameters instead of the entire network, making it easier to optimise. CAVIA aims to learn two distinct sets of parameters: task-specific context parameters and task-agnostic parameters . At every meta-training iteration (inner loop), CAVIA starts from a fixed value , typically , and updates its context-parameters for each task in the current batch of tasks as follows11 1 We outline CAVIA for one gradient update step, but it can be extended to several gradient steps.:
In the meta-update step (outer loop), the model parameters are updated with respect to the performance after the inner-loop update:
At test time, model parameters are frozen and only the task-specific parameters are updated.
3 Fast Adaptation via Backpropagating Learned Loss: VIABLE
We introduce VIABLE, a generic meta-learning extension that aims to adapt a loss function applicable to meta-gradient-based methods. During training, at each iteration, VIABLE trains an existing meta-gradient-based method (referred to as prediction network) by performing gradient updates using the output of a differentiable learned loss function (referred to as loss network). During the meta-update step, the meta-gradients are calculated and used to update the loss network. In this section, we consider two variants of loss networks: a simple loss network and an extension inspired by relation networks  which leverages relationships between datapoints.
Simple Loss Network. First, we consider a simple loss network which takes as input the target , the prediction , and pre-defined task-specific loss , and outputs a loss value. In the inner loop of the meta-gradient-based method, we replace the pre-defined task-specific loss with the output of our loss network. In this case, we replace CAVIA’s inner loop update (see (1)) with:
The task-specific parameters are updated by backpropagating the learned loss through the original loss and the outputs of the prediction network. In the outer loop, we update the parameters of the loss network along with the task-agnostic parameters of the prediction network (see (2)):
Relation Loss Network. Note that the pre-specified loss function and the aforementioned simple loss network naively calculate an independent loss per sample and average, ignoring any possible relationships between datapoints. For example, in the case of an outlier with a large disagreeing gradient compared to the other samples, simply averaging the gradients may negatively impact the model’s performance post-update. In addition, there is substantial evidence in few-shot learning showing that incorporating relational information between samples improves predictions [11, 17, 23, 26]. Thus, we believe that loss functions can improve upon gradient-based methods by providing the prediction network with relational information between samples, especially in gradient-based methods like MAML which treat their datapoints as independent during prediction. To show this, we introduce a relation loss network which takes as input the pairwise combinations of , , , . Thus, we replace CAVIA’s inner loop update (see (1)) with:
In this section, we evaluate the benefits of replacing the existing loss function in meta-gradient-based meta-learning methods with an adapted loss trained with VIABLE. We show that: 1) a loss function that leverages relational information between samples yields a substantial increase in performance over loss functions without relational information, 2) VIABLE improves the sample efficiency and reduces underfitting in a simple regression task, and 3) VIABLE is scalable by evaluating on the Mini-Imagenet dataset. For these experiments, we denote simVIABLE as applying VIABLE with a simple loss network to CAVIA, and relVIABLE as applying VIABLE with a relation loss network to CAVIA. Note that we do not evaluate against since it is not designed for few-shot learning and thus would require more samples. We describe the specifics of our implementation in the Appendix.
We begin with a regression problem of fitting sine curves from Finn et al. . A task is defined by the amplitude and phase of the sine curve which are uniformly sampled from and respectively. During training, for each task, (default ) datapoints are uniformly sampled from and given to the model to perform inner loop updates. The task specific loss is mean-squared-error (MSE) loss. In these experiments, we perform a single inner-loop update.
Improved performance. Both versions of VIABLE significantly outperform CAVIA. With 2 context parameters, CAVIA achieves a loss of 0.21, simVIABLE achieves 0.14, and relVIABLE achieves 0.02, which suggests that leveraging relational information between samples can substantially improve the effectiveness of the loss function. See Appendix C.2 for the full results.
Improved data efficiency. For this experiment, we uniformly sample (the number of training sample points) during training. We observe in Table 1 that relVIABLE achieves better performance with 4 sample points than CAVIA does with 20. In Figure 2, we see that with only a single gradient update, CAVIA underfits on the 4 test points while relVIABLE fits the curve closely.
|Number of Sample Points|
We show that this method can scale to problems which require larger networks by testing it on the few-shot image classification benchmark Mini-Imagenet .
Setup. In Rusu et al. , a Wide Residual Network (WRN)  is trained with supervised classification on the meta-train set; the network is then frozen and feature representations of the Mini-Imagenet dataset is extracted. Following their training protocol, we use the same embeddings and meta-learn on both the meta-train and meta-validation sets, with early-stopping on meta-validation.
|Matching Networks |
Results. Table 2 shows that simVIABLE offers a notable improvement over CAVIA while relVIABLE offers a substantial increase in accuracy in 5-way 5-shot experiments. In both variants of VIABLE, 5-way 1-shot experiments are within confidence intervals. We suspect that learning a loss for 1-shot experiments does not offer a significant advantage due to a single sample being all the information the model is provided regarding a class of images. For example, there is no concept of an outlier with a single sample. In the regression experiments, Table 1 shows similar results where the learned loss provides minor improvements over CAVIA for a single sample point.
5 Conclusion and Future Work
We proposed VIABLE, a general-purpose meta-learning extension applicable to existing meta-gradient-based meta-learning methods. We show that learning a loss capable of leveraging relations between samples through VIABLE improves upon CAVIA by mitigating underfitting and yielding substantial improvements to sample efficiency and performance. Furthermore, we show VIABLE is scalable by evaluating on the Mini-Imagenet dataset. For future work, we are interested in applying this extension to other existing meta-learning methods such as MAML and LEO, and evaluating variants of loss networks which utilise more than just pairwise relations such as an attention network.
We thank Andrei Rusu for useful feedback on working with the LEO image embeddings . This work was supported by a generous equipment grant from NVIDIA. Luisa Zintgraf is supported by the Microsoft Research PhD Scholarship Program. This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement number 637713).
- Andrychowicz et al.  M. Andrychowicz, M. Denil, S. Gomez, M. W. Hoffman, D. Pfau, T. Schaul, B. Shillingford, and N. De Freitas. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems, pages 3981–3989, 2016.
- Antoniou et al.  A. Antoniou, H. Edwards, and A. Storkey. How to train your maml. arXiv preprint arXiv:1810.09502, 2018.
- Bahdanau et al.  D. Bahdanau, P. Brakel, K. Xu, A. Goyal, R. Lowe, J. Pineau, A. Courville, and Y. Bengio. An actor-critic algorithm for sequence prediction. Fifth International Conference on Learning Representations (ICLR 2017), 2017.
- Behl et al.  H. S. Behl, A. G. Baydin, and P. H. Torr. Alpha maml: Adaptive model-agnostic meta-learning. arXiv preprint arXiv:1905.07435, 2019.
- Chebotar et al.  Y. Chebotar, A. Molchanov, S. Bechtle, L. Righetti, F. Meier, and G. Sukhatme. Meta-learning via learned loss. In ICML Multi-Task and Lifelong Reinforcement Learning Workshop, 2019.
- Duan et al.  Y. Duan, J. Schulman, X. Chen, P. L. Bartlett, I. Sutskever, and P. Abbeel. Rl2: Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779, 2016.
- Finn et al. [2017a] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1126–1135. JMLR. org, 2017a.
- Finn et al. [2017b] C. Finn, T. Yu, T. Zhang, P. Abbeel, and S. Levine. One-shot visual imitation learning via meta-learning. arXiv preprint arXiv:1709.04905, 2017b.
- Finn et al.  C. Finn, K. Xu, and S. Levine. Probabilistic model-agnostic meta-learning. In Advances in Neural Information Processing Systems, pages 9516–9527, 2018.
- Houthooft et al.  R. Houthooft, Y. Chen, P. Isola, B. Stadie, F. Wolski, O. J. Ho, and P. Abbeel. Evolved policy gradients. In Advances in Neural Information Processing Systems, pages 5400–5409, 2018.
- Koch  G. Koch. Siamese neural networks for one-shot image recognition. In ICML Deep Learning Workshop, 2015.
- Lee et al.  K. Lee, S. Maji, A. Ravichandran, and S. Soatto. Meta-learning with differentiable convex optimization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 10657–10665, 2019.
- Li et al.  Z. Li, F. Zhou, F. Chen, and H. Li. Meta-sgd: Learning to learn quickly for few-shot learning. arXiv preprint arXiv:1707.09835, 2017.
- Mishra et al.  N. Mishra, M. Rohaninejad, X. Chen, and P. Abbeel. A simple neural attentive meta-learner. Sixth International Conference on Learning Representations (ICLR 2018), 2018.
- Nguyen and Sanner  T. Nguyen and S. Sanner. Algorithms for direct 0–1 loss optimization in binary classification. In International Conference on Machine Learning, pages 1085–1093, 2013.
- Ravi and Larochelle  S. Ravi and H. Larochelle. Optimization as a model for few-shot learning. In Fifth International Conference on Learning Representations (ICLR 2017), 2017.
- Rusu et al.  A. A. Rusu, D. Rao, J. Sygnowski, O. Vinyals, R. Pascanu, S. Osindero, and R. Hadsell. Meta-learning with latent embedding optimization. In Seventh International Conference on Learning Representations (ICLR 2019), 2019.
- Santoro et al.  A. Santoro, D. Raposo, D. G. Barrett, M. Malinowski, R. Pascanu, P. Battaglia, and T. Lillicrap. A simple neural network module for relational reasoning. In Advances in Neural Information Processing Systems, pages 4967–4976, 2017.
- Santos et al.  C. N. d. Santos, K. Wadhawan, and B. Zhou. Learning loss functions for semi-supervised learning via discriminative adversarial networks. In NeurIPS Learning with Limited Data Workshop, 2017.
- Shen et al.  S. Shen, Y. Cheng, Z. He, W. He, H. Wu, M. Sun, and Y. Liu. Minimum risk training for neural machine translation. arXiv preprint arXiv:1512.02433, 2015.
- Song et al.  Y. Song, A. Schwing, R. Urtasun, et al. Training deep neural networks via direct loss minimization. In International Conference on Machine Learning, pages 2169–2177, 2016.
- Sung et al.  F. Sung, L. Zhang, T. Xiang, T. Hospedales, and Y. Yang. Learning to learn: Meta-critic networks for sample efficient learning. arXiv preprint arXiv:1706.09529, 2017.
- Sung et al.  F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. Torr, and T. M. Hospedales. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1199–1208, 2018.
- Taylor et al.  M. Taylor, J. Guiver, S. Robertson, and T. Minka. Softrank: optimizing non-smooth rank metrics. In Proceedings of the 2008 International Conference on Web Search and Data Mining, pages 77–86. ACM, 2008.
- Veeriah et al.  V. Veeriah, M. Hessel, Z. Xu, J. Rajendran, R. L. Lewis, J. Oh, H. P. van Hasselt, D. Silver, and S. Singh. Discovery of useful questions as auxiliary tasks. In Advances in Neural Information Processing Systems, pages 9306–9317, 2019.
- Vinyals et al.  O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. Matching networks for one shot learning. In Advances in Neural Information Processing Systems, pages 3630–3638, 2016.
- Wang et al.  J. X. Wang, Z. Kurth-Nelson, D. Tirumala, H. Soyer, J. Z. Leibo, R. Munos, C. Blundell, D. Kumaran, and M. Botvinick. Learning to reinforcement learn. CogSci, 2017.
- Wu et al.  L. Wu, F. Tian, Y. Xia, Y. Fan, T. Qin, L. Jian-Huang, and T.-Y. Liu. Learning to teach with dynamic loss functions. In Advances in Neural Information Processing Systems, pages 6466–6477, 2018.
- Zagoruyko and Komodakis  S. Zagoruyko and N. Komodakis. Wide residual networks. In British Machine Vision Conference, 2016.
- Zintgraf et al.  L. Zintgraf, K. Shiarli, V. Kurin, K. Hofmann, and S. Whiteson. Fast context adaptation via meta-learning. In International Conference on Machine Learning, pages 7693–7702, 2019.
VIABLE: Fast Adaptation via Backpropagating Learned Loss
Appendix A Pseudocode
Appendix B Additional Related Work
Meta-gradient based Methods. A common form of meta-learning is to adapt parameters in two interleaving phases that can be characterised as the task-specific updates (often referred to as the "inner loop") and the meta-updates (often referred to as the "outer loop"). At test time, on a new task, only the task-specific updates are applied. Finn et al.  introduces a meta-gradient-based method (MAML) that aims to learn a model initialisation that allows for fast adaptation to a new task given a few task-specific updates. Many methods that are inspired by or built on top of MAML can also be classified as meta-gradient-based [2, 4, 8, 9, 13, 30]. Another meta-gradient-based method, CAVIA  extends MAML by splitting the model parameters are into task-specific (context) parameters and task-agnostic parameters, resulting in fewer parameters to optimize in test time. Rusu et al.  introduces a meta-gradient-based method LEO that learns to produce network weights from task-specific embeddings. In this paper, we focus on CAVIA due to its structure being simple and easy to optimise.
Learning a Loss Function. Specially designed loss functions have been important in improving performance of many tasks such as classification , machine translation [3, 20], ranking , and object detection . In recent years, there has been interest in exploring methods for learning a good loss function automatically in a variety of machine learning fields [1, 5, 6, 10, 19, 22, 25, 27, 28], including reinforcement learning and semi-supervised learning. In this work, we focus on meta-learning, specifically the few-shot supervised learning setting. Closely related is meta-critics  and , who both learn a form of loss network. In contrast to their works, we are not required to learn our prediction network from scratch per task. Furthermore, VIABLE is applicable to any meta-gradient-based meta-learning techniques designed for few-shot learning, and, in contrast to meta-critics, we do not require adaptation for our loss network at test time. Most importantly, compared to the above methods, we propose to learn a loss function that is designed to operate on the entire dataset at once, thus leveraging relational information between datapoints. We achieve this by using a relation network  that looks at pairwise combinations of datapoints. As we show in this paper, this leads to a significant improvement in terms of performance.
Appendix C Regression
In the sine curve regression task, we follow the architecture used in the original paper for CAVIA  (a neural network with two hidden layers and 40 nodes each). Unless otherwise stated, by default we use 5 context parameters. In addition, a batch of 25 tasks is used per meta-update. We train for 50,000 iterations, with early stopping on a meta-validation set of 100 newly sampled tasks. During testing, we presented the model with (default ) datapoints from 1000 newly sampled tasks and measured MSE over 100 linearly spaced test points. In the meta-update step, the task-agnostic parameters of the prediction network is updated using the Adam optimiser with the standard learning rate of which is annealed every 5,000 steps by multiplying it by .
To allow a fair comparison, in VIABLE we use the same architecture as CAVIA for the prediction network. For both the relation loss network and the simple loss network, we use a neural network with three hidden layers of 32 nodes each. In the meta-update step, the parameters of the loss network is learned along with the task-agnostic parameters of the prediction network using the Adam optimiser with the standard learning rate of which is annealed every 5,000 steps by multiplying it by a factor of .
Both VIABLE and CAVIA are trained with a single inner-loop gradient step with an inner loop learning rate of 1.0.
C.2 Additional Results
|Number of Context Parameters|
|MAML||0.29 (0.02)||0.24 (0.02)||0.24 (0.02)||0.23 (0.02)||0.23 (0.02)|
|CAVIA||0.84 (0.06)||0.21 (0.02)||0.20 (0.02)||0.19 (0.02)||0.19 (0.02)|
|simVIABLE||0.75 (0.05)||0.14 (0.01)||0.15 (0.01)||0.14 (0.01)||0.16 (0.01)|
|relVIABLE||0.57 (0.05)||0.02 (0.00)||0.04 (0.00)||0.03 (0.00)||0.01 (0.00)|
Appendix D Classification
D.1 Problem Setting
In -way -shot classification, a task is a random selection of classes. The model gets to see examples per class from which the model is expected to learn to classify unseen images from the classes. The Mini-Imagenet dataset is divided into training, validation, and test metasets with 64 classes, 16 classes, and 20 classes respectively in which there are 600 images per class. We use an open-source dataset of Mini-Imagenet embeddings made available by . The embeddings are each of size 640.
D.2 Model Details
In CAVIA, our model uses a single hidden layer of size 800 and 100 context parameters. To ensure fairness, we use the same architecture for the prediction network in VIABLE. In simVIABLE, our loss network consisted of two hidden layers of 64 nodes each, and in relVIABLE, it consisted of two hidden layers of 1500 nodes each. Both VIABLE and CAVIA are trained with two inner-loop gradient steps along with an inner-learning rate of 1.0. In the meta-update step, VIABLE (prediction network and loss network) and CAVIA are both trained using the Adam optimiser with the standard learning rate of which is also annealed every 5,000 steps by multiplying it by a factor of .
D.3 Further Experiments
We perform an additional experiment that evaluates CAVIA and VIABLE’s ability to generalise to different amount of shots than seen during training. In this experiment, we train on 5-way 5-shot tasks and evaluate on 5-way k-shot where k varies from 1 to 9. Table 4 shows both variants of VIABLE significantly outperform CAVIA in generalising at test time to tasks which have a different amount of data than during meta-training. In the case of , the relation loss network calculates a loss using the same input in a pair with itself.
|Number of Shots: 5-way k-shot|
|Number of Shots: 5-way k-shot|