Abstract
Model distillation aims to distill the knowledge of a complex model into asimpler one. In this paper, we consider an alternative formulation called {\emdataset distillation}: we keep the model fixed and instead attempt to distillthe knowledge from a large training dataset into a small one. The idea is to{\em synthesize} a small number of data points that do not need to come fromthe correct data distribution, but will, when given to the learning algorithmas training data, approximate the model trained on the original data. Forexample, we show that it is possible to compress $60,000$ MNIST training imagesinto just $10$ synthetic {\em distilled images} (one per class) and achieveclose to original performance with only a few steps of gradient descent, givena particular fixed network initialization. We evaluate our method in a widerange of initialization settings and with different learning objectives.Experiments on multiple datasets show the advantage of our approach compared toalternative methods in most settings.
Quick Read (beta)
Dataset Distillation
Abstract
Model distillation aims to distill the knowledge of a complex model into a simpler one. In this paper, we consider an alternative formulation called dataset distillation: we keep the model fixed and instead attempt to distill the knowledge from a large training dataset into a small one. The idea is to synthesize a small number of data points that do not need to come from the correct data distribution, but will, when given to the learning algorithm as training data, approximate the model trained on the original data. For example, we show that it is possible to compress $60,000$ MNIST training images into just $10$ synthetic distilled images (one per class) and achieve close to original performance with only a few steps of gradient descent, given a particular fixed network initialization. We evaluate our method in a wide range of initialization settings and with different learning objectives. Experiments on multiple datasets show the advantage of our approach compared to alternative methods in most settings.
Input: \algrenewcommand\algorithmicensureOutput:
Dataset Distillation
Tongzhou Wang 
Facebook AI Research 
[email protected] 
JunYan Zhu 
Massachusetts Institute of Technology 
[email protected] 
Antonio Torralba 
Massachusetts Institute of Technology 
[email protected] 
Alexei A. Efros 
University of California, Berkeley 
[email protected] 
1 Introduction
Hinton et al. (2015) proposed network distillation as a way to transfer the knowledge from an ensemble of many separatelytrained networks into a single, typically compact network, performing a type of model compression. In this paper, we are considering a related but orthogonal task: rather than distilling the model, we propose to distill the dataset. Unlike network distillation, we keep the model fixed but encapsulate the knowledge of the entire training dataset, which typically contains thousands to millions of images, into a small number of synthetic training images. In fact, we show that we can go as low as one synthetic image per category, training the same model to reach surprisingly good performance on these synthetic images. For example in Fig. 1a, we compress $60,000$ training images of MNIST digit dataset into only $10$ synthetic images (one per class), given a fixed network initialization. Training the standard LeNet (LeCun et al., 1998) architecture on these $10$ images yields testtime MNIST recognition performance of $94\%$, compared to $99\%$ for the original task. For networks with unknown random weights, $100$ synthetic images train to $80\%$ with a few gradient descent steps. We name our method Dataset Distillation and these images distilled images.
But why is dataset distillation useful? There is the purely scientific question of how much data is really encoded in a given training set and how compressible it is? Moreover, given a few distilled images, we can now “load up" a given network with an entire datasetworth of knowledge much more efficiently, compared to traditional training that often uses tens of thousands of gradient descent steps.
A key question is whether it is even possible to compress a dataset into a small set of synthetic data samples. For example, is it possible to train an image classification model on synthetic images that are not on the natural image manifold? Conventional wisdom would suggest that the answer is no, as the synthetic training data may not follow the same distribution as the real test data. Yet, in this work, we show that this is indeed possible. We present a new optimization algorithm for synthesizing a small number of synthetic data samples not only capturing much of the original training data but also tailored explicitly for fast model training in only a few gradient steps. To achieve our goal, we first derive the network weights as a differentiable function of our synthetic training data. Given this connection, instead of optimizing the network weights for a particular training objective, we can optimize the pixel values of our distilled images. However, this formulation requires access to the initial network weights of the network. To relax this assumption, we develop a method for generating distilled images for networks with random initializations from a certain distribution. To further boost performance, we propose an iterative version, where we obtain a sequence of distilled images to train a model and each distilled image can be trained with multiple passes. Finally, we study the case of a simple linear model, deriving a lower bound on the size of distilled data required to achieve the same performance as training on the full dataset.
We demonstrate that a handful of distilled images can be used to train a model with a fixed initialization to achieve surprisingly high performance. For a network with unknown random weights pretrained on other tasks, our method can still find distilled images for fast model finetuning. We further test our method on a wide range of initialization settings: fixed initialization, random initialization, fixed pretrained weights, and random pretrained weights, as well as two training objectives: image classification and malicious dataset poisoning attack. Extensive experiments on four publicly available datasets, MNIST (LeCun, 1998), CIFAR10 (Krizhevsky & Hinton, 2009), PASCALVOC (Everingham et al., 2010) and CUB200 (Wah et al., 2011), show that our method often performs better than alternative methods and existing baselines. Our code and models will be available upon publication.
2 Related Work
Knowledge distillation. The main inspiration for this paper is network distillation (Hinton et al., 2015), a widely used technique in ensemble learning (Radosavovic et al., 2018) and model compression (Ba & Caruana, 2014; Romero et al., 2015; Howard et al., 2017). While network distillation aims to distill the knowledge of multiple networks into a single model, our goal is to compress the knowledge of an entire dataset into a few synthetic training images. Our method is also related to the theoretical concept of teaching dimension, which specifies the size of dataset necessary to teach a target model (oracle) to a learner (Goldman & Kearns, 1995; Shinohara & Miyano, 1991). While these methods do not enforce the training data to be real, they need the existence of oracle models, which our method does not require.
Dataset pruning, coreset construction, and instance selection. Another way to distill knowledge is to summarize the entire dataset by a small subset, either by only using the “valuable” data for model training (Angelova et al., 2005; Lapedriza et al., 2013; Felzenszwalb et al., 2010) or by only labeling the “valuable” data via active learning (Cohn et al., 1996; Tong & Koller, 2001). Similarly, coreset construction (Bachem et al., 2017; Tsang et al., 2005; HarPeled & Kushal, 2007; Sener & Savarese, 2018) and instance selection (OlveraLópez et al., 2010) methods aim to select a subset of the entire training data, such that models trained on the subset will perform as closely well as possible to the model trained on full dataset for faster training time. For example, solutions to many classical linear learning algorithms, e.g., Perceptron (Rosenblatt, 1957) and support vector machine (SVMs) (Hearst et al., 1998), are weighted sums of a subset of training examples, which can be viewed as coresets. However, algorithms constructing these subsets require many more training examples per category than we do, in part because their “valuable” images have to be real, whereas our distilled images are exempt from this constraint.
Gradientbased hyperparameter optimization. Our work bears similarity with the gradientbased hyperparameter optimization techniques, which compute the gradient of hyperparameter w.r.t. the final validation loss by reversing the entire training procedure (Bengio, 2000; Domke, 2012; Pedregosa, 2016; Maclaurin et al., 2015). We also backpropagate errors through optimization steps. However, we use only training set data and focus much more heavily on learning synthetic training data rather than tuning hyperparameters. To our knowledge, this direction has only been slightly touched on previously (Maclaurin et al., 2015). We explore it in much greater depth and demonstrate the idea of dataset distillation through various settings. More crucially, our distilled images can work well across random initialization weights, which cannot be achieved by any prior work.
Understanding datasets. Researchers have presented various approaches for understanding and visualizing learned models (Zeiler & Fergus, 2014; Zhou et al., 2015; Mahendran & Vedaldi, 2015; Bau et al., 2017; Koh & Liang, 2017). Unlike these approaches, we are interested in understanding the intrinsic properties of the training data rather than a specific trained model. Analyzing training datasets has, in the past, been mainly focused on the investigation of bias in datasets (Ponce et al., 2006; Torralba & Efros, 2011). For example, Torralba & Efros (2011) proposed to quantify the “value” of dataset samples using crossdataset generalization. Our method offers a new perspective for understanding datasets by distilling full datasets into few synthetic samples.
3 Approach
Given a model and a dataset, we aim to obtain a new, muchreduced synthetic dataset which performs almost as well as the original dataset. We first present our main optimization algorithm for training a network with a fixed initialization with one gradient descent (GD) step (Sec. 3.1). In Sec. 3.2, we derive the resolution to a more challenging case, where the initial weight is random rather than fixed. We also discuss the initial weights distribution where our method can work well. Furthermore, we study a linear network case to help the readers understand both the solution and limits of our method in Sec. 3.3. In Sec. 3.4, we extend our approach to more than one gradient descent steps and more than one passes. Finally, Sec. 3.5 and Sec. 3.6 demonstrate how to obtain distilled images with different initialization distributions and learning objectives.
Consider a training dataset $\mathbf{x}={\{{x}_{i}\}}_{i=1}^{N}$. We parameterize our neural network as $\theta $ and denote $\mathrm{\ell}({x}_{i},\theta )$ as the loss function that represents the loss of this network on a data point ${x}_{i}$. Our task is to find the minimizer of the empirical error over the entire training data:
$${\theta}^{*}=\underset{\theta}{\mathrm{arg}\mathrm{min}}\frac{1}{N}\sum _{i=1}^{N}\mathrm{\ell}({x}_{i},\theta )=\underset{\theta}{\mathrm{arg}\mathrm{min}}\mathrm{\ell}(\mathbf{x},\theta ),$$  (1) 
where for notation simplicity we overload the $\mathrm{\ell}(\cdot )$ notation so that $\mathrm{\ell}(\mathbf{x},\theta )$ represents the average error of $\theta $ over the entire dataset $\mathbf{x}={\{{x}_{i}\}}_{i=1}^{N}$. We make the mild assumption that $\mathrm{\ell}$ is twicedifferentiable, which holds for the majority of modern machine learning models (e.g., most neural networks) and tasks.
3.1 Optimizing Distilled Data
Standard training usually applies minibatch stochastic gradient descent (SGD) or its variants. At each step $t$, we sample a minibatch of training data ${\mathbf{x}}_{t}={\{{x}_{t,j}\}}_{j=1}^{n}$ and update the current parameters as
${\theta}_{t+1}$  $={\theta}_{t}\eta {\nabla}_{{\theta}_{t}}\mathrm{\ell}({\mathbf{x}}_{t},{\theta}_{t}),$ 
where $\eta $ is the learning rate. Such a training process often takes tens of thousands or even millions of above update steps to converge. Instead, we aim to learn a tiny set of synthetic distilled training data $\stackrel{~}{\mathbf{x}}={\{{\stackrel{~}{x}}_{i}\}}_{i=1}^{M}$ with $M\ll N$ and a corresponding learning rate $\stackrel{~}{\eta}$ so that a single GD step like
$${\theta}_{1}={\theta}_{0}\stackrel{~}{\eta}{\nabla}_{{\theta}_{0}}\mathrm{\ell}(\stackrel{~}{\mathbf{x}},{\theta}_{0})$$  (2) 
using these learned synthetic data $\stackrel{~}{\mathbf{x}}$ greatly boosts performance on the real training dataset.
Given an initialization ${\theta}_{0}$, we obtain these synthetic data and $\stackrel{~}{\eta}$ that minimize the objective below $\mathcal{L}$:
${\stackrel{~}{\mathbf{x}}}^{*},{\stackrel{~}{\eta}}^{*}=\underset{\stackrel{~}{\mathbf{x}},\stackrel{~}{\eta}}{\mathrm{arg}\mathrm{min}}\mathcal{L}(\stackrel{~}{\mathbf{x}},\stackrel{~}{\eta};{\theta}_{0})=\underset{\stackrel{~}{\mathbf{x}},\stackrel{~}{\eta}}{\mathrm{arg}\mathrm{min}}\mathrm{\ell}(\mathbf{x},{\theta}_{1})=\underset{\stackrel{~}{\mathbf{x}},\stackrel{~}{\eta}}{\mathrm{arg}\mathrm{min}}\mathrm{\ell}(\mathbf{x},{\theta}_{0}\stackrel{~}{\eta}{\nabla}_{{\theta}_{0}}\mathrm{\ell}(\stackrel{~}{\mathbf{x}},{\theta}_{0})),$  (3) 
where we derive the new weights ${\theta}_{1}$ as a function of distilled images $\stackrel{~}{\mathbf{x}}$ and learning rate $\stackrel{~}{\eta}$ using Eqn. 2 and then evaluate the new weights over all the training images $\mathbf{x}$. Note that the loss $\mathcal{L}(\stackrel{~}{\mathbf{x}},\stackrel{~}{\eta};{\theta}_{0})$ is differentiable w.r.t. $\stackrel{~}{\mathbf{x}}$ and $\stackrel{~}{\eta}$, and can thus be optimized using standard gradientbased algorithms. In many classification tasks, the data $\mathbf{x}$ may contain discrete parts, e.g., the class labels in datalabel pairs. For such cases, we fix the discrete parts rather than learn them.
3.2 Distillation for Random Initializations
Unfortunately, the above distilled data optimized for a given initialization do not generalize well to other initialization weights. The distilled data often look like random noise (e.g., in Fig. 1(a)) as it encodes the information of both training dataset $\mathbf{x}$ and a particular network initialization ${\theta}_{0}$. To address the above issue, we turn to calculate a small number of distilled data that can work for networks with random initializations from a specific distribution. We formulate the optimization problem as follows:
$${\stackrel{~}{\mathbf{x}}}^{*},{\stackrel{~}{\eta}}^{*}=\underset{\stackrel{~}{\mathbf{x}},\stackrel{~}{\eta}}{\mathrm{arg}\mathrm{min}}{\mathbb{E}}_{{\theta}_{0}\sim p({\theta}_{0})}\mathcal{L}(\stackrel{~}{\mathbf{x}},\stackrel{~}{\eta};{\theta}_{0}),$$  (4) 
where ${\theta}_{0}$ is a randomly sampled network initialization from the distribution $p({\theta}_{0})$. Algorithm 3.2 illustrates our main method. During optimization, the distilled data are optimized to work well for multiple networks whose initial weights are sampled from $p({\theta}_{0})$. In practice, we observe that the final distilled data generalize well to the unseen initializations. Besides, these distilled images usually look quite informative, encoding the discriminative features of each category (Fig. 3).
For distilled data to be properly learned, it turns out to be crucial for $\mathrm{\ell}(\mathbf{x},\cdot )$ to share similar local conditions (e.g., output values, gradient magnitudes) over ${\theta}_{0}$ sampled from $p({\theta}_{0})$. In the next section, we derive a lower bound on the number of distilled data needed for a simple model with arbitrary initial ${\theta}_{0}$, and discuss its implications on choosing $p({\theta}_{0})$.
[t] {algorithmic}[1] \Require$p({\theta}_{0})$: distribution of initial weights; $M$: the number of distilled data \Require$\alpha $: step size; $n$: batch size; $T$: the number of optimization iterations; ${\stackrel{~}{\eta}}_{0}$: initial value for $\stackrel{~}{\eta}$ \StateInitialize $\stackrel{~}{\mathbf{x}}={\{{\stackrel{~}{x}}_{i}\}}_{i=1}^{M}$ randomly, $\stackrel{~}{\eta}\leftarrow {\stackrel{~}{\eta}}_{0}$ \Foreach training step $t=1$ to $T$ \StateGet a minibatch of real data ${\mathbf{x}}_{t}={\{{x}_{t,j}\}}_{j=1}^{n}$ \StateSample a batch of initial weights ${\theta}_{0}^{(j)}\sim p({\theta}_{0})$ \Foreach sampled ${\theta}_{0}^{(j)}$ \StateCompute updated parameter with GD: ${\theta}_{1}^{(j)}={\theta}_{0}^{(j)}\stackrel{~}{\eta}{\nabla}_{{\theta}_{0}^{(j)}}\mathrm{\ell}(\stackrel{~}{\mathbf{x}},{\theta}_{0}^{(j)})$ \StateEvaluate the objective function on real data: ${\mathcal{L}}^{(j)}=\mathrm{\ell}({\mathbf{x}}_{t},{\theta}_{1}^{(j)})$ \EndFor\StateUpdate $\stackrel{~}{\mathbf{x}}\leftarrow \stackrel{~}{\mathbf{x}}\alpha {\nabla}_{\stackrel{~}{\mathbf{x}}}{\sum}_{j}{\mathcal{L}}^{(j)}$, and $\stackrel{~}{\eta}\leftarrow \stackrel{~}{\eta}\alpha {\nabla}_{\stackrel{~}{\eta}}{\sum}_{j}{\mathcal{L}}^{(j)}$ \EndFor\Ensuredistilled data $\stackrel{~}{\mathbf{x}}$ and the optimized learning rate $\stackrel{~}{\eta}$
3.3 Analysis of a Simple Linear Case with Quadratic Loss
This section studies our formulation in a simple linear regression case. We derive the lower bound of the number of distilled images needed to achieve the same performance as training on full dataset for arbitrary initialization with one GD step. Consider a dataset $\mathbf{x}$ containing $N$ datatarget pairs ${\{({d}_{i},{t}_{i})\}}_{i=1}^{N}$, where ${d}_{i}\in {\mathbb{R}}^{D}$ and ${t}_{i}\in \mathbb{R}$, which we represent as two matrices: an $N\times D$ data matrix $\mathbf{d}$ and an $N\times 1$ target matrix $\mathbf{t}$. Given the mean squared error and a $D\times 1$ weight matrix $\theta $, we have
$$\mathrm{\ell}(\mathbf{x},\theta )=\mathrm{\ell}((\mathbf{d},\mathbf{t}),\theta )=\frac{1}{2N}{\parallel \mathbf{d}\theta \mathbf{t}\parallel}^{2}.$$  (5) 
We aim to learn $M$ synthetic datatarget pairs $\stackrel{~}{\mathbf{x}}=(\stackrel{~}{\mathbf{d}},\stackrel{~}{\mathbf{t}})$, where $\stackrel{~}{\mathbf{d}}$ is an $M\times D$ matrix, $\stackrel{~}{\mathbf{t}}$ an $M\times 1$ matrix ($M\ll N$), and $\stackrel{~}{\eta}$ the learning rate, to minimize $\mathrm{\ell}(\mathbf{x},{\theta}_{0}\stackrel{~}{\eta}{\nabla}_{{\theta}_{0}}\mathrm{\ell}(\stackrel{~}{\mathbf{x}},{\theta}_{0}))$. The updated weight matrix after one GD step with these distilled data is
$${\theta}_{1}={\theta}_{0}\stackrel{~}{\eta}{\nabla}_{{\theta}_{0}}\mathrm{\ell}(\stackrel{~}{\mathbf{x}},{\theta}_{0})={\theta}_{0}\frac{\stackrel{~}{\eta}}{M}{\stackrel{~}{\mathbf{d}}}^{T}(\stackrel{~}{\mathbf{d}}{\theta}_{0}\stackrel{~}{\mathbf{t}})=(\mathbf{I}\frac{\stackrel{~}{\eta}}{M}{\stackrel{~}{\mathbf{d}}}^{T}\stackrel{~}{\mathbf{d}}){\theta}_{0}+\frac{\stackrel{~}{\eta}}{M}{\stackrel{~}{\mathbf{d}}}^{T}\stackrel{~}{\mathbf{t}}.$$  (6) 
Note that for such quadratic loss, there always exists some learned distilled data $\stackrel{~}{\mathbf{x}}$ allowing us to achieve the same performance as training on full dataset $\mathbf{x}$ (i.e., attaining the global minimum) for any initialization ${\theta}_{0}$.^{*}^{*} * One choice is to pick any global minimum ${\theta}^{*}$, and choose $\stackrel{~}{\mathbf{d}}=N\cdot \mathbf{I}$ and $\stackrel{~}{\mathbf{t}}=N\cdot {\theta}^{*}$. But how small can $M$, the size of distilled data, be? For such models, the global minimum is attained at any ${\theta}^{*}$ satisfying ${\mathbf{d}}^{T}\mathbf{d}{\theta}^{*}={\mathbf{d}}^{T}\mathbf{t}$. Substituting Eqn. (6) in, we have
$${\mathbf{d}}^{T}\mathbf{d}(\mathbf{I}\frac{\stackrel{~}{\eta}}{M}{\stackrel{~}{\mathbf{d}}}^{T}\stackrel{~}{\mathbf{d}}){\theta}_{0}+\frac{\stackrel{~}{\eta}}{M}{\mathbf{d}}^{T}\mathbf{d}{\stackrel{~}{\mathbf{d}}}^{T}\stackrel{~}{\mathbf{t}}={\mathbf{d}}^{T}\mathbf{t}.$$  (7) 
Here we make the mild assumption that the feature columns of the data matrix $\mathbf{d}$ are independent (i.e., ${\mathbf{d}}^{T}\mathbf{d}$ has full rank). For a $\stackrel{~}{\mathbf{x}}=(\stackrel{~}{\mathbf{d}},\stackrel{~}{\mathbf{t}})$ to satisfy the above equation for any ${\theta}_{0}$, we must have
$$\mathbf{I}\frac{\stackrel{~}{\eta}}{M}{\stackrel{~}{\mathbf{d}}}^{T}\stackrel{~}{\mathbf{d}}=\mathrm{\U0001d7ce},$$  (8) 
which implies that ${\stackrel{~}{\mathbf{d}}}^{T}\stackrel{~}{\mathbf{d}}$ has full rank and $M\ge D$.
Discussion. The analysis considers only a simple case but suggests that any small number of distilled data fails to generalize to arbitrary starting ${\theta}_{0}$. This is intuitively expected as the optimization target $\mathrm{\ell}(\mathbf{x},{\theta}_{1})=\mathrm{\ell}(\mathbf{x},{\theta}_{0}\stackrel{~}{\eta}{\nabla}_{{\theta}_{0}}\mathrm{\ell}(\stackrel{~}{\mathbf{x}},{\theta}_{0}))$ depends on the local behavior of $\mathrm{\ell}(\mathbf{x},\cdot )$ around ${\theta}_{0}$, which can be drastically different across various ${\theta}_{0}$ values. We note that the lower bound $M\ge D$ is a quite restricting one, considering that real datasets often have thousands to even hundreds of thousands of dimensions (e.g., image classification). This analysis motivates us to focus on $p({\theta}_{0})$ distributions that yield similar local conditions over the support. Sec. 3.5 discusses several practical choices explored in this paper. Additionally, to address the limitation of using a single GD step, we extend our method to multiple GD steps in the next section. In Sec. 4.1, we empirically verify that using multiple steps is much more effective than using just one on deep convolutional networks, with the total amount of distilled data fixed.
3.4 Multiple Gradient Descent Steps and Multiple Epochs
We can extend Algorithm 3.2 to more than one gradient descent steps by changing Line 3.2 to multiple sequential GD steps each on a different batch of distilled data and learning rate, i.e., each step $i$ is
$${\theta}_{i+1}={\theta}_{i}{\stackrel{~}{\eta}}_{i}{\nabla}_{{\theta}_{i}}\mathrm{\ell}({\stackrel{~}{\mathbf{x}}}_{i},{\theta}_{i}),$$  (9) 
and changing Line 3.2 to backpropagate through all steps. However, naively computing gradients is both memoryintensive and computationallyexpensive. Therefore, we exploit a recent technique called backgradient optimization, which allows for significantly faster gradient calculation of such updates in reversemode differentiation (i.e., backpropagation). Specifically, backgradient optimization formulates the necessary second order terms into efficient Hessianvector products (Pearlmutter, 1994), which can be easily calculated with modern automatic differentiation systems such as PyTorch (Paszke et al., 2017). For further algorithm details in this aspect, we refer readers to prior work (Domke, 2012; Maclaurin et al., 2015).
Multiple epochs. To further improve the performance, we can train the network with the same distilled images for multiple epochs (passes) of the GD step(s). In particular, we tie the image pixels for the same distilled images used in different epochs. In other words, for each epoch, our method cycles through all GD steps, where each step is associated with a different batch of distilled data. We do not tie the trained learning rates across epochs as later epochs often use smaller learning rates.
3.5 Distillation with Different Initializations
Inspired by the analysis of the simple linear case in Sec. 3.3, we aim to focus on initial weights distributions $p(\theta )$ that yield similar local conditions over the support. In this work, we focus on the following four practical choices:
 •

•
Fixed initialization: A fixed initial weights sampled using the method above.
 •

•
Fixed pretrained weights: A fixed model weights pretrained on other tasks and datasets.
Distillation for pretrained weights. Such learned distilled data essentially finetunes weights pretrained on one task to perform well for a new task, thus bridging the gap between two domains. Domain mismatch and dataset bias represent a challenging problem in machine learning today (Torralba & Efros, 2011). Extensive prior work has been proposed to adapt models to new tasks and datasets (Daume III, 2007; Saenko et al., 2010). In this work, we characterize the domain mismatch via distilled data. In Sec. 4.2, we show that a very small number of distilled images are sufficient to quickly adapt CNN models to new classification tasks.
3.6 Distillation with Different Objectives
Previous sections show that we can train distilled data to minimize the loss of the distilled task $\mathrm{\ell}(\mathbf{x},{\theta}_{1})$ defined on the final updated weights ${\theta}_{1}$ (Line 3.2 in Algorithm 3.2). Distilled images trained with different final learning objectives can train models to exhibit different desired behaviours. We have already mentioned image classification as one of the applications, where distilled images help train accurate classifiers. Below, we introduce a quite different training objective to further demonstrate the flexibility of our method.
Distillation for a malicious datapoisoning objective. For example, our approach can be used to construct a new form of data poisoning attack. To illustrate this idea, we consider the following scenario. When a single GD step is applied with our synthetic adversarial data, a wellbehaved image classifier catastrophically forgets a category but still maintains high performance on other categories.
Formally, given an attacked category $K$ and a target category $T$, we want the classifier to misclassify images from category $K$ to category $T$. To achieve this, we consider a new final objective function ${\mathrm{\ell}}_{K\to T}(\mathbf{x},{\theta}_{1})$, which is a classification loss encouraging ${\theta}_{1}$ to classify category $K$ images mistakenly as category $T$ while correctly predicting other images, e.g., a cross entropy loss with target labels of $K$ modified to $T$. Then, the attacking distilled images can be obtained via optimizing
$${\stackrel{~}{\mathbf{x}}}^{*},{\stackrel{~}{\eta}}^{*}=\underset{\stackrel{~}{\mathbf{x}},\stackrel{~}{\eta}}{\mathrm{arg}\mathrm{min}}{\mathbb{E}}_{{\theta}_{0}\sim p({\theta}_{0})}{\mathcal{L}}_{K\to T}(\stackrel{~}{\mathbf{x}},\stackrel{~}{\eta};{\theta}_{0})=\underset{\stackrel{~}{\mathbf{x}},\stackrel{~}{\eta}}{\mathrm{arg}\mathrm{min}}{\mathbb{E}}_{{\theta}_{0}\sim p({\theta}_{0})}{\mathrm{\ell}}_{K\to T}(\mathbf{x},{\theta}_{1}),$$  (10) 
where $p({\theta}_{0})$ is the distribution over random pretrained weights of welloptimized classifiers.
Compared to prior data poisoning attacks (Biggio et al., 2012; Li et al., 2016; MuñozGonzález et al., 2017; Koh & Liang, 2017), our approach crucially does not require the poisoned training data to be stored and trained on repeatedly. Instead, our method attacks the model training just in one iteration and with only a few data. This advantage makes our method effective for many online training algorithms and useful for the case where malicious users hijack the data feeding pipeline for only one gradient step (e.g., one network transmission). In Sec. 4.2, we show that a single batch of distilled data applied in one step can successfully attack welloptimized neural network models. This setting can be viewed as distilling dataset knowledge of a specific category into data.
4 Experiments
We report image classification results on MNIST (LeCun, 1998) and CIFAR10 (Krizhevsky & Hinton, 2009). For MNIST, distilled images are trained with LeNet (LeCun et al., 1998), which achieves about $99\%$ test accuracy if fully trained. For CIFAR10, we use a network architecture following Krizhevsky (2012) which achieves around $80\%$ test accuracy if fully trained. For random initializations and random pretrained weights, we report means and standard deviations on $200$ heldout models, unless otherwise specified.
Baselines. For each experiment, in addition to baselines specific to the setting, we generally compare our method against baselines trained with data derived or selected from real images:

•
Random real images: We randomly sample the same number of real training images per category.

•
Optimized real images: We sample sets of real images as above, and choose on the top $20\%$ sets that perform the best training images.

•
$k$means: For each category, we use $k$means to extract the same number of cluster centroids as the number of distilled images in our method.

•
Average real images: We compute the average image of all the images in each category, which is reused in different GD steps.
For these baselines, we perform each evaluation on $200$ holdout models with all combinations of $\text{learning rate}\in \{\text{learned learning rate with our method},0.001,0.003,0.01,0.03,0.1,0.3\}$ and $\text{\#epochs}\in \{1,3,5\}$. We report results from the best performing combination. We run all the experiments on NVIDIA Titan Xp and V100 GPUs. We use one GPU for fixed initial weights and four GPUs for random initial weights. Each training typically takes $1$ to $4$ hours. Please see supplemental material Sec. S6.1 for more training and baseline details.
4.1 Dataset Distillation
Fixed initialization. With access to initial network weights, distilled images can directly train a particular network to reach high performance. For example, $10$ learned distilled images can boost the test accuracy of a neural network with an initial accuracy $\text{FPset}\text{a}12.9\text{FPround}\text{a}\text{a}2\text{a}\%$ to the final accuracy $\text{FPset}\text{a}93.76\text{FPround}\text{a}\text{a}2\text{a}\%$ on MNIST (Fig. 1(a)). Similarly, $100$ images can train a network with an initial accuracy $\text{FPset}\text{a}8.82\text{FPround}\text{a}\text{a}2\text{a}\%$ to $\text{FPset}\text{a}54.03\text{FPround}\text{a}\text{a}2\text{a}\%$ test accuracy on CIFAR10 (Fig. 1(b)). This result suggests that even only a few distilled images have enough capacity to distill part of the dataset.
Random initialization. Trained with randomly sampled initializations using Xavier initialization (Glorot & Bengio, 2010), the learned distilled images do not need to encode information tailored for a particular starting point and thus can represent meaningful content independent of network initializations. In Fig. 3, we see that such distilled images reveal discriminative features of the corresponding categories: e.g., the ship image in Fig. 2(b). These $100$ images can train randomly initialized networks to $\text{FPset}\text{a}36.79\text{FPround}\text{a}\text{a}2\text{a}\%$ average test accuracy on CIFAR10. Similarly, for MNIST, the $100$ distilled images shown in Fig. 2(a) can train randomly initialized networks to $\text{FPset}\text{a}79.50\text{FPround}\text{a}\text{a}2\text{a}\%$ test accuracy.
Multiple gradient descent steps and multiple epochs. In Fig. 3, we learn distilled images for $10$ GD steps applied in $3$ epochs, leading to a total of $100$ images (with each step containing one image per category). In each epoch, these $10$ steps are sequentially applied once. The early steps tend to look noisier, likely regularizing random weights to point easier for further optimization. In later steps, the images gradually look like real data and share the discriminative features for these categories. Fig. 3(a) shows that using more steps significantly improves the results. Fig. 3(b) shows a similar but slower trend as the number of epochs increases. We observe that longer training (i.e., more epochs) can help the model learn all the knowledge from the distilled images, but the performance is eventually limited by the capacity of the images (i.e., the number of total images). Alternatively, we can train the model with one GD step but a big batch size. Sec. 3.3 has shown theoretical limitations of using only one step in a simple linear case. In Fig. 5, we empirically verify that with convolutional networks, using multiple steps drastically outperforms single step method, with the same number of distilled images.
Ours  Baselines  
Fixed init.  Random init.  Used as training data in same number of GD steps  Used as data for $\U0001d67a$NN classification  
Random real  Optimized real  $k$means  Average real  Random real  $k$means  
MNIST  $\text{FPset}\text{a}\mathbf{96.62}\text{FPround}\text{a}\text{a}\mathrm{\U0001d7cf}\text{a}\%$  $\text{FPset}\text{a}79.50\text{FPround}\text{a}\text{a}1\text{a}\%\pm \text{FPset}\text{a}8.08\text{FPround}\text{a}\text{a}1\text{a}\%$  $\text{FPset}\text{a}68.55\text{FPround}\text{a}\text{a}1\text{a}\%\pm \text{FPset}\text{a}9.78\text{FPround}\text{a}\text{a}1\text{a}\%$  $\text{FPset}\text{a}73.01\text{FPround}\text{a}\text{a}1\text{a}\%\pm \text{FPset}\text{a}7.63\text{FPround}\text{a}\text{a}1\text{a}\%$  $\text{FPset}\text{a}76.43\text{FPround}\text{a}\text{a}1\text{a}\%\pm \text{FPset}\text{a}9.51\text{FPround}\text{a}\text{a}1\text{a}\%$  $\text{FPset}\text{a}77.09\text{FPround}\text{a}\text{a}1\text{a}\%\pm \text{FPset}\text{a}2.70\text{FPround}\text{a}\text{a}1\text{a}\%$  $\text{FPset}\text{a}71.53\text{FPround}\text{a}\text{a}1\text{a}\%\pm \text{FPset}\text{a}2.06\text{FPround}\text{a}\text{a}1\text{a}\%$  $\text{FPset}\text{a}\mathbf{92.19}\text{FPround}\text{a}\text{a}\mathrm{\U0001d7cf}\text{a}\%\pm \text{FPset}\text{a}\mathbf{0.14}\text{FPround}\text{a}\text{a}\mathrm{\U0001d7cf}\text{a}\%$ 
CIFAR10  $\text{FPset}\text{a}\mathbf{54.03}\text{FPround}\text{a}\text{a}\mathrm{\U0001d7cf}\text{a}\%$  $\text{FPset}\text{a}\mathbf{36.79}\text{FPround}\text{a}\text{a}\mathrm{\U0001d7cf}\text{a}\%\pm \text{FPset}\text{a}\mathbf{1.18}\text{FPround}\text{a}\text{a}\mathrm{\U0001d7cf}\text{a}\%$  $\text{FPset}\text{a}21.30\text{FPround}\text{a}\text{a}1\text{a}\%\pm \text{FPset}\text{a}1.47\text{FPround}\text{a}\text{a}1\text{a}\%$  $\text{FPset}\text{a}23.40\text{FPround}\text{a}\text{a}1\text{a}\%\pm \text{FPset}\text{a}1.33\text{FPround}\text{a}\text{a}1\text{a}\%$  $\text{FPset}\text{a}22.48\text{FPround}\text{a}\text{a}1\text{a}\%\pm \text{FPset}\text{a}3.09\text{FPround}\text{a}\text{a}1\text{a}\%$  $\text{FPset}\text{a}22.34\text{FPround}\text{a}\text{a}1\text{a}\%\pm \text{FPset}\text{a}0.65\text{FPround}\text{a}\text{a}1\text{a}\%$  $\text{FPset}\text{a}18.82\text{FPround}\text{a}\text{a}1\text{a}\%\pm \text{FPset}\text{a}1.28\text{FPround}\text{a}\text{a}1\text{a}\%$  $\text{FPset}\text{a}29.42\text{FPround}\text{a}\text{a}1\text{a}\%\pm \text{FPset}\text{a}0.28\text{FPround}\text{a}\text{a}1\text{a}\%$ 
Table 1 compares our method against several baselines. Our method with both fixed and random initialization outperform all the baselines on CIFAR10 and most of the baselines on MNIST.
4.2 Distillation for Different Initializations and Objectives
Next, we show two extended settings of our main algorithm discussed in Sec. 3.5 and Sec. 3.6. Both cases assume that the initial weights are random but pretrained on a different dataset. We train the distilled images on $2000$ random pretrained models, and then apply them on unseen models.
Fixed and random pretrained weights on digits. As shown in Sec. 3.5, we can optimize distilled images to quickly finetune pretrained models for a new dataset. Table 3 shows that our method is more effective compared to various baseline on adaptation among three digits datasets: MNIST, USPS (Hull, 1994), and SVHN (Netzer et al., 2011). We also compared against a stateoftheart fewshort supervised domain adaptation method (Motiian et al., 2017). Although our method uses the entire training set to compute the distilled images, both methods use the same number of images to distill the knowledge of target dataset. Prior work (Motiian et al., 2017) is outperformed by our method with fixed pretrained weights on all the tasks, and by our method with random pretrained weights on two of the three tasks. This result shows that our distilled images indeed convey compressed information of the full dataset.
Ours with fixed pretrained  Ours with random pretrained  Random real  Optimized real  $k$means  Average real  Domain adaptation Motiian et al. (2017)  No adaptation  Train on full destination training set  
$\mathcal{M}\to \mathcal{U}$  $\text{FPset}\text{a}\mathbf{97.90}\text{FPround}\text{a}\text{a}\mathrm{\U0001d7cf}\text{a}\%$  $\text{FPset}\text{a}\mathbf{95.38}\text{FPround}\text{a}\text{a}\mathrm{\U0001d7cf}\text{a}\%\pm \text{FPset}\text{a}\mathbf{1.81}\text{FPround}\text{a}\text{a}\mathrm{\U0001d7cf}\text{a}\%$  $\text{FPset}\text{a}94.89\text{FPround}\text{a}\text{a}1\text{a}\%\pm \text{FPset}\text{a}0.80\text{FPround}\text{a}\text{a}1\text{a}\%$  $\text{FPset}\text{a}95.16\text{FPround}\text{a}\text{a}1\text{a}\%\pm \text{FPset}\text{a}0.69\text{FPround}\text{a}\text{a}1\text{a}\%$  $\text{FPset}\text{a}92.18\text{FPround}\text{a}\text{a}1\text{a}\%\pm \text{FPset}\text{a}1.64\text{FPround}\text{a}\text{a}1\text{a}\%$  $\text{FPset}\text{a}93.89\text{FPround}\text{a}\text{a}1\text{a}\%\pm \text{FPset}\text{a}0.83\text{FPround}\text{a}\text{a}1\text{a}\%$  $\text{FPset}\text{a}\mathbf{96.65}\text{FPround}\text{a}\text{a}\mathrm{\U0001d7cf}\text{a}\%\pm \text{FPset}\text{a}\mathbf{0.45}\text{FPround}\text{a}\text{a}\mathrm{\U0001d7cf}\text{a}\%$  $\text{FPset}\text{a}90.43\text{FPround}\text{a}\text{a}1\text{a}\%\pm \text{FPset}\text{a}2.97\text{FPround}\text{a}\text{a}1\text{a}\%$  $\text{FPset}\text{a}97.32\text{FPround}\text{a}\text{a}1\text{a}\%\pm \text{FPset}\text{a}0.27\text{FPround}\text{a}\text{a}1\text{a}\%$ 
$\mathcal{U}\to \mathcal{M}$  $\text{FPset}\text{a}\mathbf{93.19}\text{FPround}\text{a}\text{a}\mathrm{\U0001d7cf}\text{a}\%$  $\text{FPset}\text{a}\mathbf{92.74}\text{FPround}\text{a}\text{a}\mathrm{\U0001d7cf}\text{a}\%\pm \text{FPset}\text{a}\mathbf{1.38}\text{FPround}\text{a}\text{a}\mathrm{\U0001d7cf}\text{a}\%$  $\text{FPset}\text{a}87.05\text{FPround}\text{a}\text{a}1\text{a}\%\pm \text{FPset}\text{a}2.88\text{FPround}\text{a}\text{a}1\text{a}\%$  $\text{FPset}\text{a}87.59\text{FPround}\text{a}\text{a}1\text{a}\%\pm \text{FPset}\text{a}2.14\text{FPround}\text{a}\text{a}1\text{a}\%$  $\text{FPset}\text{a}85.62\text{FPround}\text{a}\text{a}1\text{a}\%\pm \text{FPset}\text{a}3.13\text{FPround}\text{a}\text{a}1\text{a}\%$  $\text{FPset}\text{a}78.42\text{FPround}\text{a}\text{a}1\text{a}\%\pm \text{FPset}\text{a}4.97\text{FPround}\text{a}\text{a}1\text{a}\%$  $\text{FPset}\text{a}89.15\text{FPround}\text{a}\text{a}1\text{a}\%\pm \text{FPset}\text{a}2.37\text{FPround}\text{a}\text{a}1\text{a}\%$  $\text{FPset}\text{a}67.54\text{FPround}\text{a}\text{a}1\text{a}\%\pm \text{FPset}\text{a}3.91\text{FPround}\text{a}\text{a}1\text{a}\%$  $\text{FPset}\text{a}98.60\text{FPround}\text{a}\text{a}1\text{a}\%\pm \text{FPset}\text{a}0.53\text{FPround}\text{a}\text{a}1\text{a}\%$ 
$\mathcal{S}\to \mathcal{M}$  $\text{FPset}\text{a}\mathbf{96.15}\text{FPround}\text{a}\text{a}\mathrm{\U0001d7cf}\text{a}\%$  $\text{FPset}\text{a}85.21\text{FPround}\text{a}\text{a}1\text{a}\%\pm \text{FPset}\text{a}4.73\text{FPround}\text{a}\text{a}1\text{a}\%$  $\text{FPset}\text{a}84.63\text{FPround}\text{a}\text{a}1\text{a}\%\pm \text{FPset}\text{a}2.13\text{FPround}\text{a}\text{a}1\text{a}\%$  $\text{FPset}\text{a}85.19\text{FPround}\text{a}\text{a}1\text{a}\%\pm \text{FPset}\text{a}1.19\text{FPround}\text{a}\text{a}1\text{a}\%$  $\text{FPset}\text{a}\mathbf{85.75}\text{FPround}\text{a}\text{a}\mathrm{\U0001d7cf}\text{a}\%\pm \text{FPset}\text{a}\mathbf{1.20}\text{FPround}\text{a}\text{a}\mathrm{\U0001d7cf}\text{a}\%$  $\text{FPset}\text{a}74.89\text{FPround}\text{a}\text{a}1\text{a}\%\pm \text{FPset}\text{a}2.60\text{FPround}\text{a}\text{a}1\text{a}\%$  $\text{FPset}\text{a}74.03\text{FPround}\text{a}\text{a}1\text{a}\%\pm \text{FPset}\text{a}1.50\text{FPround}\text{a}\text{a}1\text{a}\%$  $\text{FPset}\text{a}51.64\text{FPround}\text{a}\text{a}1\text{a}\%\pm \text{FPset}\text{a}2.77\text{FPround}\text{a}\text{a}1\text{a}\%$  $\text{FPset}\text{a}98.60\text{FPround}\text{a}\text{a}1\text{a}\%\pm \text{FPset}\text{a}0.53\text{FPround}\text{a}\text{a}1\text{a}\%$ 
Destination dataset  Ours  Random real  Optimized real  Average real  Finetune on full destination training set 
PASCALVOC  $\text{FPset}\text{a}\mathbf{70.75}\text{FPround}\text{a}\text{a}\mathrm{\U0001d7d0}\text{a}\%$  $\text{FPset}\text{a}19.41\text{FPround}\text{a}\text{a}2\text{a}\%\pm \text{FPset}\text{a}3.73\text{FPround}\text{a}\text{a}2\text{a}\%$  $\text{FPset}\text{a}23.82\text{FPround}\text{a}\text{a}2\text{a}\%\pm \text{FPset}\text{a}3.66\text{FPround}\text{a}\text{a}2\text{a}\%$  $\text{FPset}\text{a}9.94\text{FPround}\text{a}\text{a}2\text{a}\%$  $\text{FPset}\text{a}75.57\text{FPround}\text{a}\text{a}2\text{a}\%\pm \text{FPset}\text{a}0.18\text{FPround}\text{a}\text{a}2\text{a}\%$ 
CUB200  $\text{FPset}\text{a}\mathbf{38.76}\text{FPround}\text{a}\text{a}\mathrm{\U0001d7d0}\text{a}\%$  $\text{FPset}\text{a}7.11\text{FPround}\text{a}\text{a}2\text{a}\%\pm \text{FPset}\text{a}0.66\text{FPround}\text{a}\text{a}2\text{a}\%$  $\text{FPset}\text{a}7.23\text{FPround}\text{a}\text{a}2\text{a}\%\pm \text{FPset}\text{a}0.78\text{FPround}\text{a}\text{a}2\text{a}\%$  $\text{FPset}\text{a}2.88\text{FPround}\text{a}\text{a}2\text{a}\%$  $\text{FPset}\text{a}41.21\text{FPround}\text{a}\text{a}2\text{a}\%\pm \text{FPset}\text{a}0.51\text{FPround}\text{a}\text{a}2\text{a}\%$ 
Fixed pretrained weights on ImageNet. In Table 3, we adapt a widelyused AlexNet model (Krizhevsky, 2014) pretrained on ImageNet (Deng et al., 2009) to perform image classification on PASCALVOC (Everingham et al., 2010) and CUB200 (Wah et al., 2011) datasets. Using only $1$ distilled image per category, our method outperforms the baselines significantly. Our result is also comparable to the accuracy of finetuning on the full datasets which contain thousands of images.
Random Pretrained weights and a malicious datapoisoning objective. Sec. 3.6 shows that our method can construct a new type of data poisoning, where the attacker can apply just one GD step with a few malicious data to manipulate a welltrained model. We train distilled images to make welloptimized neural networks to misclassify a particular attacked category as another target category within only one GD step. Our method requires no access to the exact weights of the model. In Fig. 5(b), we evaluate our method on $200$ heldout models, against various baselines using data derived from real images and incorrect labels. While some baselines perform similarly well as our method on MNIST, our method significantly outperforms all the baselines on CIFAR10.
5 Discussion
In this paper, we present dataset distillation for compressing the knowledge of entire training data into a few synthetic training images. We can train a network to reach high performance with a small number of distilled images and several gradient descent steps. Finally, we demonstrate two applications including fast domain adaptation and effective data poisoning attack. In the future, we plan to extend our method to compress largescale visual datasets such as ImageNet (Deng et al., 2009) and other types of data (e.g., audio and text). Also, our current method is sensitive to the initial weights distribution. We would like to investigate more on various initialization strategies, with which distilled images can work well.
References
 Angelova et al. (2005) Anelia Angelova, Yaser AbuMostafam, and Pietro Perona. Pruning training sets for learning of object categories. In CVPR, volume 1, pp. 494–501. IEEE, 2005.
 Ba & Caruana (2014) Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? In NIPS, pp. 2654–2662, 2014.
 Bachem et al. (2017) Olivier Bachem, Mario Lucic, and Andreas Krause. Practical coreset constructions for machine learning. arXiv preprint arXiv:1703.06476, 2017.
 Bau et al. (2017) David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba. Network dissection: Quantifying interpretability of deep visual representations. In CVPR, pp. 3319–3327. IEEE, 2017.
 Bengio (2000) Yoshua Bengio. Gradientbased optimization of hyperparameters. Neural computation, 12(8):1889–1900, 2000.
 Biggio et al. (2012) Battista Biggio, Blaine Nelson, and Pavel Laskov. Poisoning attacks against support vector machines. In ICML, 2012.
 Cohn et al. (1996) David A Cohn, Zoubin Ghahramani, and Michael I Jordan. Active learning with statistical models. Journal of artificial intelligence research, 4:129–145, 1996.
 Daume III (2007) Hal Daume III. Frustratingly easy domain adaptation. In ACL, 2007.
 Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, LiJia Li, Kai Li, and Li FeiFei. Imagenet: A largescale hierarchical image database. In CVPR, 2009.
 Domke (2012) Justin Domke. Generic methods for optimizationbased modeling. In Artificial Intelligence and Statistics, pp. 318–326, 2012.
 Everingham et al. (2010) Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. IJCV, 88(2):303–338, 2010.
 Felzenszwalb et al. (2010) Pedro F Felzenszwalb, Ross B Girshick, David McAllester, and Deva Ramanan. Object detection with discriminatively trained partbased models. PAMI, 32(9):1627–1645, 2010.
 Glorot & Bengio (2010) Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 249–256, 2010.
 Goldman & Kearns (1995) Sally A Goldman and Michael J Kearns. On the complexity of teaching. Journal of Computer and System Sciences, 50(1):20–31, 1995.
 HarPeled & Kushal (2007) Sariel HarPeled and Akash Kushal. Smaller coresets for kmedian and kmeans clustering. Discrete & Computational Geometry, 37(1):3–19, 2007.
 He et al. (2015) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification. In ICCV, 2015.
 Hearst et al. (1998) Marti A. Hearst, Susan T Dumais, Edgar Osuna, John Platt, and Bernhard Scholkopf. Support vector machines. IEEE Intelligent Systems and their applications, 13(4):18–28, 1998.
 Hinton et al. (2015) Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural network. In NIPS Deep Learning and Representation Learning Workshop, 2015.
 Howard et al. (2017) Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. In CVPR, 2017.
 Hull (1994) Jonathan J. Hull. A database for handwritten text recognition research. PAMI, 16(5):550–554, 1994.
 Kingma & Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Koh & Liang (2017) Pang Wei Koh and Percy Liang. Understanding blackbox predictions via influence functions. In ICML, 2017.
 Krizhevsky (2012) Alex Krizhevsky. cudaconvnet: Highperformance c++/cuda implementation of convolutional neural networks. Source code available at https://github. com/akrizhevsky/cudaconvnet2 [March, 2017], 2012.
 Krizhevsky (2014) Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997, 2014.
 Krizhevsky & Hinton (2009) Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
 Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
 Lapedriza et al. (2013) Agata Lapedriza, Hamed Pirsiavash, Zoya Bylinskii, and Antonio Torralba. Are all training examples equally valuable? arXiv preprint arXiv:1311.6510, 2013.
 LeCun (1998) Yann LeCun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998.
 LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 Li et al. (2016) Bo Li, Yining Wang, Aarti Singh, and Yevgeniy Vorobeychik. Data poisoning attacks on factorizationbased collaborative filtering. In NIPS, 2016.
 Maclaurin et al. (2015) Dougal Maclaurin, David Duvenaud, and Ryan Adams. Gradientbased hyperparameter optimization through reversible learning. In ICML, 2015.
 Mahendran & Vedaldi (2015) Aravindh Mahendran and Andrea Vedaldi. Understanding deep image representations by inverting them. In CVPR, 2015.
 Motiian et al. (2017) Saeid Motiian, Quinn Jones, Seyed Iranmanesh, and Gianfranco Doretto. Fewshot adversarial domain adaptation. In NIPS, 2017.
 MuñozGonzález et al. (2017) Luis MuñozGonzález, Battista Biggio, Ambra Demontis, Andrea Paudice, Vasin Wongrassamee, Emil C Lupu, and Fabio Roli. Towards poisoning of deep learning algorithms with backgradient optimization. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pp. 27–38. ACM, 2017.
 Netzer et al. (2011) Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. In NIPS workshop, 2011.
 OlveraLópez et al. (2010) J Arturo OlveraLópez, J Ariel CarrascoOchoa, J Francisco MartínezTrinidad, and Josef Kittler. A review of instance selection methods. Artificial Intelligence Review, 34(2):133–143, 2010.
 Paszke et al. (2017) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In ICLR Workshop, 2017.
 Pearlmutter (1994) Barak A Pearlmutter. Fast exact multiplication by the hessian. Neural computation, 6(1):147–160, 1994.
 Pedregosa (2016) Fabian Pedregosa. Hyperparameter optimization with approximate gradient. In ICML, 2016.
 Ponce et al. (2006) Jean Ponce, Tamara L Berg, Mark Everingham, David A Forsyth, Martial Hebert, Svetlana Lazebnik, Marcin Marszalek, Cordelia Schmid, Bryan C Russell, Antonio Torralba, et al. Dataset issues in object recognition. In Toward categorylevel object recognition, pp. 29–48. 2006.
 Radosavovic et al. (2018) Ilija Radosavovic, Piotr Dollár, Ross Girshick, Georgia Gkioxari, and Kaiming He. Data distillation: Towards omnisupervised learning. In CVPR, 2018.
 Romero et al. (2015) Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. In ICLR, 2015.
 Rosenblatt (1957) Frank Rosenblatt. The perceptron, a perceiving and recognizing automaton Project Para. Cornell Aeronautical Laboratory, 1957.
 Saenko et al. (2010) Kate Saenko, Brian Kulis, Mario Fritz, and Trevor Darrell. Adapting visual category models to new domains. In ECCV, 2010.
 Sener & Savarese (2018) Ozan Sener and Silvio Savarese. Active learning for convolutional neural networks: A coreset approach. In ICLR, 2018.
 Shinohara & Miyano (1991) Ayumi Shinohara and Satoru Miyano. Teachability in computational learning. New Generation Computing, 8(4):337–347, 1991.
 Tong & Koller (2001) Simon Tong and Daphne Koller. Support vector machine active learning with applications to text classification. JMLR, 2(Nov):45–66, 2001.
 Torralba & Efros (2011) Antonio Torralba and Alexei A Efros. Unbiased look at dataset bias. In CVPR, pp. 1521–1528. IEEE, 2011.
 Tsang et al. (2005) Ivor W Tsang, James T Kwok, and PakMing Cheung. Core vector machines: Fast svm training on very large data sets. JMLR, 6(Apr):363–392, 2005.
 Wah et al. (2011) C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The CaltechUCSD Birds2002011 Dataset. Technical Report CNSTR2011001, California Institute of Technology, 2011.
 Zeiler & Fergus (2014) Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In ECCV, 2014.
 Zhou et al. (2015) Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Object detectors emerge in deep scene cnns. In ICLR, 2015.
S6 Supplementary Material
S6.1 Experiment Details
For the networks used in our experiments, we disable dropout layers due to the randomness and computational cost they introduce in distillation. Moreover, we initialize the distilled learning rates as $0.02$ and use Adam optimizer (Kingma & Ba, 2014) with a learning rate of $0.001$. For random initialization and random pretrained weights, we sample $4$ to $16$ initial weights in each step.
Details of the baselines are listed below.

•
Random real images: We randomly sample the same number of real training images per category. $10$ such set of sampled images are evaluated.

•
Optimized real images: We sampled $50$ sets of real images using above procedure, and evaluate $10$ sets that achieve best performance on $20$ heldout models and $1024$ training images.

•
$k$means: For each category, we use $k$means to extract the same number of cluster centroids as the number of distilled images in our method. $10$ such set of sampled images are evaluated.

•
Average real images: We compute the average image of all the images in each category, which is reused in different GD steps. We evaluate the model only once because average images are deterministic.