Abstract
In this paper, we propose a novel implicit semantic data augmentation (ISDA)approach to complement traditional augmentation techniques like flipping,translation or rotation. Our work is motivated by the intriguing property thatdeep networks are surprisingly good at linearizing features, such that certaindirections in the deep feature space correspond to meaningful semantictransformations, e.g., adding sunglasses or changing backgrounds. As aconsequence, translating training samples along many semantic directions in thefeature space can effectively augment the dataset to improve generalization. Toimplement this idea effectively and efficiently, we first perform an onlineestimate of the covariance matrix of deep features for each class, whichcaptures the intra-class semantic variations. Then random vectors are drawnfrom a zero-mean normal distribution with the estimated covariance to augmentthe training data in that class. Importantly, instead of augmenting the samplesexplicitly, we can directly minimize an upper bound of the expectedcross-entropy (CE) loss on the augmented training set, leading to a highlyefficient algorithm. In fact, we show that the proposed ISDA amounts tominimizing a novel robust CE loss, which adds negligible extra computationalcost to a normal training procedure. Although being simple, ISDA consistentlyimproves the generalization performance of popular deep models (ResNets andDenseNets) on a variety of datasets, e.g., CIFAR-10, CIFAR-100 and ImageNet.Code for reproducing our results are available athttps://github.com/blackfeather-wang/ISDA-for-Deep-Networks.
Quick Read (beta)
Implicit Semantic Data Augmentation for Deep Networks
Supplementary Materials
Abstract
In this paper, we propose a novel implicit semantic data augmentation (ISDA) approach to complement traditional augmentation techniques like flipping, translation or rotation. Our work is motivated by the intriguing property that deep networks are surprisingly good at linearizing features, such that certain directions in the deep feature space correspond to meaningful semantic transformations, e.g., adding sunglasses or changing backgrounds. As a consequence, translating training samples along many semantic directions in the feature space can effectively augment the dataset to improve generalization. To implement this idea effectively and efficiently, we first perform an online estimate of the covariance matrix of deep features for each class, which captures the intra-class semantic variations. Then random vectors are drawn from a zero-mean normal distribution with the estimated covariance to augment the training data in that class. Importantly, instead of augmenting the samples explicitly, we can directly minimize an upper bound of the expected cross-entropy (CE) loss on the augmented training set, leading to a highly efficient algorithm. In fact, we show that the proposed ISDA amounts to minimizing a novel robust CE loss, which adds negligible extra computational cost to a normal training procedure. Although being simple, ISDA consistently improves the generalization performance of popular deep models (ResNets and DenseNets) on a variety of datasets, e.g., CIFAR-10, CIFAR-100 and ImageNet. Code for reproducing our results are available at \colorbluehttps://github.com/blackfeather-wang/ISDA-for-Deep-Networks.
Implicit Semantic Data Augmentation for Deep Networks
Yulin Wang${}^{\mathrm{1}}$^{†}^{†}thanks: Equal contribution. Xuran Pan${}^{\mathrm{1}\mathbf{}\mathrm{*}}$ Shiji Song${}^{\mathrm{1}}$ Hong Zhang${}^{\mathrm{2}}$ Cheng Wu${}^{\mathrm{1}}$ Gao Huang${}^{\mathrm{1}}$^{†}^{†}thanks: Corresponding author. ${}^{1}$Department of Automation, Tsinghua University, Beijing, China Beijing National Research Center for Information Science and Technology (BNRist), ${}^{2}$Baidu Inc., China {yulin.bh, fykalviny}@gmail.com, [email protected], {shijis, wuc, gaohuang}@tsinghua.edu.cn
noticebox[b]33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\[email protected]
1 Introduction
Data augmentation is an effective technique to alleviate the overfitting problem in training deep networks [1, 2, 3, 4, 5]. In the context of image recognition, this usually corresponds to applying content preserving transformations, e.g., cropping, horizontal mirroring, rotation and color jittering, on the input samples. Although being effective, these augmentation techniques are not capable of performing semantic transformations, such as changing the background of an object or the texture of a foreground object. Recent work has shown that data augmentation can be more powerful if (class identity preserving) semantic transformations are allowed [6, 7, 8]. For example, by training a generative adversarial network (GAN) for each class in the training set, one could then sample infinite number of samples from the generator. Unfortunately, this procedure is computationally intensive because training generative models and inferring them to obtain augmented samples are both nontrivial tasks. Moreover, due to the extra augmented data, the training procedure is also likely to be prolonged.
In this paper, we propose an implicit semantic data augmentation (ISDA) algorithm for training deep image recognition networks. The ISDA is highly efficient as it does not require training/inferring auxiliary networks or explicitly generating extra training samples. Our approach is motivated by the intriguing observation made by recent work showing that the features deep in a network are usually linearized [9, 10]. Specifically, there exist many semantic directions in the deep feature space, such that translating a data sample in the feature space along one of these directions results in a feature representation corresponding to another sample with the same class identity but different semantics. For example, a certain direction corresponds to the semantic translation of "make-bespectacled". When the feature of a person, who does not wear glasses, is translated along this direction, the new feature may correspond to the same person but with glasses (The new image can be explicitly reconstructed using proper algorithms as shown in [9]). Therefore, by searching for many such semantic directions, we can effectively augment the training set in a way complementary to traditional data augmenting techniques.
However, explicitly finding semantic directions is not a trivial task, which usually requires extensive human annotations [9]. In contrast, sampling directions randomly is efficient but may result in meaningless transformations. For example, it makes no sense to apply the "make-bespectacled" transformation to the “car” class. In this paper, we adopt a simple method that achieves a good balance between effectiveness and efficiency. In specific, we perform an online estimate of the covariance matrix of the features for each class, which captures the intra-class variations. Then we sample directions from a zero-mean multi-variate normal distribution with the estimated covariance, and apply them to the features of training samples in that class to augment the dataset. In this way, the chance of generating meaningless semantic transformations can be significantly reduced.
To further improve the efficiency, we derive a closed-form upper bound of the expected cross-entropy (CE) loss with the proposed data augmentation scheme. Therefore, instead of performing the augmentation procedure explicitly, we can directly minimize the upper bound, which is in fact a novel robust loss function. As there is no need to generate explicit data samples, we call our algorithm implicit semantic data augmentation (ISDA). Compared to existing semantic data augmentation algorithms, the proposed ISDA can be conveniently implemented on top of most deep models without introducing auxiliary models or noticeable extra computational cost.
Although being simple, the proposed ISDA algorithm is surprisingly effective, and complements existing non-semantic data augmentation techniques quite well. Extensive empirical evaluations on several competitive image classification benchmarks show that ISDA consistently improves the generalization performance of popular deep networks, especially with little training data and powerful traditional augmentation techniques.
2 Related Work
In this section, we briefly review existing research on related topics.
Data augmentation is a widely used technique to alleviate overfitting in training deep networks. For example, in image recognition tasks, data augmentation techniques like random flipping, mirroring and rotation are applied to enforce certain invariance in convolutional networks [4, 5, 3, 11]. Recently, automatic data augmentation techniques, e.g., AutoAugment [12], are proposed to search for a better augmentation strategy among a large pool of candidates. Similar to our method, learning with marginalized corrupted features [13] can be viewed as an implicit data augmentation technique, but it is limited to simple linear models. Complementarily, recent research shows that semantic data augmentation techniques which apply class identity preserving transformations (e.g. changing backgrounds of objects or varying visual angles) to the training data is effective as well [14, 15, 6, 8]. This is usually achieved by generating extra semantically transformed training samples with specialized deep structures such as DAGAN [8], domain adaptation networks [15] or other GAN-based generators [14, 6]. Although being effective, these approaches are nontrivial to implement and computationally expensive, due to the need to train generative models beforehand and infer them during training.
Robust loss function. As shown in the paper, ISDA amounts to minimizing a novel robust loss function. Therefore, we give a brief review of related work on this topic. Recently, several robust loss functions are proposed for deep learning. For example, the L${}_{q}$ loss [16] is a balanced noise-robust form for the cross entropy (CE) loss and mean absolute error (MAE) loss, derived from the negative Box-Cox transformation. Focal loss [17] attaches high weights to a sparse set of hard examples to prevent the vast number of easy samples from dominating the training of the network. The idea of introducing large margin for CE loss has been proposed in [18, 19, 20]. In [21], the CE loss and the contrastive loss are combined to learn more discriminative features. From a similar perspective, center loss [22] simultaneously learns a center for deep features of each class and penalizes the distances between the samples and their corresponding class centers in the feature space, enhancing the intra-class compactness and inter-class separability.
Semantic transformations in deep feature space. Our work is motivated by the fact that high-level representations learned by deep convolutional networks can potentially capture abstractions with semantics [23, 10]. In fact, translating deep features along certain directions is shown to be corresponding to performing meaningful semantic transformations on the input images. For example, deep feature interpolation [9] leverages simple interpolations of deep features from pre-trained neural networks to achieve semantic image transformations. Variational Autoencoder(VAE) and Generative Adversarial Network(GAN) based methods [24, 25, 26] establish a latent representation corresponding to the abstractions of images, which can be manipulated to edit the semantics of images. Generally, these methods reveal that certain directions in the deep feature space correspond to meaningful semantic transformations, and can be leveraged to perform semantic data augmentation.
3 Method
Deep networks are known to excel at forming high-level representations in the deep feature space [4, 5, 9, 27], where the semantic relations between samples can be captured by the relative positions of their features [10]. Previous work has demonstrated that translating features towards specific directions corresponds to meaningful semantic transformations when the features are mapped to the input space [9, 28, 10]. Based on this observation, we propose to directly augment the training data in the feature space, and integrate this procedure into the training of deep models.
The proposed implicit semantic data augmentation (ISDA) has two important components, i.e., online estimation of class-conditional covariance matrices and optimization with a robust loss function. The first component aims to find a distribution from which we can sample meaningful semantic transformation directions for data augmentation, while the second saves us from explicitly generating large amount of extra training data, leading to remarkable efficiency compared to existing data augmentation techniques.
3.1 Sematic Transformations in Deep Feature Space
As aforementioned, certain directions in the deep feature space correspond to meaningful semantic transformations like “make-bespectacled” or ‘change-view-angle’. This motivates us to augment the training set by applying such semantic transformations on deep features. However, manually searching for semantic directions is infeasible for large scale problems. To address this problem, we propose to approximate the procedure by sampling random vectors from a normal distribution with zero mean and a covariance that is proportional to the intra-class covariance matrix, which captures the variance of samples in that class and is thus likely to contain rich semantic information. Intuitively, features for the person class may vary along the “wear-glasses” direction, while have nearly zero variance along the “has-propeller” direction which only occurs for other classes like the plane class. We hope that directions corresponding to meaningful transformations for each class are well represented by the principle components of the covariance matrix of that class.
Consider training a deep network $G$ with weights $\mathbf{\Theta}$ on a training set $\mathcal{D}={\{({\bm{x}}_{i},{y}_{i})\}}_{i=1}^{N}$, where ${y}_{i}\in \{1,\mathrm{\dots},C\}$ is the label of the $i$-th sample ${\bm{x}}_{i}$ over $C$ classes. Let the $A$-dimensional vector ${\bm{a}}_{i}={[{a}_{i1},\mathrm{\dots},{a}_{iA}]}^{T}=G({\bm{x}}_{i},\mathbf{\Theta})$ denote the deep features of ${\bm{x}}_{i}$ learned by $G$, and ${a}_{ij}$ indicate the $j$th element of ${\bm{a}}_{i}$.
To obtain semantic directions to augment ${\bm{a}}_{i}$, we randomly sample vectors from a zero-mean multi-variate normal distribution $\mathcal{N}(0,{\mathrm{\Sigma}}_{{y}_{i}})$, where ${\mathrm{\Sigma}}_{{y}_{i}}$ is the class-conditional covariance matrix estimated from the features of all the samples in class ${y}_{i}$. In implementation, the covariance matrix is computed in an online fashion by aggregating statistics from all mini-batches. The online estimation algorithm is given in Section A in the supplementary.
During training, $C$ covariance matrices are computed, one for each class. The augmented feature ${\stackrel{~}{\bm{a}}}_{i}$ is obtained by translating ${\bm{a}}_{i}$ along a random direction sampled from $\mathcal{N}(0,\lambda {\mathrm{\Sigma}}_{{y}_{i}})$. Equivalently, we have
$${\stackrel{~}{\bm{a}}}_{i}\sim \mathcal{N}({\bm{a}}_{i},\lambda {\mathrm{\Sigma}}_{{y}_{i}}),$$ | (1) |
where $\lambda $ is a positive coefficient to control the strength of semantic data augmentation. As the covariances are computed dynamically during training, the estimation in the first few epochs are not quite informative when the network is not well trained. To address this issue, we let $\lambda =(t/T)\times {\lambda}_{0}$ be a function of the current iteration $t$, thus to reduce the impact of the estimated covariances on our algorithm early in the training stage.
3.2 Implicit Semantic Data Augmentation (ISDA)
A naive method to implement ISDA is to explicitly augment each ${\bm{a}}_{i}$ for $M$ times, forming an augmented feature set ${\{({\bm{a}}_{i}^{1},{y}_{i}),\mathrm{\dots},({\bm{a}}_{i}^{M},{y}_{i})\}}_{i=1}^{N}$ of size $MN$, where ${\bm{a}}_{i}^{k}$ is $k$-th copy of augmented features for sample ${\bm{x}}_{i}$. Then the networks are trained by minimizing the cross-entropy (CE) loss:
$${\mathcal{L}}_{M}(\bm{W},\bm{b},\mathbf{\Theta})=\frac{1}{N}\sum _{i=1}^{N}\frac{1}{M}\sum _{k=1}^{M}-log(\frac{{e}^{{\bm{w}}_{{y}_{i}}^{T}{\bm{a}}_{i}^{k}+{b}_{{y}_{i}}}}{{\sum}_{j=1}^{C}{e}^{{\bm{w}}_{j}^{T}{\bm{a}}_{i}^{k}+{b}_{j}}}),$$ | (2) |
where $\bm{W}={[{\bm{w}}_{1},\mathrm{\dots},{\bm{w}}_{C}]}^{T}\in {\mathcal{R}}^{C\times A}$ and $\bm{b}={[{b}_{1},\mathrm{\dots},{b}_{C}]}^{T}\in {\mathcal{R}}^{C}$ are the weight matrix and biases corresponding to the final fully connected layer, respectively.
Obviously, the naive implementation is computationally inefficient when $M$ is large, as the feature set is enlarged by $M$ times. In the following, we consider the case that $M$ grows to infinity, and find that an easy-to-compute upper bound can be derived for the loss function, leading to a highly efficient implementation.
Upper bound of the loss function. In the case $M\to \mathrm{\infty}$, we are in fact considering the expectation of the CE loss under all possible augmented features. Specifically, ${\mathcal{L}}_{\mathrm{\infty}}$ is given by:
$${\mathcal{L}}_{\mathrm{\infty}}(\bm{W},\bm{b},\mathbf{\Theta}|\mathbf{\Sigma})=\frac{1}{N}\sum _{i=1}^{N}{\mathrm{E}}_{{\stackrel{~}{\bm{a}}}_{i}}[-log(\frac{{e}^{{\bm{w}}_{{y}_{i}}^{T}{\stackrel{~}{\bm{a}}}_{i}+{b}_{{y}_{i}}}}{{\sum}_{j=1}^{C}{e}^{{\bm{w}}_{j}^{T}{\stackrel{~}{\bm{a}}}_{i}+{b}_{j}}})].$$ | (3) |
If ${\mathcal{L}}_{\mathrm{\infty}}$ can be computed efficiently, then we can directly minimize it without explicitly sampling augmented features. However, Eq. (3) is difficult to compute in its exact form. Alternatively, we find that it is possible to derive an easy-to-compute upper bound for ${\mathcal{L}}_{\mathrm{\infty}}$, as given by the following proposition.
Proposition 1.
Suppose that ${\stackrel{\mathrm{~}}{\mathbf{a}}}_{i}\mathrm{\sim}\mathrm{N}\mathit{}\mathrm{(}{\mathbf{a}}_{i}\mathrm{,}\lambda \mathit{}{\mathrm{\Sigma}}_{{y}_{i}}\mathrm{)}$, then we have an upper bound of ${\mathrm{L}}_{\mathrm{\infty}}$, given by
$${\mathcal{L}}_{\mathrm{\infty}}(\bm{W},\bm{b},\mathbf{\Theta}|\mathbf{\Sigma})\le \frac{1}{N}\sum _{i=1}^{N}-log(\frac{{e}^{{\bm{w}}_{{y}_{i}}^{T}{\bm{a}}_{i}+{b}_{{y}_{i}}}}{{\sum}_{j=1}^{C}{e}^{{\bm{w}}_{j}^{T}{\bm{a}}_{i}+{b}_{j}+\frac{\lambda}{2}({\bm{w}}_{j}^{T}-{\bm{w}}_{{y}_{i}}^{T}){\mathrm{\Sigma}}_{{y}_{i}}({\bm{w}}_{j}-{\bm{w}}_{{y}_{i}})}})\triangleq {\overline{\mathcal{L}}}_{\mathrm{\infty}}.$$ | (4) |
Proof.
According to the definition of ${\mathcal{L}}_{\mathrm{\infty}}$ in (3), we have:
${\mathcal{L}}_{\mathrm{\infty}}(\bm{W},\bm{b},\mathbf{\Theta}|\mathbf{\Sigma})$ | $={\displaystyle \frac{1}{N}}{\displaystyle \sum _{i=1}^{N}}{\mathrm{E}}_{{\stackrel{~}{\bm{a}}}_{i}}[log({\displaystyle \sum _{j=1}^{C}}{e}^{({\bm{w}}_{j}^{T}-{\bm{w}}_{{y}_{i}}^{T}){\stackrel{~}{\bm{a}}}_{i}+({b}_{j}-{b}_{{y}_{i}})})]$ | (5) | ||
$\le {\displaystyle \frac{1}{N}}{\displaystyle \sum _{i=1}^{N}}log({\displaystyle \sum _{j=1}^{C}}{\mathrm{E}}_{{\stackrel{~}{\bm{a}}}_{i}}[{e}^{({\bm{w}}_{j}^{T}-{\bm{w}}_{{y}_{i}}^{T}){\stackrel{~}{\bm{a}}}_{i}+({b}_{j}-{b}_{{y}_{i}})}])$ | (6) | |||
$={\displaystyle \frac{1}{N}}{\displaystyle \sum _{i=1}^{N}}log({\displaystyle \sum _{j=1}^{C}}{e}^{({\bm{w}}_{j}^{T}-{\bm{w}}_{{y}_{i}}^{T}){\bm{a}}_{i}+({b}_{j}-{b}_{{y}_{i}})+\frac{\lambda}{2}({\bm{w}}_{j}^{T}-{\bm{w}}_{{y}_{i}}^{T}){\mathrm{\Sigma}}_{{y}_{i}}({\bm{w}}_{j}-{\bm{w}}_{{y}_{i}})})$ | (7) | |||
$={\overline{\mathcal{L}}}_{\mathrm{\infty}}.$ | (8) |
In the above, the Inequality (6) follows from the Jensen’s inequality $\mathrm{E}[logX]\le log\mathrm{E}[X]$, as the logarithmic function $log(\cdot )$ is concave. The Eq. (7) is obtained by leveraging the moment-generating function:
$$\mathrm{E}[{e}^{tX}]={e}^{t\mu +\frac{1}{2}{\sigma}^{2}{t}^{2}},X\sim \mathcal{N}(\mu ,{\sigma}^{2}),$$ |
due to the fact that $({\bm{w}}_{j}^{T}-{\bm{w}}_{{y}_{i}}^{T}){\stackrel{~}{\bm{a}}}_{i}+({b}_{j}-{b}_{{y}_{i}})$ is a Gaussian random variable, i.e.,
$$({\bm{w}}_{j}^{T}-{\bm{w}}_{{y}_{i}}^{T}){\stackrel{~}{\bm{a}}}_{i}+({b}_{j}-{b}_{{y}_{i}})\sim \mathcal{N}(({\bm{w}}_{j}^{T}-{\bm{w}}_{{y}_{i}}^{T}){\bm{a}}_{i}+({b}_{j}-{b}_{{y}_{i}}),\lambda ({\bm{w}}_{j}^{T}-{\bm{w}}_{{y}_{i}}^{T}){\mathrm{\Sigma}}_{{y}_{i}}({\bm{w}}_{j}-{\bm{w}}_{{y}_{i}})).\mathit{\u220e}$$ |
Essentially, Proposition 1 provides a surrogate loss for our implicit data augmentation algorithm. Instead of minimizing the exact loss function ${\mathcal{L}}_{\mathrm{\infty}}$, we can optimize its upper bound ${\overline{\mathcal{L}}}_{\mathrm{\infty}}$ in a much more efficient way. Therefore, the proposed ISDA boils down to a novel robust loss function, which can be easily adopted by most deep models. In addition, we can observe that when $\lambda \to 0$, which means no features are augmented, ${\overline{\mathcal{L}}}_{\mathrm{\infty}}$ reduces to the standard CE loss.
In summary, the proposed ISDA can be simply plugged into deep networks as a robust loss function, and efficiently optimized with the stochastic gradient descent (SGD) algorithm. We present the pseudo code of ISDA in Algorithm 2. Details of estimating covariance matrices and computing gradients are presented in Appendix A.
4 Experiments
In this section, we empirically validate the proposed algorithm on several widely used image classification benchmarks, i.e., CIFAR-10, CIFAR-100 [1] and ImageNet[29]. We first evaluate the effectiveness of ISDA with different deep network architectures on these datasets. Second, we apply several recent proposed non-semantic image augmentation methods in addition to the standard baseline augmentation, and investigate the performance of ISDA. Third, we present comparisons with state-of-the-art robust lost functions and generator-based semantic data augmentation algorithms. Finally, ablation study is conducted to examine the effectiveness of each component. We also visualize the augmented samples in the original input space with the aid of a generative network.
4.1 Datasets and Baselines
Datasets. We use three image recognition benchmarks in the experiments. (1) The two CIFAR datasets consist of 32x32 colored natural images in 10 classes for CIFAR-10 and 100 classes for CIFAR-100, with 50,000 images for training and 10,000 images for testing, respectively. In our experiments, we hold out 5000 images from the training set as the validation set to search for the hyper-parameter ${\lambda}_{0}$. These samples are also used for training after an optimal ${\lambda}_{0}$ is selected, and the results on the test set are reported. Images are normalized with channel means and standard deviations for pre-procession. For the non-semantic data augmentation of the training set, we follow the standard operation in [30]: 4 pixels are padded at each side of the image, followed by a random 32x32 cropping combined with random horizontal flipping. (2) ImageNet is a 1,000-class dataset from ILSVRC2012[29], providing 1.2 million images for training and 50,000 images for validation. We adopt the same augmentation configurations in [2, 4, 5].
Non-semantic augmentation techniques. To study the complementary effects of ISDA to traditional data augmentation methods, two state-of-the-art non-semantic augmentation techniques are applied, with and without ISDA. (1) Cutout [31] randomly masks out square regions of input during training to regularize the model. (2) AutoAugment [32] automatically searches for the best augmentation policies to yield the highest validation accuracy on a target dataset. All hyper-parameters are the same as reported in the papers introducing them.
Method | Params | CIFAR-10 | CIFAR-100 |
---|---|---|---|
ResNet-32 [4] | 0.5M | 7.39 $\pm $ 0.10% | 31.20 $\pm $ 0.41% |
ResNet-32 + ISDA | 0.5M | 7.09 $\mathrm{\pm}$ 0.12% | 30.27 $\mathrm{\pm}$ 0.34% |
ResNet-110 [4] | 1.7M | 6.76 $\pm $ 0.34% | 28.67 $\pm $ 0.44% |
ResNet-110 + ISDA | 1.7M | 6.33 $\mathrm{\pm}$ 0.19% | 27.57 $\mathrm{\pm}$ 0.46% |
SE-ResNet-110 [33] | 1.7M | 6.14 $\pm $ 0.17% | 27.30 $\pm $ 0.03% |
SE-ResNet-110 + ISDA | 1.7M | 5.96 $\mathrm{\pm}$ 0.21% | 26.63 $\mathrm{\pm}$ 0.21% |
Wide-ResNet-16-8 [34] | 11.0M | 4.25 $\pm $ 0.18% | 20.24 $\pm $ 0.27% |
Wide-ResNet-16-8 + ISDA | 11.0M | 4.04 $\mathrm{\pm}$ 0.29% | 19.91 $\mathrm{\pm}$ 0.21% |
Wide-ResNet-28-10 [34] | 36.5M | 3.82 $\pm $ 0.15% | 18.53 $\pm $ 0.07% |
Wide-ResNet-28-10 + ISDA | 36.5M | 3.58 $\mathrm{\pm}$ 0.15% | 17.98 $\mathrm{\pm}$ 0.15% |
ResNeXt-29, 8x24d [35] | 34.4M | 3.86 $\pm $ 0.14% | 18.16 $\pm $ 0.13% |
ResNeXt-29, 8x24d + ISDA | 34.4M | 3.67 $\mathrm{\pm}$ 0.12% | 17.43 $\mathrm{\pm}$ 0.25% |
DenseNet-BC-100-12 [5] | 0.8M | 4.90 $\pm $ 0.08% | 22.61 $\pm $ 0.10% |
DenseNet-BC-100-12 + ISDA | 0.8M | 4.54 $\mathrm{\pm}$ 0.07% | 22.10 $\mathrm{\pm}$ 0.34% |
DenseNet-BC-190-40 [5] | 15.2M | 3.52% | 17.74% |
DenseNet-BC-190-40 + ISDA | 15.2M | 3.24% | 17.42% |
Dataset | Networks | Cutout [31] | Cutout + ISDA | AA [32] | AA + ISDA |
---|---|---|---|---|---|
CIFAR-10 | Wide-ResNet-28-10 [34] | 2.99 $\pm $ 0.06% | 2.83 $\mathrm{\pm}$ 0.04% | 2.65 $\pm $ 0.07% | 2.56 $\mathrm{\pm}$ 0.01% |
Shake-Shake (26, 2x32d) [36] | 3.16 $\pm $ 0.09% | 2.93 $\mathrm{\pm}$ 0.03% | 2.89 $\pm $ 0.09% | 2.68 $\mathrm{\pm}$ 0.12% | |
Shake-Shake (26, 2x112d) [36] | 2.36% | 2.25% | 2.01% | 1.82% | |
CIFAR-100 | Wide-ResNet-28-10 [34] | 18.05 $\pm $ 0.25% | 17.02 $\mathrm{\pm}$ 0.11% | 16.60 $\pm $ 0.40% | 15.62 $\mathrm{\pm}$ 0.32% |
Shake-Shake (26, 2x32d) [36] | 18.92 $\pm $ 0.21% | 18.17 $\mathrm{\pm}$ 0.08 % | 17.50 $\pm $ 0.19% | 17.21 $\mathrm{\pm}$ 0.33% | |
Shake-Shake (26, 2x112d) [36] | 17.34 $\pm $ 0.28% | 16.24 $\mathrm{\pm}$ 0.20 % | 15.21 $\pm $ 0.20% | 13.87 $\mathrm{\pm}$ 0.26% |
Baselines. Our method is compared to several baselines including state-of-the-art robust loss functions and generator-based semantic data augmentation methods. (1) Dropout [37] is a widely used regularization approach which randomly mutes some neurons during training. (2) Large-margin softmax loss [18] introduces large decision margin, measured by a cosine distance, to the standard CE loss. (3) Disturb label [38] is a regularization mechanism that randomly replaces a fraction of labels with incorrect ones in each iteration. (4) Focal loss [17] focuses on a sparse set of hard examples to prevent easy samples from dominating the training procedure. (5) Center loss [22] simultaneously learns a center of features for each class and minimizes the distances between the deep features and their corresponding class centers. (6) ${L}_{q}$ loss [16] is a noise-robust loss function, using the negative Box-Cox transformation. (7) For generator-based semantic augmentation methods, we train several state-of-the-art GANs [39, 40, 41, 42], which are then used to generate extra training samples for data augmentation. For fair comparison, all methods are implemented with the same training configurations when it is possible. Details for hyper-parameter settings are presented in Appendix B.
Training details. For deep networks, we implement the ResNet, SE-ResNet, Wide-ResNet, ResNeXt, DenseNet and PyramidNet on the two CIFAR datasets, and ResNet on ImageNet. Detailed configurations for these models are given in Appendix B. The hyper-parameter ${\lambda}_{0}$ for ISDA is selected from the set $\{0.1,0.25,0.5,0.75,1\}$ according to the performance on the validation set. On ImageNet, due to GPU memory limitation, we approximate the covariance matrices by their diagonals, i.e., the variance of each dimension of the features. The best hyper-parameter ${\lambda}_{0}$ is selected from $\{1,2.5,5,7.5,10\}$.
4.2 Main Results
Table 1 presents the performance of several state-of-the-art deep networks with and without ISDA. It can be observed that ISDA consistently improves the generalization performance of these models, especially with fewer training samples per class. On CIFAR-100, for relatively small models like ResNet-32 and ResNet-110, ISDA reduces test errors by about $1\%$, while for larger models like Wide-ResNet-28-10 and ResNeXt-29, 8x24d, our method outperforms the competitive baselines by nearly $0.7\%$. Compared to ResNets, DenseNets generally suffer less from overfitting due to their architecture design, thus appear to benefit less from our algorithm.
Table 2 shows experimental results with recent proposed powerful traditional image augmentation methods (i.e. Cutout [31] and AutoAugment [32]). Interestingly, ISDA seems to be even more effective when these techniques exist. For example, when applying AutoAugment, ISDA achieves performance gains of $1.34\%$ and $0.98\%$ on CIFAR-100 with the Shake-Shake (26, 2x112d) and the Wide-ResNet-28-10, respectively. Notice that these improvements are more significant than the standard situations. A plausible explanation for this phenomenon is that non-semantic augmentation methods help to learn a better feature representation, which makes semantic transformations in the deep feature space more reliable. The curves of test errors during training on CIFAR-100 with Wide-ResNet-28-10 are presented in Figure 4. It is clear that ISDA achieves a significant improvement after the third learning rate drop, and shows even better performance after the forth drop.
Method | ResNet-110 | Wide-ResNet-28-10 | ||
---|---|---|---|---|
CIFAR-10 | CIFAR-100 | CIFAR-10 | CIFAR-100 | |
Large Margin [18] | 6.46$\pm $0.20% | 28.00$\pm $0.09% | 3.69$\pm $0.10% | 18.48$\pm $0.05% |
Disturb Label [38] | 6.61$\pm $0.04% | 28.46$\pm $0.32% | 3.91$\pm $0.10% | 18.56$\pm $0.22% |
Focal Loss [17] | 6.68$\pm $0.22% | 28.28$\pm $0.32% | 3.62$\pm $0.07% | 18.22$\pm $0.08% |
Center Loss [22] | 6.38$\pm $0.20% | 27.85$\pm $0.10% | 3.76$\pm $0.05% | 18.50$\pm $0.25% |
L${}_{q}$ Loss [16] | 6.69$\pm $0.07% | 28.78$\pm $0.35% | 3.78$\pm $0.08% | 18.43$\pm $0.37% |
WGAN [39] | 6.63$\pm $0.23% | - | 3.81$\pm $0.08% | - |
CGAN [40] | 6.56$\pm $0.14% | 28.25$\pm $0.36% | 3.84$\pm $0.07% | 18.79$\pm $0.08% |
ACGAN [41] | 6.32$\pm $0.12% | 28.48$\pm $0.44% | 3.81$\pm $0.11% | 18.54$\pm $0.05% |
infoGAN [42] | 6.59$\pm $0.12% | 27.64$\pm $0.14% | 3.81$\pm $0.05% | 18.44$\pm $0.10% |
Basic | 6.76$\pm $0.34% | 28.67$\pm $0.44% | - | - |
Basic + Dropout | 6.23$\pm $0.11% | 27.11$\pm $0.06% | 3.82$\pm $0.15% | 18.53$\pm $0.07% |
ISDA | 6.33$\pm $0.19% | 27.57$\pm $0.46% | - | - |
ISDA + Dropout | 5.98$\mathrm{\pm}$0.20% | 26.35$\mathrm{\pm}$0.30% | 3.58$\mathrm{\pm}$0.15% | 17.98$\mathrm{\pm}$0.15% |
Method | Top-1 | Top-5 |
---|---|---|
ResNet-50 [4] | 23.58% | 6.92% |
ResNet-50 + ISDA | 23.30% | 6.82% |
ResNet-152 [4] | 21.65% | 6.01% |
ResNet-152 + ISDA | 21.20% | 5.67% |
Table 4 presents the performance of ISDA on the large scale ImageNet dataset. It can be observed that ISDA reduces Top-1 error rate by $0.45\%$ for the ResNet-152 model. The training and test error curves are shown in Figure 4. Notably, ISDA achieves a slightly higher training error but a lower test error, indicating that ISDA performs effective regularization on deep networks.
4.3 Comparison with Other Approaches
We compare ISDA with a number of competitive baselines described in Section 4.1, ranging from robust loss functions to semantic data augmentation algorithms based on generative models. The results are summarized in Table 3, and the training curves are presented in Appendix D. One can observe that ISDA compares favorably with all the competitive baseline algorithms. With ResNet-110, the test errors of other robust loss functions are 6.38% and 27.85% on CIFAR-10 and CIFAR-100, respectively, while ISDA achieves 6.23% and 27.11%, respectively.
Among all GAN-based sematic augmentation methods, ACGAN gives the best performance, especially on CIFAR-10. However, these models generally suffer a performance reduction on CIFAR-100, which do not contain enough samples to learn a valid generator for each class. In contrast, ISDA shows consistent improvements on all the datasets. In addition, GAN-based methods require additional computation to train the generators, and introduce significant overhead to the training process. In comparison, ISDA not only leads to lower generalization error, but is simpler and more efficient.
4.4 Visualization Results
To demonstrate that our method is able to generate meaningful semantically augmented samples, we introduce an approach to map the augmented features back to the pixel space to explicitly show semantic changes of the images. Due to space limit, we defer the detailed introduction of the mapping algorithm and present it in Appendix C.
Figure 5 shows the visualization results. The first and second column represent the original images and reconstructed images without any augmentation. The rest columns present the augmented images by the proposed ISDA. It can be observed that ISDA is able to alter the semantics of images, e.g., backgrounds, visual angles, colors and type of cars, color of skins, which is not possible for traditional data augmentation techniques.
4.5 Ablation Study
Setting | CIFAR-10 | CIFAR-100 |
---|---|---|
Basic | 3.82$\pm $0.15% | 18.58$\pm $0.10% |
Identity matrix | 3.63$\pm $0.12% | 18.53$\pm $0.02% |
Diagonal matrix | 3.70$\pm $0.15% | 18.23$\pm $0.02% |
Single covariance matrix | 3.67$\pm $0.07% | 18.29$\pm $0.13% |
Constant ${\lambda}_{0}$ | 3.69$\pm $0.08% | 18.33$\pm $0.16% |
ISDA | 3.58$\mathrm{\pm}$0.15% | 17.98$\mathrm{\pm}$0.15% |
To get a better understanding of the effectiveness of different components in ISDA, we conduct a series of ablation study. In specific, several variants are considered: (1) Identity matrix means replacing the covariance matrix ${\mathrm{\Sigma}}_{c}$ by the identity matrix. (2) Diagonal matrix means using only the diagonal elements of the covariance matrix ${\mathrm{\Sigma}}_{c}$. (3) Single covariance matrix means using a global covariance matrix computed from the features of all classes. (4) Constant ${\lambda}_{\mathrm{0}}$ means using a constant ${\lambda}_{0}$ without setting it as a function of the training iterations.
Table 5 presents the ablation results. Adopting identity matrix increases the test error by 0.05% on CIFAR-10 and nearly 0.56% on CIFAR-100. Using a single covariance matrix greatly degrades the generalization performance as well. The reason is likely to be that both of them fail to find proper directions in the deep feature space to perform meaningful semantic transformations. Adopting a diagonal matrix also hurts the performance as it does not consider correlations of features.
5 Conclusion
In this paper, we proposed an efficient implicit semantic data augmentation algorithm (ISDA) to complement existing data augmentation techniques. Different from existing approaches leveraging generative models to augment the training set with semantically transformed samples, our approach is considerably more efficient and easier to implement. In fact, we showed that ISDA can be formulated as a novel robust loss function, which is compatible with any deep network with the cross-entropy loss. Extensive results on several competitive image classification datasets demonstrate the effectiveness and efficiency of the proposed algorithm.
Acknowledgments
Gao Huang is supported in part by Beijing Academy of Artificial Intelligence (BAAI) under grant BAAI2019QN0106 and Tencent AI Lab Rhino-Bird Focused Research Program under grant JR201914.
References
- [1] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” Citeseer, Tech. Rep., 2009.
- [2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in NeurIPS, 2012, pp. 1097–1105.
- [3] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in ICLR, 2015.
- [4] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778.
- [5] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, “Densely Connected Convolutional Networks,” in CVPR, 2017, pp. 2261–2269.
- [6] A. J. Ratner, H. Ehrenberg, Z. Hussain, J. Dunnmon, and C. Ré, “Learning to compose domain-specific transformations for data augmentation,” in NeurIPS, 2017, pp. 3236–3246.
- [7] C. Bowles, L. J. Chen, R. Guerrero, P. Bentley, R. N. Gunn, A. Hammers, D. A. Dickie, M. del C. Valdés Hernández, J. M. Wardlaw, and D. Rueckert, “Gan augmentation: Augmenting training data using generative adversarial networks,” CoRR, vol. abs/1810.10863, 2018.
- [8] A. Antoniou, A. J. Storkey, and H. A. Edwards, “Data augmentation generative adversarial networks,” CoRR, vol. abs/1711.04340, 2018.
- [9] P. Upchurch, J. R. Gardner, G. Pleiss, R. Pless, N. Snavely, K. Bala, and K. Q. Weinberger, “Deep feature interpolation for image content changes,” in CVPR, 2017, pp. 6090–6099.
- [10] Y. Bengio, G. Mesnil, Y. Dauphin, and S. Rifai, “Better mixing via deep representations,” in ICML, 2013, pp. 552–560.
- [11] R. K. Srivastava, K. Greff, and J. Schmidhuber, “Training very deep networks,” in NeurIPS, 2015, pp. 2377–2385.
- [12] E. D. Cubuk, B. Zoph, D. Mané, V. Vasudevan, and Q. V. Le, “Autoaugment: Learning augmentation policies from data,” CoRR, vol. abs/1805.09501, 2018.
- [13] L. Maaten, M. Chen, S. Tyree, and K. Weinberger, “Learning with marginalized corrupted features,” in ICML, 2013, pp. 410–418.
- [14] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman, “Reading text in the wild with convolutional neural networks,” International Journal of Computer Vision, vol. 116, no. 1, pp. 1–20, 2016.
- [15] K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan, “Unsupervised pixel-level domain adaptation with generative adversarial networks,” in CVPR, 2017, pp. 3722–3731.
- [16] Z. Zhang and M. R. Sabuncu, “Generalized cross entropy loss for training deep neural networks with noisy labels,” in NeurIPS, 2018.
- [17] T.-Y. Lin, P. Goyal, R. B. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in ICCV, 2017, pp. 2999–3007.
- [18] W. Liu, Y. Wen, Z. Yu, and M. Yang, “Large-margin softmax loss for convolutional neural networks.” in ICML, 2016.
- [19] X. Liang, X. Wang, Z. Lei, S. Liao, and S. Z. Li, “Soft-margin softmax for deep classification,” in ICONIP, 2017.
- [20] X. Wang, S. Zhang, Z. Lei, S. Liu, X. Guo, and S. Z. Li, “Ensemble soft-margin softmax loss for image classification,” in IJCAI, 2018.
- [21] Y. Sun, X. Wang, and X. Tang, “Deep learning face representation by joint identification-verification,” in NeurIPS, 2014.
- [22] Y. Wen, K. Zhang, Z. Li, and Y. Qiao, “A discriminative feature learning approach for deep face recognition,” in ECCV, 2016, pp. 499–515.
- [23] Y. Bengio et al., “Learning deep architectures for ai,” Foundations and trends® in Machine Learning, vol. 2, no. 1, pp. 1–127, 2009.
- [24] Y. Choi, M.-J. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo, “Stargan: Unified generative adversarial networks for multi-domain image-to-image translation,” in CVPR, 2018, pp. 8789–8797.
- [25] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in ICCV, 2017, pp. 2223–2232.
- [26] Z. He, W. Zuo, M. Kan, S. Shan, and X. Chen, “Attgan: Facial attribute editing by only changing what you want.” CoRR, vol. abs/1711.10678, 2017.
- [27] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in NeurIPS, 2015, pp. 91–99.
- [28] M. Li, W. Zuo, and D. Zhang, “Convolutional network for attribute-driven and identity-preserving human face generation,” CoRR, vol. abs/1608.06434, 2016.
- [29] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in ICML, 2009, pp. 248–255.
- [30] A. G. Howard, “Some improvements on deep convolutional neural network based image classification,” CoRR, vol. abs/1312.5402, 2014.
- [31] T. DeVries and G. W. Taylor, “Improved regularization of convolutional neural networks with cutout,” arXiv preprint arXiv:1708.04552, 2017.
- [32] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le, “Autoaugment: Learning augmentation policies from data,” in CVPR, 2019.
- [33] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in CVPR, 2018, pp. 7132–7141.
- [34] S. Zagoruyko and N. Komodakis, “Wide residual networks,” in BMVC, 2017.
- [35] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual transformations for deep neural networks,” in CVPR, 2017, pp. 1492–1500.
- [36] X. Gastaldi, “Shake-shake regularization,” arXiv preprint arXiv:1705.07485, 2017.
- [37] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” Journal of Machine Learning Research, vol. 15, pp. 1929–1958, 2014.
- [38] L. Xie, J. Wang, Z. Wei, M. Wang, and Q. Tian, “Disturblabel: Regularizing cnn on the loss layer,” in CVPR, 2016, pp. 4753–4762.
- [39] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein gan,” CoRR, vol. abs/1701.07875, 2017.
- [40] M. Mirza and S. Osindero, “Conditional generative adversarial nets,” CoRR, vol. abs/1411.1784, 2014.
- [41] A. Odena, C. Olah, and J. Shlens, “Conditional image synthesis with auxiliary classifier gans,” in ICML, 2017, pp. 2642–2651.
- [42] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel, “Infogan: Interpretable representation learning by information maximizing generative adversarial nets,” in NeurIPS, 2016, pp. 2172–2180.
- [43] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger, “Deep networks with stochastic depth,” in ECCV, 2016, pp. 646–661.
- [44] A. Mahendran and A. Vedaldi, “Understanding deep image representations by inverting them,” in CVPR, 2015, pp. 5188–5196.
Appendix A Implementation Details of ISDA.
Dynamic estimation of covariance matrices. During the training process using ${\overline{\mathcal{L}}}_{\mathrm{\infty}}$, covariance matrices are estimated by:
$${\bm{\mu}}_{j}^{(t)}=\frac{{n}_{j}^{(t-1)}{\bm{\mu}}_{j}^{(t-1)}+{m}_{j}^{(t)}\bm{\mu}^{\prime}{}_{j}{}^{(t)}}{{n}_{j}^{(t-1)}+{m}_{j}^{(t)}},$$ | (9) |
$$\begin{array}{c}\hfill {\mathrm{\Sigma}}_{j}^{(t)}=\frac{{n}_{j}^{(t-1)}{\mathrm{\Sigma}}_{j}^{(t-1)}+{m}_{j}^{(t)}\mathrm{\Sigma}^{\prime}{}_{j}{}^{(t)}}{{n}_{j}^{(t-1)}+{m}_{j}^{(t)}}+\frac{{n}_{j}^{(t-1)}{m}_{j}^{(t)}({\bm{\mu}}_{j}^{(t-1)}-\bm{\mu}^{\prime}{}_{j}{}^{(t)}){({\bm{\mu}}_{j}^{(t-1)}-\bm{\mu}^{\prime}{}_{j}{}^{(t)})}^{T}}{{({n}_{j}^{(t-1)}+{m}_{j}^{(t)})}^{2}},\end{array}$$ | (10) |
$${n}_{j}^{(t)}={n}_{j}^{(t-1)}+{m}_{j}^{(t)}$$ | (11) |
where ${\bm{\mu}}_{j}^{(t)}$ and ${\mathrm{\Sigma}}_{j}^{(t)}$ are the estimates of average values and covariance matrices of the features of ${j}^{th}$ class at ${t}^{th}$ step. $\bm{\mu}^{\prime}{}_{j}{}^{(t)}$ and $\mathrm{\Sigma}^{\prime}{}_{j}{}^{(t)}$ are the average values and covariance matrices of the features of ${j}^{th}$ class in ${t}^{th}$ mini-batch. ${n}_{j}^{(t)}$ denotes the total number of training samples belonging to ${j}^{th}$ class in all $t$ mini-batches, and ${m}_{j}^{(t)}$ denotes the number of training samples belonging to ${j}^{th}$ class only in ${t}^{th}$ mini-batch.
Gradient computation. In backward propagation, gradients of ${\overline{\mathcal{L}}}_{\mathrm{\infty}}$ are given by:
$$\frac{\partial {\overline{\mathcal{L}}}_{\mathrm{\infty}}}{\partial {b}_{j}}=\frac{\partial {\overline{\mathcal{L}}}_{\mathrm{\infty}}}{\partial {z}_{j}}=\{\begin{array}{cc}\frac{{e}^{{z}_{{y}_{i}}}}{{\sum}_{j=1}^{C}{e}^{{z}_{j}}}-1,\hfill & j={y}_{i}\hfill \\ \frac{{e}^{{z}_{j}}}{{\sum}_{j=1}^{C}{e}^{{z}_{j}}},\hfill & j\ne {y}_{i}\hfill \end{array},$$ | (12) |
$$\frac{\partial {\overline{\mathcal{L}}}_{\mathrm{\infty}}}{\partial {\bm{w}}_{j}^{T}}=\{\begin{array}{cc}({\bm{a}}_{i}+{\sum}_{n=1}^{C}[({\bm{w}}_{n}^{T}-{\bm{w}}_{{y}_{i}}^{T}){\mathrm{\Sigma}}_{i}])\frac{\partial {\overline{\mathcal{L}}}_{\mathrm{\infty}}}{\partial {z}_{j}},\hfill & j={y}_{i}\hfill \\ ({\bm{a}}_{i}+({\bm{w}}_{j}^{T}-{\bm{w}}_{{y}_{i}}^{T}){\mathrm{\Sigma}}_{i})\frac{\partial {\overline{\mathcal{L}}}_{\mathrm{\infty}}}{\partial {z}_{j}},\hfill & j\ne {y}_{i}\hfill \end{array},$$ | (13) |
$$\frac{\partial {\overline{\mathcal{L}}}_{\mathrm{\infty}}}{\partial {a}_{k}}=\sum _{j=1}^{C}{w}_{jk}\frac{\partial {\overline{\mathcal{L}}}_{\mathrm{\infty}}}{\partial {z}_{j}},1\le k\le A,$$ | (14) |
where ${w}_{jk}$ denotes ${k}^{th}$ element of ${\bm{w}}_{j}$. $\partial {\overline{\mathcal{L}}}_{\mathrm{\infty}}/\partial \mathbf{\Theta}$ can be obtained through the backward propagation algorithm using $\partial {\overline{\mathcal{L}}}_{\mathrm{\infty}}/\partial \bm{a}$.
Appendix B Training Details
On CIFAR, we implement the ResNet, SE-ResNet, Wide-ResNet, ResNeXt, DenseNet and PyramidNet. The SGD optimization algorithm with a nesterov momentum is applied to train all models. Specific hyper-parameters for training are presented in Table 6.
Network | Total Epochs | Batch Size | Weight Decay | Momentum | Initial ${l}_{r}$ | ${l}_{r}$ Schedule |
ResNet | 160 | 128 | 1e-4 | 0.9 | 0.1 | Multiplied by 0.1 in ${80}^{th}$ and ${120}^{th}$ epoch. |
SE-ResNet | 200 | 128 | 1e-4 | 0.9 | 0.1 | Multiplied by 0.1 in ${80}^{th}$, ${120}^{th}$ and ${160}^{th}$ epoch. |
Wide-ResNet | 240 | 128 | 5e-4 | 0.9 | 0.1 | Multiplied by 0.2 in ${60}^{th}$, ${120}^{th}$, ${160}^{th}$ and ${200}^{th}$ epoch. |
DenseNet-BC | 300 | 64 | 1e-4 | 0.9 | 0.1 | Multiplied by 0.1 in ${150}^{th}$, ${200}^{th}$ and ${250}^{th}$ epoch. |
ResNeXt | 350 | 128 | 5e-4 | 0.9 | 0.05 | Multiplied by 0.1 in ${150}^{th}$, ${225}^{th}$ and ${300}^{th}$ epoch. |
Shake Shake | 1800 | 64 | 1e-4 | 0.9 | 0.1 | Cosine learning rate. |
PyramidNet | 1800 | 128 | 1e-4 | 0.9 | 0.1 | Cosine learning rate. |
On ImageNet, we train ResNet for 120 epochs using the same l2 weight decay and momentum as CIFAR, following [43]. The initial learning rate is set as 0.1 and divided by 10 every 30 epochs. The size of mini-batch is set as 256.
All baselines are implemented with the same training configurations mentioned above. Dropout rate is set as 0.3 for comparison if it is not applied in the basic model, following the instruction in [37]. For noise rate in disturb label, 0.05 is adopted in Wide-ResNet-28-10 on both CIFAR-10 and CIFAR-100 datasets and ResNet-110 on CIFAR 10, while 0.1 is used for ResNet-110 on CIFAR 100. Focal Loss contains two hyper-parameters $\alpha $ and $\gamma $. Numerous combinations have been tested on the validation set and we ultimately choose $\alpha =0.5$ and $\gamma =1$ for all four experiments. For L${}_{q}$ loss, although [16] states that $q=0.7$ achieves best performance on most conditions, we suggest that $q=0.4$ is more suitable in our experiments, and therefore adopted. For center loss, we find its performance is largely affected by the learning rate of the center loss module, therefore its initial learning rate is set as 0.5 for the best generalization performance.
For generator-based augmentation methods, we apply the GANs structures introduced in [39, 40, 41, 42] to train the generators. For WGAN, a generator is trained for each class in CIFAR-10 dataset. For CGAN, ACGAN and infoGAN, single model is simply required to generate images of all classes. A 100 dimension noise drawn from standard normal distribution is adopted as input, generating images corresponding to their label. Specially, infoGAN takes additional input with two dimensions, which represent specific attributes of the whole training set. Synthetic images are involved with a fixed ratio in every mini-batch. Based on the experiments on the validation set, the proportion of generalized images is set as $1/6$.
Appendix C Reversing Convolutional Networks
To explicitly demonstrate the semantic changes generated by ISDA, we propose an algorithm to map deep features back to the pixel space. Some extra visualization results are shown in Figure 7.
An overview of the algorithm is presented in Figure 6. As there is no closed-form inverse function for convolutional networks like ResNet or DenseNet, the mapping algorithm acts in a similar way to [44] and [9], by fixing the model and adjusting inputs to find images corresponding to the given features. However, given that ISDA augments semantics of images in essence, we find it insignificant to directly optimize the inputs in the pixel space. Therefore, we add a fixed pre-trained generator $\mathcal{G}$, which is obtained through training a wasserstein GAN [39], to produce images for the classification model, and optimize the inputs of the generator instead. This approach makes it possible to effectively reconstruct images with augmented semantics.
The mapping algorithm can be divided into two steps:
Step I. Assume a random variable $\bm{z}$ is normalized to $\widehat{\bm{z}}$ and input to $\mathcal{G}$, generating fake image $\mathcal{G}(\widehat{\bm{z}})$. ${\bm{x}}_{i}$ is a real image sampled from the dataset (such as CIFAR). $\mathcal{G}(\widehat{\bm{z}})$ and ${\bm{x}}_{i}$ are forwarded through a pre-trained convolutional network to obtain deep feature vectors $f(\mathcal{G}(\widehat{\bm{z}}))$ and ${\bm{a}}_{i}$. The first step of the algorithm is to find the input noise variable ${\bm{z}}_{i}$ corresponding to ${\bm{x}}_{i}$, namely
$${\bm{z}}_{i}=\mathrm{arg}\underset{\bm{z}}{\mathrm{min}}{\parallel f(\mathcal{G}(\widehat{\bm{z}}))-{\bm{a}}_{i}\parallel}_{2}^{2}+\eta {\parallel \mathcal{G}(\widehat{\bm{z}})-{\bm{x}}_{i}\parallel}_{2}^{2},s.t.\widehat{\bm{z}}=\frac{\bm{z}-\overline{\bm{z}}}{std(\bm{z})},$$ | (15) |
where $\overline{\bm{z}}$ and $std(\bm{z})$ are the average value and the standard deviation of $\bm{z}$, respectively. The consistency of both the pixel space and the deep feature space are considered in the loss function, and we introduce a hyper-parameter $\eta $ to adjust the relative importance of two objectives.
Step II. We augment ${\bm{a}}_{i}$ with ISDA, forming ${\stackrel{~}{\bm{a}}}_{i}$ and reconstructe it in the pixel space. Specifically, we search for ${\bm{z}}_{i}^{\prime}$ corresponding to ${\stackrel{~}{\bm{a}}}_{i}$ in the deep feature space, with the start point ${\bm{z}}_{i}$ found in Step I:
$${\bm{z}}_{i}^{\prime}=\mathrm{arg}\underset{{\bm{z}}^{\mathbf{\prime}}}{\mathrm{min}}{\parallel f(\mathcal{G}({\widehat{\bm{z}}}^{\prime}))-{\stackrel{~}{\bm{a}}}_{i}\parallel}_{2}^{2},s.t.{\widehat{\bm{z}}}^{\prime}=\frac{{\bm{z}}^{\mathbf{\prime}}-\overline{{\bm{z}}^{\mathbf{\prime}}}}{std({\bm{z}}^{\mathbf{\prime}})}.$$ | (16) |
As the mean square error in the deep feature space is optimized to 0, $\mathcal{G}({\widehat{{\bm{z}}_{i}}}^{\prime})$ is supposed to represent the image corresponding to ${\stackrel{~}{\bm{a}}}_{i}$.
The proposed algorithm is performed on a single batch. In practice, a ResNet-32 network is used as the convolutional network. We solve Eq. (15), (16) with a standard gradient descent (GD) algorithm of 10000 iterations. The initial learning rate is set as 10 and 1 for Step I and Step II respectively, and is divided by 10 every 2500 iterations. We apply a momentum of 0.9 and a l2 weight decay of 1e-4.
Appendix D Extra Experimental Results
Curves of test errors of state-of-the-art methods and ISDA are presented in Figure 8. ISDA outperforms other methods consistently, and shows the best generalization performance in all situations. Notably, ISDA decreases test errors more evidently in CIFAR-100, which demonstrate that our method is more suitable for datasets with fewer samples. This observation is consistent with the results in the paper. In addition, among other methods, center loss shows competitive performance with ISDA on CIFAR-10, but it fails to significantly enhance the generalization in CIFAR-100.