An Empirical Study of Batch Normalization and Group Normalization in Conditional Computation

  • 2019-07-31 19:37:16
  • Vincent Michalski, Vikram Voleti, Samira Ebrahimi Kahou, Anthony Ortiz, Pascal Vincent, Chris Pal, Doina Precup
  • 33


Batch normalization has been widely used to improve optimization in deepneural networks. While the uncertainty in batch statistics can act as aregularizer, using these dataset statistics specific to the training setimpairs generalization in certain tasks. Recently, alternative methods fornormalizing feature activations in neural networks have been proposed. Amongthem, group normalization has been shown to yield similar, in some domains evensuperior performance to batch normalization. All these methods utilize alearned affine transformation after the normalization operation to increaserepresentational power. Methods used in conditional computation define theparameters of these transformations as learnable functions of conditioninginformation. In this work, we study whether and where the conditionalformulation of group normalization can improve generalization compared toconditional batch normalization. We evaluate performances on the tasks ofvisual question answering, few-shot learning, and conditional image generation.


Quick Read (beta)

An Empirical Study of Batch Normalization and Group Normalization in Conditional Computation

Vincent Michalski Vikram Voleti Samira Ebrahimi Kahou Anthony Ortiz University of Texas - El Paso

Pascal Vincent
Chris Pal Doina Precup

Batch normalization has been widely used to improve optimization in deep neural networks. While the uncertainty in batch statistics can act as a regularizer, using these dataset statistics specific to the training set impairs generalization in certain tasks. Recently, alternative methods for normalizing feature activations in neural networks have been proposed. Among them, group normalization has been shown to yield similar, in some domains even superior performance to batch normalization. All these methods utilize a learned affine transformation after the normalization operation to increase representational power. Methods used in conditional computation define the parameters of these transformations as learnable functions of conditioning information. In this work, we study whether and where the conditional formulation of group normalization can improve generalization compared to conditional batch normalization. We evaluate performances on the tasks of visual question answering, few-shot learning, and conditional image generation.


bnBNBatch Normalization \newacronymcbnCBNConditional Batch Normalization \newacronymcgnCGNConditional Group Normalization \newacronymcinCINConditional Instance Normalization \newacronymclevrCLEVRCompositional Language and Elementary Visual Reasoning \newacronymclnCLNConditional Layer Normalization \newacronymcogentCLEVR-CoGenTCLEVR Compositional Generalization Test \newacronymfc100FC100Fewshot-CIFAR100 \newacronymfidFIDFréchet Inception Distance \newacronymfigureqaFigureQAFigure Question Answering \newacronymfilmFiLMFeature-wise Linear Modulation \newacronymganGANGenerative Adversarial Network \newacronymgnGNGroup Normalization \newacronymgruGRUgated recurrent unit \newacronyminINInstance Normalization \newacronymisISInception Score \newacronymlnLNLayer Normalization \newacronymmseMSEmean-squared error \newacronymmlpMLPmultilayer perceptron \newacronymnmnNMNNeural Module Network \newacronymsaganSAGANSelf-Attention GAN \newacronymsqoopSQOOPSpatial Queries On Object Pairs \newacronymtadamTADAMTask dependent adaptive metric \newacronymtenTENtask embedding network \newacronymvqaVQAvisual question answering \newacronymcasCASClassification Accuracy Score


An Empirical Study of Batch Normalization and Group Normalization in Conditional Computation



noticebox[b]Preprint. Work in Progress. \[email protected]

1 Introduction

In machine learning, the parameters of a model are typically optimized using a fixed training set. The model is then evaluated on a separate partition of the data to estimate its generalization capability. In practice, even under the i.i.d. assumption11 1 All data samples are assumed to be drawn independently from an identical distribution (i.i.d.)., the distribution of these two finite sets can appear quite different to the learning algorithm, making it challenging to achieve strong and robust generalization. This difference is often the result of the fact that a training set of limited size cannot adequately cover the cross-product of all relevant factors of variation. This issue can be addressed by making strong assumptions that simplify discovering a family of patterns from limited data. Bahdanau et al. (2018), for example, show that their proposed synthetic relational reasoning task can be solved by a \glsnmn (Andreas et al., 2016) with fixed tree structure, while models without this structural prior fail.

Recent studies propose different benchmarks for evaluating task specific models for their generalization capacity (Johnson et al., 2017; Kahou et al., 2017; Bahdanau et al., 2018). While in this paper, we focus on \glsvqa, few-shot learning and generative models, any improvement in this direction can also benefit other domains such as reinforcement learning. Some of the best-performing models for each of these tasks are deep neural networks that employ \glscbn (De Vries et al., 2017) for modulating normalized activations with contextual information. For \glsbn, one usually has to precompute activation statistics over the training set to be used during inference. Since \glsbn (Ioffe and Szegedy, 2015) (and thus also \glscbn) relies on dataset statistics, it seems that it may be vulnerable to significant domain shifts between training and test data. A recent study by Galloway et al. (2019) indicates that \glsbn is also vulnerable to adversarial examples.

The recently proposed \glsgn (Wu and He, 2018) normalizes across groups of feature maps instead of across batch samples. Here, we explore whether a conditional formulation of \glsgn is a viable alternative for \glscbn. \Glsgn is conceptually simpler than \glsbn, as its function is the same during training and inference. Further, \glsgn can be used with small batch sizes, which may help in applications with particularly large feature maps, such as medical imaging or video processing, in which the available memory can be a constraint.

We compare \glscgn and \glscbn in a variety of tasks to see whether there are any significant performance differences. Section 2 reviews some basic concepts that our work builds upon. Section 3 describes setup and results of our experiments. Finally, we draw conclusions and present some directions for future work in Section 4.

2 Background

2.1 Normalization Layers

Several normalization methods have been proposed to stabilize and speed-up the training of deep neural networks (Ioffe and Szegedy, 2015; Wu and He, 2018; Lei Ba et al., 2016; Ulyanov et al., 2016). To stabilize the range of variation of network activations xi, methods such as \glsbn (Ioffe and Szegedy, 2015) first normalize the activations by subtracting mean μi and dividing by standard deviation σi:

x^i=1σi(xi-μi) (1)

The distinction between different methods lies in how exactly these statistics are being computed. Wu and He (2018) aptly summarize several methods using the following notation. Let i=(iN,iC,iH,iW) be a four-dimensional vector, whose elements index the features along the batch, channel, height and width axes, respectively. The computation of the statistics can then be written as

μi=1mk𝒮ixk,σi=1mk𝒮i(xk-μi)2+ϵ, (2)

where the set 𝒮i of size m is defined differently for each method and ϵ is a small constant for numerical stability. \Glsbn, for instance, corresponds to:

\textBN\implies𝒮i={k|kC=iC}, (3)

i.e. 𝒮i is the set of all pixels sharing the same channel axis, resulting in μi and σi being computed along the (N,H,W) axes.

As Lei Ba et al. (2016) point out, the performance of \glsbn is highly affected by the batch size hyperparameter. This insight led to the introduction of several alternative normalization schemes, that normalize per sample, i.e. not along batch axis N. \Glsln (Lei Ba et al., 2016), which normalizes activations within each layer, corresponds to the following set definition:

\textLN\implies𝒮i={k|kN=iN}. (4)

Ulyanov et al. (2016) introduce \glsin in the context of image stylization. \Glsin normalizes separately for each sample and each channel along the spatial dimensions:

\textIN\implies𝒮i={k|kN=iN,kC=iC}. (5)

Recently, Wu and He (2018) introduced \glsgn, which draws inspiration from classical features such as HOG (Dalal and Triggs, 2005). It normalizes features per sample, separately within each of G groups, along the channel axis:

\textGN\implies𝒮i={k|kN=iN,kCC/G=iCC/G} (6)

gn can be seen as a way to interpolate between the two extremes of \glsln (corresponding to G=1, i.e. all channels are in a single group) and \glsin (corresponding to G=C, i.e. each channel is in its own group).

After normalization, all above mentioned methods insert a scaling and shifting operation using learnable per-channel parameters γ and β:

yi=γx^i+β (7)

This “de-normalization” is done to restore the representational power of the normalized network layer (Ioffe and Szegedy, 2015).


cbn (De Vries et al., 2017; Perez et al., 2018) is a conditional variant of \glsbn, in which the learnable parameters γ and β in Equation 7 are replaced by learnable functions

γ(ck)=Wγck+bγ,β(ck)=Wβck+bβ (8)

of some per-sample conditioning input ck to the network with parameters Wγ, Wβ, bγ, bβ. In a \glsvqa model, ck would for instance be an embedding of the question (Perez et al., 2018). Dumoulin et al. (2017) introduce \glscin, a conditional variant of \glsin similar to \glscbn, replacing \glsbn with \glsin. In our experiments, we also explore a conditional variant of \glsgn.

2.2 Visual Question Answering

In \glsvqa (Malinowski and Fritz, 2014; Antol et al., 2015), the task is to answer a question about an image. This task is usually approached by feeding both image and question to a parametric model, which is trained to predict the correct answer, for instance via classification among all possible answers in the dataset. One recent successful model for \glsvqa is the \glsfilm architecture (Perez et al., 2018), which employs \glscbn to modulate visual features based on an embedding of the question.

2.3 Few-Shot Classification

The task of few-shot classification consists in the challenge of classifying data given only a small set of support samples for each class. In episodic M-way, k-shot classification tasks, meta-learning models (Ravi and Larochelle, 2016) learn to adapt a classifier given multiple M-class classification tasks, with k support samples for each class. The meta-learner thus has to solve the problem of generalizing between these tasks given the limited number of training samples. In this work we experiment with the recently proposed \glstadam architecture (Oreshkin et al., 2018). It belongs to the family of meta-learners, that employ nearest neighbor classification within a learned embedding space. In the case of \glstadam, the network providing this embedding is modulated by a task embedding using \glscbn.

2.4 Conditional Image Generation

Some of the most successful models for generating images are \glsplgan (Goodfellow et al., 2014). This approach involves training a neural network (Generator) to generate an image, while the only supervisory signal is that from another neural network (Discriminator) which indicates whether the image looks real or not. Several variants of \glsplgan (Mirza and Osindero, 2014; Odena et al., 2017) have been proposed to condition the image generation process on a class label. More recently, the generators that work best stack multiple ResNet-style (He et al., 2016) architectural blocks, involving two CBN-ReLU-Conv operations and an upsampling operation. These blocks are followed by a BN-ReLU-Conv operation to transform the last features into the shape of an image.

Such models can be trained as Wasserstein \glsplgan using gradient penalty (WGAN-GP) as proposed by Gulrajani et al. (2017), which gives mathematically sound arguments for an optimization framework. We adopt this framework for our experiments. More recently, two of the most noteworthy \glsgan architectures, \glssagan (Zhang et al., 2018a) and BigGAN (Brock et al., 2019), use architectures similar to WGAN-GP, with some important changes. \Glssagan inserts a self-attention mechanism (Parikh et al., 2016; Vaswani et al., 2017; Cheng et al., 2016) to attend over important parts of features during the generation process. In addition, it uses spectral normalization (Miyato et al., 2018) to stabilize training. The architecture of BigGAN is the same as for \glssagan, with the exception of an increase in batch size and channel widths, as well as some architectural changes to improve memory and computational efficiency. Both these models have been successfully used in generating high quality natural images. In our experiments, we compare performance metrics of WGAN-GP networks using two types of normalization.

3 Experiments

3.1 Visual Question Answering

We study whether substituting \glscgn for \glscbn in the \glsvqa architecture \glsfilm (Perez et al., 2018) yields comparable performance. We run experiments on several recently proposed benchmarks for compositional generalization.

3.1.1 Datasets


cogent (Johnson et al., 2017) is a variant of the popular \glsclevr dataset (Johnson et al., 2017), that tests for compositional generalization. The images consist of rendered three-dimensional scenes containing several shapes (small and large cubes, spheres and cylinders) of differing material properties (metal or rubber) and colors. Questions involve queries for object attributes, comparisons, counting of sets and combinations thereof. In contrast to the regular \glsclevr dataset, the training set of \glscogent explicitly combines some shapes only with different subsets of four out of eight colors, and provides two validation sets: one with the same combinations (valA) and one in which the shape-color assignments are swapped (valB). To perform well on valB, the model has to generalize to unseen combinations of shapes and colors, i.e. it needs to somewhat capture the compositionality of the task. Figure 0(a) shows an example from this dataset.


figureqa (Kahou et al., 2017) is a \glsvqa dataset consisting of mathematical plots with templated yes/no question-answer pairs that address relations between plot elements. The dataset contains plots of five types (vertical/horizontal bar plots, line plots, pie charts and dot-line plots). Each plot has between 2 and 10 elements, each of which has one of 100 colors. Plot elements (e.g. a slice in a pie chart) are identified by their color names in the questions. Questions query for one-vs-one or one-vs-all attribute relations, e.g. "Is Lime Green less than WebGray?" or "Does Cadet Blue have the minimum area under the curve?". Similar to \glscogent, \glsfigureqa requires compositional generalization. The overall 100 colors are split into two sets A and B, each containing 50 unique colors. During training, colors of certain plot types are sampled from set A, while the remaining plot types use colors from set B (scheme 1). There are two validation sets, one using the same color scheme, and one for which the plot-type to color assignments are swapped (scheme 2). See Figure 0(b) for a sample from the dataset.


sqoop (Bahdanau et al., 2018) is a recently introduced dataset that tests for systematic generalization. It consists of images containing five randomly chosen and arranged objects (digits and characters). Questions concern the four spatial relations LEFT OF, RIGHT OF, ABOVE and BELOW and the queries are all of the format "X R Y?", where X and Y are left-hand and right-hand objects and R is a relationship between them, e.g. "nine LEFT OF a?". To test for systematic generalization, only a limited number of combinations of each left-hand object with different right-hand objects Y are shown during training. In the hardest version of the task (1 rhs/lhs), only a single right-hand side object is combined with each left-hand side object. For instance, the training set of this version may contain images with the query "A RIGHT OF B", but no images with queries about relations of left-hand object A with any other object than B. The test set contains images and questions about all combinations, i.e. it evaluates generalization to relations between novel object combinations. Figure 0(c) shows an example from the training set.

(a) \glscogent: Are there any gray things made of the same material as the big cyan cylinder? - No
(b) \glsfigureqa: Does Medium Seafoam intersect Light Gold? - Yes
(c) \glssqoop: X right_of J? - no
Figure 1: Examples of the \glsvqa datasets used in our experiments.

3.1.2 Model

We experiment with several small variations of the \glsfilm architecture (Perez et al., 2018). The original architecture in Perez et al. (2018) consists of an unconditional stem network, a core of four ResNet (He et al., 2016) blocks with \glscbn (De Vries et al., 2017) and a classifier. The stem network is either a sequence of residual blocks trained from scratch or a fixed pre-trained feature extractor followed by a learnable layer of 3×3 convolutions. The scaling and shifting parameters of the core layers are affine transforms of a question embedding provided by a \glsgru (Cho et al., 2014). The output of the last residual block is fed to the classifier, which consists of a layer of 512 1×1 convolutions, global max-pooling, followed by a fully-connected ReLU (Nair and Hinton, 2010) layer using (unconditional) \glsbn and a softmax layer, which outputs the probability of each possible answer. We train the following three variants that include \glscgn22 2 We always set the number of groups to 4, as the authors of Wu and He (2018) showed that this hyperparameter does not have a large influence on the performance. This number was selected using uniform sampling from the set {2,4,8,16}.:

  1. 1.

    all conditional and regular \glsbn layers are replaced with corresponding conditional or regular \glsgn layers.

  2. 2.

    all \glscbn layers are replaced with \glscgn, regular \glsbn layers are left unchanged.

  3. 3.

    all \glscbn layers are replaced with \glscgn, regular \glsbn layers are left unchanged, except the fully-connected hidden layer in the classifier, for which we remove normalization.

Besides the described changes in the normalization layers, the architecture and hyperparameters are the same as used in Perez et al. (2018) for all experiments, except for \glssqoop where they are the same as in Bahdanau et al. (2018). The only difference is that we set the constant ϵ of the Adam optimizer (Kingma and Ba, 2014) to 1e-5 to improve training stability33 3 The authors of Perez et al. (2018) confirmed occasional gradient explosions with the original setting of 1e-8.. For \glssqoop, the input to the residual network are the raw image pixels. For all other networks, the input is features extracted from layer conv4 of a ResNet-101 (He et al., 2016), pre-trained on ImageNet (Russakovsky et al., 2015), following Perez et al. (2018).

3.1.3 Results

Tables 1, 2 and 3 show the results of training \glsfilm with \glscbn and \glscgn on the three considered datasets. In the experiments on \glscogent, all three \glscgn variants of \glsfilm achieve a slightly higher average accuracy. On \glsfigureqa, \glscbn outperforms \glscgn slightly. In the hardest \glssqoop variant with only one right-hand side object per left-hand side object (1 rhs/lhs), all three variants of \glscgn achieve a higher performance than \glscbn. For \glssqoop variants whose training sets contain more combinations, \glscgn did not converge in some cases. Learning curves of models successfully trained on \glssqoop seem to follow the same pattern: For a relatively large number of gradient updates there is no significant improvement. Then, at some point, almost instantly the model achieves 100% training accuracy. It is possible that a hyperparameter search or additional regularization is required to guarantee convergence.

Table 1: Classification accuracy on \glscogent valB. Mean and standard deviation of three runs with early stopping on valA are reported for the models we trained.
Model Accuracy (%)
\glscbn (\glsfilm (Perez et al., 2018)) 75.600
\glscbn (\glsfilm, our results) 75.539±0.671
\glscgn (all \glsgn) 75.758±0.356
\glscgn (\glsbn in stem, classifier no norm) 75.703±0.571
\glscgn (\glsbn in stem and classifier) 75.807±0.511
Table 2: Classification accuracy on \glsfigureqa validation2, mean and standard deviation of three runs after early stopping on validation1.
Model Accuracy (%)
\glscbn (\glsfilm, our results) 91.618±0.132
\glscgn (all \glsgn) 91.343±0.436
\glscgn (\glsbn in stem, classifier no norm) 91.080±0.166
\glscgn (\glsbn in stem and classifier) 91.317±0.514
Table 3: Test accuracies on several versions of \glssqoop. Mean and standard deviation of three runs after early stopping on the validation set are reported for the models we trained.
Dataset Model Accuracy (%)
1 rhs/lhs \glscbn (\glsfilm (Bahdanau et al., 2018)) 65.270±4.610
\glscbn (\glsfilm, our results) 72.369±0.529
\glscgn (all \glsgn) 74.020±2.814
\glscgn (\glsbn in stem, classifier no norm) 73.824±0.334
\glscgn (\glsbn in stem and classifier) 74.929±3.888
2 rhs/lhs \glscbn (\glsfilm (Bahdanau et al., 2018)) 80.200±4.320
\glscbn (\glsfilm, our results) 84.966±4.165
\glscgn (all \glsgn) 86.689±6.308
\glscgn (\glsbn in stem, classifier no norm) 83.109±0.381
\glscgn (\glsbn in stem and classifier) 85.859±5.318
4 rhs/lhs \glscbn (\glsfilm (Bahdanau et al., 2018)) 90.420±1.000
\glscbn (\glsfilm, our results) 97.043±1.958
\glscgn (all \glsgn) 91.404±0.318
\glscgn (\glsbn in stem, classifier no norm) 91.601±1.937
\glscgn (\glsbn in stem and classifier) 99.474±0.254
35 rhs/lhs \glscbn (\glsfilm (Bahdanau et al., 2018)) 99.803±0.219
\glscbn (\glsfilm, our results) 99.841±0.043
\glscgn (all \glsgn) 99.755±0.025
\glscgn (\glsbn in stem, classifier no norm) 99.815±0.122
\glscgn (\glsbn in stem and classifier) 99.782±0.155

3.2 Few-Shot Learning


cbn has also been used in recent methods for few-shot learning (Oreshkin et al., 2018; Jiang et al., 2018). We replicate the experiments of Oreshkin et al. (2018) on Mini-ImageNet and \glsfc100 using their code for \glstadam44 4 and compare the results with a version that uses \glscgn instead of \glscbn.

3.2.1 Datasets

Mini-ImageNet was proposed by Vinyals et al. (2016) as a benchmark for few-shot classification. It contains 100 classes, for each of which there are 600 images of resolution 84×84. To generate five-way five-shot classification tasks five classes and five support samples for each class are sampled uniformly. The remaining images are used to compute the accuracy. Using the proposed split by Ravi and Larochelle (2016), we uniformly sample training tasks from a subset of 64 classes. The remaining 36 classes are divided into 16 for meta-validation and 20 for meta-testing.

Fewshot-CIFAR100 (Oreshkin et al., 2018) is a few-shot classification version of the popular CIFAR100 data set (Krizhevsky, 2009). Similarly to Mini-ImageNet, it contains 100 classes and 600 samples per class. The resolution of the images is 32×32. The classes are split by superclasses to reduce information overlap between data set partitions, which makes the task more challenging than Mini-ImageNet. The training partition contains 60 classes belonging to 12 superclasses. The validation and test partitions contain 20 classes belonging to 5 superclasses each. The tasks are sampled uniformly as in Mini-ImageNet.

3.2.2 Model

Figure 2: Architecture of \glstadam (Oreshkin et al., 2018). Boxes with dashed border share parameters. Figure adapted from (Oreshkin et al., 2018).

tadam (Oreshkin et al., 2018) is a metric-based few-shot classifier, i.e. it learns a measure of similarity between query samples and class representations. The metric is based on a learned image embedding fϕ(x,c) provided by a residual network. Figure 2 shows a diagram of the overall architecture. Each class template is computed as the average embedding of all support samples for the respective class. The Euclidean distances between the embedding of a query sample and each of the class templates, weighted by a learned scaling factor α, is then used to classify the query sample x. The embedding network fϕ (see the dashed boxes in Figure 2) is modulated using \glscbn with a conditioning input c. In the computation of the similarity metric, c is fed by a task embedding Γ provided by a \acrfullten, which reads the average embeddings of support samples from all classes of the task. Note that fϕ is evaluated without conditioning (i.e. by setting c to a zero vector55 5 The conditioning input is implemented as a deviation from the identity transform (unity scaling and zero shift), so setting it to zero does not change the normalized activations.) in the computation of the task embedding Γ (see bottom of Figure 2). For the \glsgn version we replaced all conditional and regular \glsbn layers with their corresponding conditional or regular \glsgn version (with the number of groups set to 4). For a complete description of the experimental setup, including all other hyperparameters, we refer the reader to Oreshkin et al. (2018).

3.2.3 Results

Table 4: Five-way five-shot classification accuracy on Fewshot-CIFAR100 Oreshkin et al. (2018) and Mini-Imagenet Vinyals et al. (2016), mean and standard deviation of ten runs.
Dataset Model Accuracy (%)
FC100 TADAM (\glscbn) Oreshkin et al. (2018) 52.996±0.610
TADAM (\glscgn) 52.807±0.509
Mini-Imagenet TADAM (\glscbn) Oreshkin et al. (2018) 76.414±0.499
TADAM (\glscgn) 74.032±0.373

We see that using \glscgn instead of \glscbn yields only slightly reduced performance on \glsfc100, while there is a considerable 2.4% gap for Mini-ImageNet. Note, that we simply reuse the hyperparameters from Oreshkin et al. (2018), which were tuned for \glscbn.

3.3 Conditional Image Generation

Here we compare \glscbn and \glscgn on the task of generating images conditioned on their class label using the WGAN-GP (Gulrajani et al., 2017) architecture.

3.3.1 Dataset

CIFAR-10 (Krizhevsky, 2009) is a data set containing 60000 32×32 images, 6000 for each of 10 classes. The dataset is split into 50000 training and 10000 test samples.

3.3.2 Model

We replicated the WGAN-GP (Gulrajani et al., 2017) architecture from the original paper, which uses \glscbn. As in other tasks, we also train the \glscgn variants, where we substitute conditional and unconditional \glsbn layers with the corresponding conditional or unconditional \glsgn layers, with number of groups set to 4. We use the optimization setup from Gulrajani et al. (2017): a learning rate of 2e-4 for both generator and discriminator, five discriminator updates per generator update, and we also use the Adam optimizer (Kingma and Ba, 2014). We train using a single GPU (NVIDIA P100) and a batch size of 64.

3.3.3 Results

Figure 3 shows samples from WGAN-GP trained using each of the two normalization methods.

(a) CBN (b) CGN
Figure 3: Samples from models trained with different normalization techniques. The images in each column belong to the same class, ordered as ‘airplane’, ‘automobile’, ‘bird’, ‘cat’, ‘deer’, ‘dog’, ‘frog’, ‘horse’, ‘ship’, ‘truck’. Samples are not cherry-picked.

For both normalization methods, in addition to a qualitative check of the generated samples, we calculate two scores that are widely used in the community to evaluate image generation \glsis (Salimans et al., 2016) and \glsfid (Heusel et al., 2017). We use publicly available code to calculate \glsis66 6 and \glsfid77 7 The computed values for real data may differ slightly from the original ones since these use PyTorch (Paszke et al., 2017) implementations, while the original papers use TensorFlow (Abadi et al., 2015). However, we compare the same implementation of these metrics for true and generated data.


is is meant to measure the natural-ness of an image by checking the embedding of the generated images on a pre-trained Inception network (Szegedy et al., 2016). Although the suitability of the \glsis for this purpose has been rightfully put into question (Barratt and Sharma, 2018), it continues to be used frequently. \glsfid measures how similar two sets of images are, by computing the Fréchet distance between two multivariate Gaussians fitted to the embeddings of the images from the two sets. The embeddings are obtained from a pre-trained InceptionV3 network (Szegedy et al., 2016). In this case, we measure the distance between the real CIFAR-10 images, and the generated ones. This is a better metric than \glsis, since there is no constraint on the images being natural, and it is able to quantify not only their similarity to the real images, but also diversity in the generated images.

We first calculate the IS of the true images of CIFAR-10, for each class separately. Then, during training of a model, we sample images from the generator at regular intervals, and calculate the \glsis and \glsfid of those images for each class separately. This allows us to see the effect of the different normalization techniques on the conditional generation process. We average our results from four runs with different seeds, shown in Figure 4.

(a) IS (b) FID
Figure 4: (a) Inception score (IS, higher is better) and (b) FID (lower is better) of samples generated by WGAN-GP model while training on CIFAR-10.

Figure 5: Classification Accuracy Score (CAS) using a ResNet classifier trained on samples generated while training on CIFAR-10 with WGAN-GP using (blue) CBN and (green) CGN, while (black) is the value when trained with true data. All classifiers have been trained with the same hyperparameters.

We also calculate the recently proposed \glscas (Ravuri and Vinyals, 2019) for one instance of training using WGAN-GP with \glscbn and \glscgn each, shown in Figure 5. In the computation of this metric, a ResNet (He et al., 2016) classifier is trained on data sampled from the generative model being evaluated. Then the accuracy of this classifier on the true validation data is calculated. Ravuri and Vinyals (2019) mention that this could indicate the closeness of the generated data distribution to the true data distribution. All three metrics indicate that \glscbn is better than \glscgn in conditional generative models of images such as WGAN-GP.

The WGAN-GP model architecture consists of a series of residual blocks followed by bn-relu-conv layers. Each residual block contains two bn-relu-conv modules. Since the architectures of more recent models such as \glssagan (Zhang et al., 2018a) and BigGAN (Brock et al., 2019) are similar to that of the one we used, it is likely that the conclusions we draw from the WGAN-GP experiments transfer to them.

4 Conclusion

Because the performance of \glscbn heavily depends on the batch size and on how well training and test statistics match, we investigate the use of \glscgn as a potential alternative for \glscbn. We consider a set of experiments for \glsvqa, few-shot learning and image generation tasks in which some of the best models rely on \glscbn for conditional computation. We experimentally show that the effect of this substitution is task-dependent, with performance increases in some \glsvqa tasks that focus on systematic generalization, but a clear decrease in performance in conditional image generation. \Glscgn’s simpler implementation, its consistent behaviour during training and inference time, as well as its independence from batch sizes, are all good reasons to explore its adoption instead of \glscbn in tasks that require systematic generalization. That being said, further analysis is required to be able to confidently suggest one method over the other. For instance, a hyperparameter search for each of the normalization methods would be required to provide a better performance comparison. Also, we would like to characterize the sensitivity of \glscbn’s performance to the batch size and focus on domains, such as medical imaging or video processing, for which efficient large-batch training becomes nontrivial. Lastly, since some of the success of \glsbn (and consequently also \glscbn) can be attributed to the regularization effect introduced by noisy batch statistics, it seems worthwile to explore combinations of \glscgn with additional regularization as suggested for \glsgn by Wu and He (2018). The latter is also motivated by recent successful attempts at replacing (unconditional) \glsbn with careful network initialization (Zhang et al., 2019), which relies on additional regularization (Zhang et al., 2018b) to match generalization performance.


We thank Boris Oreshkin, Eugene Belilovsky, Matthew Scicluna, Mahdi Ebrahimi Kahou, Kris Sankaran and Alex Lamb for helpful discussions. This research was enabled in part by support provided by Compute Canada.


  • Bahdanau et al. [2018] Dzmitry Bahdanau, Shikhar Murty, Michael Noukhovitch, Thien Huu Nguyen, Harm de Vries, and Aaron Courville. Systematic generalization: What is required and can it be learned? arXiv preprint arXiv:1811.12889, 2018.
  • Andreas et al. [2016] Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Neural module networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 39–48, 2016.
  • Johnson et al. [2017] Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • Kahou et al. [2017] Samira Ebrahimi Kahou, Vincent Michalski, Adam Atkinson, Ákos Kádár, Adam Trischler, and Yoshua Bengio. Figureqa: An annotated figure dataset for visual reasoning. Workshop in the International Conference on Learning Representations, 2017.
  • De Vries et al. [2017] Harm De Vries, Florian Strub, Jérémie Mary, Hugo Larochelle, Olivier Pietquin, and Aaron C Courville. Modulating early visual processing by language. In Advances in Neural Information Processing Systems, pages 6594–6604, 2017.
  • Ioffe and Szegedy [2015] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning (ICML), 2015.
  • Galloway et al. [2019] Angus Galloway, Anna Golubeva, Thomas Tanay, Medhat Moussa, and Graham W Taylor. Batch normalization is a cause of adversarial vulnerability. arXiv preprint arXiv:1905.02161, 2019.
  • Wu and He [2018] Yuxin Wu and Kaiming He. Group normalization. In Proceedings of the European Conference on Computer Vision (ECCV), pages 3–19, 2018.
  • Lei Ba et al. [2016] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  • Ulyanov et al. [2016] Dmitry Ulyanov, Andrea Vedaldi, and Victor S. Lempitsky. Instance normalization: The missing ingredient for fast stylization. CoRR, abs/1607.08022, 2016.
  • Dalal and Triggs [2005] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In international Conference on computer vision & Pattern Recognition (CVPR’05), volume 1, pages 886–893. IEEE Computer Society, 2005.
  • Perez et al. [2018] Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
  • Dumoulin et al. [2017] Vincent Dumoulin, Jonathon Shlens, and Manjunath Kudlur. A learned representation for artistic style. International Conference on Learning Representations (ICLR), 2017.
  • Malinowski and Fritz [2014] Mateusz Malinowski and Mario Fritz. A multi-world approach to question answering about real-world scenes based on uncertain input. In Advances in neural information processing systems, pages 1682–1690, 2014.
  • Antol et al. [2015] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 2425–2433, 2015.
  • Ravi and Larochelle [2016] Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. 2016.
  • Oreshkin et al. [2018] Boris Oreshkin, Pau Rodríguez López, and Alexandre Lacoste. Tadam: Task dependent adaptive metric for improved few-shot learning. In Advances in Neural Information Processing Systems, pages 721–731, 2018.
  • Goodfellow et al. [2014] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, 2014.
  • Mirza and Osindero [2014] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. CoRR, abs/1411.1784, 2014.
  • Odena et al. [2017] Augustus Odena, Christopher Olah, and Jonathon Shlens. Conditional image synthesis with auxiliary classifier gans. In International Conference on Machine Learning (ICML), 2017.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 770–778, 2016.
  • Gulrajani et al. [2017] Ishaan Gulrajani, Faruk Ahmed, Martín Arjovsky, Vincent Dumoulin, and Aaron C. Courville. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, page 5769–5779, 2017.
  • Zhang et al. [2018a] Han Zhang, Ian J. Goodfellow, Dimitris N. Metaxas, and Augustus Odena. Self-attention generative adversarial networks. International Conference on Learning Representations (ICLR), 2018a.
  • Brock et al. [2019] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis. International Conference on Learning Representations, 2019.
  • Parikh et al. [2016] Ankur P Parikh, Oscar Täckström, Dipanjan Das, and Jakob Uszkoreit. A decomposable attention model for natural language inference. arXiv preprint arXiv:1606.01933, 2016.
  • Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
  • Cheng et al. [2016] Jianpeng Cheng, Li Dong, and Mirella Lapata. Long short-term memory-networks for machine reading. arXiv preprint arXiv:1601.06733, 2016.
  • Miyato et al. [2018] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. International Conference on Learning Representations (ICLR), 2018.
  • Cho et al. [2014] Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259, 2014.
  • Nair and Hinton [2010] Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pages 807–814, 2010.
  • Kingma and Ba [2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. International Conference on Learning Representations (ICLR), 2014.
  • Russakovsky et al. [2015] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.
  • Jiang et al. [2018] Xiang Jiang, Mohammad Havaei, Farshid Varno, Gabriel Chartrand, Nicolas Chapados, and Stan Matwin. Learning to learn with conditional class dependencies. 2018.
  • Vinyals et al. [2016] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. In Advances in neural information processing systems, pages 3630–3638, 2016.
  • Krizhevsky [2009] Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.
  • Salimans et al. [2016] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, Xi Chen, and Xi Chen. Improved techniques for training gans. In Advances in Neural Information Processing Systems, 2016.
  • Heusel et al. [2017] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, 2017.
  • Paszke et al. [2017] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. In NeurIPS Autodiff Workshop, 2017.
  • Abadi et al. [2015] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. URL Software available from
  • Szegedy et al. [2016] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016.
  • Barratt and Sharma [2018] Shane Barratt and Rishi Kant Sharma. A note on the inception score. CoRR, abs/1801.01973, 2018.
  • Ravuri and Vinyals [2019] Suman Ravuri and Oriol Vinyals. Classification accuracy score for conditional generative models. arXiv preprint arXiv:1905.10887, 2019.
  • Zhang et al. [2019] Hongyi Zhang, Yann N. Dauphin, and Tengyu Ma. Fixup initialization: Residual learning without normalization. In International Conference on Learning Representations, 2019.
  • Zhang et al. [2018b] Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. In International Conference on Learning Representations, 2018b. URL