Abstract
Adversarially trained generative models (GANs) have recently achievedcompelling image synthesis results. But despite early successes in using GANsfor unsupervised representation learning, they have since been superseded byapproaches based on selfsupervision. In this work we show that progress inimage generation quality translates to substantially improved representationlearning performance. Our approach, BigBiGAN, builds upon the stateoftheartBigGAN model, extending it to representation learning by adding an encoder andmodifying the discriminator. We extensively evaluate the representationlearning and generation capabilities of these BigBiGAN models, demonstratingthat these generationbased models achieve the state of the art in unsupervisedrepresentation learning on ImageNet, as well as in unconditional imagegeneration.
Quick Read (beta)
Large Scale Adversarial Representation Learning
Abstract
Adversarially trained generative models (GANs) have recently achieved compelling image synthesis results. But despite early successes in using GANs for unsupervised representation learning, they have since been superseded by approaches based on selfsupervision. In this work we show that progress in image generation quality translates to substantially improved representation learning performance. Our approach, BigBiGAN, builds upon the stateoftheart BigGAN model, extending it to representation learning by adding an encoder and modifying the discriminator. We extensively evaluate the representation learning and generation capabilities of these BigBiGAN models, demonstrating that these generationbased models achieve the state of the art in unsupervised representation learning on ImageNet, as well as in unconditional image generation.
shapes, arrows, shadows, fit, backgrounds, calc
Large Scale Adversarial Representation Learning
Jeff Donahue DeepMind [email protected] Karen Simonyan DeepMind [email protected]
noticebox[b]\[email protected]
1 Introduction
In recent years we have seen rapid progress in generative models of visual data. While these models were previously confined to domains with single or few modes, simple structure, and low resolution, with advances in both modeling and hardware they have since gained the ability to convincingly generate complex, multimodal, high resolution image distributions biggan ; stylegan ; glow .
Intuitively, the ability to generate data in a particular domain necessitates a highlevel understanding of the semantics of said domain. This idea has longstanding appeal as raw data is both cheap – readily available in virtually infinite supply from sources like the Internet – and rich, with images comprising far more information than the class labels that typical discriminative machine learning models are trained to predict from them. Yet, while the progress in generative models has been undeniable, nagging questions persist: what semantics have these models learned, and how can they be leveraged for representation learning?
The dream of generation as a means of true understanding from raw data alone has hardly been realized. Instead, the most successful approaches for unsupervised learning leverage techniques adopted from the field of supervised learning, a class of methods known as selfsupervised learning carl ; splitbrain ; cpc ; rotation . These approaches typically involve changing or holding back certain aspects of the data in some way, and training a model to predict or generate aspects of the missing information. For example, colorful ; splitbrain proposed colorization as a means of unsupervised learning, where a model is given a subset of the color channels in an input image, and trained to predict the missing channels.
Generative models as a means of unsupervised learning offer an appealing alternative to selfsupervised tasks in that they are trained to model the full data distribution without requiring any modification of the original data. One class of generative models that has been applied to representation learning is generative adversarial networks (GANs) gan . The generator in the GAN framework is a feedforward mapping from randomly sampled latent variables (also called “noise”) to generated data, with learning signal provided by a discriminator trained to distinguish between real and generated data samples, guiding the generator’s outputs to follow the data distribution. The adversarially learned inference (ALI) ali or bidirectional GAN (BiGAN) bigan approaches were proposed as extensions to the GAN framework that augment the standard GAN with an encoder module mapping real data to latents, the inverse of the mapping learned by the generator.
In the limit of an optimal discriminator, bigan showed that a deterministic BiGAN behaves like an autoencoder minimizing ${\mathrm{\ell}}_{0}$ reconstruction costs; however, the shape of the reconstruction error surface is dictated by a parametric discriminator, as opposed to simple pixellevel measures like the ${\mathrm{\ell}}_{2}$ error. Since the discriminator is usually a powerful neural network, the hope is that it will induce an error surface which emphasizes “semantic” errors in reconstructions, rather than lowlevel details.
In bigan it was demonstrated that the encoder learned via the BiGAN or ALI framework is an effective means of visual representation learning on ImageNet for downstream tasks. However, it used a DCGAN dcgan style generator, incapable of producing highquality images on this dataset, so the semantics the encoder could model were in turn quite limited. In this work we revisit this approach using BigGAN biggan as the generator, a modern model that appears capable of capturing many of the modes and much of the structure present in ImageNet images. Our contributions are as follows:

•
We show that BigBiGAN (BiGAN with BigGAN generator) matches the state of the art in unsupervised representation learning on ImageNet.

•
We propose a more stable version of the joint discriminator for BigBiGAN.

•
We perform a thorough empirical analysis and ablation study of model design choices.

•
We show that the representation learning objective also helps unconditional image generation, and demonstrate stateoftheart results in unconditional ImageNet generation.
2 BigBiGAN
The BiGAN bigan or ALI ali approaches were proposed as extensions of the GAN gan framework which enable the learning of an encoder that can be employed as an inference model ali or feature representation bigan . Given a distribution ${P}_{\mathbf{x}}$ of data $\mathbf{x}$ (e.g., images), and a distribution ${P}_{\mathbf{z}}$ of latents $\mathbf{z}$ (usually a simple continuous distribution like an isotropic Gaussian $\mathcal{N}(0,I)$), the generator $\mathcal{G}$ models a conditional distribution $P(\mathbf{x}\mathbf{z})$ of data $\mathbf{x}$ given latent inputs $\mathbf{z}$ sampled from the latent prior ${P}_{\mathbf{z}}$, as in the standard GAN generator gan . The encoder $\mathcal{E}$ models the inverse conditional distribution $P(\mathbf{z}\mathbf{x})$, predicting latents $\mathbf{z}$ given data $\mathbf{x}$ sampled from the data distribution ${P}_{\mathbf{x}}$.
Besides the addition of $\mathcal{E}$, the other modification to the GAN in the BiGAN framework is a joint discriminator $\mathcal{D}$, which takes as input datalatent pairs $(\mathbf{x},\mathbf{z})$ (rather than just data $\mathbf{x}$ as in a standard GAN), and learns to discriminate between pairs from the data distribution and encoder, versus the generator and latent distribution. Concretely, its inputs are pairs $(\mathbf{x}\sim {P}_{\mathbf{x}},\widehat{\mathbf{z}}\sim \mathcal{E}(\mathbf{x}))$ and $(\widehat{\mathbf{x}}\sim \mathcal{G}(\mathbf{z}),\mathbf{z}\sim {P}_{\mathbf{z}})$, and the goal of the $\mathcal{G}$ and $\mathcal{E}$ is to “fool” the discriminator by making the two joint distributions ${P}_{\mathbf{x}\mathcal{E}}$ and ${P}_{\mathcal{G}\mathbf{z}}$ from which these pairs are sampled indistinguishable. The adversarial minimax objective in bigan ; ali , analogous to that of the GAN framework gan , was defined as follows:
$\underset{\mathcal{G}\mathcal{E}}{\mathrm{min}}\underset{\mathcal{D}}{\mathrm{max}}\left\{{\mathbb{E}}_{\mathbf{x}\sim {P}_{\mathbf{x}},\mathbf{z}\sim {\mathcal{E}}_{\mathrm{\Phi}}(\mathbf{x})}\left[\mathrm{log}(\sigma (\mathcal{D}(\mathbf{x},\mathbf{z})))\right]+{\mathbb{E}}_{\mathbf{z}\sim {P}_{\mathbf{z}},\mathbf{x}\sim {\mathcal{G}}_{\mathrm{\Phi}}(\mathbf{z})}\left[\mathrm{log}(1\sigma (\mathcal{D}(\mathbf{x},\mathbf{z})))\right]\right\}$ 
Under this objective, bigan ; ali showed that with an optimal $\mathcal{D}$, $\mathcal{G}$ and $\mathcal{E}$ minimize the JensenShannon divergence between the joint distributions ${P}_{\mathbf{x}\mathcal{E}}$ and ${P}_{\mathcal{G}\mathbf{z}}$, and therefore at the global optimum, the two joint distributions ${P}_{\mathbf{x}\mathcal{E}}={P}_{\mathcal{G}\mathbf{z}}$ match, analogous to the results from standard GANs gan . Furthermore, bigan showed that in the case where $\mathcal{E}$ and $\mathcal{G}$ are deterministic functions (i.e., the learned conditional distributions ${P}_{\mathcal{G}}(\mathbf{x}\mathbf{z})$ and ${P}_{\mathcal{E}}(\mathbf{z}\mathbf{x})$ are Dirac $\delta $ functions), these two functions are inverses at the global optimum: e.g., ${\forall}_{\mathbf{x}\in \mathrm{supp}({P}_{\mathbf{x}})}\mathbf{x}=\mathcal{G}(\mathcal{E}(\mathbf{x}))$, with the optimal joint discriminator effectively imposing ${\mathrm{\ell}}_{0}$ reconstruction costs on $\mathbf{x}$ and $\mathbf{z}$.
While the crux of our approach, BigBiGAN, remains the same as that of BiGAN bigan ; ali , we have adopted the generator and discriminator architectures from the stateoftheart BigGAN biggan generative image model. Beyond that, we have found that an improved discriminator structure leads to better representation learning results without compromising generation (Figure 1). Namely, in addition to the joint discriminator loss proposed in bigan ; ali which ties the data and latent distributions together, we propose additional unary terms in the learning objective, which are functions only of either the data $\mathbf{x}$ or the latents $\mathbf{z}$. Although bigan ; ali prove that the original BiGAN objective already enforces that the learnt joint distributions match at the global optimum, implying that the marginal distributions of $\mathbf{x}$ and $\mathbf{z}$ match as well, these unary terms intuitively guide optimization in the “right direction” by explicitly enforcing this property. For example, in the context of image generation, the unary loss term on $\mathbf{x}$ matches the original GAN objective and provides a learning signal which steers only the generator to match the image distribution independently of its latent inputs. (In our evaluation we will demonstrate empirically that the addition of these terms results in both improved generation and representation learning.)
Concretely, the discriminator loss ${\mathcal{L}}_{\mathcal{D}}$ and the encodergenerator loss ${\mathcal{L}}_{\mathcal{E}\mathcal{G}}$ are defined as follows, based on scalar discriminator “score” functions ${s}_{*}$ and the corresponding persample losses ${\mathrm{\ell}}_{*}$:
${s}_{\mathbf{x}}(\mathbf{x})$  $={\theta}_{\mathbf{x}}^{\u22ba}{F}_{\mathrm{\Theta}}(\mathbf{x})$  
${s}_{\mathbf{z}}(\mathbf{z})$  $={\theta}_{\mathbf{z}}^{\u22ba}{H}_{\mathrm{\Theta}}(\mathbf{z})$  
${s}_{\mathrm{\mathbf{x}\mathbf{z}}}(\mathbf{x},\mathbf{z})$  $={\theta}_{\mathrm{\mathbf{x}\mathbf{z}}}^{\u22ba}{J}_{\mathrm{\Theta}}({F}_{\mathrm{\Theta}}(\mathbf{x}),{H}_{\mathrm{\Theta}}(\mathbf{z}))$  
${\mathrm{\ell}}_{\mathcal{E}\mathcal{G}}(\mathbf{x},\mathbf{z},y)$  $=y\left({s}_{\mathbf{x}}(\mathbf{x})+{s}_{\mathbf{z}}(\mathbf{z})+{s}_{\mathrm{\mathbf{x}\mathbf{z}}}(\mathbf{x},\mathbf{z})\right)$  $y\in \{1,+1\}$  
${\mathcal{L}}_{\mathcal{E}\mathcal{G}}({P}_{\mathbf{x}},{P}_{\mathbf{z}})$  $={\mathbb{E}}_{\mathbf{x}\sim {P}_{\mathbf{x}},\widehat{\mathbf{z}}\sim {\mathcal{E}}_{\mathrm{\Phi}}(\mathbf{x})}\left[{\mathrm{\ell}}_{\mathcal{E}\mathcal{G}}(\mathbf{x},\widehat{\mathbf{z}},+1)\right]+{\mathbb{E}}_{\mathbf{z}\sim {P}_{\mathbf{z}},\widehat{\mathbf{x}}\sim {\mathcal{G}}_{\mathrm{\Phi}}(\mathbf{z})}\left[{\mathrm{\ell}}_{\mathcal{E}\mathcal{G}}(\widehat{\mathbf{x}},\mathbf{z},1)\right]$  
${\mathrm{\ell}}_{\mathcal{D}}(\mathbf{x},\mathbf{z},y)$  $=h(y{s}_{\mathbf{x}}(\mathbf{x}))+h(y{s}_{\mathbf{z}}(\mathbf{z}))+h(y{s}_{\mathrm{\mathbf{x}\mathbf{z}}}(\mathbf{x},\mathbf{z}))$  $y\in \{1,+1\}$  
${\mathcal{L}}_{\mathcal{D}}({P}_{\mathbf{x}},{P}_{\mathbf{z}})$  $={\mathbb{E}}_{\mathbf{x}\sim {P}_{\mathbf{x}},\widehat{\mathbf{z}}\sim {\mathcal{E}}_{\mathrm{\Phi}}(\mathbf{x})}\left[{\mathrm{\ell}}_{\mathcal{D}}(\mathbf{x},\widehat{\mathbf{z}},+1)\right]+{\mathbb{E}}_{\mathbf{z}\sim {P}_{\mathbf{z}},\widehat{\mathbf{x}}\sim {\mathcal{G}}_{\mathrm{\Phi}}(\mathbf{z})}\left[{\mathrm{\ell}}_{\mathcal{D}}(\widehat{\mathbf{x}},\mathbf{z},1)\right]$ 
where $h(t)=\mathrm{max}(0,1t)$ is a “hinge” used to regularize the discriminator geometricgan ; tran ^{1}^{1} 1 We also considered an alternative discriminator loss ${\mathrm{\ell}}_{\mathcal{D}}^{\prime}$ which invokes the “hinge” $h$ just once on the sum of the three loss terms – ${\mathrm{\ell}}_{\mathcal{D}}^{\prime}(\mathbf{x},\mathbf{z},y)=h(y\left({s}_{\mathbf{x}}(\mathbf{x})+{s}_{\mathbf{z}}(\mathbf{z})+{s}_{\mathrm{\mathbf{x}\mathbf{z}}}(\mathbf{x},\mathbf{z})\right))$ – but found that this performed significantly worse than ${\mathrm{\ell}}_{\mathcal{D}}$ above which clamps each of the three loss terms separately. , also used in BigGAN biggan . The discriminator $\mathcal{D}$ includes three submodules: $F$, $H$, and $J$. $F$ takes only $\mathbf{x}$ as input and $H$ takes only $\mathbf{z}$, and learned projections of their outputs with parameters ${\theta}_{\mathbf{x}}$ and ${\theta}_{\mathbf{z}}$ respectively give the scalar unary scores ${s}_{\mathbf{x}}$ and ${s}_{\mathbf{z}}$. In our experiments, the data $\mathbf{x}$ are images and latents $\mathbf{z}$ are unstructured flat vectors; accordingly, $F$ is a ConvNet and $H$ is an MLP. The joint score ${s}_{\mathrm{\mathbf{x}\mathbf{z}}}$ tying $\mathbf{x}$ and $\mathbf{z}$ is given by the remaining $\mathcal{D}$ submodule, $J$, a function of the outputs of $F$ and $H$.
The $\mathcal{E}$ and $\mathcal{G}$ parameters $\mathrm{\Phi}$ are optimized to minimize the loss ${\mathcal{L}}_{\mathcal{E}\mathcal{G}}$, and the $\mathcal{D}$ parameters $\mathrm{\Theta}$ are optimized to minimize loss ${\mathcal{L}}_{\mathcal{D}}$. As usual, the expectations $\mathbb{E}$ are estimated by Monte Carlo samples taken over minibatches.
3 Evaluation
Most of our experiments follow the standard protocol used to evaluate unsupervised learning techniques, first proposed in colorful . We train a BigBiGAN on unlabeled ImageNet, freeze its learned representation, and then train a linear classifier on its outputs, fully supervised using all of the training set labels. We also measure image generation performance, reporting Inception Score improvedgan (IS) and Fréchet Inception Distance frechet (FID) as the standard metrics there.
3.1 Ablation
We begin with an extensive ablation study in which we directly evaluate a number of modeling choices, with results presented in Table 1. Where possible we performed three runs of each variant with different seeds and report the mean and standard deviation for each metric.
We start with a relatively fullyfledged version of the model at $128\times 128$ resolution (row Base), with the $\mathcal{G}$ architecture and the $F$ component of $\mathcal{D}$ taken from the corresponding $128\times 128$ architectures in BigGAN, including the skip connections and shared noise embedding proposed in biggan . $\mathbf{z}$ is 120 dimensions, split into six groups of 20 dimensions fed into each of the six layers of $\mathcal{G}$ as in biggan . The remaining components of $\mathcal{D}$ – $H$ and $J$ – are 8layer MLPs with ResNetstyle skip connections (four residual blocks with two layers each) and size 2048 hidden layers. The $\mathcal{E}$ architecture is the ResNetv250 ConvNet originally proposed for image classification in resnetv2 , followed by a 4layer MLP (size 4096) with skip connections (two residual blocks) after ResNet’s globally average pooled output. The unconditional BigGAN training setup corresponds to the “Single Label” setup proposed in zurichfewer , where a single “dummy” label is used for all images (theoretically equivalent to learning a bias in place of the classconditional batch norm inputs). We then ablate several aspects of the model, with results detailed in the following paragraphs. Additional architectural and optimization details are provided in Appendix A. Full learning curves for many results are included in Appendix D.
Latent distribution ${P}_{\mathbf{z}}$ and stochastic $\mathcal{E}$.
As in ALI ali , the encoder $\mathcal{E}$ of our Base model is nondeterministic, parametrizing a distribution $\mathcal{N}(\mu ,\sigma )$. $\mu $ and $\widehat{\sigma}$ are given by a linear layer at the output of the model, and the final standard deviation $\sigma $ is computed from $\widehat{\sigma}$ using a nonnegative “softplus” nonlinearity $\sigma =\mathrm{log}(1+\mathrm{exp}(\widehat{\sigma}))$ softplus . The final $\mathbf{z}$ uses the reparametrized sampling from kingmavae , with $\mathbf{z}=\mu +\u03f5\sigma $, where $\u03f5\sim \mathcal{N}(0,I)$. Compared to a deterministic encoder (row Deterministic $\mathrm{E}$) which predicts $\mathbf{z}$ directly without sampling (effectively modeling $P(\mathbf{z}\mathbf{x})$ as a Dirac $\delta $ distribution), the nondeterministic Base model achieves significantly better classification performance (at no cost to generation). We also compared to using a uniform ${P}_{\mathbf{z}}=\mathcal{U}(1,1)$ (row Uniform ${P}_{\mathrm{z}}$) with $\mathcal{E}$ deterministically predicting $\mathbf{z}=\mathrm{tanh}(\widehat{\mathbf{z}})$ given a linear output $\widehat{\mathbf{z}}$, as done in BiGAN bigan . This also achieves worse classification results than the nondeterministic Base model.
Unary loss terms.
We evaluate the effect of removing one or both unary terms of the loss function proposed in Section 2, ${s}_{\mathbf{x}}$ and ${s}_{\mathbf{z}}$. Removing both unary terms (row No Unaries) corresponds to the original objective proposed in bigan ; ali . It is clear that the $\mathbf{x}$ unary term has a large positive effect on generation performance, with the Base and $\mathbf{x}$ Unary Only rows having significantly better IS and FID than the $\mathbf{z}$ Unary Only and No Unaries rows. This result makes intuitive sense as it matches the standard generator loss. It also marginally improves classification performance. The $\mathbf{z}$ unary term makes a more marginal difference, likely due to the relative ease of modeling relatively simple distributions like isotropic Gaussians, though also does result in slightly improved classification and generation in terms of FID – especially without the $\mathbf{x}$ term ($\mathbf{z}$ Unary Only vs. No Unaries). On the other hand, IS is worse with the $\mathbf{z}$ term. This may be due to IS roughly measuring the generator’s coverage of the major modes of the distribution (the classes) rather than the distribution in its entirety, the latter of which may be better captured by FID and more likely to be promoted by a good encoder $\mathcal{E}$. The requirement of invertibility in a (Big)BiGAN could be encouraging the generator to produce distinguishable outputs across the entire latent space, rather than “collapsing” large volumes of latent space to a single mode of the data distribution.
$\mathcal{G}$ capacity.
To address the question of the importance of the generator $\mathcal{G}$ in representation learning, we vary the capacity of $\mathcal{G}$ (with $\mathcal{E}$ and $\mathcal{D}$ fixed) in the Small $\mathrm{G}$ rows. With a third of the capacity of the Base $\mathcal{G}$ model (Small $\mathrm{G}$ (32)), the overall model is quite unstable and achieves significantly worse classification results than the higher capacity base model^{2}^{2} 2 Though the generation performance by IS and FID in row Small $\mathrm{G}$ (32) is very poor at the point we measured – when its best validation classification performance (43.59%) is achieved – this model was performing more reasonably for generation earlier in training, reaching IS 14.69 and FID 60.67. With twothirds capacity (Small $\mathrm{G}$ (64)), generation performance is substantially worse (matching the results in biggan ) and classification performance is modestly worse. These results confirm that a powerful image generator is indeed important for learning good representations via the encoder. Assuming this relationship holds in the future, we expect that better generative models are likely to lead to further improvements in representation learning.
Standard GAN.
We also compare BigBiGAN’s image generation performance against a standard unconditional BigGAN with no encoder $\mathcal{E}$ and only the standard $F$ ConvNet in the discriminator, with only the ${s}_{\mathbf{x}}$ term in the loss (row No $\mathrm{E}$ (GAN)). While the standard GAN achieves a marginally better IS, the BigBiGAN FID is about the same, indicating that the addition of the BigBiGAN $\mathcal{E}$ and joint $\mathcal{D}$ does not compromise generation with the newly proposed unary loss terms described in Section 2. (In comparison, the versions of the model without unary loss term on $\mathbf{x}$ – rows $\mathbf{z}$ Unary Only and No Unaries – have substantially worse generation performance in terms of FID than the standard GAN.) We conjecture that the IS is worse for similar reasons that the ${s}_{\mathbf{z}}$ unary loss term leads to worse IS. Next we will show that with an enhanced $\mathcal{E}$ taking higher input resolutions, generation with BigBiGAN in terms of FID is substantially improved over the standard GAN.
High resolution $\mathcal{E}$ with varying resolution $\mathcal{G}$.
BiGAN bigan proposed an asymmetric setup in which $\mathcal{E}$ takes higher resolution images than $\mathcal{G}$ outputs and $\mathcal{D}$ takes as input, showing that an $\mathcal{E}$ taking $128\times 128$ inputs with a $64\times 64$ $\mathcal{G}$ outperforms a $64\times 64$ $\mathcal{E}$ for downstream tasks. We experiment with this setup in BigBiGAN, raising the $\mathcal{E}$ input resolution to $256\times 256$ – matching the resolution used in typical supervised ImageNet classification setups – and varying the $\mathcal{G}$ output and $\mathcal{D}$ input resolution in $\{64,128,256\}$. Our results in Table 1 (rows High Res $\mathrm{E}$ (256) and Low/High Res $\mathrm{G}$ (*)) show that BigBiGAN achieves better representation learning results as the $\mathcal{G}$ resolution increases, up to the full $\mathcal{E}$ resolution of $256\times 256$. However, because the overall model is much slower to train with $\mathcal{G}$ at $256\times 256$ resolution, the remainder of our results use the $128\times 128$ resolution for $\mathcal{G}$.
Interestingly, with the higher resolution $\mathcal{E}$, generation improves significantly (especially by FID), despite $\mathcal{G}$ operating at the same resolution (row High Res $\mathrm{E}$ (256) vs. Base). This is an encouraging result for the potential of BigBiGAN as a means of improving adversarial image synthesis itself, besides its use in representation learning and inference.
$\mathcal{E}$ architecture.
Keeping the $\mathcal{E}$ input resolution fixed at 256, we experiment with varied and often larger $\mathcal{E}$ architectures, including several of the ResNet50 variants explored in revisiting . In particular, we expand the capacity of the hidden layers by a factor of $2$ or $4$, as well as swap the residual block structure to a reversible variant called RevNet revnet with the same number of layers and capacity as the corresponding ResNets. (We use the version of RevNet described in revisiting .) We find that the base ResNet50 model (row High Res $\mathrm{E}$ (256)) outperforms RevNet50 (row RevNet), but as the network widths are expanded, we begin to see improvements from RevNet50, with doublewidth RevNet outperforming a ResNet of the same capacity (rows RevNet $\mathrm{\times}\mathrm{2}$ and ResNet $\mathrm{\times}\mathrm{2}$). We see further gains with an even larger quadruplewidth RevNet model (row RevNet $\mathrm{\times}\mathrm{4}$), which we use for our final results in Section 3.2.
Decoupled $\mathcal{E}$/$\mathcal{G}$ optimization.
As a final improvement, we decoupled the $\mathcal{E}$ optimizer from that of $\mathcal{G}$, and found that simply using a $10\times $ higher learning rate for $\mathcal{E}$ dramatically accelerates training and improves final representation learning results. For ResNet50 this improves linear classifier accuracy by nearly 3% (ResNet ($\mathrm{\uparrow}\mathrm{E}$ LR) vs. High Res $\mathrm{E}$ (256)). We also applied this to our largest $\mathcal{E}$ architecture, RevNet50 $\times 4$, and saw similar gains (RevNet $\mathrm{\times}\mathrm{4}$ ($\mathrm{\uparrow}\mathrm{E}$ LR) vs. RevNet $\mathrm{\times}\mathrm{4}$).
Encoder ($\mathcal{E}$)  Gen. ($\mathcal{G}$)  Loss ${\mathcal{L}}_{*}$  Results  
A.  D.  C.  R.  Var.  $\eta $  C.  R.  ${s}_{\mathrm{\mathbf{x}\mathbf{z}}}$  ${s}_{\mathbf{x}}$  ${s}_{\mathbf{z}}$  ${P}_{\mathbf{z}}$  IS ($\uparrow $)  FID ($\downarrow $)  Cls. ($\uparrow $)  
Base  S  50  1  128  ✓  1  96  128  ✓  ✓  ✓  $\mathcal{N}$  22.66 $\pm $ 0.18  31.19 $\pm $ 0.37  48.10 $\pm $ 0.13 
Deterministic $\mathcal{E}$  S  50  1  128  ()  1  96  128  ✓  ✓  ✓  $\mathcal{N}$  22.79 $\pm $ 0.27  31.31 $\pm $ 0.30  46.97 $\pm $ 0.35 
Uniform ${P}_{\mathbf{z}}$  S  50  1  128  ()  1  96  128  ✓  ✓  ✓  (${\mathcal{U}}$)  22.83 $\pm $ 0.24  31.52 $\pm $ 0.28  45.11 $\pm $ 0.93 
$\mathbf{x}$ Unary Only  S  50  1  128  ✓  1  96  128  ✓  ✓  ()  $\mathcal{N}$  23.19 $\pm $ 0.28  31.99 $\pm $ 0.30  47.74 $\pm $ 0.20 
$\mathbf{z}$ Unary Only  S  50  1  128  ✓  1  96  128  ✓  ()  ✓  $\mathcal{N}$  19.52 $\pm $ 0.39  39.48 $\pm $ 1.00  47.78 $\pm $ 0.28 
No Unaries (BiGAN)  S  50  1  128  ✓  1  96  128  ✓  ()  ()  $\mathcal{N}$  19.70 $\pm $ 0.30  42.92 $\pm $ 0.92  46.71 $\pm $ 0.88 
Small $\mathcal{G}$ (32)  S  50  1  128  ✓  1  (32)  128  ✓  ✓  ✓  $\mathcal{N}$  3.28 $\pm $ 0.18  247.30 $\pm $ 10.31  43.59 $\pm $ 0.34 
Small $\mathcal{G}$ (64)  S  50  1  128  ✓  1  (64)  128  ✓  ✓  ✓  $\mathcal{N}$  19.96 $\pm $ 0.15  38.93 $\pm $ 0.39  47.54 $\pm $ 0.33 
No $\mathcal{E}$ (GAN) *  ()  96  128  ()  ✓  ()  $\mathcal{N}$  23.56 $\pm $ 0.37  30.91 $\pm $ 0.23    
High Res $\mathcal{E}$ (256)  S  50  1  (256)  ✓  1  96  128  ✓  ✓  ✓  $\mathcal{N}$  23.45 $\pm $ 0.14  27.86 $\pm $ 0.13  50.80 $\pm $ 0.30 
Low Res $\mathcal{G}$ (64)  S  50  1  (256)  ✓  1  96  (64)  ✓  ✓  ✓  $\mathcal{N}$  19.40 $\pm $ 0.19  15.82 $\pm $ 0.06  47.51 $\pm $ 0.09 
High Res $\mathcal{G}$ (256)  S  50  1  (256)  ✓  1  96  (256)  ✓  ✓  ✓  $\mathcal{N}$  24.70  38.58  51.49 
ResNet101  S  (101)  1  (256)  ✓  1  96  128  ✓  ✓  ✓  $\mathcal{N}$  23.29  28.01  51.21 
ResNet $\times 2$  S  50  (2)  (256)  ✓  1  96  128  ✓  ✓  ✓  $\mathcal{N}$  23.68  27.81  52.66 
RevNet  (V)  50  1  (256)  ✓  1  96  128  ✓  ✓  ✓  $\mathcal{N}$  23.33 $\pm $ 0.09  27.78 $\pm $ 0.06  49.42 $\pm $ 0.18 
RevNet $\times 2$  (V)  50  (2)  (256)  ✓  1  96  128  ✓  ✓  ✓  $\mathcal{N}$  23.21  27.96  54.40 
RevNet $\times 4$  (V)  50  (4)  (256)  ✓  1  96  128  ✓  ✓  ✓  $\mathcal{N}$  23.23  28.15  57.15 
ResNet ($\uparrow \mathcal{E}$ LR)  S  50  1  (256)  ✓  (10)  96  128  ✓  ✓  ✓  $\mathcal{N}$  23.27 $\pm $ 0.22  28.51 $\pm $ 0.44  53.70 $\pm $ 0.15 
RevNet $\times 4$ ($\uparrow \mathcal{E}$ LR)  (V)  50  (4)  (256)  ✓  (10)  96  128  ✓  ✓  ✓  $\mathcal{N}$  23.08  28.54  60.15 
3.2 Comparison with prior methods
Representation learning.
Method  Architecture  Feature  Top1  Top5 
BiGAN bigan ; splitbrain  AlexNet  conv3  31.0   
Motion Segmentation (MS) motionseg ; carl  ResNet101  AvePool  27.6  48.3 
Exemplar (Ex) exemplar ; carl  ResNet101  AvePool  31.5  53.1 
Relative Position (RP) carlorig ; carl  ResNet101  AvePool  36.2  59.2 
Colorization (Col) colorful ; carl  ResNet101  AvePool  39.6  62.5 
Combination of MS+Ex+RP+Col carl  ResNet101  AvePool    69.3 
CPC cpc  ResNet101  AvePool  48.7  73.6 
Rotation rotation ; revisiting  RevNet50 $\times 4$  AvePool  55.4   
Efficient CPC cpcplusplus  ResNet170  AvePool  61.0  83.0 
BigBiGAN (ours)  ResNet50  AvePool  55.4  77.4 
ResNet50  BN+CReLU  56.6  78.6  
RevNet50 $\times 4$  AvePool  60.8  81.4  
RevNet50 $\times 4$  BN+CReLU  61.3  81.9 
We now take our best model by train${}_{\mathrm{val}}$ classification accuracy from the above ablations and present results on the official ImageNet validation set, comparing against the state of the art in recent unsupervised learning literature. For comparison, we also present classification results for our best performing variant with the smaller ResNet50based $\mathcal{E}$. These models correspond to the last two rows of Table 1, ResNet ($\mathrm{\uparrow}\mathrm{E}$ LR) and RevNet $\mathrm{\times}\mathrm{4}$ ($\mathrm{\uparrow}\mathrm{E}$ LR).
Results are presented in Table 2. (For reference, the fully supervised accuracy of these architectures is given in Appendix A, Table 4.) Compared with a number of modern selfsupervised approaches motionseg ; carlorig ; colorful ; cpc ; rotation ; cpcplusplus and combinations thereof carl , our BigBiGAN approach based purely on generative models performs well for representation learning, stateoftheart among recent unsupervised learning results, improving upon a recently published result from revisiting of 55.4% to 60.8% top1 accuracy using rotation prediction pretraining with the same representation learning architecture ^{3}^{3} 3 Our RevNet $\times 4$ architecture matches the widest architectures used in revisiting , labeled as $\times 16$ there. and feature, labeled as AvePool in Table 2, and matches the results of the concurrent work in cpcplusplus based on contrastic predictive coding (CPC).
We also experiment with learning linear classifiers on a different rendering of the AvePool feature, labeled BN+CReLU, which boosts our best results with RevNet $\times 4$ to 61.3% top1 accuracy. Given the global average pooling output $a$, we first compute $h=\mathrm{BatchNorm}(a)$, and the final feature is computed by concatenating $[\mathrm{ReLU}(h),\mathrm{ReLU}(h)]$, sometimes called a “CReLU” (concatened ReLU) nonlinearity crelu . $\mathrm{BatchNorm}$ denotes parameterfree Batch Normalization batchnorm , where the scale ($\gamma $) and offset ($\beta $) parameters are not learned, so training a linear classifier on this feature does not involve any additional learning. The CReLU nonlinearity retains all the information in its inputs and doubles the feature dimension, each of which likely contributes to the improved results.
Unsupervised image generation.
Method  Steps  IS ($\uparrow $)  FID vs. Train ($\downarrow $)  FID vs. Val. ($\downarrow $) 

BigGAN + SL zurichfewer  500K  20.4 (15.4 $\pm $ 7.57)    25.3 (71.7 $\pm $ 66.32) 
BigGAN + Clustering zurichfewer  500K  22.7 (22.8 $\pm $ 0.42)    23.2 (22.7 $\pm $ 0.80) 
BigBiGAN + SL (ours)  500K  25.38 (25.33 $\pm $ 0.17)  22.78 (22.63 $\pm $ 0.23)  23.60 (23.56 $\pm $ 0.12) 
BigBiGAN High Res $\mathcal{E}$ + SL (ours)  500K  25.43 (25.45 $\pm $ 0.04)  22.34 (22.36 $\pm $ 0.04)  22.94 (23.00 $\pm $ 0.15) 
BigBiGAN High Res $\mathcal{E}$ + SL (ours)  1M  27.94 (27.80 $\pm $ 0.21)  20.32 (20.27 $\pm $ 0.09)  21.61 (21.62 $\pm $ 0.09) 
In Table 3 we show results for unsupervised generation with BigBiGAN, comparing to the BigGANbased biggan unsupervised generation results from zurichfewer . Note that these results differ from those in Table 1 due to the use of the data augmentation method of zurichfewer ^{4}^{4} 4 See the “distorted” preprocessing method from the Compare GAN framework: https://github.com/google/compare_gan/blob/master/compare_gan/datasets.py. (rather than ResNetstyle preprocessing used for all results in our Table 1 ablation study). The lighter augmentation from zurichfewer results in better image generation performance under the IS and FID metrics. The improvements are likely due in part to the fact that this augmentation, on average, crops larger portions of the image, thus yielding generators that typically produce images encompassing most or all of a given object, which tends to result in more representative samples of any given class (giving better IS) and more closely matching the statistics of full center crops (as used in the real data statistics to compute FID). Besides this preprocessing difference, the approaches in Table 3 have the same configurations as used in the Base or High Res $\mathrm{E}$ (256) row of Table 1.
These results show that BigBiGAN significantly improves both IS and FID over the baseline unconditional BigGAN generation results with the same (unsupervised) “labels” (a single fixed label in the SL (Single Label) approach – row BigBiGAN + SL vs. BigGAN + SL). We see further improvements using a high resolution $\mathcal{E}$ (row BigBiGAN High Res $\mathrm{E}$ + SL), surpassing the previous unsupervised state of the art (row BigGAN + Clustering) under both IS and FID. (Note that the image generation results remain comparable: the generated image resolution is still $128\times 128$ here, despite the higher resolution $\mathcal{E}$ input.) The alternative “pseudolabeling” approach from zurichfewer , Clustering, which uses labels derived from unsupervised clustering, is complementary to BigBiGAN and combining both could yield further improvements. Finally, observing that results continue to improve significantly with training beyond 500K steps, we also report results at 1M steps in the final row of Table 3.
3.3 Reconstruction
As shown in bigan ; ali , the (Big)BiGAN $\mathcal{E}$ and $\mathcal{G}$ can reconstruct data instances $\mathbf{x}$ by computing the encoder’s predicted latent representation $\mathcal{E}(\mathbf{x})$ and then passing this predicted latent back through the generator to obtain the reconstruction $\mathcal{G}(\mathcal{E}(\mathbf{x}))$. We present BigBiGAN reconstructions in Figure 2. These reconstructions are far from pixelperfect, likely due in part to the fact that no reconstruction cost is explicitly enforced by the objective – reconstructions are not even computed at training time. However, they may provide some intuition for what features the encoder $\mathcal{E}$ learns to model. For example, when the input image contains a dog, person, or a food item, the reconstruction is often a different instance of the same “category” with similar pose, position, and texture – for example, a similar species of dog facing the same direction. The extent to which these reconstructions tend to retain the highlevel semantics of the inputs rather than the lowlevel details suggests that BigBiGAN training encourages the encoder to model the former more so than the latter. Additional reconstructions are presented in Appendix B.
4 Related work
A number of approaches to unsupervised representation learning from images based on selfsupervision have proven very successful. Selfsupervision generally involves learning from tasks designed to resemble supervised learning in some way, but in which the “labels” can be created automatically from the data itself with no manual effort. An early example is relative location prediction (carlorig, ), where a model is trained on input pairs of image patches and predicts their relative locations. Contrastive predictive coding (CPC) (cpc, ; cpcplusplus, ) is a recent related approach where, given an image patch, a model predicts which patches occur in other image locations. Other approaches include colorization colorful ; splitbrain , motion segmentation motionseg , rotation prediction rotation , and exemplar matching exemplar . Rigorous empirical comparisons of many of these approaches have also been conducted carl ; revisiting . A key advantage offered by BigBiGAN and other approaches based on generative models, relative to most selfsupervised approaches, is that their input may be the fullresolution image or other signal, with no cropping or modification of the data needed (though such modifications may be beneficial as data augmentation). This means the resulting representation can typically be applied directly to full data in the downstream task with no domain shift.
A number of relevant autoencoder and GAN variants have also been proposed. Associative compression networks (ACNs) acn learn to compress at the dataset level by conditioning data on other previously transmitted data which are similar in code space, resulting in models that can “daydream” semantically similar samples, similar to BigBiGAN reconstructions. VQVAEs vqvae pair a discrete (vector quantized) encoder with an autoregressive decoder to produce faithful reconstructions with a high compression factor and demonstrate representation learning results in reinforcement learning settings. In the adversarial space, adversarial autoencoders advae proposed an autoencoderstyle encoderdecoder pair trained with pixellevel reconstruction cost, replacing the KLdivergence regularization of the prior used in VAEs kingmavae with a discriminator. In another proposed VAEGAN hybrid learnedsimilarity the pixelspace reconstruction error used in most VAEs is replaced with feature space distance from an intermediate layer of a GAN discriminator. Other hybrid approaches like AGE age and $\alpha $GAN alphagan add an encoder to stabilize GAN training. An interesting difference between many of these approaches and the BiGAN ali ; bigan framework is that BiGAN does not train the encoder or generator with an explicit reconstruction cost. Though it can be shown that (Big)BiGAN implicitly minimizes a reconstruction cost, qualitative reconstruction results (Section 3.3) suggest that this reconstruction cost is of a different flavor, emphasizing highlevel semantics over pixellevel details.
5 Discussion
We have shown that BigBiGAN, an unsupervised learning approach based purely on generative models, achieves stateoftheart results in image representation learning on ImageNet. Our ablation study lends further credence to the hope that powerful generative models can be beneficial for representation learning, and in turn that learning an inference model can improve largescale generative models. In the future we hope that representation learning can continue to benefit from further advances in generative models and inference models alike, as well as scaling to larger image databases.
Acknowledgments
The authors would like to thank Aidan Clark, Olivier Hénaff, Aäron van den Oord, Sander Dieleman, and many other colleagues at DeepMind for useful discussions and feedback on this work.
References
 (1) Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high fidelity natural image synthesis. In ICLR, 2019.
 (2) Carl Doersch, Abhinav Gupta, and Alexei A. Efros. Unsupervised visual representation learning by context prediction. In ICCV, 2015.
 (3) Carl Doersch and Andrew Zisserman. Multitask selfsupervised visual learning. In ICCV, 2017.
 (4) Jeff Donahue, Philipp Krähenbühl, and Trevor Darrell. Adversarial feature learning. In ICLR, 2017.
 (5) Alexey Dosovitskiy, Philipp Fischer, Jost Tobias Springenberg, Martin Riedmiller, and Thomas Brox. Discriminative unsupervised feature learning with exemplar convolutional neural networks. In NeurIPS, 2014.
 (6) Charles Dugas, Yoshua Bengio, François Belisle, Claude Nadeau, and Rene Garcia. Incorporating secondorder functional knowledge for better option pricing. In NeurIPS, 2000.
 (7) Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Olivier Mastropietro, Alex Lamb, Martin Arjovsky, and Aaron Courville. Adversarially learned inference. In ICLR, 2017.
 (8) Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. In ICLR, 2018.
 (9) Aidan N. Gomez, Mengye Ren, Raquel Urtasun, and Roger B. Grosse. The reversible residual network: Backpropagation without storing activations. In NeurIPS, 2017.
 (10) Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NeurIPS, 2014.
 (11) Google. Cloud TPU. https://cloud.google.com/tpu/. Accessed: 2019.
 (12) Alex Graves, Jacob Menick, and Aäron van den Oord. Associative compression networks. In arXiv:1804.02476, 2018.
 (13) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In ECCV, 2016.
 (14) Olivier J. Hénaff, Ali Razavi, Carl Doersch, S. M. Ali Eslami, and Aäron van den Oord. Dataefficient image recognition with contrastive predictive coding. In arXiv:1905.09272, 2019.
 (15) Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two timescale update rule converge to a local Nash equilibrium. In NeurIPS, 2017.
 (16) Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In arXiv:1502.03167, 2015.
 (17) Tero Karras, Samuli Laine, and Timo Aila. A stylebased generator architecture for generative adversarial networks. In CVPR, 2019.
 (18) Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
 (19) Diederik P. Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. In arXiv:1807.03039, 2018.
 (20) Diederik P. Kingma and Max Welling. Autoencoding variational Bayes. In arXiv:1312.6114, 2013.
 (21) Alexander Kolesnikov, Xiaohua Zhai, and Lucas Beyer. Revisiting selfsupervised visual representation learning. In arXiv:1901.09005, 2019.
 (22) Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo Larochelle, and Ole Winther. Autoencoding beyond pixels using a learned similarity metric. In ICML, 2016.
 (23) Jae Hyun Lim and Jong Chul Ye. Geometric GAN. In arXiv:1705.02894, 2017.
 (24) Mario Lucic, Michael Tschannen, Marvin Ritter, Xiaohua Zhai, Olivier Bachem, and Sylvain Gelly. Highfidelity image generation with fewer labels. In ICML, 2019.
 (25) Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. Adversarial autoencoders. In ICLR, 2016.
 (26) Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. In ICLR, 2018.
 (27) Deepak Pathak, Ross Girshick, Piotr Dollár, Trevor Darrell, and Bharath Hariharan. Learning features by watching objects move. In CVPR, 2017.
 (28) Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, 2016.
 (29) Mihaela Rosca, Balaji Lakshminarayanan, David WardeFarley, and Shakir Mohamed. Variational approaches for autoencoding generative adversarial networks. In arXiv:1706.04987, 2017.
 (30) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li FeiFei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.
 (31) Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training GANs. In arXiv:1606.03498, 2016.
 (32) Wenling Shang, Kihyuk Sohn, Diogo Almeida, and Honglak Lee. Understanding and improving convolutional neural networks via concatenated rectified linear units. In ICML, 2016.
 (33) Dustin Tran, Rajesh Ranganath, and David M. Blei. Hierarchical implicit models and likelihoodfree variational inference. In NeurIPS, 2017.
 (34) Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. It takes (only) two: Adversarial generatorencoder networks. In arXiv:1704.02304, 2017.
 (35) Aäron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. In arXiv:1807.03748, 2018.
 (36) Aäron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. In arXiv:1711.00937, 2017.
 (37) Richard Zhang, Phillip Isola, and Alexei A. Efros. Colorful image colorization. In ECCV, 2016.
 (38) Richard Zhang, Phillip Isola, and Alexei A. Efros. Splitbrain autoencoders: Unsupervised learning by crosschannel prediction. In CVPR, 2016.
Appendix A Model and optimization details
Our optimizer matches that of BigGAN [1] – we use Adam [18] with batch size 2048 and the same learning rates and other hyperparameters, using the $\mathcal{G}$ optimizer to update $\mathcal{E}$ simultaneously, with the same alternating optimization: two $\mathcal{D}$ updates followed by a single joint update of $\mathcal{G}$ and $\mathcal{E}$. (We do not use orthogonal regularization used in [1], finding it gave worse results in the unconditional setting, matching the findings of [24].) Spectral normalization [26] is used in $\mathcal{G}$ and $\mathcal{D}$, but not in $\mathcal{E}$. Full crossreplica batch normalization is used in both $\mathcal{G}$ and $\mathcal{E}$ (including for the linear classifier training on $\mathcal{E}$ features used for evaluations). We also apply exponential moving averaging (EMA) with a decay of 0.9999 to the $\mathcal{G}$ and $\mathcal{E}$ weights in all evaluations. (We find this results in only a small improvement for $\mathcal{E}$ evaluations, but a substantial one for $\mathcal{G}$ evaluations.)
At BigBiGAN training time, as well as linear classification evaluation training time, we preprocess inputs with ResNet [13]style data augmentation, though with crops of size 128 or 256 rather than 224^{5}^{5} 5 Preprocessing code from the TensorFlow ResNet TPU model: https://github.com/tensorflow/tpu/tree/master/models/official/resnet..
For linear classification evaluations in the ablations reported in Table 1, we hold out 10K randomly selected images from the official ImageNet [30] training set as a validation set and report accuracy on that validation set, which we call train${}_{\mathrm{val}}$. All results in Table 1 are run for 500K steps, with early stopping based on linear classifier accuracy on our train${}_{\mathrm{val}}$ split. In all of these models the linear classifier is initialized to 0 and trained for 5K Adam steps with a (high) learning rate of 0.01 and EMA smoothing with decay 0.9999. We have found it helpful to monitor representation learning progress during BigBiGAN training by periodically rerunning this linear classification evaluation from scratch given the current $\mathcal{E}$ weights, resetting the classifier weights to 0 before each evaluation.
In Table 2 we extend the BigBiGAN training time to 1M steps, and report results on the official validation set of 50K images for comparison with prior work. The classifier in these results is trained for 100K Adam steps, sweeping over learning rates $\{{10}^{4},3\cdot {10}^{4},{10}^{3},3\cdot {10}^{3},{10}^{2}\}$, again applying EMA with decay 0.9999 to the classifier weights. Hyperparameter selection and early stopping is again based on classification accuracy on train${}_{\mathrm{val}}$. As in [1], FID is reported against statistics over the full ImageNet training set, preprocessed by resizing the minor axis to the $\mathcal{G}$ output resolution and taking the center crop along the major axis, except as noted in Table 3, where we also report FID against the validation set for comparison with [24].
All models were trained with data parallelism on TPU pod slices [11] using 32 to 512 cores.
Supervised model performance.
In Table 4 we present the results of fully supervised training with the model architectures used in our experiments in Section 3 for comparison purposes.
Architecture  Top1  Top5 

ResNet50  76.3  93.1 
ResNet101  77.8  93.8 
RevNet50  71.8  90.5 
RevNet50 $\times 2$  74.9  92.2 
RevNet50 $\times 4$  76.6  93.1 
First layer convolutional filters.
In Figure 3 we visualize the learned convolutional filters for the first convolutional layer of our BigBiGAN encoders $\mathcal{E}$ using the largest RevNet $\times 4$ $\mathcal{E}$ architecture. Note the difference between the filters in (a) and (b) (corresponding to rows RevNet $\mathrm{\times}\mathrm{4}$ and RevNet $\mathrm{\times}\mathrm{4}$ ($\mathrm{\uparrow}\mathrm{E}$ LR) in Table 1). In (b) we use the higher $\mathcal{E}$ learning rate and see a corresponding qualitative improvement in the appearance of the learned filters, with less noise and more Gaborlike and color filters, as observed in BiGAN [4]. This suggests that examining the convolutional filters of the input layer can serve as a diagnostic for undertrained models.
Appendix B Samples and reconstructions
Samples  Reconstructions  

Model  Image  IS ($\uparrow $)  FID ($\downarrow $)  Image  Rel. ${\mathrm{\ell}}_{1}$ Error % ($\downarrow $) 
Base  Figure 4  24.10  30.14  Figure 5  70.54 
Light Augmentation  Figure 6  27.09  20.96  Figure 7  72.53 
High Res $\mathcal{E}$ (256)  Figure 8  24.91  26.56  Figure 9  70.60 
High Res $\mathcal{G}$ (256)  Figure 10  25.73  37.21  Figure 11  77.70 
In this Appendix we present BigBiGAN samples and reconstructions from several variants of the method. Table 5 includes pointers to samples and reconstruction images, as well as relevant metrics. The samples were selected by best FID vs. training set statistics, and we show the IS and FID along with sample images at that point. The reconstructions were selected by best (lowest) relative pixelwise ${\mathrm{\ell}}_{1}$ error, the error metric presented in Table 5, computed as:
${E}_{\mathrm{Rel}{\mathrm{\ell}}_{1}}$  $={\displaystyle \frac{{\mathbb{E}}_{\mathbf{x}\sim {P}_{\mathbf{x}}}{\mathbf{x}\mathcal{G}(\mathcal{E}(\mathbf{x}))}_{1}}{{\mathbb{E}}_{\mathbf{x},{\mathbf{x}}^{\prime}\sim {P}_{\mathbf{x}}}{{\mathbf{x}}^{\prime}\mathcal{G}(\mathcal{E}(\mathbf{x}))}_{1}}},$ 
where $\mathbf{x}$ and ${\mathbf{x}}^{\prime}$ are independent data samples, and ${{\mathbf{x}}^{\prime}\mathcal{G}(\mathcal{E}(\mathbf{x}))}_{1}$ serves as a “baseline” reconstruction error relative to a “random” input. For example, with a random initialization of $\mathcal{G}$ and $\mathcal{E}$, we have ${E}_{\mathrm{Rel}{\mathrm{\ell}}_{1}}\approx 1$. This relative metric penalizes degenerate reconstructions, such as the mean image, which would sometimes achieve low absolute reconstruction error despite having no perceptual similarity to the inputs. despite that the resulting images having no perceptual similarity to the inputs. In practice, given $N$ data samples ${\mathbf{x}}_{0},{\mathbf{x}}_{1},\mathrm{\dots},{\mathbf{x}}_{N1}$ (we use $N=$ 50K), we estimate the denominator by comparing each sample ${\mathbf{x}}_{i}$ with a single neighbor ${\mathbf{x}}_{(i+1)\mathrm{mod}N}$, computing:
${E}_{\mathrm{Rel}{\mathrm{\ell}}_{1}}$  $\approx {\displaystyle \frac{{\sum}_{i=0}^{N1}{{\mathbf{x}}_{i}\mathcal{G}(\mathcal{E}({\mathbf{x}}_{i}))}_{1}}{{\sum}_{i=0}^{N1}{{\mathbf{x}}_{(i+1)\mathrm{mod}N}\mathcal{G}(\mathcal{E}({\mathbf{x}}_{i}))}_{1}}}$ 
Appendix C Nearest neighbors
Top1 / Top5 Acc. (%)  

Metric  $k=1$  $k=5$  $k=25$  $k=50$ 
${D}_{1}$  38.09 /   41.28 / 58.56  43.32 / 65.12  42.73 / 66.22 
${D}_{2}$  35.68 /   38.61 / 55.59  40.65 / 62.23  40.15 / 63.42 
In this Appendix we consider an alternative way of evaluating representations – by means of $k$ nearest neighbors classification, which does not involve learning any parameters during evaluation and is even simpler than learning a linear classifier as done in Section 3. For all results in this section, we use the outputs of the global average pooling layer (a flat 8192D feature) of our best performing model, RevNet $\mathrm{\times}\mathrm{4}$, $\mathrm{\uparrow}\mathrm{E}$ LR. We do not do any data augmentation for either the training or validation sets: we simply crop each image at the center of its larger axis and resize to $256\times 256$.
We use a normalized ${\mathrm{\ell}}_{1}$ or ${\mathrm{\ell}}_{2}$ distance metric as our nearest neighbors criterion, defined as ${D}_{p}(a,b)={\frac{a}{{a}_{p}}\frac{b}{{b}_{p}}}_{p}$, for $p\in \{1,2\}$. (${D}_{2}$ corresponds to cosine distance.) For label predictions with multiple neighbors ($k>1$), we use a simple counting scheme: the label with the most votes is selected as the prediction. Ties (multiple labels with the same number of votes) are broken by $k=1$ nearest neighbor classification among the data with the tied labels.
Quantitative results.
In Table 6 we present $k$ nearest neighbors classification results for $k\in \{1,5,25,50\}$. Across all $k$, the ${\mathrm{\ell}}_{1}$based metric ${D}_{1}$ outperforms ${D}_{2}$, and the remainder of our discussion refers to the ${D}_{1}$ results. With just a single neighbor ($k=1$) we achieve a top1 accuracy around 38%. Top1 accuracy reaches 43% with $k=25$, dropping off slightly at $k=50$ as votes from more distant neighbors are added.
Qualitative results.
Figure 12 shows sample nearest neighbors in the ImageNet training set for query images in the validation set. Despite being fully unsupervised, the neighbors in many cases match the query image in terms of highlevel semantic content such as the category of the object of interest, demonstrating BigBiGAN’s ability to capture highlevel attributes of the data in its unsupervised representations. Where applicable, the object’s pose and position in the image appears to be important as well – for example, the nearest neighbors of the RV (row 2, column 2) are all RVs facing roughly the same direction. In other cases, the nearest neighbors appear to be selected primarily based on the background or color scheme.
Discussion.
While our quantitative $k$ nearest neighbors classification results are far from the state of the art for ImageNet classification and significantly below the linear classifierbased results reported in Table 2, note that in this setup, no supervised learning of model parameters from labels occurs at any point: labels are predicted purely based on distance in a feature space learned from BigBiGAN training on image pixels alone. We believe this makes nearest neighbors classification an interesting additional benchmark for future approaches to unsupervised representation learning.
Appendix D Learning curves
In this Appendix we present learning curves showing how the image generation and representation learning metrics that we measured evolve throughout training, as a more detailed view of the results in Section 3, Table 1. We include plots for the following results:

•
Image generation (Figure 13)

•
Latent distribution ${P}_{\mathbf{z}}$ and stochastic $\mathcal{E}$ (Figure 14)

•
Unary loss terms (Figure 15)

•
$\mathcal{G}$ capacity (Figure 16)

•
High resolution $\mathcal{E}$ with varying resolution $\mathcal{G}$ (Figure 17)

•
$\mathcal{E}$ architecture (Figure 18)

•
Decoupled $\mathcal{E}$/$\mathcal{G}$ learning rates (Figure 19)