Abstract
We propose to reinterpret a standard discriminative classifier of p(yx) asan energy based model for the joint distribution p(x,y). In this setting, thestandard class probabilities can be easily computed as well as unnormalizedvalues of p(x) and p(xy). Within this framework, standard discriminativearchitectures may beused and the model can also be trained on unlabeled data.We demonstrate that energy based training of the joint distribution improvescalibration, robustness, andoutofdistribution detection while also enablingour models to generate samplesrivaling the quality of recent GAN approaches. Weimprove upon recently proposed techniques for scaling up the training of energybased models and presentan approach which adds little overhead compared tostandard classification training. Our approach is the first to achieveperformance rivaling the stateoftheartin both generative and discriminativelearning within one hybrid model.
Quick Read (beta)
Your classifier is secretly an energy based model and you should treat it like one
Abstract
We propose to reinterpret a standard discriminative classifier of $p(y\mathbf{x})$ as an energy based model for the joint distribution $p(\mathbf{x},y)$. In this setting, the standard class probabilities can be easily computed as well as unnormalized values of $p(\mathbf{x})$ and $p(\mathbf{x}y)$. Within this framework, standard discriminative architectures may be used and the model can also be trained on unlabeled data. We demonstrate that energy based training of the joint distribution improves calibration, robustness, and outofdistribution detection while also enabling our models to generate samples rivaling the quality of recent GAN approaches. We improve upon recently proposed techniques for scaling up the training of energy based models and present an approach which adds little overhead compared to standard classification training. Our approach is able to achieve performance rivaling the stateoftheart in both generative and discriminative learning within one hybrid model.
Your classifier is secretly an energy based model and you should treat it like one
Will Grathwohl 

University of Toronto & Vector Institute 
Google Research 
[email protected] 
KuanChieh Wang^{†}^{†}thanks: Equal Contribtuion & JörnHenrik Jacobsen^{†}^{†}footnotemark: 

University of Toronto & Vector Institute 
[email protected] 
[email protected] 
David Duvenaud 

University of Toronto & Vector Institute 
[email protected] 
Kevin Swersky & Mohammad Norouzi 

Google Research 
{kswersky, mnorouzi}@google.com 
1 Introduction
For decades, research on generative models has been motivated by the promise that generative models can benefit downstream problems such as semisupervised learning, imputation of missing data, and calibration of uncertainty (e.g., Chapelle et al. (2006); Dempster et al. (1977)). Yet, most recent research on deep generative models ignores these problems, and instead focuses on qualitative sample quality and loglikelihood on heldout validation sets.
Currently, there is a large performance gap between the strongest generative modeling approach to downstream tasks of interest and handtailored solutions for each specific problem. One potential explanation is that most downstream tasks are discriminative in nature and stateoftheart generative models have diverged quite heavily from stateoftheart discriminative architectures. Thus, even when trained solely as classifiers, the performance of generative models is far below the performance of the best discriminative models. Hence, the potential benefit from the generative component of the model is far outweighed by the decrease in discriminative performance. Recent work (Behrmann et al., 2018; Chen et al., 2019) attempts to improve the discriminative performance of generative models by leveraging invertible architectures, but these methods still underperform their purely discriminative counterparts jointly trained as generative models.
This paper advocates the use of energy based models (EBMs) to help realize the potential of generative models on downstream discriminative problems. While EBMs are currently challenging to work with, they fit more naturally within a discriminative framework than other generative models and facilitate the use of modern classifier architectures. Figure 1 illustrates an overview of the architecture, where the logits of a classifier are reinterpreted to define the joint density of data points and labels and the density of data points alone.
The contributions of this paper can be summarized as: 1) We present a novel and intuitive framework for joint modeling of labels and data. 2) Our models considerably outperform previous stateoftheart hybrid models at both generative and discriminative modeling. 3) We show that the incorporation of generative modeling gives our models improved calibration, outofdistribution detection, and adversarial robustness, performing on par with or better than handtailored methods for multiple tasks.
2 Energy Based Models
Energy based models (LeCun et al., 2006) hinge on the observation that any probability density $p(\mathbf{x})$ for $\mathbf{x}\in {\mathbb{R}}^{D}$ can be expressed as
${p}_{\theta}(\mathbf{x})={\displaystyle \frac{\mathrm{exp}({E}_{\theta}(\mathbf{x}))}{Z(\theta )}},$  (1) 
where ${E}_{\theta}(\mathbf{x}):{\mathbb{R}}^{D}\to \mathbb{R}$, known as the energy function, maps each point to a scalar, and $Z(\theta )={\int}_{\mathbf{x}}\mathrm{exp}({E}_{\theta}(\mathbf{x}))$ is the normalizing constant (with respect to $\mathbf{x}$) also known as the partition function. Thus, one can parameterize an EBM using any function that takes $\mathbf{x}$ as the input and returns a scalar.
For most choices of ${E}_{\theta}$, one cannot compute or even reliably estimate $Z(\theta )$, which means estimating normalized densities is intractable and standard maximum likelihood estimation of the parameters, $\theta $, is not straightforward. Thus, we must rely on other methods to train EBMs. We note that the derivative of the loglikelihood for a single example $\mathbf{x}$ with respect to $\theta $ can be expressed as
$\frac{\partial \mathrm{log}{p}_{\theta}(\mathbf{x})}{\partial \theta}}={\mathbb{E}}_{{p}_{\theta}({\mathbf{x}}^{\prime})}\left[{\displaystyle \frac{\partial {E}_{\theta}({\mathbf{x}}^{\prime})}{\partial \theta}}\right]{\displaystyle \frac{\partial {E}_{\theta}(\mathbf{x})}{\partial \theta}},$  (2) 
where the expectation is over the model distribution. Unfortunately, we cannot easily draw samples from ${p}_{\theta}(\mathbf{x})$, so we must resort to MCMC to use this gradient estimator. This approach was used to train some of the earliest EBMs. For example, Restricted Boltzmann Machines (Hinton, 2002) were trained using a block Gibbs sampler to approximate the expectation in Eq. (2).
Despite a long period of little development, there has been recent work using this method to train largescale EBMs on highdimensional data, parameterized by deep neural networks (Nijkamp et al., 2019b; a; Du and Mordatch, 2019; Xie et al., 2016). These recent successes have approximated the expectation in Eq. (2) using a sampler based on Stochastic Gradient Langevin Dynamics (SGLD) (Welling and Teh, 2011) which draws samples following
${\mathbf{x}}_{0}\sim {p}_{0}(\mathbf{x}),{\mathbf{x}}_{i+1}={\mathbf{x}}_{i}{\displaystyle \frac{\alpha}{2}}{\displaystyle \frac{\partial {E}_{\theta}({\mathbf{x}}_{i})}{\partial {\mathbf{x}}_{i}}}+\u03f5,\u03f5\sim \mathcal{N}(0,\alpha )$  (3) 
where ${p}_{0}(\mathbf{x})$ is typically a Uniform distribution over the input domain and the stepsize $\alpha $ should be decayed following a polynomial schedule. In practice the stepsize, $\alpha $, and the standard deviation of $\u03f5$ is often chosen separately leading to a biased sampler which allows for faster training. See Appendix H.1 for further discussion of samplers for EBM training.
3 What your classifier is hiding
In modern machine learning, a classification problem with $K$ classes is typically addressed using a parametric function, ${f}_{\theta}:{\mathbb{R}}^{D}\to {\mathbb{R}}^{K}$, which maps each data point $\mathbf{x}\in {\mathbb{R}}^{D}$ to $K$ realvalued numbers known as logits. These logits are used to parameterize a categorical distribution using the socalled Softmax transfer function:
${p}_{\theta}(y\mid \mathbf{x})={\displaystyle \frac{\mathrm{exp}({f}_{\theta}(\mathbf{x})[y])}{{\sum}_{{y}^{\prime}}\mathrm{exp}({f}_{\theta}(\mathbf{x})[{y}^{\prime}])}},$  (4) 
where ${f}_{\theta}(\mathbf{x})[y]$ indicates the ${y}^{\text{th}}$ index of ${f}_{\theta}(\mathbf{x})$, i.e., the logit corresponding the the ${y}^{\text{th}}$ class label.
Our key observation in this work is that one can slightly reinterpret the logits obtained from ${f}_{\theta}$ to define $p(\mathbf{x},y)$ and $p(\mathbf{x})$ as well. Without changing ${f}_{\theta}$, one can reuse the logits to define an energy based model of the joint distribution of data point $\mathbf{x}$ and labels $y$ via:
${p}_{\theta}(\mathbf{x},y)={\displaystyle \frac{\mathrm{exp}({f}_{\theta}(\mathbf{x})[y])}{Z(\theta )}},$  (5) 
where $Z(\theta )$ is the unknown normalizing constant and ${E}_{\theta}(\mathbf{x},y)={f}_{\theta}(\mathbf{x})[y]$.
By marginalizing out $y$, we obtain an unnormalized density model for $\mathbf{x}$ as well,
${p}_{\theta}(\mathbf{x})={\displaystyle \sum _{y}}{p}_{\theta}(\mathbf{x},y)={\displaystyle \frac{{\sum}_{y}\mathrm{exp}({f}_{\theta}(\mathbf{x})[y])}{Z(\theta )}}.$  (6) 
Notice now that the $\text{LogSumExp}(\cdot )$ of the logits of any classifier can be reused to define the energy function at a data point $\mathbf{x}$ as
$${E}_{\theta}(\mathbf{x})={\text{LogSumExp}}_{y}({f}_{\theta}(\mathbf{x})[y])=\mathrm{log}{\sum}_{y}\mathrm{exp}({f}_{\theta}(\mathbf{x})[y]).$$  (7) 
Unlike typical classifiers, where shifting the logits ${f}_{\theta}(\mathbf{x})$ by an arbitrary scalar does not affect the model at all, in our framework, shifting the logits for a data point $\mathbf{x}$ will affect $\mathrm{log}{p}_{\theta}(\mathbf{x})$. Thus, we are making use of the extra degree of freedom hidden within the logits to define the density function over input examples as well as the joint density among examples and labels. Finally, when we compute ${p}_{\theta}(y\mid \mathbf{x})$ via ${p}_{\theta}(\mathbf{x},y)/{p}_{\theta}(\mathbf{x})$ by dividing Eq. (5) to Eq. (6), the normalizing constant cancels out, yielding the standard Softmax parameterization in Eq. (4). Thus, we have found a generative model hidden within every standard discriminative model! Since our approach proposes to reinterpret a classifier as a Joint Energybased Model we refer to it throughout this work as JEM.
4 Optimization
We now wish to take advantage of our new interpretation of classifier architectures to gain the benefits of generative models while retaining strong discriminative performance. Since our model’s parameterization of $p(y\mathbf{x})$ is normalized over $y$, it is simple to maximize its likelihood as in standard classifier training. Since our models for $p(\mathbf{x})$ and $p(\mathbf{x},y)$ are unnormalized, maximizing their likelihood is not as easy. There are many ways we could train ${f}_{\theta}$ to maximize the likelihood of the data under this model. We could apply the gradient estimator of Equation 2 to the likelihood under the joint distribution of Equation 5. Using Equations 6 and 4, we can also factor the likelihood as
$\mathrm{log}{p}_{\theta}(\mathbf{x},y)=\mathrm{log}{p}_{\theta}(\mathbf{x})+\mathrm{log}{p}_{\theta}(y\mathbf{x}).$  (8) 
The estimator of Equation 2 is biased when using a MCMC sampler with a finite number of steps. Given that the goal of our work is to incorporate EBM training into the standard classification setting, the distribution of interest is $p(y\mathbf{x})$. For this reason we propose to train using the factorization of Equation 8 to ensure this distribution is being optimized with an unbiased objective. We optimize $p(y\mathbf{x})$ using standard crossentropy and optimize $\mathrm{log}p(\mathbf{x})$ using Equation 2 with SGLD where gradients are taken with respect to ${\text{LogSumExp}}_{y}({f}_{\theta}(x)[y])$. We find alternative factorings of the likelihood lead to considerably worse performance as can be seen in Section 5.1.
Following Du and Mordatch (2019) we use persistent contrastive divergence (Tieleman, 2008) to estimate the expectation in the righthandside of Equation 2 since it gives an order of magnitude savings in computation compared to seeding new chains at each iteration as in Nijkamp et al. (2019b). This comes at the cost of decreased training stability. These tradeoffs are discussed in Appendix H.2.
5 Applications
We completed a thorough empirical investigation to demonstrate the benefits of JEM over standard classifiers. First, we achieved performance rivaling the state of the art in both discriminative and generative modeling. Even more interesting, we observed a number of benefits related to the practical application of discriminative models including improved uncertainty quantification, outofdistribution detection, and robustness to adversarial examples. Generative models have been longexpected to provide these benefits but have never been demonstrated to do so at this scale.
All architectures used are based on Wide Residual Networks (Zagoruyko and Komodakis, 2016) where we have removed batchnormalization^{1}^{1} 1 This was done to remove sources of stochasticity in early experiments. Since then we have been able to successfully train JointEBMs with Batch Normalization and other forms of stochastic regularization (such as dropout) without issue. We leave the incorporation of these methods to further work. to ensure that our models’ outputs are deterministic functions of the input. This slightly increases classification error of a WRN2810 from $4.2\%$ to $6.4\%$ on CIFAR10 and from $2.3$ to $3.4\%$ on SVHN.
All models were trained in the same way with the same hyperparameters which were tuned on CIFAR10. Intriguingly, the SGLD sampler parameters found here generalized well across datasets and model architectures. All models are trained on a single GPU in approximately 36 hours. Full experimental details can be found in Appendix A.
5.1 Hybrid modeling
SVHN 
CIFAR100 
First, we show that a given classifier architecture can be trained as an EBM to achieve competitive performance as both a classifier and a generative model. We train JEM on CIFAR10, SVHN, and CIFAR100 and compare against other hybrid models as well as standalone generative and discriminative models. We find JEM performs near the state of the art in both tasks simultaneously, outperforming other hybrid models (Table 5).
Given that we cannot compute normalized likelihoods, we present inception scores (IS) (Salimans et al., 2016) and Frechet Inception Distance (FID) (Heusel et al., 2017) as a proxy for this quantity. We find that JEM is competitive with SOTA generative models at these metrics. These metrics are not commonly reported on CIFAR100 and SVHN so we present accuracy and qualitative samples on these datasets. Our models achieve 96.7% and 72.2% accuracy on SVHN and CIFAR100, respectively. Samples from JEM can be seen in Figures 2, 3 and in Appendix C.
JEM is trained to maximize the likelihood factorization shown in Eq. 8. This was to ensure that no bias is added into our estimate of $\mathrm{log}p(y\mathbf{x})$ which can be computed exactly in our setup. Prior work (Du and Mordatch, 2019; Xie et al., 2016) proposes to factorize the objective as $\mathrm{log}p(\mathbf{x}y)+\mathrm{log}p(y)$. In these works, each $p(\mathbf{x}y)$ is a separate EBM with a distinct, unknown normalizing constant, meaning that their model cannot be used to compute $p(y\mathbf{x})$ or $p(\mathbf{x})$. This explains why the model of Du and Mordatch (2019) (we will refer to this model as IGEBM) is not a competitive classifier. As an ablation, we trained JEM to maximize this objective and found a considerable decrease in discriminative performance (see Table 5, row 4).
5.2 Calibration
Accuracy  

Confidence  Confidence 
A classifier is considered calibrated if its predictive confidence, ${\mathrm{max}}_{y}p(y\mathbf{x})$, aligns with its misclassification rate. Thus, when a calibrated classifier predicts label $y$ with confidence $.9$ it should have a $90\%$ chance of being correct. This is an important feature for a model to have when deployed in realworld scenarios where outputting an incorrect decision can have catastrophic consequences. The classifier’s confidence can be used to decide when to output a prediction or deffer to a human, for example. Here, a wellcalibrated, but less accurate classifier can be considerably more useful than a more accurate, but lesscalibrated model.
While classifiers have grown more accurate in recent years, they have also grown considerably less calibrated (Guo et al., 2017). Contrary to this behavior, we find that JEM notably improves classification while retaining high accuracy.
We focus on CIFAR100 since SOTA classifiers achieve approximately $80\%$ accuracy. We train JEM on this dataset and compare to a baseline of the same architecure without EBM training. Our baseline model achieves $74.2\%$ accuracy and JEM achieves $72.2\%$ (for reference, a ResNet110 achieves $74.8\%$ accuracy (Zagoruyko and Komodakis, 2016)). We find the baseline model is very poorly calibrated outputting highly overconfident predictions. Conversely, we find JEM produces a nearly perfectly calibrated classifier when measured with Expected Calibration Error (see Appendix E.1). Compared to other calibration methods such as Platt scaling (Guo et al., 2017), JEM requires no additional training data. Results can be seen in Figure 4 and additional results can be found in Appendix E.2.
5.3 OutOfDistribution Detection
In general, outofdistribution (OOD) detection is a binary classification problem, where the model is required to produce a score
$${s}_{\theta}(\mathbf{x})\in \mathbb{R},$$ 
where $\mathbf{x}$ is the query, and $\theta $ is the set of learnable parameters. We desire that the scores for indistribution examples are higher than that outofdistribution examples. Typically for evaluation, thresholdfree metrics are used, such as the area under the receiveroperating curve (AUROC) (Hendrycks and Gimpel, 2016). There exist a number of distinct OOD detection approaches to which JEM can be applied. We expand on them below. Further results and experimental details can be found in Appendix F.2.
5.3.1 Input Density
A natural approach to OOD detection is to fit a density model on the data and consider examples with low likelihood to be OOD. While intuitive, this approach is currently not competitive on highdimensional data. Nalisnick et al. (2018) showed that tractable deep generative models such as Kingma and Dhariwal (2018) and Salimans et al. (2017) can assign higher densities to OOD examples than indistribution examples. Further work (Nalisnick et al., 2019) shows examples where the densities of an OOD dataset are completely indistinguishable from the indistribution set, e.g., see Table 2, column 1. Conversely, Du and Mordatch (2019) have shown that the likelihoods from EBMs can be reliably used as a predictor for OOD inputs. As can be seen in Table 2 column 2, JEM consistently assigns higher likelihoods to indistribution data than OOD data. One possible explanation for JEM’s further improvement over IGEBM is its ability to incorporate labeled information during training while also being able to derive a principled model of $p(\mathbf{x})$. Intriguingly, Glow does not appear to benefit in the same way from this supervision as is demonstrated by the little difference between our unconditional and classconditional Glow results. Quantitative results can be found in Table 3 (top).
Glow $\mathrm{log}p(x)$  JEM $\mathrm{log}p(x)$  Approx. Mass JEM  

SVHN  
CIFAR100  
CelebA 
CIFAR10  

${s}_{\theta}(\mathbf{x})$  Model  SVHN  Interp  CIFAR100  CelebA 
$\mathrm{log}p(\mathbf{x})$  Unconditional Glow  .05  .51  .55  .57 
ClassConditional Glow  .07  .45  .51  .53  
IGEBM  .63  $\mathbb{.70}$  .50  .70  
JEM (Ours)  $\mathbb{.67}$  .65  $\mathbb{.67}$  $\mathbb{.75}$  
${\mathrm{max}}_{y}p(y\mathbf{x})$  WideResNet  $\mathbb{.93}$  $\mathbb{.77}$  .85  .62 
ClassConditional Glow  .64  .61  .65  .54  
IGEBM  .43  .69  .54  .69  
JEM (Ours)  .89  .75  $\mathbb{.87}$  $\mathbb{.79}$  
$\frac{\partial \mathrm{log}p(\mathbf{x})}{\partial \mathbf{x}}$  Unconditional Glow  $\mathbb{.95}$  .27  .46  .29 
ClassConditional Glow  $.47$  .01  .52  .59  
IGEBM  .84  .65  .55  .66  
JEM (Ours)  $.83$  $\mathbb{.78}$  $\mathbb{.82}$  $\mathbb{.79}$ 
5.3.2 Predictive Distribution
Many successful approaches have utilized a classifier’s predictive distribution for OOD detection (Gal and Ghahramani, 2016; Wang et al., 2018; Liang et al., 2017). A useful OOD score that can be derived from this distribution is the maximum prediction probability: ${s}_{\theta}(\mathbf{x})={\mathrm{max}}_{y}{p}_{\theta}(y\mathbf{x})$ (Hendrycks and Gimpel, 2016). It has been demonstrated that OOD performance using this score is highly correlated with a model’s classification accuracy. Since JEM is a competitive classifier, we find it performs on par (or beyond) the performance of a strong baseline classifier and considerably outperforms other generative models. Results can be seen in Table 3 (middle).
5.3.3 A new score: Approximate Mass
It has been recently proposed that likelihood may not be enough for OOD detection in high dimensions (Nalisnick et al., 2019). It is possible for a point to have high likelihood under a distribution yet be nearly impossible to be sampled. Real samples from a distribution lie in what is known as the “typical” set. This is the area of high probability mass. A single point may have high density but if the surrounding areas have very low density, then that point is likely not in the typical set and therefore likely not a sample from the data distribution. For a highlikelihood datapoint outside of the typical set, we expect the density to change rapidly around it, thus the norm of the gradient of the logdensity will be large compared to examples in the typical set (otherwise it would be in an area of high mass). We propose an alternative OOD score based on this quantity:
${s}_{\theta}(\mathbf{x})={{\displaystyle \frac{\partial \mathrm{log}{p}_{\theta}(\mathbf{x})}{\partial \mathbf{x}}}}_{2}.$  (9) 
For EBMs (JEM and IGEBM), we find this predictor greatly outperforms our own and other generative model’s likelihoods – see Table 2 column 3. For tractable likelihood methods we find this predictor is anticorrelated with the model’s likelihood and neither is reliable for OOD detection. Results can be seen in Table 3 (bottom).
5.4 Robustness
Recent work (Athalye et al., 2017) has demonstrated that classifiers trained to be adversarially robust can be repurposed to generate convincing images, do inpainting, and translate examples from one class to another. This is done through an iterative refinement procedure, quite similar to the SGLD used to sample from EBMs. We also note that adversarial training (Goodfellow et al., 2014) bears many similarities to SGLD training of EBMs. In both settings, we use a gradientbased optimization procedure to generate examples which activate a specific highlevel network activation, then optimize the weights of the network to minimize the generated example’s effect on that activation. Further connections have been drawn between adversarial training and regularizing the gradients of the network’s activations around the data (SimonGabriel et al., 2018). This is similar to the objective of Score Matching (Hyvärinen, 2005) which can also be used to train EBMs (Kingma and Lecun, 2010; Song and Ermon, 2019).
Given these connections one may wonder if a classifier derived from an EBM would be more robust to adversarial examples than a standard model. This behavior has been demonstrated in prior work on EBMs (Du and Mordatch, 2019) but their work did not produce a competitive discriminative model and is therefore of limited practical application for this purporse. Similarly, we find JEM achieves considerable robustness without sacrificing discriminative performance.
5.4.1 Improved Robustness Through EBM Training
(a) ${L}_{\mathrm{\infty}}$ Robustness 
(b) ${L}_{2}$ Robustness 
A common threat model for adversarial robustness is that of perturbationbased adversarial examples with an ${L}_{p}$norm constraint (Goodfellow et al., 2014). They are defined as perturbed inputs $\stackrel{~}{\mathbf{x}}=\mathbf{x}+\delta $, which change a model’s prediction subject to $$. These examples exploit semantically meaningless perturbations to which the model is overly sensitive. However, closeness to real inputs in terms of a given metric does not imply that adversarial examples reside within areas of high density according to the model distribution, hence it is not surprising that the model makes mistakes when asked to classify inputs it has rarely or never encountered during training.
This insight has been used to detect and robustly classify adversarial examples with generative models (Song et al., 2017; Li et al., 2018; Fetaya et al., 2019). The stateoftheart method for adversarial robustness on MNIST classifies by comparing an input to samples generated from a classconditional generative model (Schott et al., 2018). This can be thought of as classifying an example similar to the input but from an area of higher density under the model’s learned distribution. This refined input resides in areas where the model has already “seen” sufficient data and is thus able to accurately classify. Albeit promising, this family of methods has not been able to scale beyond MNIST due to a lack of sufficiently powerful conditional generative models. We believe JEM can help close this gap. We propose to run a few iterations of our model’s sampling procedure seeded at a given input. This should be able to transform lowprobability inputs to a nearby point of high probability, “undoing” any adversarial attack and enabling the model to classify robustly.
Perturbation Robustness We run a number of powerful adversarial attacks on our CIFAR10 models. We run a whitebox PGD attack, giving the attacker access to the gradients through our sampling procedure^{2}^{2} 2 In Du and Mordatch (2019) the attacker was not given access to the gradients of the refinement procedure. We rerun these stronger attacks on their model as well and provide a comparison in Appendix G.. Because our sampling procedure is stochastic, we compute the “expectation over transformations” Athalye et al. (2018), the expected gradient over multiple runs of the sampling procedure. We also run gradientfree blackbox attacks; the boundary attack (Brendel et al., 2017) and the bruteforce pointwise attack (Rauber et al., 2017). All attacks are run with respect to the ${L}_{2}$ and ${L}_{\mathrm{\infty}}$ norms and we test JEM with 0, 1, and 10 steps of sampling seeded at the input.
Results from the PGD experiments can be seen in Figure 5. Experimental details and remaining results, including gradientfree attacks, can be found in Appendix G. Our model is considerably more robust than a baseline with standard classifier training. With respect to both norms, JEM delivers considerably improved robustness when compared to the baseline but for many epsilons falls below stateoftheart adversarial training (Madry et al., 2017; Santurkar et al., 2019) and the stateoftheart certified robutness method of Salman et al. (2019) (“RandAdvSmooth” in Figure 5). We note that each of these baseline methods is trained to be robust to the norm through which it is being attacked and it has been shown that attacking an ${L}_{\mathrm{\infty}}$ adversarially trained model with an ${L}_{2}$ adversary decreases robustness considerably (Madry et al., 2017). However, we attack the same JEM model with both norms and observe competitive robustness in both cases.
JEM with 0 steps refinement is noticeably more robust than the baseline model trained as a standard classifier, thus simply adding EBM training can produce more robust models. We also find that increasing the number of refinement steps further increases robustness to levels at robustnessspecific approaches. We expect that increasing the number of refinement steps will lead to more robust models but due to computational constraints we could not run attacks in this setting.
Distal Adversarials Another common failure mode of nonrobust models is their tendency to classify nonsensical inputs with high confidence. To analyze this property, we follow Schott et al. (2018). Starting from noise we generate images to maximize $p(y=\text{\u201ccar\u201d}\mathbf{x})$. Results are shown in figure 6. The baseline confidently classifies unstructured noise images. The ${L}_{2}$ adversarially trained ResNet with $\u03f5=0.5$ (Santurkar et al., 2019) confidently classifies somewhat structured, but unrealistic images. JEM does not confidently classify nonsensical images, so instead, car attributes and natural image properties visibly emerge.
6 Limitations
Energy based models can be very challenging to work with. Since normalized likelihoods cannot be computed, it can be hard to verify that learning is taking place at all. When working in domains such as images, samples can be drawn and checked to assess learning, but this is far from a generalizable strategy. Even so, these samples are only samples from an approximation to the model so they can only be so useful. Furthermore, the gradient estimators we use to train JEM are quite unstable and are prone to diverging if the sampling and optimization parameters are not tuned correctly. Regularizers may be added (Du and Mordatch, 2019) to increase stability but it is not clear what effect they have on the final model. The models used to generate the results in this work regularly diverged throughout training, requiring them to be restarted with lower learning rates or with increased regularization. See Appendix H.3 for a detailed description of how these difficulties were handled.
While this may seem prohibitive, we believe the results presented in this work are sufficient to motivate the community to find solutions to these issues as any improvement in the training of energy based models will further improve the results we have presented in this work.
7 Related Work
Prior work (Xie et al., 2016) made a similar observation to ours about classifiers and EBMs but define the model differently. They reinterpret the logits to define a classconditional EBM $p(\mathbf{x}y)$, similar to Du and Mordatch (2019). This setting requires additional parameters to be learned to derive a classifier and an unconditional model. We believe this subtle distinction is responsible for our model’s success. The model of (Song and Ou, 2018) is similar as well but is trained using a GANlike generator and is applied to different applications. Also related are Introspective Networks (jin2017introspective; lee2018wasserstein) which have drawn a similar connection between discriminative classifiers and generative models. They derive a generative model from a classifier which learns to distinguish between data and negative examples generative via an MCMClike procedure. Training in this way has also been shown to improve adversarial robustness.
Our work builds heavily on Nijkamp et al. (2019b; a); Du and Mordatch (2019) which scales the training of EBMs to highdimensional data using Contrastive Divergence and SGLD. While these works have pushed the boundaries of the types of data to which we can apply EBMs, many issues still exist. These methods require many steps of SGLD to take place at each training iteration. Each step requires approximately the same amount of computation as one iteration of standard discriminitive model training, therefore training EBMs at this scale is orders of magnitude slower than training a classifier – limiting the size of problems we can attack with these methods. There exist orthogonal approaches to training EBMs which we believe have promise to scale more gracefully.
Score matching (Hyvärinen, 2005) attempts to match the derivative of the model’s density with the derivative of the data density. This approach saw some development towards highdimensional data (Kingma and Lecun, 2010) and recently has been successfully applied to large natural images (Song and Ermon, 2019). This approach required a model to output the derivatives of the density function, not the density function itself, so it is unclear what utility this model can provide to the applications we have discussed in this work. Regardless, we believe this is a promising avenue for further research. Noise Contrastive Estimation (Gutmann and Hyvärinen, 2010) rephrases the density estimation problem as a classification problem, attempting to distinguish data from a known noise distribution. If the classifier is properly structured, then once the classification problem is solved, an unnormalized density estimator can be derived from the classifier and noise distribution. While this method has been recently extended (Ceylan and Gutmann, 2018), these methods are challenging to extend to highdimensional data.
8 Conclusion and Further Work
In this work we have presented JEM, a novel reinterpretation of standard classifier architectures which retains the strong performance of SOTA discriminative models while adding the benefits of generative modeling approaches. Our work is enabled by recent work scaling techniques for training EBMs to high dimensional data. We have demonstrated the utility of incorporating this type of training into discriminative models. While there exist many issues in training EBMs we hope the results presented here will encourage the community to improve upon current approaches.
9 Acknowledgements
We would like to thank Ying Nian Wu and Mitch Hill for providing some EBM training tips and tricks which were crucial in getting this project off the ground. We would also like to thank Jeremy Cohen for his useful feedback which greatly strengthened our adversarial robustness results. We would like to thank Lukas Schott for feedback on the robustness evaluation, Alexander Meinke and Francesco Croce for spotting some typos and suggesting the transfer attack. We would also like to thank Zhuowen Tu and Kwonjoon Lee for bringing related work to our attention.
References
 Obfuscated gradients give a false sense of security: circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420. Cited by: §G.1, §5.4.1.
 Synthesizing robust adversarial examples. arXiv preprint arXiv:1707.07397. Cited by: §5.4.
 A note on the inception score. arXiv preprint arXiv:1801.01973. Cited by: Table 6, Appendix B.
 Invertible residual networks. arXiv preprint arXiv:1811.00995. Cited by: §1.
 Decisionbased adversarial attacks: reliable attacks against blackbox machine learning models. arXiv preprint arXiv:1712.04248. Cited by: §5.4.1.
 Conditional noisecontrastive estimation of unnormalised models. arXiv preprint arXiv:1806.03664. Cited by: §7.
 Semisupervised learning. MIT Press. Cited by: §1.
 Residual flows for invertible generative modeling. arXiv preprint arXiv:1906.02735. Cited by: Table 6, Appendix B, §1, Table 1.
 Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society: Series B (Methodological) 39 (1), pp. 1–22. Cited by: §1.
 Implicit generation and generalization in energybased models. arXiv preprint arXiv:1903.08689. Cited by: Table 6, Appendix B, §F.1, Figure 15, Appendix G, §H.1, §2, §4, §5.1, §5.3.1, §5.4, Table 1, §6, §7, §7, footnote 2.
 Conditional generative models are not robust. arXiv preprint arXiv:1906.01171. Cited by: §5.4.1.
 Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In International Conference on Machine Learning (ICML), pp. 1050–1059. Cited by: §5.3.2.
 Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: §5.4.1, §5.4.
 On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 1321–1330. Cited by: Figure 4, §5.2, §5.2.
 Noisecontrastive estimation: a new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 297–304. Cited by: §7.
 A baseline for detecting misclassified and outofdistribution examples in neural networks. arXiv preprint arXiv:1610.02136. Cited by: §5.3.2, §5.3.
 Gans trained by a two timescale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pp. 6626–6637. Cited by: Table 6, Appendix B, §5.1.
 Training products of experts by minimizing contrastive divergence. Neural computation 14 (8), pp. 1771–1800. Cited by: §2.
 Estimation of nonnormalized statistical models by score matching. Journal of Machine Learning Research 6 (Apr), pp. 695–709. Cited by: §5.4, §7.
 Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Appendix A.
 Glow: generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pp. 10215–10224. Cited by: §5.3.1, Table 1.
 Regularized estimation of image statistics by score matching. In Advances in neural information processing systems, pp. 1126–1134. Cited by: §5.4, §7.
 A tutorial on energybased learning. Predicting structured data 1 (0). Cited by: §2.
 Are generative classifiers more robust to adversarial attacks?. arXiv preprint arXiv:1802.06552. Cited by: §5.4.1.
 Enhancing the reliability of outofdistribution image detection in neural networks. arXiv preprint arXiv:1706.02690. Cited by: §5.3.2.
 Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083. Cited by: §5.4.1.
 Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957. Cited by: Table 1.
 Do deep generative models know what they don’t know?. arXiv preprint arXiv:1810.09136. Cited by: §5.3.1.
 Detecting outofdistribution inputs to deep generative models using a test for typicality. arXiv preprint arXiv:1906.02994. Cited by: §5.3.1, §5.3.3.
 On the anatomy of mcmcbased maximum likelihood learning of energybased models. arXiv preprint arXiv:1903.12370. Cited by: §H.1, §H.2, §H.3, §2, §7.
 On learning nonconvergent shortrun mcmc toward energybased model. arXiv preprint arXiv:1904.09770. Cited by: §H.1, §H.3, §2, §4, §7.
 Foolbox: a python toolbox to benchmark the robustness of machine learning models. arXiv preprint arXiv:1707.04131. Cited by: Appendix G, §5.4.1.
 Improved techniques for training gans. In Advances in neural information processing systems, pp. 2234–2242. Cited by: §5.1.
 Pixelcnn++: improving the pixelcnn with discretized logistic mixture likelihood and other modifications. arXiv preprint arXiv:1701.05517. Cited by: §5.3.1.
 Provably robust deep learning via adversarially trained smoothed classifiers. arXiv preprint arXiv:1906.04584. Cited by: §5.4.1.
 Computer vision with a single (robust) classifier. CoRR abs/1906.09453. External Links: Link, 1906.09453 Cited by: §5.4.1, §5.4.1.
 Towards the first adversarially robust neural network model on mnist. arXiv preprint arXiv:1805.09190. Cited by: §5.4.1, §5.4.1.
 Adversarial vulnerability of neural networks increases with input dimension. arXiv preprint arXiv:1802.01421. Cited by: §5.4.
 Generative modeling by estimating gradients of the data distribution. arXiv preprint arXiv:1907.05600. Cited by: §H.1, §5.4, Table 1, §7.
 Pixeldefend: leveraging generative models to understand and defend against adversarial examples. arXiv preprint arXiv:1710.10766. Cited by: §5.4.1.
 Learning neural random fields with inclusive auxiliary generators. arXiv preprint arXiv:1806.00271. Cited by: §7.
 Training restricted boltzmann machines using approximations to the likelihood gradient. In Proceedings of the 25th international conference on Machine learning, pp. 1064–1071. Cited by: §4.
 Adversarial distillation of bayesian neural network posteriors. In International Conference on Machine Learning (ICML), Cited by: §5.3.2.
 Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML11), pp. 681–688. Cited by: §H.1, §2.
 A theory of generative convnet. In International Conference on Machine Learning, pp. 2635–2644. Cited by: §2, §5.1, §7.
 Wide residual networks. arXiv preprint arXiv:1605.07146. Cited by: §5.2, §5.
Appendix A Training Details
We train all models with the Adam optimizer (Kingma and Ba, 2014) for 150 epochs through the dataset using a staircase decay schedule. All network architecutres are based on WideResNet2810 with no batch normalization. We generate samples using PCD with hyperparameters in Table 4. We evolve the chains with 20steps of SGLD per iteration and with probability $.05$ we reiniatilize the chains with uniform random noise. For preprocessing, we scale images to the range $[1,1]$ and add Gaussian noise of stddev = $.03$. Pseudocode for our training procedure is in Algorithm A.
When training via contrastive divergence there are a few different ways one could potentially draw samples from ${p}_{\theta}(\mathbf{x})$. We could:

1.
Sample $y\sim p(y)$ then sample $\mathbf{x}\sim {p}_{\theta}(\mathbf{x}y)$ via SGLD with energy $E(\mathbf{x}y)={f}_{\theta}(\mathbf{x})[y]$ then throw away $y$.

2.
Sample $\mathbf{x}\sim {p}_{\theta}(\mathbf{x})$ via SGLD with energy $E(x)={\text{LogSumExp}}_{y}{f}_{\theta}(\mathbf{x})[y]$.
We experimented with both methods during training and found that while method 1 produced more visually appealing samples (from a human’s perspective), method 2 produced slightly stronger discirminative performance – 92.9% vs. 91.2% accuracy on CIFAR10. For this reason we use method 2 in all results presented.
[h!] {algorithmic}[1] \Whilenot converged \StateSample $\mathbf{x}$ and $y$ from dataset \State${L}_{\text{clf}}(\theta )=\text{xent}({f}_{\theta}(\mathbf{x}),y)$ \StateSample ${\widehat{\mathbf{x}}}_{0}\sim B$ with probability $1\rho $, else ${\widehat{\mathbf{x}}}_{0}\sim \mathcal{U}(1,1)$\CommentInitialize SGLD \For$t\in [1,2,\mathrm{\dots},\eta ]$ \CommentSGLD \State${\widehat{\mathbf{x}}}_{t}={\widehat{\mathbf{x}}}_{t1}+\alpha \cdot \frac{\partial {\text{LogSumExp}}_{{y}^{\prime}}({f}_{\theta}({\widehat{\mathbf{x}}}_{t1})[{y}^{\prime}])}{\partial {\widehat{\mathbf{x}}}_{t1}}+\sigma \cdot \mathcal{N}(0,I)$ \EndFor\State${L}_{\text{gen}}(\theta )={\text{LogSumExp}}_{{y}^{\prime}}(f(\mathbf{x})[{y}^{\prime}]){\text{LogSumExp}}_{{y}^{\prime}}(f({\widehat{\mathbf{x}}}_{t})[{y}^{\prime}])$ \CommentSurrogate for Eq 2 \State$L(\theta )={L}_{\text{clf}}(\theta )+{L}_{\text{gen}}(\theta )$ \StateObtain gradients $\frac{\partial L(\theta )}{\partial \theta}$ for training \StateAdd ${\widehat{\mathbf{x}}}_{t}$ to $B$ \EndWhile
Variable  Values 
initial learning rate  .0001 
learning epochs  150 
learning rate decay  .3 
learning rate decay epochs  50, 100 
SGLD steps $\eta $  20 
Buffersize  10000 
reinitialization frequency $\rho $  .05 
SGLD stepsize $\alpha $  1 
SGLD noise $\sigma $  .01 
Appendix B Sample Quality Evalution
In this section we describe the details for reproducing the Inception Score (IS) and FID results reported in the paper. First we note that both IS and FID are scores computed based on a pretrained classifier network, and thus can be very dependent on the exact model/code repository used. For a more detailed discussion on the variability of IS, please refer to Barratt and Sharma (2018). To gauge our model against the other papers, we document our attempt to fairly compare the scores across papers in Table 4. As a direct comparison of IS, we got 8.76 using the code provided by Du and Mordatch (2019), and is better than their best reported score of 8.3. For FID, we used the official implementation from Heusel et al. (2017). Note that FID computed from this repository assigned much worse FID than reported in Chen et al. (2019).
Conditional vs unconditional samples.
Since we are interested in training a Hybrid model, our model, by definition, is a conditional generative model as it has access to label information. In Table 5, unconditional samples mean samples directly obtained from running SGLD using $p(x)$. Conditional samples are obtained by taking the max of our $p(yx)$ model. The reported scores are obtained by keeping the top 10 percentile samples with the highest $p(yx)$ values. Scores obtained on a “single” model are computed directly on the training replay buffer of the last checkpoint. “Ensemble” here are obtained by lumping together 5 buffers over the last few epochs of training. As we initialize SGLD with uniform noise, using the training buffer is exactly the same as resampling from the model.
Conditional  Unconditional  

Method  single  ensemble  single  ensemble 
JEM (Ours)    8.76  7.82  7.79 
EBM (D&M)  8.3  X  6.02  6.78 
Inception Score  FID  
Method  from paper  B&S  D&M  from paper  H  D&M 
Residual Flow  X  3.6    46.4     
Glow  X    3.9  48.9*  107   
JEM (Ours)  X  7.13  8.76  X  38.4   
JEM $p(\mathbf{x}y)$ factored  X    6.36  X  61.8   
EBM (D&M)  8.3    8.3  37.9    37.9 
SNGAN  8.59      25.5     
NCSN  8.91      25.3     
Appendix C Further Hybrid Model Samples
Additional samples from CIFAR10 and SVHN can be seen in Figure 7 and samples from CIFAR100 can be seen in Figure 8
Appendix D Qualitative Analysis of Samples
Visual quality is difficult to quantify. Of the known metrics like IS and FID, using samples that have higher $p(y\mathbf{x})$ values results in higher scores, but not necessary if we use samples with higher $\mathrm{log}p(\mathbf{x})$. However, this is likely because of the downfalls of the evaluation metrics themselves rather than reflecting true sample quality.
Based on our analysis (below), we find

1.
Our $\mathrm{log}p(\mathbf{x})$ model assigns values that cluster around different means for different classes. The class automobiles has the highest $\mathrm{log}p(\mathbf{x})$. Of all generated samples, all top 100 samples are of this class.

2.
Given the class, the samples that have higher $\mathrm{log}p(\mathbf{x})$ values all have white background and centered object, and lower $\mathrm{log}p(\mathbf{x})$ samples have colorful (e.g., forestlike) background.

3.
Of all samples, higher $p(y\mathbf{x})$ values means clearly centered objects, and lower $p(y\mathbf{x})$ otherwise.
Appendix E Calibration
E.1 Expected Calibration Error
Expected Calibration Error (ECE) is a metric to measure the calibration of a classifier. It works by first computing the confidence, ${\mathrm{max}}_{y}p(y{\mathbf{x}}_{i})$, for each ${\mathbf{x}}_{i}$ in some dataset. We then group the items into equally spaced buckets ${\{{B}_{m}\}}_{m=1}^{M}$ based on the classifier’s output confidence. For example, if $M=20$, then ${B}_{0}$ would represent all examples for which the classifier’s confidence was between $0.0$ and $0.05$.
We then define:
$\text{ECE}={\displaystyle \sum _{m=1}^{M}}{\displaystyle \frac{{B}_{m}}{n}}\text{acc}({B}_{m})\text{conf}({B}_{m})$  (10) 
where $n$ is the number of examples in the dataset, acc$({B}_{m})$ is the averaged accuracy of the classifier of all examples in ${B}_{m}$ and conf$({B}_{m})$ is the averaged confidence over all examples in ${B}_{m}$.
For a perfectly calibrated classifier, this value will be 0 for any choice of $M$. In our analysis, we choose $M=20$ throughout.
E.2 Further results
We find that JEM also improves calibration on CIFAR10 as can be seen in Table 13. There we see an improvement in calibration, but both classifiers are well calibrated because their accuracy is so high. In a more interesting experiment, we limit the size of the training set to 4,000 labeled examples. In this setting the accuracy drops to 78.0% and 74.9% in the baseline and JEM, respectively. Given the JEM can be trained on unlabeled data, we treat the remainder of the training set as unlabeled and train in a semisupervised manner. We find this gives a noticeable boost in the classifier’s calibration as seen in Figure 13. Surprisingly this did not improve generalization. We leave exploring this phenomenon for future work.
(a) CIFAR10 Baseline  (b) CIFAR10 JEM 
(c) CIFAR100 Baseline (4k labels)  (d) CIFAR100 JEM (4k labels) 
Appendix F OufOfDistribution Detection
F.1 Experimental details
To obtain OOD results for unconditional Glow, we used the pretrained model and implementation of https://github.com/y0ast/GlowPyTorch. We trained a ClassConditional model as well using this codebase which was used to generate the classconditional OOD results.
We obtained the IGEBM of Du and Mordatch (2019) from their opensource implementation at https://github.com/openai/ebm_code_release. For likelihood and likelihoodgradient OOD scores we used their pretrained cifar10_large_model_uncond model. We were able to replicate the likelihood based OOD results presented in their work. We implemented our likelihoodgradient approximatemass score on top of their codebase. For predictive distribution based OOD scores we used their cifar_cond model which was the model used in their work to generate their robustness results.
F.2 Further results
Figure 7 contains results on two datasets, Constant and Uniform, which were omitted for space. Most models perform very well at the Uniform dataset. On the Constant dataset (all examples = 0) generative models mainly fail – with JEM being the only one whose likelihoods can be used to derive a predictive score function for OOD detection. Intrestinly, we could not obtain approximate mass scores on this dataset from the Glow models due to numerical stability issues.
CIFAR10  
Score  Model  SVHN  Uniform  Constant  Interp  CIFAR100  CelebA 
Unconditional Glow  .05  1.0  0.0  .51  .55  .57  
$\mathrm{log}p(\mathbf{x})$  Glow Supervised  .07  1.0  0.0  .45  .51  .53 
IGEBM  .63  1.0  .30  $\mathbb{.70}$  .50  .70  
JEM (Ours)  $\mathbb{.67}$  1.0  $\mathbb{.51}$  .65  $\mathbb{.67}$  $\mathbb{.75}$  
WRNbaseline  $\mathbb{.93}$  $\mathbb{.97}$  $\mathbb{.99}$  $\mathbb{.77}$  .85  .62  
${\mathrm{max}}_{y}p(y\mathbf{x})$  ClassConditional Glow  .64  0.0  .82  .61  .65  .54 
IGEBM  .43  .05  .60  .69  .54  .69  
JEM (Ours)  .89  .41  .84  .75  $\mathbb{.87}$  $\mathbb{.79}$  
Unconditional Glow  $\mathbb{.95}$  .99  NaN  .27  .46  .29  
$\frac{\partial \mathrm{log}p(\mathbf{x})}{\partial \mathbf{x}}$  ClassConditional Glow  $.47$  .99  NaN  .01  .52  .59 
IGEBM  .84  .99  0.0  .65  .55  .66  
JEM (Ours)  $.83$  $\mathbb{1.0}$  $\mathbb{.75}$  $\mathbb{.78}$  $\mathbb{.82}$  $\mathbb{.79}$ 
Appendix G Attack Details and Further Robustness Results
We use foolbox (Rauber et al., 2017) for our experiments. PGD uses binary search to determine minimal epsilons for every input and we plot the resulting robustnessdistortion curves. PGD runs with 20 random restarts and 40 iterations. For the boundary attack, we run default foolbox settings with one important difference. The random initialization often fails for JEM and thus we initialize the attack with a correclty classified input of another class. This other class is chosen based on the top2 prediction for the image to be attacked. As all our attacks are expensive to run, we only attacked 300 randomly chosen inputs. The same randomly chosen inputs were used to attack each model.
In Figure 14 we see the results of the boundary attack and pointwise attack on JEM and a baseline. The main point to running these attacks was to demonstrate that our model was not able to “cheat” by having vanishing gradients through our gradientbased sampling procedure. Since PGD was more successful than these gradientfree methods, this is clearly not the case and the attacker was able to use the gradients of the sampling procedure to attack our model. Further, we observe the same behavior across all attacks; the EBM with 0 steps sampling is more robust than the baseline and the robustness increases as we add more steps of sampling.
We also compare JEM to the IGEBM of Du and Mordatch (2019) with 10 steps of sampling refinement, see Figure 15. We run the same gradientbased attacks on their model and find that despite not having competitive clean accuracy, it is quite robust to large $\u03f5$ attacks – especially with respect to the ${L}_{\mathrm{\infty}}$ norm. After $\u03f5=12$ their model is more robust than ours and after $\u03f5=18$ it is more robust than the adversarial training baseline. With respect to the ${L}_{2}$ norm their model is more robust than the adversarial training baseline above $\u03f5=280$ but remains less robust than JEM until $\u03f5=525$.
We believe these results demonstrate that EBMs are a compelling class of models to explore for further work on building robust models.
(a) Boundary ${L}_{\mathrm{\infty}}$  (b) Boundary ${L}_{2}$ 
(c) Pointwise ${L}_{\mathrm{\infty}}$  (d) Pointwise ${L}_{2}$ 
(a) PGD ${L}_{\mathrm{\infty}}$  (b) PGD ${L}_{2}$ 
G.1 Expectation Over Transformations
Our SGLDbased refinement procedure is stochastic in nature and it has been shown that stochastic defenses to adversarial attacks can provide a false sense of security (Athalye et al., 2018). To deal with this, when we attack our stochastically refined classifiers, we average the classifier’s predictions over multiple samples of this refinement procedure. This makes the defense more deterministic and easier to attack. We redefine the logits of our classifier as:
$\mathrm{log}{p}_{n}^{k}(y\mathbf{x})={\displaystyle \frac{1}{n}}{\displaystyle \sum _{i=1}^{n}}\mathrm{log}p(y{\mathbf{x}}_{i}),{\mathbf{x}}_{i}\sim \text{SGLD}(\mathbf{x},k)$  (11) 
where we have defined SGLD$(\mathbf{x},k)$ as an SGLD chain run for $k$ steps seeded at $\mathbf{x}$. Intuitively, we draw $n$ different samples ${\{{\mathbf{x}}_{i}\}}_{i=1}^{n}$ from our model seeded at input $\mathbf{x}$, then compute $\mathrm{log}p(y{\mathbf{x}}_{i})$ for each of these samples, then average the results. We then attack these averaged logits with PGD to generate the results in Figure 5. We experimented with different numbers of samples and found that 10 samples yields very similar results to 5 samples on JEM with one refinement step (see Figure 16). Because 10 samples took very long to run on the JEM model with ten refinement steps, we settled on using 5 samples in the results reported in the main body of the paper.
(a) PGD ${L}_{\mathrm{\infty}}$  (b) PGD ${L}_{2}$ 
G.2 Transfer Attacks
We would like to see if JEM’s refinement procedure can correct adversarial perturbed inputs – inputs which cause the model to fail. To do this, we generate a series of adversarial examples for JEM0, with respect to the ${l}_{\mathrm{\infty}}$ norm, and test the accuracy of JEM{1,10} on these examples. Ideally, with further refinement the accuracy will increase. The results of this experiment can be seen in Figure 17. We see here that JEM’s refinement procedure can correct for adversarial perturbations.
(a) PGD ${L}_{\mathrm{\infty}}$ 
Appendix H A Discussion on Samplers
H.1 Improper SGLD
Recall the transition kernel of SGLD:
${\mathbf{x}}_{0}$  $\sim {p}_{0}(\mathbf{x})$  
${\mathbf{x}}_{i+1}$  $={\mathbf{x}}_{i}{\displaystyle \frac{\alpha}{2}}{\displaystyle \frac{\partial {E}_{\theta}({\mathbf{x}}_{i})}{\partial \theta}}+\u03f5,\u03f5\sim \mathcal{N}(0,\alpha )$ 
In the proper formulation of this sampler (Welling and Teh, 2011), the stepsize and the variance of the Gaussian noise are related $\text{Var}(\u03f5)=\alpha $. If the stepsize is decayed with a polynomial schedule, then samples from SGLD converge to samples from our unnomralized density as the number of steps goes to $\mathrm{\infty}$.
In practice, we approximate these samples with a sampler that runs for a finite number of steps. When using the proper stepsize to noise ratio, the signal from the gradient is overtaken by the noise when stepsizes are large enough to be informative. In practice the sampler is typically “relaxed” in that different values are used for the stepsize and the amount of Guassian noise added – typically the amount of noise is significantly reduced.
While we are no longer working with a valid MCMC sampler, this approximation has been successfully applied in practice in most recent work scaling EBM training to high dimensional data (Nijkamp et al., 2019b; a; Du and Mordatch, 2019) with the exception of Song and Ermon (2019) (which develops a clever workaround). The model they train is actually an ensemble of models trained on data with different amounts of noise added. They use a proper SGLD sampler decaying the step size as they sample, moving from their highnoise models to their lownoise models. This provides one possible explanation for the compelling results of their model.
In our work we have set the stepsize $\alpha =2$ and draw $\u03f5\sim \mathcal{N}(0,{.01}^{2})$. We have found these parameters to work well across a variety of datasets, domains, architectures, and sampling procedures (persistent vs. shortrun). We believe they are a decent “starting place” for energyfunctions parameterized by deep neural networks.
H.2 Persistent or Shortrun Chains?
Both persistent and shortrun markov chains have been able to succesfully train EBMs. Nijkamp et al. (2019a) presents a careful study of various samplers which can be used and the tradeoffs one makes when choosing one sampler over another. In our work we have found that if computation allows, shortrun MCMC chains are preferable in terms of training stability. Given that each step of SGLD requires approximately the computation of 1 training iteration of a standard classifier we are incentivized to find a sampler which can stably train EBMs requiring as few steps as possible per training iteration.
In our experiments we found the smallest number of SGLD steps we could take to stably train an EBM at the scale of this work was 80 steps. Even so, these models eventually would diverge late into training. At 80 steps, we found the cost of training to be prohibitively high compared to a standard classifier.
We found that by using persistent markov chains, we could further reduce the number of steps per iteration to 20 and still allow for relatively stable training. This gave a 4x speedup over our fastest shortrun MCMC sampler. Still, this PCD sampler was noticebly less stable than the fastest shortrun sampler we could use but we found the multiple factor increase in speed to be a worthwhile tradeoff.
If time allows, we recommend using a shortrun MCMC sampler with a large enough number of steps to be stable. Given that is not always possible on problems of scale, PCD can be made to work more efficiently, but at the cost of a greater number of stabilityrelated hyperparameters. These additional parameters include the buffer size and the reinitialization frequency of the Markov chains. We found both to be important for training stability and found no general recipe for which to set them. We ran most of our experiments with reinitialization frequency at $5\%$.
A particualrly interesting observation we discovered while using PCD is that the model would use the length of the markov chains to encode semantic information. We found that when training models on CIFAR10, when chains were young they almost always could be identified as frogs. When chains were old they could almost always be identified as cars. This behavior is likely some degeneracy of PCD which would not be possible with a shortrun MCMC since all chains have the same length.
H.3 Dealing with Instability
Training a model with the gradient estimator of Eq. (2) can be quite unstable – especially when combined with other objective as was the case with all models presented in this work. There exists a “stable region” of sorts when training these models where the energy values of the true data are in the same range as the energy values of the generated samples. Intuitively, if the generated samples create energies that are not trivially separated from the training data, then real learning has to take place. Nijkamp et al. (2019b; a) provide a careful analysis of this and we refer the reader there for a more indepth analysis.
We find that when using PCD occasionally throughout training a sample will be drawn from the replay buffer that has a considerably higherthan average energy (higher than the energy of a random initialization). This causes the gradients w.r.t this example to be orders of magnitude larger than gradients w.r.t the rest of the examples and causes the model to diverge. We tried a number of heuristic approaches such as gradient clipping, energy clipping, ignoring examples with atypical energy values, and many others but could not find an approach that stabilized training and did not hurt generative and discriminative performance.
The only two approaches we found to consistently work to increase stability of a model which has diverged is to 1) decrease the learning rate and 2) increase the number of SGLD steps in each PCD iteration. Unfortunately, both of these approaches slow down learning. We also had some success simply restarting models from a saved checkpoint with a different random seed. This was the main approach taken unless the model was late into training. In this case, random restarts were less effective and we increased the number of SGLD steps from 20 to 40 which stabilized training.
While we are very optimistic about the future of largescale EBMs we believe these are the most important issues that must be addressed in order for these models to be succeful.