Abstract
Adversarial examples have attracted significant attention in machinelearning, but the reasons for their existence and pervasiveness remain unclear.We demonstrate that adversarial examples can be directly attributed to thepresence of nonrobust features: features derived from patterns in the datadistribution that are highly predictive, yet brittle and incomprehensible tohumans. After capturing these features within a theoretical framework, weestablish their widespread existence in standard datasets. Finally, we presenta simple setting where we can rigorously tie the phenomena we observe inpractice to a misalignment between the (humanspecified) notion of robustnessand the inherent geometry of the data.
Quick Read (beta)
Adversarial Examples Are Not Bugs, They Are Features
Abstract
Adversarial examples have attracted significant attention in machine learning, but the reasons for their existence and pervasiveness remain unclear. We demonstrate that adversarial examples can be directly attributed to the presence of nonrobust features: features (derived from patterns in the data distribution) that are highly predictive, yet brittle and (thus) incomprehensible to humans. After capturing these features within a theoretical framework, we establish their widespread existence in standard datasets. Finally, we present a simple setting where we can rigorously tie the phenomena we observe in practice to a misalignment between the (humanspecified) notion of robustness and the inherent geometry of the data.
bibliography/bib.bib \newfloatcommandcapbtabboxtable[][\FBwidth]
1 Introduction
The pervasive brittleness of deep neural networks \citepszegedy2014intriguing,engstrom2019rotation,hendrycks2019benchmarking,athalye2018synthesizing has attracted significant attention in recent years. Particularly worrisome is the phenomenon of adversarial examples \citepbiggio2013evasion,szegedy2014intriguing, imperceptibly perturbed natural inputs that induce erroneous predictions in stateoftheart classifiers. Previous work has proposed a variety of explanations for this phenomenon, ranging from theoretical models \citepschmidt2018adversarially,bubeck2018adversarial to arguments based on concentration of measure in highdimensions \citepgilmer2018adversarial,mahloujifar2018curse,shafahi2019are. These theories, however, are often unable to fully capture behaviors we observe in practice (we discuss this further in Section 5).
More broadly, previous work in the field tends to view adversarial examples as aberrations arising either from the high dimensional nature of the input space or statistical fluctuations in the training data [szegedy2014intriguing, goodfellow2015explaining, gilmer2018adversarial]. From this point of view, it is natural to treat adversarial robustness as a goal that can be disentangled and pursued independently from maximizing accuracy \citepmadry2018towards,stutz2019disentangling,suggala2019adversarial, either through improved standard regularization methods \citeptanay2016boundary or pre/postprocessing of network inputs/outputs [uesato2018adversarial, carlini2017adversarial, he2017adversarial].
In this work, we propose a new perspective on the phenomenon of adversarial examples. In contrast to the previous models, we cast adversarial vulnerability as a fundamental consequence of the dominant supervised learning paradigm. Specifically, we claim that:
Adversarial vulnerability is a direct result of our models’ sensitivity to wellgeneralizing features in the data.
Recall that we usually train classifiers to solely maximize (distributional) accuracy. Consequently, classifiers tend to use any available signal to do so, even those that look incomprehensible to humans. After all, the presence of “a tail” or “ears” is no more natural to a classifier than any other equally predictive feature. In fact, we find that standard ML datasets do admit highly predictive yet imperceptible features. We posit that our models learn to rely on these “nonrobust” features, leading to adversarial perturbations that exploit this dependence.^{2}^{2} 2 It is worth emphasizing that while our findings demonstrate that adversarial vulnerability does arise from nonrobust features, they do not preclude the possibility of adversarial vulnerability also arising from other phenomena [tanay2016boundary, schmidt2018adversarially]. For example, \citetnakkiran2019bugs constructs adversarial examples that do not exploit nonrobust features (and hence do not allow one to learn a generalizing model from them). Still, the mere existence of useful nonrobust features suffices to establish that without explicitly discouraging models from utilizing these features, adversarial vulnerability will remain an issue.
Our hypothesis also suggests an explanation for adversarial transferability: the phenomenon that adversarial perturbations computed for one model often transfer to other, independently trained models. Since any two models are likely to learn similar nonrobust features, perturbations that manipulate such features will apply to both. Finally, this perspective establishes adversarial vulnerability as a humancentric phenomenon, since, from the standard supervised learning point of view, nonrobust features can be as important as robust ones. It also suggests that approaches aiming to enhance the interpretability of a given model by enforcing “priors” for its explanation [mahendran2015understanding, olah2017feature, smilkov2017smoothgrad] actually hide features that are “meaningful” and predictive to standard models. As such, producing humanmeaningful explanations that remain faithful to underlying models cannot be pursued independently from the training of the models themselves.
To corroborate our theory, we show that it is possible to disentangle robust from nonrobust features in standard image classification datasets. Specifically, given any training dataset, we are able to construct:

1.
A “robustified” version for robust classification (Figure 0(a))^{3}^{3} 3 The corresponding datasets for CIFAR10 are publicly available at http://git.io/advdatasets. . We demonstrate that it is possible to effectively remove nonrobust features from a dataset. Concretely, we create a training set (semantically similar to the original) on which standard training yields good robust accuracy on the original, unmodified test set. This finding establishes that adversarial vulnerability is not necessarily tied to the standard training framework, but is also a property of the dataset.

2.
A “nonrobust” version for standard classification (Figure 0(b))^{2}^{2}footnotemark: 2 . We are also able to construct a training dataset for which the inputs are nearly identical to the originals, but all appear incorrectly labeled. In fact, the inputs in the new training set are associated to their labels only through small adversarial perturbations (and hence utilize only nonrobust features). Despite the lack of any predictive humanvisible information, training on this dataset yields good accuracy on the original, unmodified test set. This demonstrates that adversarial perturbations can arise from flipping features in the data that are useful for classification of correct inputs (hence not being purely aberrations).
Finally, we present a concrete classification task where the connection between adversarial examples and nonrobust features can be studied rigorously. This task consists of separating Gaussian distributions, and is loosely based on the model presented in \citettsipras2019robustness, while expanding upon it in a few ways. First, adversarial vulnerability in our setting can be precisely quantified as a difference between the intrinsic data geometry and that of the adversary’s perturbation set. Second, robust training yields a classifier which utilizes a geometry corresponding to a combination of these two. Lastly, the gradients of standard models can be significantly more misaligned with the interclass direction, capturing a phenomenon that has been observed in practice in more complex scenarios \citeptsipras2019robustness.
2 The Robust Features Model
We begin by developing a framework, loosely based on the setting proposed by \citettsipras2019robustness, that enables us to rigorously refer to “robust” and “nonrobust” features. In particular, we present a set of definitions which allow us to formally describe our setup, theoretical results, and empirical evidence.
Setup.
We consider binary classification^{4}^{4} 4 Our framework can be straightforwardly adapted though to the multiclass setting., where inputlabel pairs $(x,y)\in \mathcal{X}\times \{\pm 1\}$ are sampled from a (data) distribution $\mathcal{D}$; the goal is to learn a classifier $C:\mathcal{X}\to \{\pm 1\}$ which predicts a label $y$ corresponding to a given input $x$.
We define a feature to be a function mapping from the input space $\mathcal{X}$ to the real numbers, with the set of all features thus being $\mathcal{F}=\{f:\mathcal{X}\to \mathbb{R}\}$. For convenience, we assume that the features in $\mathcal{F}$ are shifted/scaled to be meanzero and unitvariance (i.e., so that ${\mathbb{E}}_{(x,y)\sim \mathcal{D}}[f(x)]=0$ and ${\mathbb{E}}_{(x,y)\sim \mathcal{D}}[f{(x)}^{2}]=1$), in order to make the following definitions scaleinvariant^{5}^{5} 5 This restriction can be straightforwardly removed by simply shifting/scaling the definitions.. Note that this formal definition also captures what we abstractly think of as features (e.g., we can construct an $f$ that captures how “furry” an image is).
Useful, robust, and nonrobust features.
We now define the key concepts required for formulating our framework. To this end, we categorize features in the following manner:

•
$\rho $useful features: For a given distribution $\mathcal{D}$, we call a feature $f$ $\rho $useful ($\rho >0$) if it is correlated with the true label in expectation, that is if
$${\mathbb{E}}_{(x,y)\sim \mathcal{D}}[y\cdot f(x)]\ge \rho .$$ (1) We then define ${\rho}_{\mathcal{D}}(f)$ as the largest $\rho $ for which feature $f$ is $\rho $useful under distribution $\mathcal{D}$. (Note that if a feature $f$ is negatively correlated with the label, then $f$ is useful instead.) Crucially, a linear classifier trained on $\rho $useful features can attain nontrivial generalization performance.

•
$\gamma $robustly useful features: Suppose we have a $\rho $useful feature $f$ (${\rho}_{\mathcal{D}}(f)>0$). We refer to $f$ as a robust feature (formally a $\gamma $robustly useful feature for $\gamma >0$) if, under adversarial perturbation (for some specified set of valid perturbations $\mathrm{\Delta}$), $f$ remains $\gamma $useful. Formally, if we have that
$${\mathbb{E}}_{(x,y)\sim \mathcal{D}}\left[\underset{\delta \in \mathrm{\Delta}(x)}{inf}y\cdot f(x+\delta )\right]\ge \gamma .$$ (2) 
•
Useful, nonrobust features: A useful, nonrobust feature is a feature which is $\rho $useful for some $\rho $ bounded away from zero, but is not a $\gamma $robust feature for any $\gamma \ge 0$. These features help with classification in the standard setting, but may hinder accuracy in the adversarial setting, as the correlation with the label can be flipped.
Classification.
In our framework, a classifier $C=(F,w,b)$ is comprised of a set of features $F\subseteq \mathcal{F}$, a weight vector $w$, and a scalar bias $b$. For a given input $x$, the classifier predicts the label $y$ as
$$C(x)=\text{sgn}\left(b+\sum _{f\in F}{w}_{f}\cdot f(x)\right).$$ 
For convenience, we denote the set of features learned by a classifier $C$ as ${F}_{C}$.
Standard Training.
Training a classifier is performed by minimizing a loss function (via empirical risk minimization (ERM)) that decreases with the correlation between the weighted combination of the features and the label. The simplest example of such a loss is ^{6}^{6} 6 Just as for the other parts of this model, we use this loss for simplicity only—it is straightforward to generalize to more practical loss function such as logistic or hinge loss.
$${\mathbb{E}}_{(x,y)\sim \mathcal{D}}\left[{\mathcal{L}}_{\theta}(x,y)\right]={\mathbb{E}}_{(x,y)\sim \mathcal{D}}\left[y\cdot \left(b+\sum _{f\in F}{w}_{f}\cdot f(x)\right)\right].$$  (3) 
When minimizing classification loss, no distinction exists between robust and nonrobust features: the only distinguishing factor of a feature is its $\rho $usefulness. Furthermore, the classifier will utilize any $\rho $useful feature in $F$ to decrease the loss of the classifier.
Robust training.
In the presence of an adversary, any useful but nonrobust features can be made anticorrelated with the true label, leading to adversarial vulnerability. Therefore, ERM is no longer sufficient to train classifiers that are robust, and we need to explicitly account for the effect of the adversary on the classifier. To do so, we use an adversarial loss function that can discern between robust and nonrobust features \citepmadry2018towards:
$${\mathbb{E}}_{(x,y)\sim \mathcal{D}}\left[\underset{\delta \in \mathrm{\Delta}(x)}{\mathrm{max}}{\mathcal{L}}_{\theta}(x+\delta ,y)\right],$$  (4) 
for an appropriately defined set of perturbations $\mathrm{\Delta}$. Since the adversary can exploit nonrobust features to degrade classification accuracy, minimizing this adversarial loss (as in adversarial training \citepgoodfellow2015explaining,madry2018towards) can be viewed as explicitly preventing the classifier from learning a useful but nonrobust combination of features.
Remark.
We want to note that even though the framework above enables us to formally describe and predict the outcome of our experiments, it does not necessarily capture the notion of nonrobust features exactly as we intuitively might think of them. For instance, in principle, our theoretical framework would allow for useful nonrobust features to arise as combinations of useful robust features and useless nonrobust features [goh2019discussion]. These types of constructions, however, are actually precluded by our experimental results (in particular, the classifiers trained in Section 3 would not generalize). This shows that our experimental findings capture a stronger, more finegrained statement than our formal definitions are able to express. We view bridging this gap as an interesting direction for future work.
3 Finding Robust (and NonRobust) Features
The central premise of our proposed framework is that there exist both robust and nonrobust features that constitute useful signals for standard classification. We now provide evidence in support of this hypothesis by disentangling these two sets of features.
On one hand, we will construct a “robustified” dataset, consisting of samples that primarily contain robust features. Using such a dataset, we are able to train robust classifiers (with respect to the standard test set) using standard (i.e., nonrobust) training. This demonstrates that robustness can arise by removing certain features from the dataset (as, overall, the new dataset contains less information about the original training set). Moreover, it provides evidence that adversarial vulnerability is caused by nonrobust features and is not inherently tied to the standard training framework.
On the other hand, we will construct datasets where the inputlabel association is based purely on nonrobust features (and thus the corresponding dataset appears completely mislabeled to humans). We show that this dataset suffices to train a classifier with good performance on the standard test set. This indicates that natural models use nonrobust features to make predictions, even in the presence of robust features. These features alone are actually sufficient for nontrivial generalizations performance on natural images, which indicates that they are indeed valuable features, rather than artifacts of finitesample overfitting.
A conceptual description of these experiments can be found in Figure 1.
3.1 Disentangling robust and nonrobust features
Recall that the features a classifier learns to rely on are based purely on how useful these features are for (standard) generalization. Thus, under our conceptual framework, if we can ensure that only robust features are useful, standard training should result in a robust classifier. Unfortunately, we cannot directly manipulate the features of very complex, highdimensional datasets. Instead, we will leverage a robust model and modify our dataset to contain only the features that are relevant to that model.
In terms of our formal framework (Section 2), given a robust (i.e., adversarially trained [madry2018towards]) model $C$ we aim to construct a distribution ${\widehat{\mathcal{D}}}_{R}$ which satisfies:
$${\mathbb{E}}_{(x,y)\sim {\widehat{\mathcal{D}}}_{R}}\left[f(x)\cdot y\right]=\{\begin{array}{cc}{\mathbb{E}}_{(x,y)\sim \mathcal{D}}\left[f(x)\cdot y\right]\hfill & \text{if}f\in {F}_{C}\hfill \\ 0\hfill & \text{otherwise},\hfill \end{array}$$  (5) 
where ${F}_{C}$ again represents the set of features utilized by $C$. Conceptually, we want features used by $C$ to be as useful as they were on the original distribution $\mathcal{D}$ while ensuring that the rest of the features are not useful under ${\widehat{\mathcal{D}}}_{NR}$.
We will construct a training set for ${\widehat{\mathcal{D}}}_{R}$ via a onetoone mapping $x\mapsto {x}_{r}$ from the original training set for $\mathcal{D}$. In the case of a deep neural network, ${F}_{C}$ corresponds to exactly the set of activations in the penultimate layer (since these correspond to inputs to a linear classifier). To ensure that features used by the model are equally useful under both training sets, we (approximately) enforce all features in ${F}_{C}$ to have similar values for both $x$ and ${x}_{r}$ through the following optimization:
$$\underset{{x}_{r}}{\mathrm{min}}{\parallel g({x}_{r})g(x)\parallel}_{2},$$  (6) 
where $x$ is the original input and $g$ is the mapping from $x$ to the representation layer. We optimize this objective using gradient descent in input space^{7}^{7} 7 We follow [madry2018towards] and normalize gradient steps during this optimization. Experimental details are provided in Appendix C..
Since we don’t have access to features outside ${F}_{C}$, there is no way to ensure that the expectation in (5) is zero for all $f\notin {F}_{C}$. To approximate this condition, we choose the starting point of gradient descent for the optimization in (6) to be an input ${x}_{0}$ which is drawn from $\mathcal{D}$ independently of the label of $x$ (we also explore sampling ${x}_{0}$ from noise in Appendix D.1). This choice ensures that any feature present in that input will not be useful since they are not correlated with the label in expectation over ${x}_{0}$. The underlying assumption here is that, when performing the optimization in (6), features that are not being directly optimized (i.e., features outside ${F}_{C}$) are not affected. We provide pseudocode for the construction in Figure 5 (Appendix C).
Given the new training set for ${\widehat{\mathcal{D}}}_{R}$ (a few random samples are visualized in Figure 1(a)), we train a classifier using standard (nonrobust) training. We then test this classifier on the original test set (i.e. $\mathcal{D}$). The results (Figure 1(b)) indicate that the classifier learned using the new dataset attains good accuracy in both standard and adversarial settings ^{8}^{8} 8 In an attempt to explain the gap in accuracy between the model trained on ${\widehat{\mathcal{D}}}_{R}$ and the original robust classifier $C$, we test distributional shift, by reporting results on the “robustified” test set in Appendix D.3. ^{9}^{9} 9 In order to gain more confidence in the robustness of the resulting model, we attempt several diverse attacks in Appendix D.2..
As a control, we repeat this methodology using a standard (nonrobust) model for $C$ in our construction of the dataset. Sample images from the resulting “nonrobust dataset” ${\widehat{\mathcal{D}}}_{NR}$ are shown in Figure 1(a)—they tend to resemble more the source image of the optimization ${x}_{0}$ than the target image $x$. We find that training on this dataset leads to good standard accuracy, yet yields almost no robustness (Figure 1(b)). We also verify that this procedure is not simply a matter of encoding the weights of the original model—we get the same results for both ${\widehat{\mathcal{D}}}_{R}$ and ${\widehat{\mathcal{D}}}_{NR}$ if we train with different architectures than that of the original models.
Overall, our findings corroborate the hypothesis that adversarial examples can arise from (nonrobust) features of the data itself. By filtering out nonrobust features from the dataset (e.g. by restricting the set of available features to those used by a robust model), one can train a significantly more robust model using standard training.
3.2 Nonrobust features suffice for standard classification
The results of the previous section show that by restricting the dataset to only contain features that are used by a robust model, standard training results in classifiers that are significantly more robust. This suggests that when training on the standard dataset, nonrobust features take on a large role in the resulting learned classifier. Here we set out to show that this role is not merely incidental or due to finitesample overfitting. In particular, we demonstrate that nonrobust features alone suffice for standard generalization— i.e., a model trained solely on nonrobust features can perform well on the standard test set.
To show this, we construct a dataset where the only features that are useful for classification are nonrobust features (or in terms of our formal model from Section 2, all features $f$ that are $\rho $useful are nonrobust). To accomplish this, we modify each inputlabel pair $(x,y)$ as follows. We select a target class $t$ either (a) uniformly at random among classes (hence features become uncorrelated with the labels) or (b) deterministically according to the source class (e.g. using a fixed permutation of labels). Then, we add a small adversarial perturbation to $x$ in order to ensure it is classified as $t$ by a standard model. Formally:
$${x}_{adv}=\underset{\parallel {x}^{\prime}x\parallel \le \epsilon}{\mathrm{arg}\mathrm{min}}{L}_{C}({x}^{\prime},t),$$  (7) 
where ${L}_{C}$ is the loss under a standard (nonrobust) classifier $C$ and $\epsilon $ is a small constant. The resulting inputs are nearly indistinguishable from the originals (Appendix D Figure 9)—to a human observer, it thus appears that the label $t$ assigned to the modified input is simply incorrect. The resulting inputlabel pairs $({x}_{adv},t)$ make up the new training set (pseudocode in Appendix C Figure 6).
Now, since $\parallel {x}_{adv}x\parallel $ is small, by definition the robust features of ${x}_{adv}$ are still correlated with class $y$ (and not $t$) in expectation over the dataset. After all, humans still recognize the original class. On the other hand, since every ${x}_{adv}$ is strongly classified as $t$ by a standard classifier, it must be that some of the nonrobust features are now strongly correlated with $t$ (in expectation).
In the case where $t$ is chosen at random, the robust features are originally uncorrelated with the label $t$ (in expectation), and after the adversarial perturbation can be only slightly correlated (hence being significantly less useful for classification than before) ^{10}^{10} 10 \citetgoh2019leakage provides an approach to quantifying this “robust feature leakage” and finds that one can obtain a (small) amount of test accuracy by leveraging robust feature leakage on ${\widehat{\mathcal{D}}}_{rand}$.. Formally, we aim to construct a dataset ${\widehat{\mathcal{D}}}_{rand}$ where ^{11}^{11} 11 Note that the optimization procedure we describe aims to merely approximate this condition, where we once again use trained models to simulate access to robust and nonrobust features. :
$${\mathbb{E}}_{(x,y)\sim {\widehat{\mathcal{D}}}_{rand}}\left[y\cdot f(x)\right]\{\begin{array}{cc}>0\hfill & \text{if}f\text{nonrobustly useful under}\mathcal{D},\hfill \\ \simeq 0\hfill & \text{otherwise.}\hfill \end{array}$$  (8) 
In contrast, when $t$ is chosen deterministically based on $y$, the robust features actually point away from the assigned label $t$. In particular, all of the inputs labeled with class $t$ exhibit nonrobust features correlated with $t$, but robust features correlated with the original class $y$. Thus, robust features on the original training set provide significant predictive power on the training set, but will actually hurt generalization on the standard test set. Viewing this case again using the formal model, our goal is to construct ${\widehat{\mathcal{D}}}_{det}$ such that
$$  (9) 
We find that standard training on these datasets actually generalizes to the original test set, as shown in Table 1). This indicates that nonrobust features are indeed useful for classification in the standard setting. Remarkably, even training on ${\widehat{\mathcal{D}}}_{det}$ (where all the robust features are correlated with the wrong class), results in a wellgeneralizing classifier. This indicates that nonrobust features can be picked up by models during standard training, even in the presence of robust features that are predictive ^{13}^{13} 13 Additional results and analysis (e.g. training curves, generating ${\widehat{\mathcal{D}}}_{rand}$ and ${\widehat{\mathcal{D}}}_{det}$ with a robust model, etc.) are in App. D.6 and D.5^{14}^{14} 14 We also show that the models trained on ${\widehat{\mathcal{D}}}_{rand}$ and ${\widehat{\mathcal{D}}}_{det}$ generalize to CIFAR10.1 [recht2018cifar10] in Appendix D.7..
Source Dataset  Dataset  

CIFAR10  ImageNet${}_{R}$  
$\mathcal{D}$  95.3%  96.6% 
${\widehat{\mathcal{D}}}_{rand}$  63.3%  87.9% 
${\widehat{\mathcal{D}}}_{det}$  43.7%  64.4% 
3.3 Transferability can arise from nonrobust features
One of the most intriguing properties of adversarial examples is that they transfer across models with different architectures and independently sampled training sets \citepszegedy2014intriguing,papernot2016transferability,charles2019geometric. Here, we show that this phenomenon can in fact be viewed as a natural consequence of the existence of nonrobust features. Recall that, according to our main thesis, adversarial examples can arise as a result of perturbing wellgeneralizing, yet brittle features. Given that such features are inherent to the data distribution, different classifiers trained on independent samples from that distribution are likely to utilize similar nonrobust features. Consequently, an adversarial example constructed by exploiting the nonrobust features learned by one classifier will transfer to any other classifier utilizing these features in a similar manner.
In order to illustrate and corroborate this hypothesis, we train five different architectures on the dataset generated in Section 3.2 (adversarial examples with deterministic labels) for a standard ResNet50 [he2016deep]. Our hypothesis would suggest that architectures which learn better from this training set (in terms of performance on the standard test set) are more likely to learn similar nonrobust features to the original classifier. Indeed, we find that the test accuracy of each architecture is predictive of how often adversarial examples transfer from the original model to standard classifiers with that architecture (Figure 3). In a similar vein, \citetnakkiran2019bugs constructs a set of adversarial perturbations that is explicitly nontransferable and finds that these perturbations cannot be used to learn a good classifier. These findings thus corroborate our hypothesis that adversarial transferability arises when models learn similar brittle features of the underlying dataset.
4 A Theoretical Framework for Studying (Non)Robust Features
The experiments from the previous section demonstrate that the conceptual framework of robust and nonrobust features is strongly predictive of the empirical behavior of stateoftheart models on realworld datasets. In order to further strengthen our understanding of the phenomenon, we instantiate the framework in a concrete setting that allows us to theoretically study various properties of the corresponding model. Our model is similar to that of \citettsipras2019robustness in the sense that it contains a dichotomy between robust and nonrobust features, but extends upon it in a number of ways:

1.
The adversarial vulnerability can be explicitly expressed as a difference between the inherent data metric and the ${\mathrm{\ell}}_{2}$ metric.

2.
Robust learning corresponds exactly to learning a combination of these two metrics.

3.
The gradients of adversarially trained models align better with the adversary’s metric.
Setup.
We study a simple problem of maximum likelihood classification between two Gaussian distributions. In particular, given samples $(x,y)$ sampled from $\mathcal{D}$ according to
$$y\stackrel{\text{u.a.r.}}{\sim}\{1,+1\},x\sim \mathcal{N}(y\cdot {\bm{\mu}}_{*},{\mathbf{\Sigma}}_{*}),$$  (10) 
our goal is to learn parameters $\mathrm{\Theta}=(\bm{\mu},\mathbf{\Sigma})$ such that
$$\mathrm{\Theta}=\mathrm{arg}\underset{\bm{\mu},\mathbf{\Sigma}}{\mathrm{min}}{\mathbb{E}}_{(x,y)\sim \mathcal{D}}\left[\mathrm{\ell}(x;y\cdot \bm{\mu},\mathbf{\Sigma})\right],$$  (11) 
where $\mathrm{\ell}(x;\bm{\mu},\mathbf{\Sigma})$ represents the Gaussian negative loglikelihood (NLL) function. Intuitively, we find the parameters $\bm{\mu},\mathbf{\Sigma}$ which maximize the likelihood of the sampled data under the given model. Classification under this model can be accomplished via likelihood test: given an unlabeled sample $x$, we predict $y$ as
$$y=\mathrm{arg}\underset{y}{\mathrm{max}}\mathrm{\ell}(x;y\cdot \bm{\mu},\mathbf{\Sigma})=\text{sign}\left({x}^{\top}{\mathbf{\Sigma}}^{1}\bm{\mu}\right).$$ 
In turn, the robust analogue of this problem arises from replacing $\mathrm{\ell}(x;y\cdot \bm{\mu},\mathbf{\Sigma})$ with the NLL under adversarial perturbation. The resulting robust parameters ${\mathrm{\Theta}}_{r}$ can be written as
${\mathrm{\Theta}}_{r}=\mathrm{arg}\underset{\bm{\mu},\mathbf{\Sigma}}{\mathrm{min}}{\mathbb{E}}_{(x,y)\sim \mathcal{D}}\left[\underset{{\parallel \delta \parallel}_{2}\le \epsilon}{\mathrm{max}}\mathrm{\ell}(x+\delta ;y\cdot \bm{\mu},\mathbf{\Sigma})\right],$  (12) 
A detailed analysis of this setting is in Appendix E—here we present a highlevel overview of the results.
(1) Vulnerability from metric misalignment (nonrobust features).
Note that in this model, one can rigorously make reference to an inner product (and thus a metric) induced by the features. In particular, one can view the learned parameters of a Gaussian $\mathrm{\Theta}=(\bm{\mu},\mathbf{\Sigma})$ as defining an inner product over the input space given by ${\u27e8x,y\u27e9}_{\mathrm{\Theta}}={(x\bm{\mu})}^{\top}{\mathbf{\Sigma}}^{1}(y\bm{\mu})$. This in turn induces the Mahalanobis distance, which represents how a change in the input affects the features learned by the classifier. This metric is not necessarily aligned with the metric in which the adversary is constrained, the ${\mathrm{\ell}}_{2}$norm. Actually, we show that adversarial vulnerability arises exactly as a misalignment of these two metrics.
Theorem 1 (Adversarial vulnerability from misalignment).
Consider an adversary whose perturbation is determined by the “Lagrangian penalty” form of (12), i.e.
$$\underset{\delta}{\mathrm{max}}\mathrm{\ell}(x+\delta ;y\cdot \bm{\mu},\mathbf{\Sigma})C\cdot {\parallel \delta \parallel}_{2},$$ 
where $C\mathrm{\ge}\frac{\mathrm{1}}{{\sigma}_{m\mathit{}i\mathit{}n}\mathit{}\mathrm{(}{\mathrm{\Sigma}}_{\mathrm{*}}\mathrm{)}}$ is a constant trading off NLL minimization and the adversarial constraint^{15}^{15} 15 The constraint on $C$ is to ensure the problem is concave.. Then, the adversarial loss ${\mathrm{L}}_{a\mathit{}d\mathit{}v}$ incurred by the nonrobustly learned $\mathrm{(}\mathbf{\mu}\mathrm{,}\mathrm{\Sigma}\mathrm{)}$ is given by:
${\mathcal{L}}_{adv}(\mathrm{\Theta})\mathcal{L}(\mathrm{\Theta})$  $=\text{\mathit{t}\mathit{r}}\left[{\left(I+{\left(C\cdot {\mathbf{\Sigma}}_{*}I\right)}^{1}\right)}^{2}\right]d,$ 
and, for a fixed $\text{\mathit{t}\mathit{r}}\mathit{}\mathrm{(}{\mathrm{\Sigma}}_{\mathrm{*}}\mathrm{)}\mathrm{=}k$ the above is minimized by ${\mathrm{\Sigma}}_{\mathrm{*}}\mathrm{=}\frac{k}{d}\mathit{}\mathbf{I}$.
In fact, note that such a misalignment corresponds precisely to the existence of nonrobust features, as it indicates that “small” changes in the adversary’s metric along certain directions can cause large changes under the datadependent notion of distance established by the parameters. This is illustrated in Figure 4, where misalignment in the featureinduced metric is responsible for the presence of a nonrobust feature in the corresponding classification problem.
(2) Robust Learning.
The optimal (nonrobust) maximum likelihood estimate is $\mathrm{\Theta}={\mathrm{\Theta}}^{*}$, and thus the vulnerability for the standard MLE estimate is governed entirely by the true data distribution. The following theorem characterizes the behaviour of the learned parameters in the robust problem. ^{16}^{16} 16 Note: as discussed in Appendix E.3.3, we study a slight relaxation of (12) that approaches exactness exponentially fast as $d\to \mathrm{\infty}$. In fact, we can prove (Section E.3.4) that performing (sub)gradient descent on the inner maximization (also known as adversarial training \citepgoodfellow2015explaining,madry2018towards) yields exactly ${\mathrm{\Theta}}_{r}$. We find that as the perturbation budget $\epsilon $ is increased, the metric induced by the learned features mixes ${\mathrm{\ell}}_{2}$ and the metric induced by the features.
Theorem 2 (Robustly Learned Parameters).
Just as in the nonrobust case, ${\mathbf{\mu}}_{r}\mathrm{=}{\mathbf{\mu}}^{\mathrm{*}}$, i.e. the true mean is learned. For the robust covariance ${\mathrm{\Sigma}}_{r}$, there exists an ${\epsilon}_{\mathrm{0}}\mathrm{>}\mathrm{0}$, such that for any $\epsilon \mathrm{\in}\mathrm{[}\mathrm{0}\mathrm{,}{\epsilon}_{\mathrm{0}}\mathrm{)}$,
$${\mathbf{\Sigma}}_{r}=\frac{1}{2}{\mathbf{\Sigma}}_{*}+\frac{1}{\lambda}\cdot \bm{I}+\sqrt{\frac{1}{\lambda}\cdot {\mathbf{\Sigma}}_{*}+\frac{1}{4}{\mathbf{\Sigma}}_{*}^{2}},\mathit{\text{where}}\mathit{\hspace{1em}\hspace{1em}}\mathrm{\Omega}\left(\frac{1+{\epsilon}^{1/2}}{{\epsilon}^{1/2}+{\epsilon}^{3/2}}\right)\le \lambda \le O\left(\frac{1+{\epsilon}^{1/2}}{{\epsilon}^{1/2}}\right).$$ 
The effect of robust optimization under an ${\mathrm{\ell}}_{2}$constrained adversary is visualized in Figure 4. As $\u03f5$ grows, the learned covariance becomes more aligned with identity. For instance, we can see that the classifier learns to be less sensitive in certain directions, despite their usefulness for natural classification.
(3) Gradient Interpretability.
\citettsipras2019robustness observe that gradients of robust models tend to look more semantically meaningful. It turns out that under our model, this behaviour arises as a natural consequence of Theorem 2. In particular, we show that the resulting robustly learned parameters cause the gradient of the linear classifier and the vector connecting the means of the two distributions to better align (in a worstcase sense) under the ${\mathrm{\ell}}_{2}$ inner product.
Theorem 3 (Gradient alignment).
Let $f\mathit{}\mathrm{(}x\mathrm{)}$ and ${f}_{r}\mathit{}\mathrm{(}x\mathrm{)}$ be monotonic classifiers based on the linear separator induced by standard and ${\mathrm{\ell}}_{\mathrm{2}}$robust maximum likelihood classification, respectively. The maximum angle formed between the gradient of the classifier (wrt input) and the vector connecting the classes can be smaller for the robust model:
$$\underset{\bm{\mu}}{\mathrm{min}}\frac{\u27e8\bm{\mu},{\nabla}_{x}{f}_{r}(x)\u27e9}{\parallel \bm{\mu}\parallel \cdot \parallel {\nabla}_{x}{f}_{r}(x)\parallel}>\underset{\bm{\mu}}{\mathrm{min}}\frac{\u27e8\bm{\mu},{\nabla}_{x}f(x)\u27e9}{\parallel \bm{\mu}\parallel \cdot \parallel {\nabla}_{x}f(x)\parallel}.$$ 
Figure 4 illustrates this phenomenon in the twodimensional case. With ${\mathrm{\ell}}_{2}$bounded adversarial training the gradient direction (perpendicular to the decision boundary) becomes increasingly aligned under the ${\mathrm{\ell}}_{2}$ inner product with the vector between the means ($\bm{\mu}$).
Discussion.
Our theoretical analysis suggests that rather than offering any quantitative classification benefits, a natural way to view the role of robust optimization is as enforcing a prior over the features learned by the classifier. In particular, training with an ${\mathrm{\ell}}_{2}$bounded adversary prevents the classifier from relying heavily on features which induce a metric dissimilar to the ${\mathrm{\ell}}_{2}$ metric. The strength of the adversary then allows for a tradeoff between the enforced prior, and the datadependent features.
Robustness and accuracy.
Note that in the setting described so far, robustness can be at odds with accuracy since robust training prevents us from learning the most accurate classifier (a similar conclusion is drawn in \citeptsipras2019robustness). However, we note that there are very similar settings where nonrobust features manifest themselves in the same way, yet a classifier with perfect robustness and accuracy is still attainable. Concretely, consider the distributions pictured in Figure 14 in Appendix D.10. It is straightforward to show that while there are many perfectly accurate classifiers, any standard loss function will learn an accurate yet nonrobust classifier. Only when robust training is employed does the classifier learn a perfectly accurate and perfectly robust decision boundary.
5 Related Work
Several models for explaining adversarial examples have been proposed in prior work, utilizing ideas ranging from finitesample overfitting to highdimensional statistical phenomena \citepgilmer2018adversarial,fawzi2018adversarial,ford2019adversarial,tanay2016boundary, shafahi2019are,mahloujifar2018curse, shamir2019simple,goodfellow2015explaining,bubeck2018adversarial. The key differentiating aspect of our model is that adversarial perturbations arise as wellgeneralizing, yet brittle, features, rather than statistical anomalies or effects of poor statistical concentration. In particular, adversarial vulnerability does not stem from using a specific model class or a specific training method, since standard training on the “robustified” data distribution of Section 3.1 leads to robust models. At the same time, as shown in Section 3.2, these nonrobust features are sufficient to learn a good standard classifier. We discuss the connection between our model and others in detail in Appendix A. We discuss additional related work in Appendix B.
6 Conclusion
In this work, we cast the phenomenon of adversarial examples as a natural consequence of the presence of highly predictive but nonrobust features in standard ML datasets. We provide support for this hypothesis by explicitly disentangling robust and nonrobust features in standard datasets, as well as showing that nonrobust features alone are sufficient for good generalization. Finally, we study these phenomena in more detail in a theoretical setting where we can rigorously study adversarial vulnerability, robust training, and gradient alignment.
Our findings prompt us to view adversarial examples as a fundamentally human phenomenon. In particular, we should not be surprised that classifiers exploit highly predictive features that happen to be nonrobust under a humanselected notion of similarity, given such features exist in realworld datasets. In the same manner, from the perspective of interpretability, as long as models rely on these nonrobust features, we cannot expect to have model explanations that are both humanmeaningful and faithful to the models themselves. Overall, attaining models that are robust and interpretable will require explicitly encoding human priors into the training process.
7 Acknowledgements
We thank Preetum Nakkiran for suggesting the experiment of Appendix D.9 (i.e. replicating Figure 3 but with targeted attacks). We also are grateful to the authors of \citetengstrom2019discussion (Chris Olah, Dan Hendrycks, Justin Gilmer, Reiichiro Nakano, Preetum Nakkiran, Gabriel Goh, Eric Wallace)—for their insights and efforts replicating, extending, and discussing our experimental results.
Work supported in part by the NSF grants CCF1553428, CCF1563880, CNS1413920, CNS1815221, IIS1447786, IIS1607189, the Microsoft Corporation, the Intel Corporation, the MITIBM Watson AI Lab research grant, and an Analog Devices Fellowship.
Appendix A Connections to and Disambiguation from Other Models
Here, we describe other models for adversarial examples and how they relate to the model presented in this paper.
Concentration of measure in highdimensions.
An orthogonal line of work \citepgilmer2018adversarial,fawzi2018adversarial, mahloujifar2018curse,shafahi2019are, argues that the high dimensionality of the input space can present fundamental barriers on classifier robustness. At a high level, one can show that, for certain data distributions, any decision boundary will be close to a large fraction of inputs and hence no classifier can be robust against small perturbations. While there might exist such fundamental barriers to robustly classifying standard datasets, this model cannot fully explain the situation observed in practice, where one can train (reasonably) robust classifiers on standard datasets \citepmadry2018towards,raghunathan2018certified, wong2018provable,xiao2019training,cohen2019certified.
Insufficient data.
\citetschmidt2018adversarially propose a theoretical model under which a single sample is sufficient to learn a good, yet nonrobust classifier, whereas learning a good robust classifier requires $O(\sqrt{d})$ samples. Under this model, adversarial examples arise due to insufficient information about the true data distribution. However, unless the adversary is strong enough (in which case no robust classifier exists), adversarial inputs cannot be utilized as inputs of the opposite class (as done in our experiments in Section 3.2). We note that our model does not explicitly contradict the main thesis of \citetschmidt2018adversarially. In fact, this thesis can be viewed as a natural consequence of our conceptual framework. In particular, since training models robustly reduces the effective amount of information in the training data (as nonrobust features are discarded), more samples should be required to generalize robustly.
Boundary Tilting.
\citettanay2016boundary introduce the “boundary tilting” model for adversarial examples, and suggest that adversarial examples are a product of overfitting. In particular, the model conjectures that “adversarial examples are possible because the class boundary extends beyond the submanifold of sample data and can be—under certain circumstances—lying close to it.” Consequently, the authors suggest that mitigating adversarial examples may be a matter of regularization and preventing finitesample overfitting. In contrast, our empirical results in Section 3.2 suggest that adversarial inputs consist of features inherent to the data distribution, since they can encode generalizing information about the target class.
Inspired by this hypothesis and concurrently to our work, \citetkim2019bridging present a simple classification task comprised of two Gaussian distributions in two dimensions. They experimentally show that the decision boundary tends to better align with the vector between the two means for robust models. This is a special case of our theoretical results in Section 4. (Note that this exact statement is not true beyond two dimensions, as discussed in Section 4.)
Test Error in Noise.
\citetfawzi2016robustness and \citetford2019adversarial argue that the adversarial robustness of a classifier can be directly connected to its robustness under (appropriately scaled) random noise. While this constitutes a natural explanation of adversarial vulnerability given the classifier robustness to noise, these works do not attempt to justify the source of the latter.
At the same time, recent work [lecuyer2018certified, cohen2019certified, ford2019adversarial] utilizes random noise during training or testing to construct adversarially robust classifiers. In the context of our framework, we can expect the added noise to disproportionately affect nonrobust features and thus hinder the model’s reliance on them.
Local Linearity.
\citetgoodfellow2015explaining suggest that the local linearity of DNNs is largely responsible for the existence of small adversarial perturbations. While this conjecture is supported by the effectiveness of adversarial attacks exploiting local linearity (e.g., FGSM \citepgoodfellow2015explaining), it is not sufficient to fully characterize the phenomena observed in practice. In particular, there exist adversarial examples that violate the local linearity of the classifier \citepmadry2018towards, while classifiers that are less linear do not exhibit greater robustness \citepathalye2018obfuscated.
Piecewiselinear decision boundaries.
\citetshamir2019simple prove that the geometric structure of the classifier’s decision boundaries can lead to sparse adversarial perturbations. However, this result does not take into account the distance to the decision boundary along these direction or feasibility constraints on the input domain. As a result, it cannot meaningfully distinguish between classifiers that are brittle to small adversarial perturbations and classifiers that are moderately robust.
Theoretical constructions which incidentally exploit nonrobust features.
\citetbubeck2018adversarial and \citetnakkiran2019adversarial propose theoretical models where the barrier to learning robust classifiers is, respectively, due to computational constraints or model complexity. In order to construct distributions that admit accurate yet nonrobust classifiers they (implicitly) utilize the concept of nonrobust features. Namely, they add a lowmagnitude signal to each input that encodes the true label. This allows a classifier to achieve perfect standard accuracy, but cannot be utilized in an adversarial setting as this signal is susceptible to small adversarial perturbations.
Appendix B Additional Related Work
We describe previously proposed models for the existence of adversarial examples in the previous section. Here we discuss other work that is methodologically or conceptually similar to ours.
Distillation.
The experiments performed in Section 3.1 can be seen as a form of distillation. There is a line of work, known as model distillation \citephinton2015distilling,furlanello2018born, bucilua2006model, where the goal is to train a new model to mimic another already trained model. This is typically achieved by adding some regularization terms to the loss in order to encourage the two models to be similar, often replacing training labels with some other target based on the already trained model. While it might be possible to successfully distill a robust model using these methods, our goal was to achieve it by only modifying the training set (leaving the training process unchanged), hence demonstrating that adversarial vulnerability is mainly a property of the dataset. Closer to our work is dataset distillation \citepwang2018dataset which considers the problem of reconstructing a classifier from an alternate dataset much smaller than the original training set. This method aims to produce inputs that directly encode the weights of the already trained model by ensuring that the classifier’s gradient with respect to these inputs approximates the desired weights. (As a result, the inputs constructed do not resemble natural inputs.) This approach is orthogonal to our goal since we are not interested in encoding the particular weights into the dataset but rather in imposing a structure to its features.
Adversarial Transferabiliy.
In our work, we posit that a potentially natural consequence of the existence of nonrobust features is adversarial transferability \citeppapernot2017practical,liu2017delving,papernot2016transferability. A recent line of work has considered this phenomenon from a theoretical perspective, confined to simple models, or unbounded perturbations [charles2019geometric, zou2017geometric]. \citettramer2017space study transferability empirically, by finding adversarial subspaces, (orthogonal vectors whose linear combinations are adversarial perturbations). The authors find that there is a significant overlap in the adversarial subspaces between different models, and identify this as a source of transferability. In our work, we provide a potential reason for this overlap—these directions correspond to nonrobust features utilized by models in a similar manner.
Universal Adversarial Perturbations
\citetmoosavi2017universal construct perturbations that can cause misclassification when applied to multiple different inputs. More recently, \citetjetley2018friends discover input patterns that are meaningless to humans and can induce misclassification, while at the same time being essential for standard classification. These findings can be naturally cast into our framework by considering these patterns as nonrobust features, providing further evidence about their pervasiveness.
Manipulating dataset features
\citetding2018on perform synthetic transformations on the dataset (e.g., image saturation) and study the performance of models on the transformed dataset under standard and robust training. While this can be seen as a method of restricting the features available to the model during training, it is unclear how well these models would perform on the standard test set. \citetgeirhos2018imagenettrained aim to quantify the relative dependence of standard models on shape and texture information of the input. They introduce a version of ImageNet where texture information has been removed and observe an improvement to certain corruptions.
Appendix C Experimental Setup
C.1 Datasets
For our experimental analysis, we use the CIFAR10 [krizhevsky2009learning] and (restricted) ImageNet [russakovsky2015imagenet] datasets. Attaining robust models for the complete ImageNet dataset is known to be a challenging problem, both due to the hardness of the learning problem itself, as well as the computational complexity. We thus restrict our focus to a subset of the dataset which we denote as restricted ImageNet. To this end, we group together semantically similar classes from ImageNet into 9 superclasses shown in Table 2. We train and evaluate only on examples corresponding to these classes.
Class  Corresponding ImageNet Classes 

“Dog”  151 to 268 
“Cat”  281 to 285 
“Frog”  30 to 32 
“Turtle”  33 to 37 
“Bird”  80 to 100 
“Primate”  365 to 382 
“Fish”  389 to 397 
“Crab”  118 to 121 
“Insect”  300 to 319 
C.2 Models
We use the ResNet50 architecture for our baseline standard and adversarially trained classifiers on CIFAR10 and restricted ImageNet. For each model, we grid search over three learning rates ($0.1$, $0.01$, $0.05$), two batch sizes ($128$, $256$) including/not including a learning rate drop (a single order of magnitude) and data augmentation. We use the standard training parameters for the remaining parameters. The hyperparameters used for each model are given in Table 3.
Dataset  LR  Batch Size  LR Drop  Data Aug.  Momentum  Weight Decay 

${\widehat{\mathcal{D}}}_{R}$ (CIFAR)  0.1  128  Yes  Yes  0.9  $5\cdot {10}^{4}$ 
${\widehat{\mathcal{D}}}_{R}$ (Restricted ImageNet)  0.01  128  No  Yes  0.9  $5\cdot {10}^{4}$ 
${\widehat{\mathcal{D}}}_{NR}$ (CIFAR)  0.1  128  Yes  Yes  0.9  $5\cdot {10}^{4}$ 
${\widehat{\mathcal{D}}}_{rand}$ (CIFAR)  0.01  128  Yes  Yes  0.9  $5\cdot {10}^{4}$ 
${\widehat{\mathcal{D}}}_{rand}$ (Restricted ImageNet)  0.01  256  No  No  0.9  $5\cdot {10}^{4}$ 
${\widehat{\mathcal{D}}}_{det}$ (CIFAR)  0.1  128  Yes  No  0.9  $5\cdot {10}^{4}$ 
${\widehat{\mathcal{D}}}_{det}$ (Restricted ImageNet)  0.05  256  No  No  0.9  $5\cdot {10}^{4}$ 
C.3 Adversarial training
To obtain robust classifiers, we employ the adversarial training methodology proposed in [madry2018towards]. Specifically, we train against a projected gradient descent (PGD) adversary constrained in ${\mathrm{\ell}}_{2}$norm starting from the original image. Following \citetmadry2018towards we normalize the gradient at each step of PGD to ensure that we move a fixed distance in ${\mathrm{\ell}}_{2}$norm per step. Unless otherwise specified, we use the values of $\u03f5$ provided in Table 4 to train/evaluate our models. We used $7$ steps of PGD with a step size of $\epsilon /5$.
Adversary  CIFAR10  Restricted Imagenet 
${\mathrm{\ell}}_{2}$  0.5  3 
C.4 Constructing a Robust Dataset
In Section 3.1, we describe a procedure to construct a dataset that contains features relevant only to a given (standard/robust) model. To do so, we optimize the training objective in (6). Unless otherwise specified, we initialize ${x}_{r}$ as a different randomly chosen sample from the training set. (For the sake of completeness, we also try initializing with a Gaussian noise instead as shown in Table 7.) We then perform normalized gradient descent (${\mathrm{\ell}}_{2}$norm of gradient is fixed to be constant at each step). At each step we clip the input ${x}_{r}$ to in the $[0,1]$ range so as to ensure that it is a valid image. Details on the optimization procedure are shown in Table 5. We provide the pseudocode for the construction in Figure 5.
CIFAR10  Restricted Imagenet  

step size  0.1  1 
iterations  1000  2000 
C.5 Nonrobust features suffice for standard classification
To construct the dataset as described in Section 3.2, we use the standard projected gradient descent (PGD) procedure described in [madry2018towards] to construct an adversarial example for a given input from the dataset (7). Perturbations are constrained in ${\mathrm{\ell}}_{2}$norm while each PGD step is normalized to a fixed step size. The details for our PGD setup are described in Table 6. We provide pseudocode in Figure 6.
Attack Parameters  CIFAR10  Restricted Imagenet 

$\epsilon $  0.5  3 
step size  0.1  0.1 
iterations  100  100 
Appendix D Omitted Experiments and Figures
D.1 Detailed evaluation of models trained on “robust” dataset
In Section 3.1, we generate a “robust” training set by restricting the dataset to only contain features relevant to a robust model (robust dataset) or a standard model (nonrobust dataset). This is performed by choosing either a random input from the training set or random noise^{17}^{17} 17 We use 10k steps to construct the dataset from noise, instead to using 1k steps done when the input is a different training set image (cf. Table 5). and then performing the optimization procedure described in (6). The performance of these classifiers along with various baselines is shown in Table 7. We observe that while the robust dataset constructed from noise resembles the original, the corresponding nonrobust does not (Figure 7). This also leads to suboptimal performance of classifiers trained on this dataset (only $46\%$ standard accuracy) potentially due to a distributional shift.
Robust Accuracy  
Model  Accuracy  $\epsilon =0.25$  $\epsilon =0.5$ 
Standard Training  95.25 %  4.49%  0.0% 
Robust Training  90.83%  82.48%  70.90% 
Trained on nonrobust dataset (constructed from images)  87.68%  0.82%  0.0% 
Trained on nonrobust dataset (constructed from noise)  45.60%  1.50%  0.0% 
Trained on robust dataset (constructed from images)  85.40%  48.20 %  21.85% 
Trained on robust dataset (constructed from noise)  84.10%  48.27 %  29.40% 
D.2 Adversarial evaluation
To verify the robustness of our classifiers trained on the ‘robust” dataset, we evaluate them with strong attacks \citepcarlini2019on. In particular, we try up to 2500 steps of projected gradient descent (PGD), increasing steps until the accuracy plateaus, and also try the CW${\mathrm{\ell}}_{2}$ loss function \citepcarlini2017towards with 1000 steps. For each attack we search over step size. We find that over all attacks and step sizes, the accuracy of the model does not drop by more than 2%, and plateaus at $48.27\%$ for both PGD and CW${\mathrm{\ell}}_{2}$ (the value given in Figure 2). We show a plot of accuracy in terms of the number of PGD steps used in Figure 8.
D.3 Performance of “robust” training and test set
In Section 3.1, we observe that an ERM classifier trained on a “robust” training dataset ${\widehat{\mathcal{D}}}_{R}$ (obtained by restricting features to those relevant to a robust model) attains nontrivial robustness (cf. Figure 1 and Table 7). In Table 8, we evaluate the adversarial accuracy of the model on the corresponding robust training set (the samples which the classifier was trained on) and test set (unseen samples from ${\widehat{\mathcal{D}}}_{R}$, based on the test set). We find that the drop in robustness comes from a combination of generalization gap (the robustness on the ${\widehat{\mathcal{D}}}_{R}$ test set is worse than it is on the robust training set) and distributional shift (the model performs better on the robust test set consisting of unseen samples from ${\widehat{\mathcal{D}}}_{R}$ than on the standard test set containing unseen samples from $\mathcal{D}$).
Dataset  Robust Accuracy 

Robust training set  77.33% 
Robust test set  62.49% 
Standard test set  48.27% 
D.4 Classification based on nonrobust features
Figure 9 shows sample images from $\mathcal{D}$, ${\widehat{\mathcal{D}}}_{rand}$ and ${\widehat{\mathcal{D}}}_{det}$ constructed using a standard (nonrobust) ERM classifier, and an adversarially trained (robust) classifier.
In Table 9, we repeat the experiments in Table 1 based on datasets constructed using a robust model. Note that using a robust model to generate the ${\widehat{\mathcal{D}}}_{det}$ and ${\widehat{\mathcal{D}}}_{rand}$ datasets will not result in nonrobust features that are strongly predictive of $t$ (since the prediction of the classifier will not change). Thus, training a model on these datasets leads to poor accuracy on the standard test set from $\mathcal{D}$.
Observe from Figure 10 that models trained on datasets derived from the robust model show a decline in test accuracy as training progresses. In Table 9, the accuracy numbers reported correspond to the last iteration, and not the best performance. This is because we have no way to crossvalidate in a meaningful way as the validation set itself comes from ${\widehat{\mathcal{D}}}_{rand}$ or ${\widehat{\mathcal{D}}}_{det}$, and not from the true data distribution $D$. Thus, validation accuracy will not be predictive of the true test accuracy, and thus will not help determine when to early stop.
Model used to construct dataset  Dataset used in training  

$\mathcal{D}$  ${\widehat{\mathcal{D}}}_{rand}$  ${\widehat{\mathcal{D}}}_{det}$  
Robust  95.3%  25.2 %  5.8% 
Standard  95.3%  63.3 %  43.7% 
D.5 Accuracy curves
D.6 Performance of ERM classifiers on relabeled test set
In Table 10), we evaluate the performance of classifiers trained on ${\widehat{\mathcal{D}}}_{det}$ on both the original test set drawn from $\mathcal{D}$, and the test set relabelled using $t(y)=(y+1)modC$. Observe that the classifier trained on ${\widehat{\mathcal{D}}}_{det}$ constructed using a robust model actually ends up learning permuted labels based on robust features (indicated by high test accuracy on the relabelled test set).
Model used to construct training dataset for ${\widehat{\mathrm{D}}}_{d\mathbf{}e\mathbf{}t}$  Dataset used in testing  

$\mathcal{D}$  relabelled$\mathcal{D}$  
Standard  43.7%  16.2% 
Robust  5.8%  65.5% 
D.7 Generalization to CIFAR10.1
\citetrecht2018cifar10 have constructed an unseen but distributionshifted test set for CIFAR10. They show that for many previously proposed models, accuracy on the CIFAR10.1 test set can be predicted as a linear function of performance on the CIFAR10 test set.
As a sanity check (and a safeguard against any potential adaptive overfitting to the test set via hyperparameters, historical test set reuse, etc.) we note that the classifiers trained on ${\widehat{\mathcal{D}}}_{det}$ and ${\widehat{\mathcal{D}}}_{rand}$ achieve $44\%$ and $55\%$ generalization on the CIFAR10.1 test set, respectively. This demonstrates nontrivial generalization, and actually perform better than the linear fit would predict (given their accuracies on the CIFAR10 test set).
D.8 Omitted Results for Restricted ImageNet
D.9 Targeted Transferability
D.10 Robustness vs. Accuracy
Appendix E Gaussian MLE under Adversarial Perturbation
In this section, we develop a framework for studying nonrobust features by studying the problem of maximum likelihood classification between two Gaussian distributions. We first recall the setup of the problem, then present the main theorems from Section 4. First we build the techniques necessary for their proofs.
E.1 Setup
We consider the setup where a learner receives labeled samples from two distributions, $\mathcal{N}({\bm{\mu}}_{*},{\mathbf{\Sigma}}_{*})$, and $\mathcal{N}({\bm{\mu}}_{*},{\mathbf{\Sigma}}_{*})$. The learner’s goal is to be able to classify new samples as being drawn from ${\mathcal{D}}_{1}$ or ${\mathcal{D}}_{2}$ according to a maximum likelihood (MLE) rule.
A simple coupling argument demonstrates that this problem can actually be reduced to learning the parameters $\widehat{\bm{\mu}}$, $\widehat{\mathbf{\Sigma}}$ of a single Gaussian $\mathcal{N}({\bm{\mu}}_{*},{\mathbf{\Sigma}}_{*})$, and then employing a linear classifier with weight ${\widehat{\mathbf{\Sigma}}}^{1}\widehat{\bm{\mu}}$. In the standard setting, maximum likelihoods estimation learns the true parameters, ${\bm{\mu}}_{*}$ and ${\mathbf{\Sigma}}_{*}$, and thus the learned classification rule is $C(x)=\mathrm{\U0001d7d9}\{{x}^{\top}{\mathbf{\Sigma}}^{1}\bm{\mu}>0\}$.
In this work, we consider the problem of adversarially robust maximum likelihood estimation. In particular, rather than simply being asked to classify samples, the learner will be asked to classify adversarially perturbed samples $x+\delta $, where $\delta \in \mathrm{\Delta}$ is chosen to maximize the loss of the learner. Our goal is to derive the parameters $\bm{\mu},\mathbf{\Sigma}$ corresponding to an adversarially robust maximum likelihood estimate of the parameters of $\mathcal{N}({\bm{\mu}}_{*},{\mathbf{\Sigma}}_{*})$. Note that since we have access to ${\mathbf{\Sigma}}_{*}$ (indeed, the learner can just run nonrobust MLE to get access), we work in the space where ${\mathbf{\Sigma}}^{*}$ is a diagonal matrix, and we restrict the learned covariance $\mathbf{\Sigma}$ to the set of diagonal matrices.
Notation.
We denote the parameters of the sampled Gaussian by ${\bm{\mu}}_{*}\in {\mathbb{R}}^{d}$, and ${\mathbf{\Sigma}}_{*}\in \{\text{diag}(\bm{u})\bm{u}\in {\mathbb{R}}^{d}\}$. We use ${\sigma}_{min}(X)$ to represent the smallest eigenvalue of a square matrix $X$, and $\mathrm{\ell}(\cdot ;x)$ to represent the Gaussian negative loglikelihood for a single sample $x$. For convenience, we often use $\bm{v}=x\bm{\mu}$, and $R=\parallel {\bm{\mu}}_{*}\parallel $. We also define the $\mathrm{\u29b0}$ operator to represent the vectorization of the diagonal of a matrix. In particular, for a matrix $X\in {\mathbb{R}}^{d\times d}$, we have that ${X}_{\mathrm{\u29b0}}=v\in {\mathbb{R}}^{d}$ if ${v}_{i}={X}_{ii}$.
E.2 Outline and Key Results
We focus on the case where $\mathrm{\Delta}={\mathcal{B}}_{2}(\u03f5)$ for some $\u03f5>0$, i.e. the ${\mathrm{\ell}}_{2}$ ball, corresponding to the following minimax problem:
$$\underset{\bm{\mu},\mathbf{\Sigma}}{\mathrm{min}}{\mathbb{E}}_{x\sim \mathcal{N}({\bm{\mu}}^{*},{\mathbf{\Sigma}}^{*})}\left[\underset{\delta :\parallel \delta \parallel =\epsilon}{\mathrm{max}}\mathrm{\ell}(\bm{\mu},\mathbf{\Sigma};x+\delta )\right]$$  (13) 
We first derive the optimal adversarial perturbation for this setting (Section E.3.1), and prove Theorem 1 (Section E.3.2). We then propose an alternate problem, in which the adversary picks a linear operator to be applied to a fixed vector, rather than picking a specific perturbation vector (Section E.3.3). We argue via Gaussian concentration that the alternate problem is indeed reflective of the original model (and in particular, the two become equivalent as $d\to \mathrm{\infty}$). In particular, we propose studying the following in place of (13):
$\underset{\bm{\mu},\mathbf{\Sigma}}{\mathrm{min}}\underset{M\in \mathcal{M}}{\mathrm{max}}{\mathbb{E}}_{x\sim \mathcal{N}({\bm{\mu}}^{*},{\mathbf{\Sigma}}^{*})}\left[\mathrm{\ell}(\bm{\mu},\mathbf{\Sigma};x+M(x\bm{\mu}))\right]$  (14)  
$\text{where}\mathcal{M}=\{M\in {\mathbb{R}}^{d\times d}:{M}_{ij}=0\forall i\ne j,{\mathbb{E}}_{x\sim \mathcal{N}({\bm{\mu}}^{*},{\mathbf{\Sigma}}^{*})}\left[{\parallel M\bm{v}\parallel}_{2}^{2}\right]={\u03f5}^{2}\}.$ 
Our goal is to characterize the behavior of the robustly learned covariance $\mathbf{\Sigma}$ in terms of the true covariance matrix ${\mathbf{\Sigma}}_{*}$ and the perturbation budget $\epsilon $. The proof is through Danskin’s Theorem, which allows us to use any maximizer of the inner problem ${M}^{*}$ in computing the subgradient of the inner minimization. After showing the applicability of Danskin’s Theorem (Section E.3.4) and then applying it (Section E.3.5) to prove our main results (Section E.3.7). Our three main results, which we prove in the following section, are presented below.
First, we consider a simplified version of (13), in which the adversary solves a maximization with a fixed Lagrangian penalty, rather than a hard ${\mathrm{\ell}}_{2}$ constraint. In this setting, we show that the loss contributed by the adversary corresponds to a misalignment between the data metric (the Mahalanobis distance, induced by ${\mathbf{\Sigma}}^{1}$), and the ${\mathrm{\ell}}_{2}$ metric: See 1
We then return to studying (14), where we provide upper and lower bounds on the learned robust covariance matrix $\mathbf{\Sigma}$: See 2
Finally, we show that in the worst case over mean vectors ${\bm{\mu}}_{*}$, the gradient of the adversarial robust classifier aligns more with the interclass vector: See 3
E.3 Proofs
In the first section, we have shown that the classification between two Gaussian distributions with identical covariance matrices centered at ${\bm{\mu}}^{*}$ and ${\bm{\mu}}^{*}$ can in fact be reduced to learning the parameters of a single one of these distributions.
Thus, in the standard setting, our goal is to solve the following problem:
$\underset{\bm{\mu},\mathbf{\Sigma}}{\mathrm{min}}{\mathbb{E}}_{x\sim \mathcal{N}({\bm{\mu}}^{*},{\mathbf{\Sigma}}^{*})}\left[\mathrm{\ell}(\bm{\mu},\mathbf{\Sigma};x)\right]:=\underset{\bm{\mu},\mathbf{\Sigma}}{\mathrm{min}}{\mathbb{E}}_{x\sim \mathcal{N}({\bm{\mu}}^{*},{\mathbf{\Sigma}}^{*})}\left[\mathrm{log}\left(\mathcal{N}(\bm{\mu},\mathbf{\Sigma};x)\right)\right].$ 
Note that in this setting, one can simply find differentiate $\mathrm{\ell}$ with respect to both $\bm{\mu}$ and $\mathbf{\Sigma}$, and obtain closed forms for both (indeed, these closed forms are, unsurprisingly, ${\bm{\mu}}^{*}$ and ${\mathbf{\Sigma}}^{*}$). Here, we consider the existence of a malicious adversary who is allowed to perturb each sample point $x$ by some $\delta $. The goal of the adversary is to maximize the same loss that the learner is minimizing.
E.3.1 Motivating example: ${\mathrm{\ell}}_{2}$constrained adversary
We first consider, as a motivating example, an ${\mathrm{\ell}}_{2}$constrained adversary. That is, the adversary is allowed to perturb each sampled point by $\delta :{\parallel \delta \parallel}_{2}=\epsilon $. In this case, the minimax problem being solved is the following:
$\underset{\bm{\mu},\mathbf{\Sigma}}{\mathrm{min}}{\mathbb{E}}_{x\sim \mathcal{N}({\bm{\mu}}^{*},{\mathbf{\Sigma}}^{*})}\left[\underset{\parallel \delta \parallel =\epsilon}{\mathrm{max}}\mathrm{\ell}(\bm{\mu},\mathbf{\Sigma};x+\delta )\right].$  (15) 
The following Lemma captures the optimal behaviour of the adversary:
Lemma 1.
In the minimax problem captured in (15) (and earlier in (13)), the optimal adversarial perturbation ${\delta}^{\mathrm{*}}$ is given by
$${\delta}^{*}={\left(\lambda \bm{I}{\mathbf{\Sigma}}^{1}\right)}^{1}{\mathbf{\Sigma}}^{1}\bm{v}={\left(\lambda \mathbf{\Sigma}\bm{I}\right)}^{1}\bm{v},$$  (16) 
where $\mathbf{v}\mathrm{=}x\mathrm{}\mathbf{\mu}$, and $\lambda $ is set such that ${\mathrm{\parallel}{\delta}^{\mathrm{*}}\mathrm{\parallel}}_{\mathrm{2}}\mathrm{=}\epsilon $.
Proof.
In this context, we can solve the inner maximization problem with Lagrange multipliers. In the following we write $\mathrm{\Delta}={\mathcal{B}}_{2}(\epsilon )$ for brevity, and discard terms not containing $\delta $ as well as constant factors freely:
$\mathrm{arg}\underset{\delta \in \mathrm{\Delta}}{\mathrm{max}}\mathrm{\ell}(\bm{\mu},\mathbf{\Sigma};x+\delta )$  $=\mathrm{arg}\underset{\delta \in \mathrm{\Delta}}{\mathrm{max}}{(x+\delta \bm{\mu})}^{\top}{\mathbf{\Sigma}}^{1}(x+\delta \bm{\mu})$  
$=\mathrm{arg}\underset{\delta \in \mathrm{\Delta}}{\mathrm{max}}{(x\bm{\mu})}^{\top}{\mathbf{\Sigma}}^{1}(x\bm{\mu})+2{\delta}^{\top}{\mathbf{\Sigma}}^{1}(x\bm{\mu})+{\delta}^{\top}{\mathbf{\Sigma}}^{1}\delta $  
$=\mathrm{arg}\underset{\delta \in \mathrm{\Delta}}{\mathrm{max}}{\delta}^{\top}{\mathbf{\Sigma}}^{1}(x\bm{\mu})+{\displaystyle \frac{1}{2}}{\delta}^{\top}{\mathbf{\Sigma}}^{1}\delta .$  (17) 
Now we can solve (17) using the aforementioned Lagrange multipliers. In particular, note that the maximum of (17) is attained at the boundary of the ${\mathrm{\ell}}_{2}$ ball $\mathrm{\Delta}$. Thus, we can solve the following system of two equations to find $\delta $, rewriting the norm constraint as $\frac{1}{2}{\parallel \delta \parallel}_{2}^{2}=\frac{1}{2}{\epsilon}^{2}$:
$$\{\begin{array}{cc}{\nabla}_{\delta}\left({\delta}^{\top}{\mathbf{\Sigma}}^{1}(x\bm{\mu})+\frac{1}{2}{\delta}^{\top}{\mathbf{\Sigma}}^{1}\delta \right)=\lambda {\nabla}_{\delta}\left({\parallel \delta \parallel}_{2}^{2}{\epsilon}^{2}\right)\u27f9{\mathbf{\Sigma}}^{1}(x\bm{\mu})+{\mathbf{\Sigma}}^{1}\delta =\lambda \delta \hfill & \\ {\parallel \delta \parallel}_{2}^{2}={\epsilon}^{2}.\hfill & \end{array}$$  (18) 
For clarity, we write $\bm{v}=x\bm{\mu}$: then, combining the above, we have that
$${\delta}^{*}={\left(\lambda \bm{I}{\mathbf{\Sigma}}^{1}\right)}^{1}{\mathbf{\Sigma}}^{1}\bm{v}={\left(\lambda \mathbf{\Sigma}\bm{I}\right)}^{1}\bm{v},$$  (19) 
our final result for the maximizer of the inner problem, where $\lambda $ is set according to the norm constraint. ∎
E.3.2 Variant with Fixed Lagrangian (Theorem 1)
To simplify the analysis of Theorem 1, we consider a version of (15) with a fixed Lagrangian penalty, rather than a norm constraint:
$$\mathrm{max}\mathrm{\ell}(x+\delta ;y\cdot \bm{\mu},\mathbf{\Sigma})C\cdot {\parallel \delta \parallel}_{2}.$$ 
Note then, that by Lemma 1, the optimal perturbation ${\delta}^{*}$ is given by
$${\delta}^{*}={\left(C\mathbf{\Sigma}\bm{I}\right)}^{1}.$$ 
Proof.
We begin by expanding the Gaussian negative loglikelihood for the relaxed problem:
${\mathcal{L}}_{adv}(\mathrm{\Theta})\mathcal{L}(\mathrm{\Theta})$  $={\mathbb{E}}_{x\sim \mathcal{N}({\bm{\mu}}^{*},{\mathbf{\Sigma}}^{*})}\left[2\cdot {\bm{v}}^{\top}{\left(C\cdot \mathbf{\Sigma}\bm{I}\right)}^{\top}{\mathbf{\Sigma}}^{1}\bm{v}+{\bm{v}}^{\top}{\left(C\cdot \mathbf{\Sigma}\bm{I}\right)}^{\top}{\mathbf{\Sigma}}^{1}{\left(C\cdot \mathbf{\Sigma}\bm{I}\right)}^{1}\bm{v}\right]$  
$={\mathbb{E}}_{x\sim \mathcal{N}({\bm{\mu}}^{*},{\mathbf{\Sigma}}^{*})}\left[2\cdot {\bm{v}}^{\top}{\left(C\cdot \mathbf{\Sigma}\mathbf{\Sigma}\mathbf{\Sigma}\right)}^{1}\bm{v}+{\bm{v}}^{\top}{\left(C\cdot \mathbf{\Sigma}\bm{I}\right)}^{\top}{\mathbf{\Sigma}}^{1}{\left(C\cdot \mathbf{\Sigma}\bm{I}\right)}^{1}\bm{v}\right]$ 
Recall that we are considering the vulnerability at the MLE parameters ${\bm{\mu}}^{*}$ and ${\mathbf{\Sigma}}^{*}$:
${\mathcal{L}}_{adv}(\mathrm{\Theta})\mathcal{L}(\mathrm{\Theta})$  $={\mathbb{E}}_{\bm{v}\sim \mathcal{N}(0,I)}[2\cdot {\bm{v}}^{\top}{\mathbf{\Sigma}}_{*}^{1/2}{(C\cdot {\mathbf{\Sigma}}_{*}^{2}{\mathbf{\Sigma}}_{*})}^{1}{\mathbf{\Sigma}}_{*}^{1/2}\bm{v}$  
$\mathrm{\hspace{1em}\hspace{1em}}+{\bm{v}}^{\top}{\mathbf{\Sigma}}_{*}^{1/2}{(C\cdot {\mathbf{\Sigma}}_{*}\bm{I})}^{\top}{\mathbf{\Sigma}}_{*}^{1}{(C\cdot {\mathbf{\Sigma}}_{*}\bm{I})}^{1}{\mathbf{\Sigma}}_{*}^{1/2}\bm{v}]$  
$={\mathbb{E}}_{\bm{v}\sim \mathcal{N}(0,I)}\left[2\cdot {\bm{v}}^{\top}{\left(C\cdot {\mathbf{\Sigma}}_{*}\bm{I}\right)}^{1}\bm{v}+{\bm{v}}^{\top}{\mathbf{\Sigma}}_{*}^{1/2}{\left({C}^{2}{\mathbf{\Sigma}}_{*}^{3}2C\cdot {\mathbf{\Sigma}}_{*}^{2}+{\mathbf{\Sigma}}_{*}\right)}^{1}{\mathbf{\Sigma}}_{*}^{1/2}\bm{v}\right]$  
$={\mathbb{E}}_{\bm{v}\sim \mathcal{N}(0,I)}\left[2\cdot {\bm{v}}^{\top}{\left(C\cdot {\mathbf{\Sigma}}_{*}\bm{I}\right)}^{1}\bm{v}+{\bm{v}}^{\top}{\left(C\cdot {\mathbf{\Sigma}}_{*}\bm{I}\right)}^{2}\bm{v}\right]$  
$={\mathbb{E}}_{\bm{v}\sim \mathcal{N}(0,I)}\left[{\parallel v\parallel}_{2}^{2}+{\bm{v}}^{\top}\bm{I}\bm{v}+2\cdot {\bm{v}}^{\top}{\left(C\cdot {\mathbf{\Sigma}}_{*}\bm{I}\right)}^{1}\bm{v}+{\bm{v}}^{\top}{\left(C\cdot {\mathbf{\Sigma}}_{*}\bm{I}\right)}^{2}\bm{v}\right]$  
$={\mathbb{E}}_{\bm{v}\sim \mathcal{N}(0,I)}\left[{\parallel v\parallel}_{2}^{2}+{\bm{v}}^{\top}{\left(\bm{I}+{\left(C\cdot {\mathbf{\Sigma}}_{*}\bm{I}\right)}^{1}\right)}^{2}\bm{v}\right]$  
$=\text{tr}\left[{\left(\bm{I}+{\left(C\cdot {\mathbf{\Sigma}}_{*}\bm{I}\right)}^{1}\right)}^{2}\right]d$ 
This shows the first part of the theorem. It remains to show that for a fixed $k=\text{tr}({\mathbf{\Sigma}}_{*})$, the adversarial risk is minimized by ${\mathbf{\Sigma}}_{*}=\frac{k}{d}\bm{I}$:
$\underset{{\mathbf{\Sigma}}_{*}}{\mathrm{min}}{\mathcal{L}}_{adv}(\mathrm{\Theta})\mathcal{L}(\mathrm{\Theta})$  $=\underset{{\mathbf{\Sigma}}_{*}}{\mathrm{min}}\text{tr}\left[{\left(\bm{I}+{\left(C\cdot {\mathbf{\Sigma}}_{*}\bm{I}\right)}^{1}\right)}^{2}\right]$  
$=\underset{\{{\sigma}_{i}\}}{\mathrm{min}}{\displaystyle \sum _{i=1}^{d}}{\left(1+{\displaystyle \frac{1}{C\cdot {\sigma}_{i}1}}\right)}^{2},$ 
where $\{{\sigma}_{i}\}$ are the eigenvalues of ${\mathbf{\Sigma}}_{*}$. Now, we have that $\sum {\sigma}_{i}=k$ by assumption, so by optimality conditions, we have that ${\mathbf{\Sigma}}_{*}$ minimizes the above if ${\nabla}_{\{{\sigma}_{i}\}}\propto \overrightarrow{1}$, i.e. if ${\nabla}_{{\sigma}_{i}}={\nabla}_{{\sigma}_{j}}$ for all $i,j$. Now,
${\nabla}_{{\sigma}_{i}}$  $=2\cdot \left(1+{\displaystyle \frac{1}{C\cdot {\sigma}_{i}1}}\right)\cdot {\displaystyle \frac{C}{{\left(C\cdot {\sigma}_{i}1\right)}^{2}}}$  
$=2\cdot {\displaystyle \frac{{C}^{2}\cdot {\sigma}_{i}}{{(C\cdot {\sigma}_{i}1)}^{3}}}.$ 
Then, by solving analytically, we find that
$2\cdot {\displaystyle \frac{{C}^{2}\cdot {\sigma}_{i}}{{(C\cdot {\sigma}_{i}1)}^{3}}}=2\cdot {\displaystyle \frac{{C}^{2}\cdot {\sigma}_{j}}{{(C\cdot {\sigma}_{j}1)}^{3}}}$ 
admits only one real solution, ${\sigma}_{i}={\sigma}_{j}$. Thus, ${\mathbf{\Sigma}}_{*}\propto \bm{I}$. Scaling to satisfy the trace constraint yields ${\mathbf{\Sigma}}_{*}=\frac{k}{d}\bm{I}$, which concludes the proof. ∎
E.3.3 Real objective
Our motivating example (Section E.3.1) demonstrates that the optimal perturbation for the adversary in the ${\mathrm{\ell}}_{2}$constrained case is actually a linear function of $\bm{v}$, and in particular, that the optimal perturbation can be expressed as $D\bm{v}$ for a diagonal matrix $D$. Note, however, that the problem posed in (15) is not actually a minimax problem, due to the presence of the expectation between the outer minimization and the inner maximization. Motivated by this and (19), we define the following robust problem:
$\underset{\bm{\mu},\mathbf{\Sigma}}{\mathrm{min}}\underset{M\in \mathcal{M}}{\mathrm{max}}{\mathbb{E}}_{x\sim \mathcal{N}({\bm{\mu}}^{*},{\mathbf{\Sigma}}^{*})}\left[\mathrm{\ell}(\bm{\mu},\mathbf{\Sigma};x+M\bm{v})\right],$  (20)  
$\text{where}\mathcal{M}=\{M\in {\mathbb{R}}^{d\times d}:{M}_{ij}=0\forall i\ne j,{\mathbb{E}}_{x\sim \mathcal{N}({\bm{\mu}}^{*},{\mathbf{\Sigma}}^{*})}\left[{\parallel M\bm{v}\parallel}_{2}^{2}\right]={\u03f5}^{2}\}.$ 
First, note that this objective is slightly different from that
of (15).
In the motivating example, $\delta $ is constrained to always have $\epsilon $norm,
and thus is normalizer on a persample basis inside of the expectation. In contrast,
here the classifier is concerned with being robust to perturbations that are linear
in $\bm{v}$, and of ${\epsilon}^{2}$ squared norm in expectation.
Note, however, that via the result of \citetlaurent2000adaptive showing strong concentration for the norms of Gaussian random variables, in high dimensions this bound on expectation has a corresponding highprobability bound on the norm. In particular, this implies that as $d\to \mathrm{\infty}$, ${\parallel M\bm{v}\parallel}_{2}=\epsilon $ almost surely, and thus the problem becomes identical to that of (15). We now derive the optimal $M$ for a given $(\bm{\mu},\mathbf{\Sigma})$:
Lemma 2.
Consider the minimax problem described by (20), i.e.
$\underset{\bm{\mu},\mathbf{\Sigma}}{\mathrm{min}}\underset{M\in \mathcal{M}}{\mathrm{max}}{\mathbb{E}}_{x\sim \mathcal{N}({\bm{\mu}}^{*},{\mathbf{\Sigma}}^{*})}\left[\mathrm{\ell}(\bm{\mu},\mathbf{\Sigma};x+M\bm{v})\right].$ 
Then, the optimal action ${M}^{\mathrm{*}}$ of the inner maximization problem is given by
$$M={\left(\lambda \mathbf{\Sigma}\bm{I}\right)}^{1},$$  (21) 
where again $\lambda $ is set so that $M\mathrm{\in}\mathrm{M}$.
Proof.
We accomplish this in a similar fashion to what was done for ${\delta}^{*}$, using Lagrange multipliers:
${\nabla}_{M}{\mathbb{E}}_{x\sim \mathcal{N}({\bm{\mu}}^{*},{\mathbf{\Sigma}}^{*})}\left[{\bm{v}}^{\top}M{\mathbf{\Sigma}}^{1}\bm{v}+{\displaystyle \frac{1}{2}}{\bm{v}}^{\top}M{\mathbf{\Sigma}}^{1}M\bm{v}\right]$  $=\lambda {\nabla}_{M}{\mathbb{E}}_{x\sim \mathcal{N}({\bm{\mu}}^{*},{\mathbf{\Sigma}}^{*})}\left[{\parallel M\bm{v}\parallel}_{2}^{2}{\epsilon}^{2}\right]$  
${\mathbb{E}}_{x\sim \mathcal{N}({\bm{\mu}}^{*},{\mathbf{\Sigma}}^{*})}\left[{\mathbf{\Sigma}}^{1}\bm{v}{\bm{v}}^{\top}+{\mathbf{\Sigma}}^{1}M\bm{v}{\bm{v}}^{\top}\right]$  $={\mathbb{E}}_{x\sim \mathcal{N}({\bm{\mu}}^{*},{\mathbf{\Sigma}}^{*})}\left[\lambda M\bm{v}{\bm{v}}^{\top}\right]$  
${\mathbf{\Sigma}}^{1}{\mathbf{\Sigma}}^{*}+{\mathbf{\Sigma}}^{1}M{\mathbf{\Sigma}}^{*}$  $=\lambda M{\mathbf{\Sigma}}^{*}$  
$M$  $={\left(\lambda \mathbf{\Sigma}\bm{I}\right)}^{1},$ 
where $\lambda $ is a constant depending on $\mathbf{\Sigma}$ and $\bm{\mu}$ enforcing the expected squarednorm constraint. ∎
Indeed, note that the optimal $M$ for the adversary takes a nearidentical form to the optimal $\delta $ (19), with the exception that $\lambda $ is not sampledependent but rather varies only with the parameters.
E.3.4 Danskin’s Theorem
The main tool in proving our key results is Danskin’s Theorem \citepdanskin1967theory, a powerful theorem from minimax optimization which contains the following key result:
Theorem 4 (Danskin’s Theorem).
Suppose $\varphi \mathit{}\mathrm{(}x\mathrm{,}z\mathrm{)}\mathrm{:}\mathrm{R}\mathrm{\times}Z\mathrm{\to}\mathrm{R}$ is a continuous function of two arguments, where $Z\mathrm{\subset}{\mathrm{R}}^{m}$ is compact. Define $f\mathit{}\mathrm{(}x\mathrm{)}\mathrm{=}{\mathrm{max}}_{z\mathrm{\in}Z}\mathit{}\varphi \mathit{}\mathrm{(}x\mathrm{,}z\mathrm{)}$. Then, if for every $z\mathrm{\in}Z$, $\varphi \mathit{}\mathrm{(}x\mathrm{,}z\mathrm{)}$ is convex and differentiable in $x$, and $\frac{\mathrm{\partial}\mathit{}\varphi}{\mathrm{\partial}\mathit{}x}$ is continuous:
The subdifferential of $f\mathit{}\mathrm{(}x\mathrm{)}$ is given by
$$\partial f(x)=\mathrm{conv}\{\frac{\partial \varphi (x,z)}{\partial x}:z\in {Z}_{0}(x)\},$$ 
where $\mathrm{conv}\mathit{}\mathrm{(}\mathrm{\cdot}\mathrm{)}$ represents the convex hull operation, and ${Z}_{\mathrm{0}}$ is the set of maximizers defined as
$${Z}_{0}(x)=\{\overline{z}:\varphi (x,\overline{z})=\underset{z\in Z}{\mathrm{max}}\varphi (x,z)\}.$$ 
In short, given a minimax problem of the form ${\mathrm{min}}_{x}{\mathrm{max}}_{y\in C}f(x,y)$ where $C$ is a compact set, if $f(\cdot ,y)$ is convex for all values of $y$, then rather than compute the gradient of $g(x):={\mathrm{max}}_{y\in C}f(x,y)$, we can simply find a maximizer ${y}^{*}$ for the current parameter $x$; Theorem 4 ensures that ${\nabla}_{x}f(x,{y}^{*})\in {\partial}_{x}g(x)$. Note that $\mathcal{M}$ is trivially compact (by the HeineBorel theorem), and differentiability/continuity follow rather straightforwardly from our reparameterization (c.f. (22)), and so it remains to show that the outer minimization is convex for any fixed $M$.
Convexity of the outer minimization.
Note that even in the standard case (i.e. nonadversarial), the Gaussian negative loglikelihood is not convex with respect to $(\bm{\mu},\mathbf{\Sigma})$. Thus, rather than proving convexity of this function directly, we employ the parameterization used by [daskalakis2019efficient]: in particular, we write the problem in terms of $\bm{T}={\mathbf{\Sigma}}^{1}$ and $\bm{m}={\mathbf{\Sigma}}^{1}\bm{\mu}$. Under this parameterization, we show that the robust problem is convex for any fixed $M$.
Lemma 3.
Under the aforementioned parameterization of $\mathbf{T}\mathrm{=}{\mathrm{\Sigma}}^{\mathrm{}\mathrm{1}}$ and $\mathbf{m}\mathrm{=}{\mathrm{\Sigma}}^{\mathrm{}\mathrm{1}}\mathit{}\mathbf{\mu}$, the following “Gaussian robust negative loglikelihood” is convex:
$${\mathbb{E}}_{x\sim \mathcal{N}({\bm{\mu}}^{*},{\mathbf{\Sigma}}^{*})}\left[\mathrm{\ell}(\bm{m},\bm{T};x+M\bm{v})\right].$$ 
Proof.
To prove this, we show that the likelihood is convex even with respect to a single sample $x$; the result follows, since a convex combination of convex functions remains convex. We begin by looking at the likelihood of a single sample $x\sim \mathcal{N}({\bm{\mu}}_{*},{\mathbf{\Sigma}}_{*})$:
$\mathcal{L}(\bm{\mu},\mathbf{\Sigma};x+M(x\bm{\mu}))$  $={\displaystyle \frac{1}{\sqrt{{(2\pi )}^{k}\mathbf{\Sigma}}}}\mathrm{exp}\left({\displaystyle \frac{1}{2}}{(x\bm{\mu})}^{\top}{(I+M)}^{2}{\mathbf{\Sigma}}^{1}(x\bm{\mu})\right)$  
$={\displaystyle \frac{\frac{1}{\sqrt{{(2\pi )}^{k}\mathbf{\Sigma}}}\mathrm{exp}\left(\frac{1}{2}{(x\bm{\mu})}^{\top}{(I+M)}^{2}{\mathbf{\Sigma}}^{1}(x\bm{\mu})\right)}{{\displaystyle \int \frac{1}{\sqrt{{(2\pi )}^{k}{\mathbf{(}\bm{I}\mathbf{+}\bm{M}\mathbf{)}}^{\mathbf{}\mathrm{\U0001d7d0}}\mathbf{\Sigma}}}\mathrm{exp}\left(\frac{1}{2}{(x\bm{\mu})}^{\top}{(I+M)}^{2}{\mathbf{\Sigma}}^{1}(x\bm{\mu})\right)}}}$  
$={\displaystyle \frac{{I+M}^{1}\mathrm{exp}\left(\frac{1}{2}{x}^{\top}{(I+M)}^{2}{\mathbf{\Sigma}}^{1}x+{\bm{\mu}}^{\top}{(I+M)}^{2}{\mathbf{\Sigma}}^{1}x\right)}{{\displaystyle \int \mathrm{exp}\left(\frac{1}{2}{x}^{\top}{(I+M)}^{2}{\mathbf{\Sigma}}^{1}x+{\bm{\mu}}^{\top}{(I+M)}^{2}{\mathbf{\Sigma}}^{1}x\right)}}}$ 
In terms of the aforementioned $\bm{T}$ and $\bm{m}$, and for convenience defining $A={(I+M)}^{2}$:
$\mathrm{\ell}(x)$  $={A}^{1/2}+\left({\displaystyle \frac{1}{2}}{x}^{\top}A\bm{T}x{\bm{m}}^{\top}Ax\right)\mathrm{log}\left({\displaystyle \int \mathrm{exp}\left(\frac{1}{2}{x}^{\top}A\bm{T}x{\bm{m}}^{\top}Ax\right)}\right)$  
$\nabla \mathrm{\ell}(x)$  $=\left[\begin{array}{c}\hfill \frac{1}{2}{(Ax{x}^{\top})}_{\mathrm{\u29b0}}\hfill \\ \hfill Ax\hfill \end{array}\right]{\displaystyle \frac{{\displaystyle \int \left[\begin{array}{c}\hfill \frac{1}{2}{(Ax{x}^{\top})}_{\mathrm{\u29b0}}\hfill \\ \hfill Ax\hfill \end{array}\right]\mathrm{exp}\left(\frac{1}{2}{x}^{\top}A\bm{T}x{\bm{m}}^{\top}Ax\right)}}{{\displaystyle \int \mathrm{exp}\left(\frac{1}{2}{x}^{\top}A\bm{T}x{\bm{m}}^{\top}Ax\right)}}}$  
$=\left[\begin{array}{c}\hfill \frac{1}{2}{(Ax{x}^{\top})}_{\mathrm{\u29b0}}\hfill \\ \hfill Ax\hfill \end{array}\right]{\mathbb{E}}_{z\sim \mathcal{N}({\bm{T}}^{1}\bm{m},{(\bm{A}\bm{T})}^{1})}\left[\begin{array}{c}\hfill \frac{1}{2}{(Az{z}^{\top})}_{\mathrm{\u29b0}}\hfill \\ \hfill Az\hfill \end{array}\right].$  (22) 
From here, following an identical argument to [daskalakis2019efficient] Equation (3.7), we find that
${\bm{H}}_{\mathrm{\ell}}={\mathrm{Cov}}_{z\sim \mathcal{N}({\bm{T}}^{1}\bm{m},{(A\bm{T})}^{1})}[\left(\begin{array}{c}\hfill {\left({\displaystyle \frac{1}{2}}Az{z}^{T}\right)}_{\mathrm{\u29b0}}\hfill \\ \hfill z\hfill \end{array}\right),\left(\begin{array}{c}\hfill {\left({\displaystyle \frac{1}{2}}Az{z}^{T}\right)}_{\mathrm{\u29b0}}\hfill \\ \hfill z\hfill \end{array}\right)]\succcurlyeq \mathrm{\U0001d7ce},$ 
i.e. that the loglikelihood is indeed convex with respect to $\left[\begin{array}{c}\hfill \bm{T}\hfill \\ \hfill \bm{m}\hfill \end{array}\right]$, as desired. ∎
E.3.5 Applying Danskin’s Theorem
The previous two parts show that we can indeed apply Danskin’s theorem to the outer minimization, and in particular that the gradient of $f$ at $M={M}^{*}$ is in the subdifferential of the outer minimization problem. We proceed by writing out this gradient explicitly, and then setting it to zero (note that since we have shown $f$ is convex for all choices of perturbation, we can use the fact that a convex function is globally minimized $\iff $ its subgradient contains zero). We continue from above, plugging in (21) for $M$ and using (22) to write the gradients of $\mathrm{\ell}$ with respect to $\bm{T}$ and $\bm{m}$.
$0={\nabla}_{\left[\begin{array}{c}\hfill \bm{T}\hfill \\ \hfill \bm{m}\hfill \end{array}\right]}\mathrm{\ell}$  $={\mathbb{E}}_{x\sim \mathcal{N}({\bm{\mu}}^{*},{\mathbf{\Sigma}}^{*})}\left[\left[\begin{array}{c}\hfill \frac{1}{2}{(Ax{x}^{\top})}_{\mathrm{\u29b0}}\hfill \\ \hfill Ax\hfill \end{array}\right]{\mathbb{E}}_{z\sim \mathcal{N}({\bm{T}}^{1}\bm{m},{(\bm{A}\bm{T})}^{1})}\left[\begin{array}{c}\hfill \frac{1}{2}{(Az{z}^{\top})}_{\mathrm{\u29b0}}\hfill \\ \hfill Az\hfill \end{array}\right]\right]$  
$={\mathbb{E}}_{x\sim \mathcal{N}({\bm{\mu}}^{*},{\mathbf{\Sigma}}^{*})}\left[\begin{array}{c}\hfill \frac{1}{2}{(Ax{x}^{\top})}_{\mathrm{\u29b0}}\hfill \\ \hfill Ax\hfill \end{array}\right]{\mathbb{E}}_{z\sim \mathcal{N}({\bm{T}}^{1}\bm{m},{(\bm{A}\bm{T})}^{1})}\left[\begin{array}{c}\hfill \frac{1}{2}{(Az{z}^{\top})}_{\mathrm{\u29b0}}\hfill \\ \hfill Az\hfill \end{array}\right]$  
$=\left[\begin{array}{c}\hfill \frac{1}{2}{(A{\mathbf{\Sigma}}_{*})}_{\mathrm{\u29b0}}\hfill \\ \hfill A{\bm{\mu}}_{*}\hfill \end{array}\right]{\mathbb{E}}_{z\sim \mathcal{N}({\bm{T}}^{1}\bm{m},{(\bm{A}\bm{T})}^{1})}\left[\begin{array}{c}\hfill \frac{1}{2}{(A{(A\bm{T})}^{1})}_{\mathrm{\u29b0}}\hfill \\ \hfill A{\bm{T}}^{1}\bm{m}\hfill \end{array}\right]$  
$=\left[\begin{array}{c}\hfill \frac{1}{2}A{\mathbf{\Sigma}}_{*}\hfill \\ \hfill A{\bm{\mu}}_{*}\hfill \end{array}\right]\left[\begin{array}{c}\hfill \frac{1}{2}A{(A\bm{T})}^{1}\hfill \\ \hfill A{\bm{T}}^{1}\bm{m}\hfill \end{array}\right]$  
$=\left[\begin{array}{c}\hfill \frac{1}{2}A{\mathbf{\Sigma}}_{*}\frac{1}{2}{\bm{T}}^{1}\hfill \\ \hfill A{\bm{T}}^{1}\bm{m}A{\bm{\mu}}_{*}\hfill \end{array}\right]$  (23) 
Using this fact, we derive an implicit expression for the robust covariance matrix $\mathbf{\Sigma}$. Note that for the sake of brevity, we now use $M$ to denote the optimal adversarial perturbation (previously defined as ${M}^{*}$ in (21)). This implicit formulation forms the foundation of the bounds given by our main results.
Lemma 4.
The minimax problem discussed throughout this work admits the following (implicit) form of solution:
$\mathbf{\Sigma}$  $={\displaystyle \frac{1}{\lambda}}I+{\displaystyle \frac{1}{2}}{\mathbf{\Sigma}}_{*}+\sqrt{{\displaystyle \frac{1}{\lambda}}{\mathbf{\Sigma}}_{*}+{\displaystyle \frac{1}{4}}{\mathbf{\Sigma}}_{*}^{2}},$ 
where $\lambda $ is such that $M\mathrm{\in}\mathrm{M}$, and is thus dependent on $\mathrm{\Sigma}$.
Proof.
Rewriting (23) in the standard parameterization (with respect to $\bm{\mu},\mathbf{\Sigma}$) and reexpanding $A={(I+M)}^{2}$ yields:
$0={\nabla}_{\left[\begin{array}{c}\hfill \bm{T}\hfill \\ \hfill \bm{m}\hfill \end{array}\right]}\mathrm{\ell}=\left[\begin{array}{c}\hfill \frac{1}{2}{(I+M)}^{2}{\mathbf{\Sigma}}_{*}\frac{1}{2}\mathbf{\Sigma}\hfill \\ \hfill {(I+M)}^{2}\bm{\mu}{(I+M)}^{2}{\bm{\mu}}_{*}\hfill \end{array}\right]$ 
Now, note that the equations involving $\bm{\mu}$ and $\mathbf{\Sigma}$ are completely independent, and thus can be solved separately. In terms of $\bm{\mu}$, the relevant system of equations is $A\bm{\mu}A{\bm{\mu}}_{*}=0$, where multiplying by the inverse $A$ gives that
$$\bm{\mu}={\bm{\mu}}_{*}.$$  (24) 
This tells us that the mean learned via ${\mathrm{\ell}}_{2}$robust maximum likelihood estimation is precisely the true mean of the distribution.
Now, in the same way, we set out to find $\mathbf{\Sigma}$ by solving the relevant system of equations:
${\mathbf{\Sigma}}_{*}^{1}$  $={\mathbf{\Sigma}}^{1}{(M+I)}^{2}.$  (25) 
Now, we make use of the Woodbury Matrix Identity in order to write $(I+M)$ as
$$I+{(\lambda \mathbf{\Sigma}I)}^{1}=I+\left(I{\left(\frac{1}{\lambda}{\mathbf{\Sigma}}^{1}I\right)}^{1}\right)={\left(\frac{1}{\lambda}{\mathbf{\Sigma}}^{1}I\right)}^{1}.$$ 
Thus, we can revisit (25) as follows:
${\mathbf{\Sigma}}_{*}^{1}$  $={\mathbf{\Sigma}}^{1}{\left({\displaystyle \frac{1}{\lambda}}{\mathbf{\Sigma}}^{1}I\right)}^{2}$  
$\frac{1}{{\lambda}^{2}}}{\mathbf{\Sigma}}_{*}^{1}{\mathbf{\Sigma}}^{2}\left({\displaystyle \frac{2}{\lambda}}{\mathbf{\Sigma}}_{*}^{1}+I\right){\mathbf{\Sigma}}^{1}+{\mathbf{\Sigma}}_{*}^{1$  $=0$  
$\frac{1}{{\lambda}^{2}}}{\mathbf{\Sigma}}_{*}^{1}\left({\displaystyle \frac{2}{\lambda}}{\mathbf{\Sigma}}_{*}^{1}+I\right)\mathbf{\Sigma}+{\mathbf{\Sigma}}_{*}^{1}{\mathbf{\Sigma}}^{2$  $=0$ 
We now apply the quadratic formula to get an implicit expression for $\mathbf{\Sigma}$ (implicit since technically $\lambda $ depends on $\mathbf{\Sigma}$):
$\mathbf{\Sigma}$  $=\left({\displaystyle \frac{2}{\lambda}}{\mathbf{\Sigma}}_{*}^{1}+I\pm \sqrt{{\displaystyle \frac{4}{\lambda}}{\mathbf{\Sigma}}_{*}^{1}+I}\right){\displaystyle \frac{1}{2}}{\mathbf{\Sigma}}_{*}$  
$={\displaystyle \frac{1}{\lambda}}I+{\displaystyle \frac{1}{2}}{\mathbf{\Sigma}}_{*}+\sqrt{{\displaystyle \frac{1}{\lambda}}{\mathbf{\Sigma}}_{*}+{\displaystyle \frac{1}{4}}{\mathbf{\Sigma}}_{*}^{2}}.$  (26) 
This concludes the proof. ∎
E.3.6 Bounding $\lambda $
We now attempt to characterize the shape of $\lambda $ as a function of $\epsilon $. First, we use the fact that $\mathbb{E}[{\parallel Xv\parallel}^{2}]=\text{tr}({X}^{2})$ for standard normallydrawn $v$. Thus, $\lambda $ is set such that $\text{tr}({\mathbf{\Sigma}}_{*}{M}^{2})=\epsilon $, i.e:
$$\sum _{i=0}\frac{{\mathbf{\Sigma}}_{ii}^{*}}{{(\lambda {\mathbf{\Sigma}}_{ii}1)}^{2}}=\epsilon $$  (27) 
Now, consider ${\epsilon}^{2}$ as a function of $\lambda $. Observe that for $\lambda \ge \frac{1}{{\sigma}_{min}(\mathbf{\Sigma})}$, we have that $M$ must be positive semidefinite, and thus ${\epsilon}^{2}$ decays smoothly from $\mathrm{\infty}$ (at $\lambda =\frac{1}{{\sigma}_{min}})$ to zero (at $\lambda =\mathrm{\infty}$). Similarly, for $\lambda \le \frac{1}{{\sigma}_{max}(\mathbf{\Sigma})}$, $\epsilon $ decays smoothly as $\lambda $ decreases. Note, however, that such values of $\lambda $ would necessarily make $M$ negative semidefinite, which would actually help the loglikelihood. Thus, we can exclude this case; in particular, for the remainder of the proofs, we can assume $\lambda \ge \frac{1}{{\sigma}_{max}(\mathbf{\Sigma})}$.
Also observe that the zeros of $\epsilon $ in terms of $\lambda $ are only at $\lambda =\pm \mathrm{\infty}$. Using this, we can show that there exists some ${\epsilon}_{0}$ for which, for all $$, the only corresponding possible valid value of $\lambda $ is where $\lambda \ge \frac{1}{{\sigma}_{min}}$. This idea is formalized in the following Lemma.
Lemma 5.
For every ${\mathrm{\Sigma}}_{\mathrm{*}}$, there exists some ${\epsilon}_{\mathrm{0}}\mathrm{>}\mathrm{0}$ for which, for all $\epsilon \mathrm{\in}\mathrm{[}\mathrm{0}\mathrm{,}{\epsilon}_{\mathrm{0}}\mathrm{)}$ the only admissible value of $\lambda $ is such that $\lambda \mathrm{\ge}\frac{\mathrm{1}}{{\sigma}_{m\mathit{}i\mathit{}n}\mathit{}\mathrm{(}\mathrm{\Sigma}\mathrm{)}}$, and thus such that $M$ is positive semidefinite.
Proof.
We prove the existence of such an ${\epsilon}_{0}$ by lower bounding $\epsilon $ (in terms of $\lambda $) for any finite $\lambda >0$ that does not make $M$ PSD. Providing such a lower bound shows that for small enough $\epsilon $ (in particular, less than this lower bound), the only corresponding values of $\lambda $ are as desired in the statement^{18}^{18} 18 Since our only goal is existence, we lose many factors from the analysis that would give a tighter bound on ${\epsilon}_{0}$..
In particular, if $M$ is not PSD, then there must exist at least one index $k$ such that $$, and thus ${(\lambda {\mathbf{\Sigma}}_{kk}1)}^{2}\le 1$ for all $\lambda >0$. We can thus lower bound (27) as:
$$\epsilon =\sum _{i=0}\frac{{\mathbf{\Sigma}}_{ii}^{*}}{{(\lambda {\mathbf{\Sigma}}_{ii}1)}^{2}}\ge \frac{{\mathbf{\Sigma}}_{kk}^{*}}{{(\lambda {\mathbf{\Sigma}}_{kk}1)}^{2}}\ge {\mathbf{\Sigma}}_{kk}^{*}\ge {\sigma}_{min}({\mathbf{\Sigma}}^{*})>0$$  (28) 
By contradiction, it follows that for any $$, the only admissible $\lambda $ is such that $M$ is PSD, i.e. according to the statement of the Lemma. ∎
In the regime $\epsilon \in [0,{\epsilon}_{0})$, note that $\lambda $ is inversely proportional to $\epsilon $ (i.e. as $\epsilon $ grows, $\lambda $ decreases). This allows us to get a qualitative view of (26): as the allowed perturbation value increases, the robust covariance $\mathbf{\Sigma}$ resembles the identity matrix more and more, and thus assigns more and more variance on initially lowvariance features. The $\sqrt{{\mathbf{\Sigma}}_{*}}$ term indicates that the robust model also adds uncertainty proportional to the square root of the initial variance—thus, lowvariance features will have (relatively) more uncertainty in the robust case. Indeed, our main result actually follows as a (somewhat loose) formalization of this intuition.
E.3.7 Proof of main theorems
First, we give a proof of Theorem 2, providing lower and upper bounds on the learned robust covariance $\mathbf{\Sigma}$ in the regime $\epsilon \in [0,{\epsilon}_{0})$.
See 2
Proof.
We have already shown that $\bm{\mu}={\bm{\mu}}_{*}$ in the robust case (c.f. (24)). We choose ${\epsilon}_{0}$ to be as described, i.e. the largest $\epsilon $ for which the set $\{\lambda :\text{tr}({\mathbf{\Sigma}}_{*}^{2}M)=\epsilon ,\lambda \ge 1/{\sigma}_{\mathrm{max}}(\mathbf{\Sigma})\}$ has only one element $\lambda $ (which, as we argued, must not be less than $1/{\sigma}_{min}(\mathbf{\Sigma})$). We have argued that such an ${\epsilon}_{0}$ must exist.
We prove the result by combining our early derivation (in particular, (25) and (26)) with upper and lower bound on $\lambda $, which we can compute based on properties of the trace operator. We begin by deriving a lower bound on $\lambda $. By linear algebraic manipulation (given in Appendix E.3.8), we get the following bound:
$$\lambda \ge \frac{d}{\text{tr}(\mathbf{\Sigma})}\left(1+\sqrt{\frac{d\cdot {\sigma}_{min}({\mathbf{\Sigma}}_{*})}{\epsilon}}\right)$$  (29) 
Now, we can use (25) in order to remove the dependency of $\lambda $ on $\mathbf{\Sigma}$:
$\mathbf{\Sigma}$  $={\mathbf{\Sigma}}_{*}{(M+I)}^{2}$  
$\text{tr}(\mathbf{\Sigma})$  $=\text{tr}\left[{({\mathbf{\Sigma}}_{*}^{1/2}M+{\mathbf{\Sigma}}_{*}^{1/2})}^{2}\right]$  
$\le 2\cdot \text{tr}\left[{({\mathbf{\Sigma}}_{*}^{1/2}M)}^{2}+{({\mathbf{\Sigma}}_{*}^{1/2})}^{2}\right]$  
$\le 2\cdot \left(\epsilon +\text{tr}({\mathbf{\Sigma}}_{*})\right).$ 
Applying this to (29) yields:
$\lambda $  $\ge {\displaystyle \frac{d/2}{\epsilon +\text{tr}({\mathbf{\Sigma}}_{*})}}\left(1+\sqrt{{\displaystyle \frac{d\cdot {\sigma}_{min}({\mathbf{\Sigma}}_{*})}{\epsilon}}}\right).$ 
Note that we can simplify this bound significantly by writing $\epsilon =d\cdot {\sigma}_{min}({\mathbf{\Sigma}}_{*}){\epsilon}^{\prime}\le \text{tr}({\mathbf{\Sigma}}_{*}){\epsilon}^{\prime}$, which does not affect the result (beyond rescaling the valid regime $(0,{\epsilon}_{0})$), and gives:
$\lambda $  $\ge {\displaystyle \frac{d/2}{(1+{\epsilon}^{\prime})\text{tr}({\mathbf{\Sigma}}_{*})}}\left(1+{\displaystyle \frac{1}{\sqrt{{\epsilon}^{\prime}}}}\right)\ge {\displaystyle \frac{d\cdot (1+\sqrt{{\epsilon}^{\prime}})}{2\sqrt{{\epsilon}^{\prime}}(1+{\epsilon}^{\prime})\text{tr}({\mathbf{\Sigma}}_{*})}}$ 
Next, we follow a similar methodology (Appendix E.3.8) in order to upper bound $\lambda $:
$$\lambda \le \frac{1}{{\sigma}_{min}(\mathbf{\Sigma})}\left(\sqrt{\frac{{\parallel {\mathbf{\Sigma}}_{*}\parallel}_{F}\cdot d}{\epsilon}}+1\right).$$ 
Note that by (25) and positive semidefiniteness of $M$, it must be that ${\sigma}_{min}(\mathbf{\Sigma})\ge {\sigma}_{min}({\mathbf{\Sigma}}_{*})$. Thus, we can simplify the previous expression, also substituting $\epsilon =d\cdot {\sigma}_{min}({\mathbf{\Sigma}}_{*}){\epsilon}^{\prime}$:
$$\lambda \le \frac{1}{{\sigma}_{min}({\mathbf{\Sigma}}_{*})}\left(\sqrt{\frac{{\parallel {\mathbf{\Sigma}}_{*}\parallel}_{F}}{{\sigma}_{min}({\mathbf{\Sigma}}_{*}){\epsilon}^{\prime}}}+1\right)=\frac{{\parallel {\mathbf{\Sigma}}_{*}\parallel}_{F}+\sqrt{\epsilon \cdot {\sigma}_{min}({\mathbf{\Sigma}}_{*})}}{{\sigma}_{min}{({\mathbf{\Sigma}}_{*})}^{3/2}\sqrt{\epsilon}}$$ 
These bounds can be straightforwardly combined with Lemma 4, which concludes the proof. ∎
Proof.
To prove this, we make use of the following Lemmas:
Lemma 6.
For two positive definite matrices $A$ and $B$ with $\kappa \mathit{}\mathrm{(}A\mathrm{)}\mathrm{>}\kappa \mathit{}\mathrm{(}B\mathrm{)}$, we have that $\kappa \mathit{}\mathrm{(}A\mathrm{+}B\mathrm{)}\mathrm{\le}\mathrm{max}\mathit{}\mathrm{\{}\kappa \mathit{}\mathrm{(}A\mathrm{)}\mathrm{,}\kappa \mathit{}\mathrm{(}B\mathrm{)}\mathrm{\}}$.
Proof.
We proceed by contradiction:
$\kappa (A+B)$  $={\displaystyle \frac{{\lambda}_{max}(A)+{\lambda}_{max}(B)}{{\lambda}_{min}(A)+{\lambda}_{min}(B)}}$  
$\kappa (A)$  $={\displaystyle \frac{{\lambda}_{max}(A)}{{\lambda}_{min}(A)}}$  
$\kappa (A)$  $\ge \kappa (A+B)$  
$\iff {\lambda}_{max}(A)\left({\lambda}_{min}(A)+{\lambda}_{min}(B)\right)$  $\ge {\lambda}_{min}(A)\left({\lambda}_{max}(A)+{\lambda}_{max}(B)\right)$  
$\iff {\lambda}_{max}(A){\lambda}_{min}(B)$  $\ge {\lambda}_{min}(A){\lambda}_{max}(B)$  
$\iff {\displaystyle \frac{{\lambda}_{max}(A)}{{\lambda}_{min}(A)}}$  $\ge {\displaystyle \frac{{\lambda}_{min}(A)}{{\lambda}_{max}(B)}},$ 
which is false by assumption. This concludes the proof. ∎
Lemma 7 (Straightforward).
For a positive definite matrix $A$ and $k\mathrm{>}\mathrm{0}$, we have that
$$ 
Lemma 8 (Angle induced by positive definite matrix; folklore).
^{19}^{19} 19 A proof can be found in https://bit.ly/2L6jdATFor a positive definite matrix $A\mathrm{\succ}\mathrm{0}$ with condition number $\kappa $, we have that
$$\underset{x}{\mathrm{min}}\frac{{x}^{\top}Ax}{{\parallel Ax\parallel}_{2}\cdot {\parallel x\parallel}_{2}}=\frac{2\sqrt{\kappa}}{1+\kappa}.$$  (30) 
These two results can be combined to prove the theorem. First, we show that $\kappa (\mathbf{\Sigma})\le \kappa ({\mathbf{\Sigma}}_{*})$:
$\kappa (\mathbf{\Sigma})$  $=\kappa \left({\displaystyle \frac{1}{\lambda}}I+{\displaystyle \frac{1}{2}}{\mathbf{\Sigma}}_{*}+\sqrt{{\displaystyle \frac{1}{\lambda}}{\mathbf{\Sigma}}_{*}+{\displaystyle \frac{1}{4}}{\mathbf{\Sigma}}_{*}^{2}}\right)$  
$$  
$$  
$=\mathrm{max}\{\kappa \left({\mathbf{\Sigma}}_{*}\right),\sqrt{\kappa \left({\displaystyle \frac{2}{\lambda}}\sqrt{{\displaystyle \frac{1}{4}}{\mathbf{\Sigma}}_{*}^{2}}+{\displaystyle \frac{1}{4}}{\mathbf{\Sigma}}_{*}^{2}\right)}\}$  
$\le \kappa \left({\mathbf{\Sigma}}_{*}\right).$ 
Finally, note that (30) is a strictly decreasing function in $\kappa $, and as such, we have shown the theorem. ∎
E.3.8 Bounds for $\lambda $
Lower bound.
$\epsilon $  $=\text{tr}({\mathbf{\Sigma}}_{*}{M}^{2})$  
$\ge {\sigma}_{min}({\mathbf{\Sigma}}_{*})\cdot \text{tr}({M}^{2})$  by the definition of $\text{tr}(\cdot )$  
$\ge {\displaystyle \frac{{\sigma}_{min}({\mathbf{\Sigma}}_{*})}{d}}\cdot \text{tr}{(M)}^{2}$  by CauchySchwarz  
$\ge {\displaystyle \frac{{\sigma}_{min}({\mathbf{\Sigma}}_{*})}{d}}\cdot {\left[\text{tr}\left({(\lambda \mathbf{\Sigma}\bm{I})}^{1}\right)\right]}^{2}$  Expanding $M$ (21)  
$\ge {\displaystyle \frac{{\sigma}_{min}({\mathbf{\Sigma}}_{*})}{d}}\cdot {\left[\text{tr}{\left(\lambda \mathbf{\Sigma}\bm{I}\right)}^{1}\cdot {d}^{2}\right]}^{2}$  AMHM inequality  
$\ge {d}^{3}\cdot {\sigma}_{min}({\mathbf{\Sigma}}_{*})\cdot {\left[\lambda \cdot \text{tr}(\mathbf{\Sigma})d\right]}^{2}$  
${\left[\lambda \cdot \text{tr}(\mathbf{\Sigma})d\right]}^{2}$  $\ge {\displaystyle \frac{{d}^{3}\cdot {\sigma}_{min}({\mathbf{\Sigma}}_{*})}{\epsilon}}$  
$\lambda \cdot \text{tr}(\mathbf{\Sigma})d$  $\ge {\displaystyle \frac{{d}^{3/2}\cdot \sqrt{{\sigma}_{min}({\mathbf{\Sigma}}_{*})}}{\sqrt{\epsilon}}}$  since $M$ is PSD  
$\lambda $  $\ge {\displaystyle \frac{d}{\text{tr}(\mathbf{\Sigma})}}\left(1+\sqrt{{\displaystyle \frac{d\cdot {\sigma}_{min}({\mathbf{\Sigma}}_{*})}{\epsilon}}}\right)$ 
Upper bound
$\epsilon $  $=\text{tr}({\mathbf{\Sigma}}_{*}{M}^{2})$  
$\le {\parallel {\mathbf{\Sigma}}_{*}\parallel}_{F}\cdot d\cdot {\sigma}_{max}{(M)}^{2}$  
$\le {\parallel {\mathbf{\Sigma}}_{*}\parallel}_{F}\cdot d\cdot {\sigma}_{min}{(M)}^{2}$  
$\lambda \cdot {\sigma}_{min}(\mathbf{\Sigma})1$  $\le \sqrt{{\displaystyle \frac{{\parallel {\mathbf{\Sigma}}_{*}\parallel}_{F}\cdot d}{\epsilon}}}$  
$\lambda $  $\le {\displaystyle \frac{1}{{\sigma}_{min}(\mathbf{\Sigma})}}\left(\sqrt{{\displaystyle \frac{{\parallel {\mathbf{\Sigma}}_{*}\parallel}_{F}\cdot d}{\epsilon}}}+1\right).$ 