Abstract
In many realworld applications of Machine Learning it is of paramountimportance not only to provide accurate predictions, but also to ensure certainlevels of robustness. Adversarial Training is a training procedure aiming atproviding models that are robust to worstcase perturbations around predefinedpoints. Unfortunately, one of the main issues in adversarial training is thatrobustness w.r.t. gradientbased attackers is always achieved at the cost ofprediction accuracy. In this paper, a new algorithm, called WassersteinProjected Gradient Descent (WPGD), for adversarial training is proposed. WPGDprovides a simple way to obtain costsensitive robustness, resulting in a finercontrol of the robustnessaccuracy tradeoff. Moreover, WPGD solves an optimaltransport problem on the output space of the network and it can efficientlydiscover directions where robustness is required, allowing to control thedirectional tradeoff between accuracy and robustness. The proposed WPGD isvalidated in this work on image recognition tasks with different benchmarkdatasets and architectures. Moreover, real worldlike datasets are oftenunbalanced: this paper shows that when dealing with such type of datasets, theperformance of adversarial training are mainly affected in term of standardaccuracy.
Quick Read (beta)
Directional Adversarial Training for Cost Sensitive Deep Learning Classification Applications
\SetAlFnt\SetAlCapFnt\SetAlCapNameFnt\crefnamedefinitionDef.Defs. \crefnameappendixAppendixAppendices \crefformatequation(#2#1#3) \crefmultiformatequation(#2#1#3) and (#2#1#3), (#2#1#3) and (#2#1#3) \crefnamesectionSec.Secs. \crefnamesubsectionSec.Secs. \crefnamesubsubsectionSec.Secs.
${}^{1}$ Human Inspired Technology Center, University of Padova.
${}^{2}$ Department of Information Engineering, University of Padova.
${}^{3}$ University of Pennsylvania.
Email: [email protected],[email protected],
[email protected]
Abstract: In many realworld applications of Machine Learning it is of paramount importance not only to provide accurate predictions, but also to ensure certain levels of robustness. Adversarial Training is a training procedure aiming at providing models that are robust to worstcase perturbations around predefined points. Unfortunately, one of the main issues in adversarial training is that robustness w.r.t. gradientbased attackers is always achieved at the cost of prediction accuracy. In this paper, a new algorithm, called Wasserstein Projected Gradient Descent (WPGD), for adversarial training is proposed. WPGD provides a simple way to obtain costsensitive robustness, resulting in a finer control of the robustnessaccuracy tradeoff. Moreover, WPGD solves an optimal transport problem on the output space of the network and it can efficiently discover directions where robustness is required, allowing to control the directional tradeoff between accuracy and robustness. The proposed WPGD is validated in this work on image recognition tasks with different benchmark datasets and architectures. Moreover, real worldlike datasets are often unbalanced: this paper shows that when dealing with such type of datasets, the performance of adversarial training are mainly affected in term of standard accuracy.
Keywords: Adversarial training, Artificial Intelligence, Costsensitive, Deep Learning, Image Classification, Optimal Transport, Wasserstein
1. Introduction
Recent advancements in Deep Learning have lead to several breakthrough applications in many fields, like Computer Vision [24], Healthcare [12], Industry 4.0 [29, 39], Natural Language Processing [52], Speech Recognition [34] and Transportation [16]. A crucial requirement for many applications in these fields, is to have models that do not have unexpected behaviors. However, Deep neural networks (DNNs), under some circumstances do not satisfy this property.
Probably the main alarming behavior of DNNs [5, 26] for classification tasks is that they are susceptible to adversarial perturbations, i.e., for example, in the context of Computer Vision, modifications to the input image that although imperceptible to the human eye cause the network to misclassify, confidently, the image [47]. These perturbations are easy to synthesize and they may even generalize across different networks [32]. This suggests surprising vulnerabilities in these stateoftheart classifiers and it has resulted in a flurry of activities towards understanding this phenomenon [14, 43], building robustness and defenses against it [18, 28], as also discovering new attacks [3, 7, 35, 36]. Adversarial robustness is fundamental in many realworld applications; in important applications like autonomous driving [38] and predictive maintenance [46], errors and faults have different priorities and importance: for example, in autonomous driving, if a recognition system of an autonomous car misclassifies a cat as a dog there should be reasonably no damage, while, if a human is misclassified, that could lead to dramatic consequences.
Adversarial robustness is here defined as the accuracy of a given model evaluated in the worstcase input around a prescribed neighbourhood. More informally, it can be considered as the accuracy of the models in worstcase scenarios. In this context, the most common and effective approach to enable robustness to adversarial examples in DNNs is Adversarial Training [28], whose idea is to train a model with these worstcase examples, called adversarial examples instead of using clean data, ie. data measured either without error or with negligible error. Thus, it is training procedure belonging to the class of minimax problems [40], in which a inner loop finds the worstcase data point ${x}^{\star}$ trough gradient ascent and the outer loop minimizes the target loss on ${x}^{\star}$.
Unfortunately, adversarial robustness comes at the price of lower classification accuracy on clean data: this tradeoff has been demonstrated by various analyses [13, 49]. As argued above, an adversarially robust classifier with low accuracy is unlikely to be used in practical applications require both. Although much efforts has been devoted to theoretically understand robustness, its practical consequences in industrial applications received few attention from the literature [20].
The present work aims at addressing the aforementioned issues with the following contributions:

•
it is shown that the quantitative and qualitative difference between robust and standard models correlates with the visual metric of classes, ie. it is aligned with the human notion of distance between classes. Adversarially trained networks learn to (mostly) ignore finegrained classification and confuse classes with samples that are close to the decision boundaries. This result is corroborated by [33] where it is shown that adversarial training leads to boundaries with low curvature;

•
it is shown that robust models are less confident in their predictions than standard models are;

•
inspired by the previous observation, Wasserstein Projected Gradient Descent (WPGD), an algorithm for adversarial training of deep networks, is presented here. WPGD improves the efficiency of the inner loop in gradientbased defenses such as Projected Gradient Descent (PGD). WPGD formulates an optimal transport problem on the label space with the underlying metric given by the distances of the classification boundaries between classes. This metric guides the search for adversarial perturbations towards classes that are visually dissimilar. It is shown that training deep networks using WPGD is effective in shaping boundaries to maintain direction robustness where required will maintaining accuracy on similar classes.
Moreover, it is worth noting that, although the experiments in this work regard image recognition tasks, the WPGD framework can be easily extended to other types of data such as timeseries.
The rest of this paper is organized as follows. In creftype 2 the building blocks of the proposed approach, estimating the distance to the boundaries and optimal transport, are presented, while properties of adversarial training are discussed in creftype 3. In creftype 4 the WPGD algorithm is introduced and experimental results on MNIST [25], CIFAR10 and Tiny Imagenet datasets for different deep networks are reported. Related works and discussion are provided in creftype 6 and creftype 7 respectively.
2. Notation and building blocks
This section describes the notation and the main building blocks of the approach presented in this work.
Notation: Let $\theta \in {\mathbb{R}}^{d}$ denote the parameters of a neural network. Input images are denoted by $X=\{{x}_{i}:i\le N\}$ with pixel intensities normalized to lie between $[0,1]$. Given an image $x$, let $\kappa (x)\in \{1,\mathrm{\dots},K\}$ be its groundtruth label, the onehot encoding of $\kappa (x)$ is denoted by $y(x)$. The normalized probability distribution over the classes as predicted by the network is denoted by $\widehat{y}(x)\in {\mathbb{R}}^{K}$, here $\widehat{y}{(x)}_{k}$ denotes its ${k}^{\mathrm{th}}$ entry and $\widehat{\kappa}(x)={\mathrm{arg}\mathrm{max}}_{k}\widehat{y}{(x)}_{k}$ is the predicted class. The crossentropy loss can then be written as
$${\mathrm{\ell}}_{\mathrm{CE}}(\theta ;x)=\mathrm{log}\widehat{y}{(x)}_{\kappa (x)}$$  (1) 
and training a network involves minimizing the average loss, ie. ${\mathrm{arg}\mathrm{min}}_{\theta}{\mathbb{E}}_{x\sim X}\left[{\mathrm{\ell}}_{\mathrm{CE}}(\theta ;x)\right]$.
The training dataset is represented with $\mathcal{D}=\{\mathbf{x},\mathbf{y}\}$, where $\text{\bm{x}}={\{{x}_{i}\}}_{i=1}^{N}$ and $\text{\bm{y}}={\{{y}_{i}\}}_{i=1}^{N}$ are, respectively, a set of randomly sampled data point and their corresponding labels generated from a unknown distribution ${p}_{\psi}(x,y)$, parametrized by $\psi $. In lieu of minimizing the expected loss over the training data, adversarial training solves
$$\underset{\theta}{\mathrm{min}}{\mathbb{E}}_{X}\left[\underset{{x}^{\prime}\in \mathcal{M}(x)}{\mathrm{max}}{\mathrm{\ell}}_{\mathrm{CE}}({x}^{\prime};\theta )\right];$$  (2) 
this is a saddle point problem where, at each iteration, candidate images ${x}^{\prime}$ are chosen from a set $\mathcal{M}(x)$ (or a manifold). This has been a successful approach to training neural networks robustly w.r.t. adversarial perturbations, see [28, 22, 44, 21]. In this paper only $\mathcal{M}(x)={\mathcal{M}}_{\mathrm{\infty}}(x)=\{{x}^{\prime}:{\parallel {x}^{\prime}x\parallel}_{\mathrm{\infty}}\le \u03f5\}$, the infinitynorm ball around $x$, is considered to obtain an algorithm based on PGD [6], [28].
It is remarked that the theoretical properties described in the following are generally applied to general setting and not only Euclidean perturbations. In this paper it is distinguished between natural error (NE) and adversarial error (AE) as the errors obtained with natural images and with adversarial images, respectively. In the following only ${\mathrm{\ell}}_{\mathrm{\infty}}$ is used for perturbations in all the experiments regarding real datasets while ${\mathrm{\ell}}_{2}$^{1}^{1} 1 The reason for using ${\mathrm{\ell}}_{2}$ instead of ${\mathrm{\ell}}_{\mathrm{\infty}}$ is simply to ease visualization of the impact of adversarial training. for perturbations in the synthetic example of creftype 3.4.
3. Properties of adversarial training
In this Section some effects and properties of adversarial training on various aspects are reported. Such aspects are:

•
the qualitative and quantitative description of classification errors, measured by the accuracy gap (creftype 3.1);

•
unbalanced classification problems (creftype 3.2);

•
the characterization of output confidence ( creftype 3.3);

•
the characterization of boundaries (creftype 3.4).
The aforementioned effects are supported by experiments reported in this Section. Moreover, it is shown in creftype 3.3 that an entropic regularization help in obtaining robustness.
The properties and effects of adversarial training reported here have motivated WPGD that will be presented in the following Section.
3.1. Accuracy gap
In order to ease the understanding of the results on this Section, the notion of accuracy gap is defined as the following:
Definition 1.
Let ${C}_{p\mathit{}g\mathit{}d}$ and ${C}_{c\mathit{}e}$ be the confusion matrices of robust and standard models, respectively. The accuracy gap $G$ is defined as the absolute difference between the confusion matrices:
$$G={C}_{pgd}{C}_{ce}$$ 
Although it is known that robustness is obtained at cost of accuracy [28, 49], it is not still clear in the literature whether this gap can be mitigated^{2}^{2} 2 On MNIST dataset, high capacity networks reduce the accuracy gap to near zero. However, in more complex datasets, such as CIFAR10\xspace, this gap exists even with very large networks.. In this work a first step into tackling this problem is taken by studying how errors are distributed between images and classes: it is shown in the following that misclassification errors are distributed following the visual metric, meaning that robust networks tend to destroy finegrained classification. Qualitatively, the visual metric is a distance between classes that can be easily interpreted by humans. One approach for defining such visual metric is to employ the distance from boundaries of a deep neural network: in fact, [42] showed that NNs learn representations that are wellaligned with our idea of visual similarity.
Due to highdimensionality of input, obtaining a good approximation of the visual metric is not easily feasible. However, it can be replaced by the semantic metric provided by WordNet [30], which is a good proxy for the visual metric as also showed by [9]. For MNIST, it used a linear classifier on the input pixels whereby the boundaries can computeed accurately.
creftype 1 illustrates results for CIFAR10\xspace. In particular, creftype 0(b) shows the accuracy gap between a Wide Residual Network [53] trained using PGD and one trained with the standard crossentropy loss. From this figure, it is easy to see a visual correlation between metric and accuracy gap. Interestingly, the errors that are explained by such metrics, correspond to classes which are visually similar. For instance, creftype 1 shows a gap on the pair birdairplane which are visually similar but semantically different. Analogously, in creftype 2, creftype 3 and creftype 4 it is shown the WordNet metrics and the relative accuracy gaps for MNIST, TinyImagenet and CIFAR100\xspace, respectively. Similar results are identifiable also for these datasets. In fact, regarding MNIST, not surprisingly, digits ”$0$” and ”$1$” hardly fool each other. The most similar digits are ”$4$” and ”$9$”: in fact, a small manipulation of such digits can be sufficient to make them indistinguishable. Also, regarding CIFAR100\xspace, as an example, from indices 811 there is an evident cluster composed by the classes man, boy, woman and girl. Other very connected classes are bridge, skyscraper, house, castle and road. Moreover, there are animals that are semantically different but which are visually similar, such the couple 3290 that are seal and otter respectively. The bottomright cluster represents flowers and plants.
In creftype 1 a quantitative measure (supporting the aforementioned ’visual’ results) of the correlation between accuracy gap and relative metric is provided. The minus sign is due to the fact that confusion matrices and distances are inversely correlated: when the values of diagonal increase of the confusion matrices, then the distance between classes decreases, on average. For MNIST the correlation is higher since an approximation of the actual visual metric has been used, while for CIFAR10\xspace and CIFAR100\xspace the correlation is lower because some pairs, for example, birdairplane are semantically different. Moreover, it is remarked that with high output dimension, the correlation decreases even when there are wellcorrelated structures. The correlation between two random matrices in ${\mathbb{R}}^{200\times 200}$ is almost zero in expectation.
MNIST  CIFAR10  CIFAR100  TinyImagenet  
Correlation  0.88   0.65  0.35  0.22 
Given these premises and observations, the following conjecture can be made: when the number of classes is high than boundaries among similar classes becomes more complex. Thus, as an ablation study, two 2classes problems with the CIFAR10\xspacedataset are reported in the following: the first problem is to distinguish classes airplane (id: 0) and horse (id: 7) while the second is cat (id: 3) vs dog (id: 5). In creftype 5 it is shown that even in simple settings, adversarial training affects dramatically finegrained classification.
3.2. Unbalanced classification
Although realworld datasets are longtailed [11], most of the experiments and theoretical findings on the accuracyrobustness tradeoff in the literature were performed with balanced datasets [50].
Through an experimental analysis, it is shown that when classes are unbalanced, adversarial training can have dramatic effects on clean accuracy. For this analysis, the same 2classes problems of the ablation study reported in subsection 3.1 are selected: catvsdog and airplanevshorse. Classes are artificially randomly unbalanced such that their ratio is 0.3.
The catvsdog classification problem is intrinsically difficult since the two classes have many features in common. Moreover, CIFAR10\xspacehave lowresolution images making (sometimes) this classification task not trivial also for human classifiers. On the contrary, airplanevshorse is a simple task and thus one should expect that adversarial training does not decrease much clean accuracy.
The results of these two experiments are shown in creftype 5: two different considerations are here reported. The first is that when classes are similar, as mentioned above, PGD heavily impacts on the performance with respect to standard training. Instead, for dissimilar classes, the effect is much less pronounced. This a solid argument for supposing that using a single $\u03f5$ may be not optimal. The second consideration is that when dataset is unbalanced, PGD further amplifies the difficulty of the classification task. For example, for catvsdog (creftype 4(d)), in presence of unbalance, the model can’t be fit at all.
3.3. Entropy of softmax outputs
One of the issues of ’standardly’ trained network, is that they are overconfident, that is, they tend to predict classes with with high probability even when images are not clear [19]. Adversarial training can be seen as an implicit regularization and thus it is legitimate to analyze confidence of predictions on robust models. Indeed, in creftype 7 it is shown that another characteristic of adversarial training is reducing confidence of predictions; in fact, the entropy of class logits of the robustly trained network is much higher. This suggests that confidence scores obtained by thresholding the softmax predictions should be changed. Thus, it may seems that robust representations are less discriminative than standard ones. ^{3}^{3} 3 For those who are not used to deep learning language, in this context a representation is the vector (output of the feature extractor) that is feed to the last layer which is a linear classifier. It turns out that this intuition is true and supported by creftype 6. In order to assess the structure of representations, it has been employed tSNE [27], a techniques that allows to visualize highdimensional data in 2 or 3 dimensions. From creftype 6, it is clear that robust representations are less clustered with respect to natural ones. Each coloured cluster correspond to one particular class.
3.4. PGD flattens boundaries
In order to better understand the behavior of PGD and also to compare it with WPGD defined in creftype 4, a simple classification problem with three classes is considered. creftype 8 and creftype 9 show the boundaries for PGD and WPGD (for different $\u03f5$), respectively. creftype 7(a) represents the standard training with which achieves almost zero error. As $\u03f5$ increases boundaries are more flattened as orthogonal as possible to the gradient direction. The adopted cost matrix is $C=\left[\begin{array}{ccc}\hfill 0\hfill & \hfill 10\hfill & \hfill 0.01\hfill \\ \hfill 10\hfill & \hfill 0\hfill & \hfill 1\hfill \\ \hfill 0.01\hfill & \hfill 1\hfill & \hfill 0\hfill \end{array}\right]$. Related this results, [33] showed experimentally that the main effect of PGD is to reduce the curvature of boundaries. However, it can be easily shown that even when the curvature is, robust training still has an effect. Moreover, it is noticed that gradients are more aligned to the vector which connect two classes. This is due to the ”isotropic” effect of PGD which tend to estimate more isotropic distributions. This is in accordance with [50] in which authors observed that gradient on the robust model are more meaningful. This argument is also in accordance to results on finegrained classification present on this work, suggesting that visually similar are separated by more complex boundaries. Instead, WPGD controls the the regularization of boundaries through the cost matrix: boundaries for couple of classes considered more similar are mostly preserved.
Remark 2.
One may find the claim that since visually similar are separated by more complex boundaries, it obviously hurts robustness. However, the range of values of $\u03f5$ used for robust training are much smaller than the minimal distance between two images in the dataset. Thus, at least in principle, it is not still clear why it is not possible to obtain robustness and accuracy at the same time.
4. Wasserstein Projected Gradient Descent
creftype 4.1 briefly reviews the the necessary background on discrete Optimal transport tools, while creftype 4.2 introduces a new formulation of directional adversarial training.
4.1. Wasserstein metric and optimal transportation
The cost between classes, referred to as label metric, is defined in the following:
Definition 3 (Label metric ${C}^{\odot p}$).
A symmetric positive semidefinite matrix $C\mathrm{\in}{\mathrm{R}}_{\mathrm{+}}^{K\mathrm{\times}K}$ defines a pseudoRiemannian metric on the domain, an entry ${C}_{k\mathrm{,}{k}^{\mathrm{\prime}}}$ is the cost of transporting unit probability mass from class $k$ to class ${k}^{\mathrm{\prime}}$. Note that ${C}_{k\mathrm{,}k}\mathrm{=}\mathrm{0}$. The notation ${C}^{\mathrm{\odot}p}$ denotes the elementwise ${p}^{\mathrm{th}}$power of $C$.
The other building blocks are the optimal transportation problem [41] and the Wasserstein metric over probability distributions. Given two probability distributions $q,{q}^{\prime}$ supported on $K$ classes, the $p$Wasserstein distance between $q$ and ${q}^{\prime}$ for $p\in [1,\mathrm{\infty})$ is defined to be
$${\mathrm{W}}_{p}^{p}(q,{q}^{\prime})=\underset{\pi \in \mathrm{\Pi}(q,{q}^{\prime})}{inf}\u27e8\pi ,{C}^{\odot p}\u27e9$$  (3) 
where $\mathrm{\Pi}(q,{q}^{\prime})=\{\pi \in {\mathbb{R}}_{+}^{K\times K}:q=\pi \mathrm{\U0001d7d9},{q}^{\prime}={\pi}^{\top}\mathrm{\U0001d7d9}\}$ is the set of joint probability distributions with $q$ as the right marginal and ${q}^{\prime}$ as the left marginal; $\mathrm{\U0001d7d9}$ denotes the allone vector and $\u27e8\cdot ,\cdot \u27e9$ is the Frobenius inner product on matrices. The Wasserstein distance is the optimal cost of transporting probability mass from an initial distribution $q$ to a final distribution ${q}^{\prime}$. For $$, the Wasserstein distance in creftype 3 is defined to be ${\mathrm{W}}_{p}(q,{q}^{\prime})={inf}_{\pi \in \mathrm{\Pi}(q,{q}^{\prime})}\u27e8\pi ,{C}^{\odot p}\u27e9$; note the absence of ${p}^{\mathrm{th}}$ power on the lefthand side. For any separable complete metric space $(\mathcal{X},d)$ and $p>0$, the metric space $({\mathcal{P}}_{p},{\mathrm{W}}_{p})$ is complete and separable, ${\mathcal{P}}_{p}$ being the set of probability distributions supported on $\mathcal{X}$ [2].
Problem creftype 3 is called the Kantorovich relaxation [41] of the original optimal transport problem with $\mathrm{\Pi}={\mathbb{R}}_{+}^{K\times K}$ [31] and it takes $\mathcal{O}({K}^{3})$ operations to solve it using linear programming or interior point methods. [8] proposed a smoothed alternative to creftype 3 by adding a convex negative entropic term
${}^{\lambda}\mathrm{W}_{p}^{p}(q,{q}^{\prime})=\underset{\pi \in \mathrm{\Pi}(q,{q}^{\prime})}{inf}\u27e8\pi ,{C}^{\odot p}\u27e9{\lambda}^{1}\mathrm{H}(\pi ),$  (4) 
$\mathrm{H}(\pi )={\sum}_{k,{k}^{\prime}=1}^{K}{\pi}_{k,{k}^{\prime}}\mathrm{log}{\pi}_{k,{k}^{\prime}}$ that enables an efficient algorithm based on SinhornKnopp iteration [45] to approximate ${\pi}^{*}$. Large values of $\lambda $ give better approximation to the exact distance ${\mathrm{W}}_{p}^{p}$ and it can be shown that ${}^{\lambda}\mathrm{W}_{p}^{p}$ converges to ${\mathrm{W}}_{p}^{p}$ as $\lambda \to \mathrm{\infty}$ [37].
SinhornKnopp iteration is a costly algorithm if the number of classes $K$ is large or the metric ${C}^{\odot p}$ is complex. However as the following lemma shows, if one of the probability distributions is a onehot vector, one can compute the optimal transport ${\pi}^{*}$ in closed form. Indeed, in this paper, the $p$Wasserstein distance is computed between the groundtruth $y(x)$ and the network predictions $\widehat{y}(x)$, the former being a onehot vector.
Lemma 4 (Closedform Wasserstein distance).
For any normalized $q$, if the target probability distribution ${q}^{\mathrm{\prime}}$ is a onehot vector, the Wasserstein distance ${\mathrm{W}}_{p}^{p}$ can be computed in closed form and is given by
$${\mathrm{W}}_{p}^{p}(q,{q}^{\prime})={C}_{{\kappa}^{*}}^{\odot p}q$$ 
where ${\kappa}^{\mathrm{*}}\mathrm{=}{\mathrm{arg}\mathit{}\mathrm{max}}_{k}\mathit{}{q}^{\mathrm{\prime}}$. The optimal transport is such that its $\mathrm{\left(}{\kappa}^{\mathrm{*}}\mathrm{\right)}{}^{\mathrm{th}}$ column is $q$.
The proof of this lemma follows from the observation that the set $\mathrm{\Pi}(q,{q}^{\prime})$ is degenerate for onehot ${q}^{\prime}$, the constraints ${\pi}^{\top}\mathrm{\U0001d7d9}={q}^{\prime}$ and $\pi \mathrm{\U0001d7d9}=q$ force the $\left({\kappa}^{*}\right){}^{\mathrm{th}}$ column of $\pi $ to be simply $q$. Note that the Wasserstein distance is symmetric and therefore the same statement holds for ${\mathrm{W}}_{p}^{p}({q}^{\prime},q)$. Finally, the regularized [8] Wasserstein Loss is defined as follows:
Definition 5 (Wasserstein Loss).
The Wasserstein Loss can now be defined as
$${\mathrm{\ell}}_{\mathrm{W}}(\theta ;x)={C}_{\kappa (x)}^{\odot p}\widehat{y}(x)\frac{{\lambda}^{1}}{\mathrm{log}K}\mathrm{H}(\widehat{y}(x));$$  (5) 
here ${C}_{y\mathit{}\mathrm{(}x\mathrm{)}}^{\mathrm{\odot}p}$ denotes the $y\mathit{}{\mathrm{(}x\mathrm{)}}^{\mathrm{th}}$ row of the matrix ${C}^{\mathrm{\odot}p}\mathrm{\in}{\mathrm{R}}_{\mathrm{+}}^{K\mathrm{\times}K}$. Note that computing ${\mathrm{\ell}}_{\mathrm{W}}\mathit{}\mathrm{(}\theta \mathrm{;}x\mathrm{)}$ and backpropagating through it has the same computational complexity as standard crossentropy.
4.2. WPGD
The saddle point formulation for the Wasserstein loss creftype 5 can be modified to lead to the following definition.
Definition 6 (Robust Wasserstein loss).
The Robust Wasserstein loss is defined as
$$\underset{\theta}{\mathrm{min}}{\mathbb{E}}_{X}{\mathrm{\ell}}_{\mathrm{CE}}({x}^{*};\theta ),{x}^{*}=\underset{{\parallel {x}^{\prime}x\parallel}_{\mathrm{\infty}}\le \u03f5}{\mathrm{arg}\mathrm{max}}{\mathrm{\ell}}_{\mathrm{W}}(\theta ;{x}^{\prime})$$  (6) 
The outer loop remains the same while the inner loop is responsible to find the adversarial example which maximize the Wasserstein loss ${\mathrm{\ell}}_{\mathrm{W}}$. This implies that at the beginning of training WPGD will prefer directions connecting visually distant classes, such as, cat and truck, preventing to flatten regions between similar classes. It is important to note that during training there is an implicit tradeoff between choosing directions suggested by the metrics and gradients directions. In fact, the loss gradient is nothing else that an inner product of the $K$ logit’s gradients and the the row $k$th row of $C$. Imposing an approximation of the real visual metric, helps to efficiently explore the ${\mathrm{\ell}}_{\mathrm{\infty}}$ball which, especially for highdimensional input can be hard to explore, leading to better results. For WPGD experiments, the metrics previously described will be used.
5. Experiments
This Section provides the experimental findings of the WPGD approach.
5.1. Datasets and networks
In this paper, the MNIST [25], CIFAR10\xspace, CIFAR100\xspacedatasets [23] and TinyImagetNet\xspace [1] dataset are used for the experiments. For all datasets, images are normalized to have pixel intensities between $[0,1]$. The adversarial vulnerability of neural networks increases with the number of output classes [13]. In this context, is it worth emphasizing that the TinyImagetNet\xspace dataset with $200$ classes is a viable dataset for benchmarking adversarial learning algorithms; this dataset is however less popular in the literature which primarily focuses on MNIST and CIFAR10\xspace. For the CIFAR datasets, it is used standard dataaugmentation which involves mirror flipping with probability of $0.5$ and random crops of size $32\times 32$ after padding images by $4$ pixels on each side. The following networks are used in all the experiments:

(1)
W1610: WideResidual network of [53] with 16 layers, a widening factor of $10$, weight decay of $5\times {10}^{4}$ and zero dropout.

(2)
W4010: WideResidual network of [53] with 50 layers, a widening factor of $10$, weight decay of $5\times {10}^{4}$ and zero dropout.

(3)
W2810: WideResidual network of [53] with $28$ layers, a widening factor of $10$, weight decay of $5\times {10}^{4}$ and zero dropout.
All networks are trained with stochastic gradient descent (SGD), Nesterov’s momentum of $0.9$ and minibatch size of $128$.
5.2. Algorithms
The following four algorithms will be compared:

(1)
CE: This is the standard crossentropy loss ${\mathrm{\ell}}_{\mathrm{CE}}$ defined in creftype 1.

(2)
PGD: This is the algorithm of [28]; the saddlepoint problem creftype 2 is solved with $8$ steps in the inner loop to compute the adversarial image.

(3)
WPGD: This is the robust Wasserstein loss described in creftype 6 where the inner loop in PGD searches over the adversarial image that maximizes the Wasserstein transport cost. The computational complexity of WPGD is the same as that of PGD. WPGD is compared with three different value of $p=1,2.5,10$.
W$s$10 represents the wideresnet architecture with $s$ layers. In order to test robustness, 20steps PGD attacks are performed starting from a random (uniformly sampled) position inside the ${\mathrm{\ell}}_{\mathrm{\infty}}$ ball of the test image $x$. All the WPGD experiments are run with the cost matrix provided by the WordNet metric [30].
5.3. Directional robustness of WPGD
In creftype 2 the main results of natural training (CE) and robust training (PGD) for CIFAR10\xspace and TinyImagetNet\xspaceare reported. creftype 5.4 reports a summary table for quantitative results on directional robustness. Instead, in creftype 10 it is shown the tradeoff arising from WPGD training. As $p$ increases, finegrained classification is more preserved. In addition to standard accuracy, the characterizations of adversarial robustness of PGD and WPGD is compared. In creftype 11 it shown that WPGDtrained networks with a strong metric tend to be more robust between visually distant classes, which supports out claims. For sake of clarity, only results for CIFAR10\xspace and TinyImagetNet\xspaceand W1610 are reported. Interestingly, WPGD is less robust than PGD for classes bird and airplane: thus, imposing a metric, even if it is only approximately correct, seems to help to obtain more visually meaningful errors.
C10  CE  PGD  

1610  2810  1610  2810  
NE  4.4  3.9  14.11  13.9 
AE  100  100  34.5  31.25 
Tiny  CE  PGD  

1610  2810  1610  2810  
NE  37.7  36.9  55.3  36.9 
AE  99.9  100  70.4  70.5 
5.4. Supplementary comparisons for CE, PGD and WPGD
creftype 12 report curves plot for PGD and WPGD for CIFAR10\xspaceand TinyImagetNet\xspace. Moreover, creftype 3 reports the summary of weighted robustness score $S$ defined as:
$$S=\sum _{i,j}{c}_{i,j}{m}_{i,j}$$  (7) 
where $M={\{{m}_{ij}\}}_{i,j=1}^{K}$ is the adversarial confusion matrix, $C={\{{c}_{ij}\}}_{i,j=1}^{K}$ is the metric of the given dataset. Attacks are computed maximizing the loss creftype 6, that is considering the worstcase scenario in which the attacker knows the metric. This score weighs more errors in correspondence of high cost. In order to make results legible, the zero reference is set to the PGDtrained model. As it can be seen increasing $p$, results in reducing the score $S$, which means that, on average, more similar classes are reached.
AE [$\mathrm{\%}$]  $p$  dataset  $S$  

W1610  34.53  0.0  CIFAR10\xspace  0.14 
W1610  34.62  1.0  CIFAR10\xspace  0.26 
W1610  34.98  2.5  CIFAR10\xspace  0.34 
W1610  39.76  10.0  CIFAR10\xspace  0.53 
W2810  31.24  0.0  CIFAR10\xspace  0.00 
W1610  70.23  1.0  TinyImagetNet\xspace  6.33 
W1610  73.61  2.5  TinyImagetNet\xspace  12.45 
W1610  92.62  10.0  TinyImagetNet\xspace  55.17 
W2810  69.84  1.0  TinyImagetNet\xspace  9.62 
W2810  69.69  0.0  TinyImagetNet\xspace  9.48 
6. Related work
This work is related to [28, 48]. Although they give theoretical and practical results on the connection between robustness and accuracy for adversarial training, they don’t analyze how the accuracy gap is distributed. They also argue that adversarial training requires extra capacity in order to build complex boundaries [22]. In contrast, [33] have recently argued that adversarial training leads to flatter decision boundaries and in fact, explicitly penalizing the curvature of the decision boundary is a good technique to train robust classifiers. Results in this paper corroborate these findings. The accuracy gap of adversarially trained networks with respect to standard crossentropy trained networks can be explained, very well the experiments show, by the network getting these pairs of classes incorrect. Semantic metrics, e.g., those derived from WordNet [30] to aid visual classification have been popular to introduce a new datamodality in standard supervised learning [9, 10]. This paper identifies the inherent visual metric that the network induces while being trained using crossentropy loss or the adversarial loss. Lastly, using an optimal transport formulation to impose a metric on the label space of deep networks bears close resemblance to the work of [15]. This work uses the Wasserstein loss computed using the SinhornKnopp iteration to predict multilabel images. The present paper is the first to use the optimal transport formulation to induce a costsensitive adversarial training of deep networks. Further, for singlelabel images, it shown that the optimal transport problem has a closed form solution which makes it computationally equivalent to the crossentropy loss; this simple but powerful property may be of independent interest for problems like hierarchical classification [17, 51, 4].
7. Conclusions and future work
While the literature on adversarial training is flourishing, profound studies towards understanding its implication and sensitivity to common realworld applications are still lacking. In particular, this paper focused on applications that are costsensitive or the dataset is unbalanced. Moreover, due to an intrinsic tradeoff between robustness and accuracy, it is of paramount importance to be to govern such tradeoff when designing and implementing machine and deep learningbased applications where a certain amount of accuracy is required. In liue of this, the present paper made several advances towards understanding better robustness from one side and being able to semantically control it from the other side.
In particular, this paper identified that the accuracy gap in adversarial training comes from the loss of finegrained classification capabilities in neural networks. This observation motivates the optimal transport formulation: a metric on the label space that measures the distance to the boundary for standard crossentropy training or, often equivalently, a semantic metric obtained from external data modalities such as WordNet, reduces the search space and makes it easier to discover—and fix—these classes during adversarial training, resulting in an improvement of accuracy at the cost of (directional) robustness. It is conceivable that, although a highdimensional classifier may always remain vulnerable to adversarial perturbations, it is possible to build robust, realworld systems by incorporating such diverse data. Thus, this work is a first step toward a principled robust training for realworld applications involving artificial intelligence and deep learning.
Future works will regard the study of methodologies or heuristics to systematically control the robustnessaccuracy tradeoff without the need of tuning $\u03f5$ by hyperparameter tuning. Moreover, another future direction of research is the application of the WPGD approach to other problems like fraud detection and Predictive Maintenance.
Acknowledgments
The authors gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan V GPU used for this research and Amazon Web Services for donating research credits.
main˙EAAI.brf
References
 [1] Tinyimagenet. https://tinyimagenet.herokuapp.com/.
 [2] Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré. Gradient flows: in metric spaces and in the space of probability measures. Springer Science & Business Media, 2008.
 [3] Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. arXiv:1802.00420, 2018.
 [4] Hessam Bagherinezhad, Maxwell Horton, Mohammad Rastegari, and Ali Farhadi. Label refinery: Improving imagenet classification through label progression. arXiv:1805.02641, 2018.
 [5] Aykut Beke and Tufan Kumbasar. Learning with type2 fuzzy activation functions to improve the performance of deep neural networks. Engineering Applications of Artificial Intelligence, 85:372–384, 2019.
 [6] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.
 [7] Nicholas Carlini and David Wagner. Adversarial examples are not easily detected: Bypassing ten detection methods. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pages 3–14. ACM, 2017.
 [8] Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in neural information processing systems, pages 2292–2300, 2013.
 [9] Jia Deng, Alexander C. Berg, Kai Li, and Li FeiFei. What does classifying more than 10,000 image categories tell us? Lecture Notes in Computer Science, pages 71–84, 2010.
 [10] Jia Deng, Nan Ding, Yangqing Jia, Andrea Frome, Kevin Murphy, Samy Bengio, Yuan Li, Hartmut Neven, and Hartwig Adam. Largescale object classification using label relation graphs. Lecture Notes in Computer Science, pages 48–64, 2014.
 [11] Jia Deng, Wei Dong, Richard Socher, LiJia Li, Kai Li, and Li FeiFei. Imagenet: A largescale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. Ieee, 2009.
 [12] Andre Esteva, Alexandre Robicquet, Bharath Ramsundar, Volodymyr Kuleshov, Mark DePristo, Katherine Chou, Claire Cui, Greg Corrado, Sebastian Thrun, and Jeff Dean. A guide to deep learning in healthcare. Nature medicine, 25(1):24, 2019.
 [13] Alhussein Fawzi, Hamza Fawzi, and Omar Fawzi. Adversarial vulnerability for any classifier. arXiv:1802.08686, 2018.
 [14] Alhussein Fawzi, Omar Fawzi, and Pascal Frossard. Analysis of classifiers robustness to adversarial perturbations. Machine Learning, 107(3):481–508, 2018.
 [15] Charlie Frogner, Chiyuan Zhang, Hossein Mobahi, Mauricio Araya, and Tomaso A Poggio. Learning with a wasserstein loss. In Advances in Neural Information Processing Systems, pages 2053–2061, 2015.
 [16] Zehai Gao, Cunbao Ma, Yige Luo, and Zhiyue Liu. Ima health state evaluation using deep feature learning with quantum neural network. Engineering Applications of Artificial Intelligence, 76:119–129, 2018.
 [17] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. 11 2013.
 [18] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv:1412.6572, 2014.
 [19] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 1321–1330. JMLR. org, 2017.
 [20] Olakunle Ibitoye, Omair Shafiq, and Ashraf Matrawy. Analyzing adversarial attacks against deep learning for intrusion detection in iot networks. arXiv preprint arXiv:1905.05137, 2019.
 [21] Harini Kannan, Alexey Kurakin, and Ian Goodfellow. Adversarial logit pairing. arXiv preprint arXiv:1803.06373, 2018.
 [22] J Zico Kolter and Eric Wong. Provable defenses against adversarial examples via the convex outer adversarial polytope. arXiv preprint arXiv:1711.00851, 2017.
 [23] A. Krizhevsky. Learning multiple layers of features from tiny images. Master’s thesis, Computer Science, University of Toronto, 2009.
 [24] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
 [25] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 [26] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, 2015.
 [27] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using tsne. Journal of machine learning research, 9(Nov):2579–2605, 2008.
 [28] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv:1706.06083, 2017.
 [29] Marco Maggipinto, Matteo Terzi, Chiara Masiero, Alessandro Beghi, and Gian Antonio Susto. A computer visioninspired deep learning architecture for virtual metrology modeling with 2dimensional data. IEEE Transactions on Semiconductor Manufacturing, 31(3):376–384, 2018.
 [30] George A Miller. WordNet: a lexical database for English. Communications of the ACM, 38(11):39–41, 1995.
 [31] Gaspard Monge. Mémoire sur la théorie des déblais et des remblais. Histoire de l’Académie Royale des Sciences de Paris, 1781.
 [32] SeyedMohsen MoosaviDezfooli, Alhussein Fawzi, Omar Fawzi, and Pascal Frossard. Universal adversarial perturbations. arXiv:1610.08401, 2017.
 [33] SeyedMohsen MoosaviDezfooli, Alhussein Fawzi, Jonathan Uesato, and Pascal Frossard. Robustness via curvature regularization, and vice versa. 11 2018.
 [34] Ali Bou Nassif, Ismail Shahin, Imtinan Attili, Mohammad Azzeh, and Khaled Shaalan. Speech recognition using deep neural networks: A systematic review. IEEE Access, 7:19143–19165, 2019.
 [35] Nicolas Papernot, Patrick McDaniel, and Ian Goodfellow. Transferability in machine learning: from phenomena to blackbox attacks using adversarial samples. arXiv:1605.07277, 2016.
 [36] Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z Berkay Celik, and Ananthram Swami. Practical blackbox attacks against deep learning systems using adversarial examples. arXiv:1602.02697, 2016.
 [37] Gabriel Peyré and Marco Cuturi. Computational optimal transport. arXiv:1803.00567, 2018.
 [38] Adnan Qayyum, Muhammad Usama, Junaid Qadir, and Ala AlFuqaha. Securing connected & autonomous vehicles: Challenges posed by adversarial machine learning and the way forward. arXiv preprint arXiv:1905.12762, 2019.
 [39] Xing Qi. Rotor resistance and excitation inductance estimation of an induction motor using deepqlearning algorithm. Engineering Applications of Artificial Intelligence, 72:67–79, 2018.
 [40] Hassan Rafique, Mingrui Liu, Qihang Lin, and Tianbao Yang. Nonconvex minmax optimization: Provable algorithms and applications in machine learning, 2018.
 [41] Filippo Santambrogio. Optimal transport for applied mathematicians. Birkäuser, NY, 2015.
 [42] Andrew M Saxe, James L McClelland, and Surya Ganguli. A mathematical theory of semantic development in deep neural networks. Proceedings of the National Academy of Sciences, 116(23):11537–11546, 2019.
 [43] Ludwig Schmidt, Shibani Santurkar, Dimitris Tsipras, Kunal Talwar, and Aleksander Madry. Adversarially robust generalization requires more data. arXiv:1804.11285, 2018.
 [44] Aman Sinha, Hongseok Namkoong, and John Duchi. Certifying some distributional robustness with principled adversarial training. 2018.
 [45] Richard Sinkhorn. A relationship between arbitrary positive matrices and doubly stochastic matrices. The annals of mathematical statistics, 35(2):876–879, 1964.
 [46] Gian Antonio Susto, Andrea Schirru, Simone Pampuri, Seán McLoone, and Alessandro Beghi. Machine learning for predictive maintenance: A multiple classifier approach. IEEE Transactions on Industrial Informatics, 11(3):812–820, 2014.
 [47] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv:1312.6199, 2013.
 [48] Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry. Robustness may be at odds with accuracy. 05 2018.
 [49] Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry. There is no free lunch in adversarial robustness (but there are unexpected benefits). arXiv:1805.12152, 2018.
 [50] Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry. Robustness may be at odds with accuracy. In International Conference on Learning Representations, 2019.
 [51] Cinna Wu, Mark Tygert, and Yann LeCun. Hierarchical loss for classification. arXiv:1709.01062, 2017.
 [52] Tom Young, Devamanyu Hazarika, Soujanya Poria, and Erik Cambria. Recent trends in deep learning based natural language processing. ieee Computational intelligenCe magazine, 13(3):55–75, 2018.
 [53] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.