Adversarial Learning in Statistical Classification: A Comprehensive Review of Defenses Against Attacks

  • 2019-04-12 16:05:21
  • David J. Miller, Zhen Xiang, George Kesidis
  • 2


With the wide deployment of machine learning (ML) based systems for a varietyof applications including medical, military, automotive, genomic, as well asmultimedia and social networking, there is great potential for damage fromadversarial learning (AL) attacks. In this paper, we provide a contemporarysurvey of AL, focused particularly on defenses against attacks on statisticalclassifiers. After introducing relevant terminology and the goals and range ofpossible knowledge of both attackers and defenders, we survey recent work ontest-time evasion (TTE), data poisoning (DP), and reverse engineering (RE)attacks and particularly defenses against same. In so doing, we distinguishrobust classification from anomaly detection (AD), unsupervised fromsupervised, and statistical hypothesis-based defenses from ones that do nothave an explicit null (no attack) hypothesis; we identify the hyperparameters aparticular method requires, its computational complexity, as well as theperformance measures on which it was evaluated and the obtained quality. Wethen dig deeper, providing novel insights that challenge conventional AL wisdomand that target unresolved issues, including: 1) robust classification versusAD as a defense strategy; 2) the belief that attack success increases withattack strength, which ignores susceptibility to AD; 3) small perturbations fortest-time evasion attacks: a fallacy or a requirement?; 4) validity of theuniversal assumption that a TTE attacker knows the ground-truth class for theexample to be attacked; 5) black, grey, or white box attacks as the standardfor defense evaluation; 6) susceptibility of query-based RE to an AD defense.We then present benchmark comparisons of several defenses against TTE, RE, andbackdoor DP attacks on images. The paper concludes with a discussion of futurework.


Quick Read (beta)

Adversarial Learning in Statistical Classification: A Comprehensive Review of Defenses Against Attacksthanks: The authors would like to acknowledge research contributions of their student Yujia Wang, which were helpful in the development of this paper. This research was supported in part by an AFOSR DDDAS grant, a Cisco Systems URP gift, and an AWS credits gift.

David J. Miller, Zhen Xiang and George Kesidis The authors are with the School of EECS, Pennsylvania State University, University Park, PA, 16803, USA email: {djm25,gik2}

With the wide deployment of machine learning (ML) based systems for a variety of applications including medical, military, automotive, genomic, as well as multimedia and social networking, there is great potential for damage from adversarial learning (AL) attacks. In this paper, we provide a contemporary survey of AL, focused particularly on defenses against attacks on statistical classifiers. After introducing relevant terminology and the goals and range of possible knowledge of both attackers and defenders, we survey recent work on test-time evasion (TTE), data poisoning (DP), and reverse engineering (RE) attacks and particularly defenses against same. In so doing, we distinguish robust classification from anomaly detection (AD), unsupervised from supervised, and statistical hypothesis-based defenses from ones that do not have an explicit null (no attack) hypothesis; we identify the hyperparameters a particular method requires, its computational complexity, as well as the performance measures on which it was evaluated and the obtained quality. We then dig deeper, providing novel insights that challenge conventional AL wisdom and that target unresolved issues, including: 1) robust classification versus AD as a defense strategy; 2) the belief that attack success increases with attack strength, which ignores susceptibility to AD; 3) small perturbations for test-time evasion attacks: a fallacy or a requirement?; 4) validity of the universal assumption that a TTE attacker knows the ground-truth class for the example to be attacked; 5) black, grey, or white box attacks as the standard for defense evaluation; 6) susceptibility of query-based RE to an AD defense. We then present benchmark comparisons of several defenses against TTE, RE, and backdoor DP attacks on images. The paper concludes with a discussion of continuing research directions, including the supreme challenge of detecting attacks whose goal is not to alter classification decisions, but rather simply to embed, without detection, “fake news” or other false content.

test-time-evasion, data poisoning, reverse engineering, deep neural networks, anomaly detection, robust classification, black box, white box, targeted attacks, transferability

I Introduction

Machine learning (ML) based systems have found broad applications ranging from military, industrial, medical, multimedia/Web, and scientific (including genomics) to even the political, social science, and legal arenas. As their integration into the modern world’s infrastructure continues to grow, they become ever more enticing targets for adversaries, including individual hackers, criminal organizations, as well as government intelligence services, which may seek to “break” them. Thus, adversarial learning (AL), the problem of devising attacks against ML systems as well as defenses against such attacks [3],[9], has become a popular topic over the past decade. We exclude from consideration here methods such as Generative Adversarial Networks (GANs) [24], where the “adversary” is just a component of a game-theoretic machine learning framework, under complete control of the ML designer – here the focus is on adversaries which seek to disrupt operation of the ML system in some way. Researchers have devised attacks against various types of ML systems, including supervised classifiers and regression/prediction [68], as well as unsupervised clustering and density estimation [8]. The main focus of this paper is on attacks against supervised statistical classifiers.

Brief Review of Statistical Classification: A classifier is a function C(x¯)𝒦{ω1,,ωK} that acts on an input, fixed-dimensional feature vector x¯, whose entries could be numerical, categorical, or ordinal, and maps the feature vector to one of K categories defined for the given application domain. For example, for the MNIST image data set [40], the ten classes are the digits ω1=‘0’ through ω10=‘9’. The feature vector (sometimes also called the “pattern”) could be raw (e.g. a speech waveform or a scanned digital image) or it could be derived, based on some front-end feature extraction (e.g., linear prediction or cepstral coefficients for speech [62], a bag-of-words representation for a text document, e.g. [57], or principal component representation of an image). Often, the classification mapping is implemented by a “winner-take-all” (WTA) rule [19] applied to a collection of (learned) discriminant functions, one per class. WTAs are used both by neural network based classifiers, including deep neural networks (DNNs), as well as linear and generalized linear classifiers (e.g., support vector machines (SVMs) [72]). Another widely used classifier structure is a decision tree [10]. For a review of various classification methods, see [19]. The classification rule divides the feature space into mutually exclusive and collectively exhaustive regions, one per class. Neighboring classes are those which share a decision boundary (the boundary between two classes, if it exists, is the locus of points for which the two classes have the same (largest) discriminant function value). The classifier is trained, based on an available labeled training set


to minimize a (supervised) loss function measured over the training samples. The goal is to learn a decision rule that generalizes to make accurate decisions on test samples, unseen during training, drawn according to the true joint density function fX¯,C(), which is generally unknown. The ultimate attainable performance for a given classification problem is the Bayes error rate, which depends on this (unknown) joint density [19]. The classifier’s generalization error is often estimated using a separate, finite labeled test set (held out from training). Often, the labeled data resource is split into exclusive training and test subsets. Cross validation methods are also used to estimate generalization error. One may also take some of the training data and use it as a validation set for choosing hyperparameters of the classifier that cannot be easily chosen as part of the standard classifier learning algorithm. For example, for an SVM, parameters of the kernel (even the choice of kernel) and the degree of slackness on margin violations are hyperparameters that may need to be chosen. For a deep neural network, the number of layers, the sizes of the layers, the choice of activation functions, the stopping rule for learning, and the neuron “dropout” rate [66] could all be considered hyperparameters.

Terminology for and Types of Adversarial Learning Attacks: Early seminal work introduced useful terminology to distinguish various types of AL attacks and their distinct objectives [3]. Causative attacks alter the learned classification model. Exploratory attacks do not alter the learned model but seek to learn information either about the classifier/decision rule or about the data samples on which it was trained (which may contain private or sensitive information). Targeted attacks focus on ensuring the classifier assigns either a particular subset of data samples (even a single sample) or a particular region of feature space to a chosen (target) class. Indiscriminate attacks, on the other hand, simply seek to induce decision changes without ensuring a particular target class is assigned. Moreover, availability attacks seek to make a classifier unusable by degrading its accuracy to an unacceptably low level. There are three fundamental types of AL attacks on classifiers, which we next describe.

Data Poisoning (DP): These are causative attacks wherein the attacker has the ability to introduce “poisoned” samples into the training (and/or validation and/or test) sets. The poisoned samples could either be joint feature density “typical” examples from the domain but which are mislabeled (either in a targeted or indiscriminate fashion) or examples that are not typical of the domain. For example, if the image domain is categories of vehicles, a bird image would be an atypical example.

Until recently, most DP attacks were availability attacks, seeking to degrade the learned classifier’s accuracy, e.g. [31],[75]. However, there is also strong recent interest in backdoor DP attacks [16],[70],[43], i.e., targeted attacks which seek to have the classifier learn a “backdoor” pattern embedded into the poisoned samples. The backdoor pattern could be an imperceptible (random) watermark-like pattern or something perceptible but innocuous – e.g., the presence of glasses on a face [16]. Backdoor attacks seek for the classifier to learn to classify to the target class when the backdoor pattern is present but otherwise not to alter its decisionmaking – degradation in accuracy in the absence of the backdoor pattern could destroy the classifier’s usability and could also be the basis for detecting that the classifier has been DP-attacked.

In some works, it is assumed that initially there is a clean (free of poisoned samples) training set, but that it is subsequently altered by additional data collection, by on-line learning, or by the actions of an adversarial insider. Under this scenario, for availability DP attacks, one can detect poisoned samples by discerning that their use in learning degrades classification accuracy [56] relative to just use of the clean data. Also, for backdoor attacks, one may be able to detect poisoned samples as outliers relative to the class’s density learned just using known clean samples. A more challenging DP scenario for attack detection is the embedded scenario, where one cannot assume the training data set is initially clean and where there is no available means (time stamps, data provenance, etc.) for identifying a subset of samples guaranteed to be free of poisoning.

Reverse Engineering: These are exploratory attacks that involve querying (probing) a classifier either to learn its decision rule or to learn something about the data set on which it was trained (an attack on data privacy). Querying creates a training set for the attacker, allowing him to learn a surrogate of the true classifier. Several motivations have been given for reverse-engineering a classifier’s decision rule. In [69], the goal was to avoid having to pay for use of an on-line ML service that makes decisions, given a (paying) user’s queries. A more concerning attacker objective is to learn a good surrogate classifier so that the attacker can identify the classifier’s “vulnerabilities” – in particular, knowledge of the classifier allows the attacker to craft test-time evasion attacks, described next.

Test-Time Evasion: This fundamental attack on classifiers is one wherein test (operational) examples are perturbed, either human-imperceptibly and/or so that they are not easily machine-detected, but in such a way that the classifier’s decision is changed (and now disagrees with a consensus human decision), e.g. [5, 68, 59, 23, 13]. TTE attacks may be either targeted or indiscriminate, although, as with backdoor attacks, they are most strategic if they are targeted, perturbing samples from a source category cs so that they are classified to a particular target (destination) category, cd. In order to ensure such an attack is successful, the attacker needs to know the true class label of the pattern he will attack. Crafting TTE attacks involves perturbing a source pattern until it moves from the decision region for one category across the decision boundary into the region for another (e.g., targeted) category. Such attacks, creating adversarial examples [68], may e.g. cause an autonomous vehicle to fail to recognize a road sign, cause an automated system to falsely target a civilian vehicle, or grant image or audio authentication access to a building, machine, or to restricted information. TTE attacks may be on physical objects in the real world (e.g., altering a road sign, camouflaging a vehicle, or darkening a room to make it more difficult to identify objects of interest). Alternatively, they may be alterations of data objects that have either already been digitally captured (digital images, voice files) or those which are natively digital, e.g., emails, documents, or computer programs. If a TTE attack is enabled by an RE attack, which learns a surrogate classifier (rather than assuming the TTE attacker possesses perfect knowledge of the actual classifier), an important property is transferability of the TTE attack [58]: does a perturbation of a pattern that induces a targeted or indiscriminate change in the surrogate classifier’s decision also induce such a change in the actual classifier’s decision?

Knowledge and Assumptions of the Attacker and the Defender: In publications involving the aforementioned attack types, various assumptions, either explicit or implicit, are made about the knowledge and resources of the attacker as well as those of the defender, i.e. the system consisting of the classifier plus any defense which it mounts to defeat potential attacks. [6] has advocated making the assumptions and goals of the attacker explicit in AL work. We also advocate making explicit the defender’s goals and assumptions. Again, some relatively standard, useful terminology has been introduced [3]. First, an attacker or defender may be either proactive or reactive. A proactive defense is one mounted without specific knowledge of a potential attacker’s strategies, objectives, or knowledge. The defender’s challenge here is formidable, since he is seeking to protect the classifier against any and all potential attacks, but not necessarily insurmountable. Against TTE attacks, proactive defenses include robust classification strategies, which seek to correctly classify attacked patterns, as well as anomaly detectors, relying on attacked patterns exhibiting detectable atypicality relative to clean patterns. Similarly, robust classification and detection have been proposed as proactive strategies against DP attacks. A reactive attack is one which exploits knowledge of a mounted defense. Quite a few papers have shown that reactive attacks can defeat proactive defenses by specifically targeting them. For example, after the robust classification approach known as “distillation” was published in [60], [13] devised a TTE attacker which, with full knowledge of this classifier, crafts successful TTE attack examples, ones which cause targeted changes to decisions made by the classifier in [60]. [77] likewise devised an attack which defeats the robust classification defense from [71]. In both these examples, the setting is implicitly game-theoretic, with the (sequential) advantage to the attacker. Presumably, with knowledge of the attack, the defender could alter his/her strategy to defeat it. For example, [48] developed a supervised detector of TTE attacks. Unfortunately, while successful in detecting assumed known attacks, this method fares poorly if the attacker uses an unknown strategy [48, 12].

A wide range of assumptions about the attacker’s knowledge have been made in the AL literature. Early DP work [45] targeting naive Bayes email-spam filters showed that: i) TTE attacks could be mounted by adding “good” words, typically appearing in ham (non-spam) emails, to spam emails; ii) likewise, effective DP attacks are created by sending “ham-like” emails from a black-listed IP address, whose emails are treated as labeled spam instances and used to on-line refine (in this case, corrupt) the naive Bayes spam filter. Alternatively, the attacker can send emails that get the spam model to focus on specific “bad” words. Subsequent spam emails then omit these words, which causes the spam filter to misclassify them as ham. This is known as a red herring attack.

These attacks merely assumed some knowledge of “good” and “bad” words. Subsequent works have assumed much greater attacker knowledge, as well as capabilities and resources. In particular, for both DPs and TTEs, the attacker may possess knowledge of: i) the feature space; ii) the classifier type (e.g., SVM or DNN); iii) the classifier’s learning algorithm; iv) any hyperparameters associated with classifier training; v) the training set used by the classifier. If the attacker knows all of the above, in the TTE case he has perfect knowledge of the actual classifier and a rich resource of legitimate supervised examples from the domain. In the DP case, the attacker can formulate an optimization problem seeking to maximally degrade accuracy with the fewest possible poisoned samples. In [44], it is even assumed that classifier learning is outsourced to a third party who is also the DP attacker. In this highly vulnerable scenario, the attacker has great facility either to degrade accuracy in a controlled way or to embed backdoors into the learned classifier.

Full knowledge of the classifier has been referred to as a white box assumption. At the other extreme, no knowledge of the classifier (but some ability to query it) is a black box assumption. A grey box attack is somewhere in between – in [12], in their grey box scenario, the authors make assumptions i) - iv) above but, rather than possessing the full training set, the TTE attacker’s surrogate classifier is trained using other data (still representative of the domain, but perhaps smaller than the training set possessed by the learner/defender). Arguing that security by obscurity is unreliable [9], it is plausible an attacker may possess full (white box) knowledge of the classifier. However, we will argue in Section IV that it seems less plausible the attacker will also possess full knowledge of any detector deployed along with the classifier.

Some attack works also make implicit assumptions. [59] generates TTE attacks both with noticeable artifacts (on the MNIST image digits domain) and which are in some cases class-ambiguous to a human being (see Figure 4). These attacks are deemed successful in [59] but [54] has shown that a very simple detector has strong capability to detect this attack – [59] thus essentially assumes no attack detector is in play. Moreover, if there is substantial decision uncertainty, the classifier can use a rejection (or “don’t know”) option. This is not allowed or considered in [59]. Rejection is an essential option in contexts where the problem is not classification per se, but rather authentication (e.g., for accessing a building or restricted information). Many TTE attack works also do not consider existing, conventional defenses that may be deployed. For example, for a “man-in-the-middle” scenario where the pattern is transmitted to the classifier but intercepted on the way by the TTE attacker, use of standard encryption techniques would defeat the attack [38]. Likewise, in [11], a TTE attack on voice systems could potentially be defeated by SIRI’s existing built-in speaker-recognition system11 1 Though a speaker-recognition system can be overcome by existing voice cloning technology (especially under a white box scenario where the attacker has samples of the authorized party’s voice), standard techniques of limited privilege can be applied. For example, Siri cannot enter data into Safari, and even if it could, there is standard two-factor authentication to prevent unauthorized access to a private web site. Operationally, such security measures may require significant overhead and could be instituted only in response to detected attacks, cf. Section III-B where defenses against TTE attacks are discussed in detail..

Another important characterization of an attack is its strength. For TTEs, attack strength is quantified by some measure of the amount of perturbation from the original, clean pattern. For general DP attacks, attack strength could be the fraction of the training set that is poisoned. For targeted backdoor attacks, it could be the fraction of samples labeled to the target class that are poisoned. Moreover, if a watermark-like pattern is embedded, the signal-to-noise ratio (treating the backdoor pattern as noise) is a second attack strength measure. While one might expect that the attacker’s success rate will increase with its strength, a stronger attack may also be more detectable.

Performance Measures: For general DP attacks, the attack success can be quantified by the amount of induced degradation in the classifier’s (test set) correct decision rate. For backdoor attacks it can be quantified by the rate of classifying backdoor patterns to the target class and by any increased degradation in the classifier’s accuracy. However, if a DP attack detector is in play, then the true positive rate (TPR) and false positive rate (FPR) of the detector are of great interest. All of these quantities could vary as a function of attack strength.

For TTE attacks, there are multiple, vital performance criteria, some of which have not been consistently considered in past works. First, there is the attack success rate in inducing (possibly targeted) decision changes and in avoiding detection, if a detector is deployed. If a robust classifier is used, there is the correct decision rate of the classifier in the presence of TTE attacks. There is also the correct decision rate of the classifier in the absence of TTE attacks (A robust classifier may degrade accuracy in the absence of an attack [71, 42]. Also, a detector may reject some attack samples, as well as some non-attacked test samples, which can also affect the classifier’s accuracy.). If a detector is deployed, there is the true attack detection rate (TPR) and false positive rate (FPR), with the receiver-operating characteristic area-under-the-curve (ROC AUC) a summary measure. All of these quantities may vary as a function of attack strength.

Finally, there is the computational workload of the attacker in creating DP and TTE attacks, and of the robust classifier and/or detector, in defending against them.

Classification Domains: Most AL work has focused on image classification. Two comments are in order here. First, even though they have been applied to images, some attacks, as well as some defenses, are actually more general-purpose, applicable e.g. whenever data samples are represented as vectors of continuous-valued features. For example, methods applicable to images may also be applicable to speech waveforms (or associated derived feature vectors), to gene microarray expression profiles, and even to text documents (e.g., if documents are represented as continuous-valued feature vectors via latent semantic indexing or principal component analysis). Thus, to emphasize the potential general applicability of a particular attack or defense, we will refer to its operation on a pattern (feature vector), rather than on an image. Second, we will also discuss some works that focus on domains other than images.

The rest of the paper is organized as follows. In Section II, we review AL attacks. In Section III, we review recent work on AL defenses. In Section IV, we offer some deeper perspectives, particularly regarding TTE attacks. In Section V, we provide some experimental studies. Finally, in Section VI, we discuss future work.

II Review of AL Attacks

II-A TTE Attacks

Early work considered the spam email domain and relatively simple classifiers such as naive Bayes and maximum-entropy classifiers, using “good word” attacks [45]. Subsequently, considering a two-class problem, [5] proposed a constrained objective function for the TTE attacker, seeking (in the two-class case) to minimize the discriminant function output while constraining the amount of perturbation from the original, clean pattern. They demonstrated their approach for SVM classification of images and PDF files and also showed how it could be extended to neural network classifiers. [68] considered deep neural networks, posed a related attacker optimization objective, seeking imperceptible image perturbations that induce misclassifications, showed that this approach is highly successful in creating adversarial examples starting from any (or at any rate, most) legitimate images, and also showed that these adversarial examples seem to transfer well (remaining as adversarial examples for other networks with different architectures and for networks trained on disjoint training sets). There are much older optimization approaches related to [5] and [68] for finding decision boundaries in neural networks, e.g. [17], but which were not specifically applied to create adversarial test-time attacks.

A related method, proposed in [23], the fast gradient sign method (FGSM), was motivated primarily by the goal of creating adversarial examples that could be added to the learner’s training set (data augmentation) and thus used to increase robustness of the learned classifier. However, FGSM has also been frequently used as a TTE attack in works assessing TTE defenses.

While [5, 68, 23], are “global” methods, potentially altering all of a pattern’s features (e.g., all image pixels’ values) but while constraining the overall distortion of the pattern perturbation, [59] proposed a TTE attack that restricts the number of modified features (pixels). However, this requires large changes on the modified features in order to induce a change to the classifier’s decision. [59] cycles over features (pixels), one at a time, at each step modifying the pixel estimated to effect the most progress toward the classifier’s decision boundary. This procedure is applied either until a successful decision change is induced or until the modified pixel budget is reached.

An improvement over prior TTE attack works is achieved by “CW” [13, 12]. Here, the attacker’s objective function, to be minimized in crafting perturbations, focuses in a discriminative fashion [33] on two classes: the target class (t) and the class other than the target class with the largest discriminant function value, which could be the true class of the original, clean pattern. Let Fj(x¯) denote class j’s discriminant function value for the original pattern, with the classifier using the WTA rule C(x¯)=argmaxjFj(x¯). Then, [13] essentially forms the attacker’s optimization problem as:

minx¯(||x¯-x¯||p+cmax{maxjtFj(x¯)-Ft(x¯),-k}). (1)

We make the following comments:

  1. 1.

    Minimizing (1) seeks the perturbed pattern that, while constrained in p-norm distance to x¯, maximizes the discriminant function difference between that of the target class and that of the nearest competitor class, i.e. it maximizes confidence in decisionmaking to class t. However, very high confidence misclassifications are not necessarily sought. Thus, the hyperparameter k controls the maximum allowed confidence in the decision. While the authors show in [13] that increasing k increases attack transferability, they do not suggest how to choose k in the case where there is a detector in play which is (black box) unknown to the attacker – it is unclear what level of decision “confidence” will be consistent with being detection-evasive.

  2. 2.

    The hyperparameter c controls the cost tradeoff between inducing a misclassification and the amount of pattern perturbation. If c is too small, the relative penalty on the size of perturbations is too large to allow decision alteration. If c is too large, misclassifications will be induced but the perturbations may be large (and hence perceptible and/or machine-detectable). This hyperparameter is chosen via a binary search procedure, seeking the smallest value inducing misclassifications.

  3. 3.

    The strongest version of the CW attack, observed experimentally [12],[54], uses the p=2 norm.

  4. 4.

    [54] also extends the attack to defeat both a classifier and an accompanying detector, i.e. they proposed a “fully” white box attack, exploiting full knowledge of both the classifier and detector. However, as will be discussed later, [12] showed that even with just full knowledge of the classifier (white box with respect to the classifier, black box with respect to the detector), their attack defeats many TTE defense schemes, including some AD defenses.

  5. 5.

    While [12] gives an extensive benchmark experimental study, they do not consistently report FPRs for the detection methods they evaluated – e.g., consider the evaluation of [20] in the ”black box” case in [12].

  6. 6.

    [12] also asserts that as a defender “one must also show that a adversary aware of the defense can not generate attacks that evade detection”. We refer to this as the strong white box requirement, and discuss this further in Section IV.

There is also intriguing work on universal TTE attacks on DNNs, i.e. a single perturbation vector that induces classification errors when added to most test patterns for a given domain [55]. This work is important because it is suggestive of the fragility of DNN-based classification. However, a single perturbation vector required to induce errors on nearly all test patterns must be “larger” than a customized perturbation designed to successfully misclassify a single (particular) test pattern. Thus, images perturbed by a universal attack should be more easily detected than those perturbed by the pattern-customized attacks in, e.g., [23, 59, 13, 54].

Although the previous works could in principle be applied to any classification domain working in some (pattern) space, they all focused experimentally on images. There has been work on other domains, including voice [11, 14] and music [36, 35]. [11] developed an iterative procedure, with a human attacker in the loop, to generate voice commands that are recognized by a speech recognition system and yet are not human-understandable. [36, 35] considered music genre classification. In [36], the authors showed that changes to the tempo degrade the accuracy of a DNN but do not affect a human being’s capability to classify the musical genre. This suggests that the DNN is not using the same “reasoning” as a human being in performing music classification. In [35], the authors modify the TTE attack approach in [68] to be suitable for the music domain. This would be straightforward, except that the classification is performed on individual “frames” that are temporally overlapping. Thus, in constructing a TTE attack for a given frame, one should ensure consistency with all frames with which the given frame overlaps. This problem is addressed in [35] using an iterative projection algorithm.

II-B DP Attacks

As noted previously, early work on DP attacks addressed spam email and did not require the attacker to know very much – just a dictionary of tokens (words), e.g. [3]. More recent availability DP attacks require greater knowledge of the system under attack. [7] showed that significant degradation in an SVM’s accuracy could be achieved with the addition of just one poisoned training sample – on MNIST the error rate increased from the 2-5% range to the 15-20% range. To achieve this, the attacker exploits knowledge of the training set, a validation set, the SVM learning algorithm, and its hyperparameters. The authors define the attacker’s objective function as classifier error rate on the validation set, as a function of the poisoned sample’s location (and class label). This loss is maximized under the constraint that the support vector, non-support vector subsets of the training set are not altered by addition of the poisoned sample. Thus, [7] adds a single, new (adversarially labeled) support vector to the SVM solution. The constraint is met by performing gradient descent carefully, with a small step size. One could expect that even more classifier degradation could be achieved if the support vectors were allowed to change through the addition of the poisoned sample. However, this would also entail a more complex optimization procedure. An example of a DP attack on SVMs is illustrated in Figure 1.

Fig. 1: Linear SVM classifier decision boundary for a two-class dataset with support vectors and classification margins indicated (left). Decision boundary is significantly impacted if just one training sample is changed, even when that sample’s class label does not change (right).

While SVMs (which rely on a “support vector” subset of the training set to define the linear discriminant function) are unsurprisingly fragile in the presence of availability DP attacks, there is little prior work investigating such attacks against DNNs. One reason may be computational complexity – e.g., one could in some way alternate gradient descent optimization in weight space ( minimizing the loss function, i.e. the defender/learner’s problem) and gradient ascent in pattern space (maximizing the loss function, i.e. the attacker’s problem), to find a set of poison patterns that maximally degrade the learned DNN’s accuracy. However, such a procedure would be quite computationally heavy. Moreover, DNNs should be less inherently fragile to data poisoning than SVMs –to degrade accuracy sufficiently a larger attack strength (fraction of poisoned samples) may be needed. For “big data domains”, with e.g. one million training samples, even 10% data poisoning means optimizing 100,000 poisoned sample locations and labels.

On the other hand, DNNs appear to be quite vulnerable to backdoor DP attacks, as demonstrated in several recent works [16, 43]. The backdoor pattern could be an imperceptible (e.g. random-looking) watermark-like pattern or something perceptible but innocuous – e.g., the presence of glasses on a face [16], a plausible object in the background in an image scene (such as a tree or a bird in the sky), or a noise-like audio background pattern in the case of speech classification. One very attractive aspect of these attacks is that they may require no knowledge of the classifier – the attacker simply needs i) legitimate examples from the domain, into which it embeds the backdoor pattern; ii) the ability to poison the training set, with these samples labeled to the target (backdoor) class. On the other hand, if the attacker possesses knowledge of the classifier, its training set, and its learning algorithm, he can optimize the backdoor pattern in order to ensure: 1) the backdoor is well-learned; 2) classifier accuracy is not compromised; and 3) 1) and 2) are accomplished with the least amount of attack strength (and hence the least attack-detectability). An attempt at such an approach is given in [43].

II-C RE Attacks

In [69], the authors consider black box ML services, offered by companies such as Google, where, for a given (presumably big data, big model) domain, a user pays for class decisions on individual samples (queries) submitted to the ML service. [69] demonstrates that, with a relatively modest number of queries (perhaps as many as ten thousand or more), one can learn a classifier that closely mimics the black box ML service decisions. Once the black box has been reverse-engineered, the attacker need no longer subscribe to the ML service. Perhaps more importantly, such reverse-engineering enables TTE attacks by providing knowledge of the classifier when it is not initially known. One weakness of [69] is that it neither considers very large (feature space) domains nor very large neural networks – orders of magnitude more queries may be needed to reverse-engineer a DNN on a large-scale domain. However, a more critical weakness of [69] which we will discuss in Section IV is that the queries in [69] are easily detected because they are random, i.e. they do not use any knowledge of training samples for the domain.

Recently, RE attacks based on more realistic queries have been proposed [58]. First, the adversary collects a small set of representative labeled samples from the domain as an initial training set, S0, and uses this to train an initial surrogate classifier. Then, there is stagewise data collection and retraining, over a sequence of stages. In each, the adversary augments the current training set by querying the classifier with the stage’s newly generated samples, i.e.,

Sk+1={x¯+λsgn((maxcPs(k)[C=c|x¯])):x¯Sk}Sk (2)

where k is the current stage index and Ps(k)[C=c|x¯] is the current surrogate class posterior model. The surrogate classifier is then retrained using the labeled Sk+1. Based on (2), each successive stage crafts samples closer to the classifier’s true boundary, which is helpful for RE learning (but which also makes these samples less class-representative and thus more detectable, as seen in Section V). Once a sufficiently accurate surrogate classifier is learned, a TTE attack can be launched using this classifier.

III Review of Recent Work on AL Defenses

In this section we review recent work on defenses against TTE, DP, and RE AL attacks.

III-A Robust Classification Approaches for TTEs

The early, dominant approach considered for defending against TTE attacks is robust classification, i.e. with the classifier trained in such a way that a pattern from class c that is perturbed by the attacker is still assigned to c by the classifier. Several different robust classification strategies and learning approaches have been developed. [18] modifies the support vector machine (SVM) training objective to ensure the learned weight vector is not sparse, i.e. it makes use of all the features. Thus, if the attacker corrupts some features, other (unperturbed) features will still contribute to decisionmaking. However, [18] may fail if only a few features are strongly class-discriminating. [18] also did not evaluate the accuracy of their approach in the absence of attacks. [42] performs blurring of test patterns (images) in order to destroy a possible attacker’s perturbations. This approach has some effectiveness, but it may reduce classifier accuracy in the absence of attacks. This method will be experimentally evaluated in Section V. It was also shown in [25] that some attacks are quite fragile and that even the process of image capture (involving cropping, etc.) may defeat the attack – image capture itself, assuming the attack is on a physical object, not on an already digital image, may give some robustness against adversaries.

[73] considers DNNs for digit recognition and malware detection. Quite different from [18], they randomly nullify (zero) input features, both during training and testing/inference. This may “eliminate” perturbed features in a test-time pattern. There are, however, several shortcomings here. First, for the malware domain, the features as defined in [73] are binary {0,1}. Thus, nullifying (zeroing) does not necessarily alter a feature’s original value (if it is already zero). By first recoding the binary features to {-1,1}, nullifying (zeroing) will always change the feature value. This may improve the performance of [73]. Second, there is a significant tradeoff in [73] between accuracy in correctly classifying attacked examples and accuracy of the classifier in the absence of attacks. As the nullification rate is increased, the frequency of defeating the attack increases, but accuracy in the absence of attack decreases. [60] on the other hand reported relatively small loss in accuracy in the absence of attacks, for their “distillation” defense strategy.

As another more recent example, [71] has proposed to learn the classifier to minimize an expected training loss function, where the expectation is over a distribution of bounded adversarial perturbations of the training patterns. This can also be seen as a type of training with “data augmentation” often used in learning neural networks to achieve solutions robust to common pattern variations (e.g., variation in object position, scale, and orientation for digital images). However, [77] has shown that even though the robust training in [71] can give guarantees of classifier robustness in the vicinity of the training patterns, this robustness is not guaranteed to generalize to the test set. While good robustness is achieved by [71] for simple image domains such as MNIST, the robustness to TTE attacks on the CIFAR-10 image domain [1] is much weaker, with accuracy degraded from 87% for clean patterns to less than 50% for perturbed patterns. [77] further demonstrates that applying a simple affine transformation to the image gray-scale values (which alters the image background and contrast while not affecting human labeling) coupled with standard TTE attacks is highly successful in defeating the robust classifier based on [71]. This fragility is reminiscent of the effect of altered tempo on a DNN trained for music genre classification [36], discussed earlier.

Some key limitations of the robust classification defense against TTEs are discussed in Section IV. Here we will just mention one important one: many TTE attacks (whether realistic or not) assume full knowledge of the classifier. Thus, even if a robust classification method can defeat TTE attacks that target a conventional (non-robust) classifier, TTE attacks on the robust classifier itself will often be successful, e.g. the distillation defense in [60] was not effective against an attack designed to defeat it [13]. Likewise, as mentioned above, [77] is a reactive defense tailored to defeat the classifier from [71]. These are instances of the attack-defense arms race mentioned earlier, where the (reactive) player with the most knowledge (in this case the attacker) has the advantage over the proactive player, with less knowledge (the defender).

III-B Anomaly Detection (AD) of TTEs

Over the past several years, there has been great interest in an alternative defense strategy against TTEs – anomaly detection of these attacks, with a number of published works. One motivation given for this paradigm shift, e.g. in [12], is that robust classification of attacked images is difficult, while detection is “easier”. An argument in support of this is that if one designs a good robust classifier, then one gets a good detector essentially “for free” – one may make detections when the robust classifier’s decision disagrees with a conventional (non-robust) classifier’s decision. Likewise, considering methods like [42] which “correct” an image X containing an attack, e.g. by blurring it, producing a new image X, one can build a single classifier and make an attack detection when C^(X)C^(X). However, such detection architectures, based on robust classification, are obviously not necessarily optimal. In particular, considering the two classifier approach, if there is significant classifier error rate, then there are two significant sources for classifier non-consensus: i) a successful TTE attack on the conventional classifier and its correction by the robust classifier, and ii) misclassification of a non-attacked pattern by one (or both) of the classifiers. In Section V, we will show that direct design of an AD can yield much better detection accuracy than a detector built around robust classification. Moreover, irrespective of whether detection is truly “easier”, this argument (that detection is essentially a subset of robust classification) is not made in many of the robust classification defense papers – it is simply assumed that the only objective of interest is to defeat the attack by correctly classifying in the face of it. Attack detection is not even considered in [18],[73],[60],[42],[76]. By contrast, in Section IV, we will give much stronger arguments for the intrinsic value in making detection inferences and will point out that, in some scenarios, when an attack is present, making robust classification inferences in fact has no utility. Moreover, we also argue that the attacker has a natural mechanism for learning a robust classifier (and then targeting it) – querying – whereas he may have no ability to query so as to learn an AD that may be in play. We defer more detailed arguments until Section IV.

Irrespective of the motivation given, there has been substantial recent work on detection of TTEs. Various approaches have been proposed, with a recent benchmark comparison study given in [12] evaluating a number of methods against the CW attack [13], which was demonstrated in [13],[12] to be more difficult to detect than earlier attack methods such as [23] and [59].

III-B1 Supervised Detection of TTEs

One strategy is to treat the detection problem as a supervised one, using labeled examples of “known” attacks. The resulting binary classifier (attack vs. no attack) can then be experimentally evaluated both on the known attacks and on unknown attacks. Examples of such systems include [27] and the supervised approach taken in [20]. However, [27] failed to detect the CW attack on the CIFAR-10 image domain [12]. [20] similarly proved unsuccessful in detecting CW on CIFAR-10 [12]. [42] also treated the problem as supervised, applying a multi-stage classifier, with each stage working on features derived from a deep layer of the trained DNN classifier. A detection is made unless all the stages decide the image is attack-free. [12] demonstrates this detector performs very poorly on CW applied both to the MNIST and CIFAR-10 domains. [48], which feeds a DNN classifier’s deep layers as features to a supervised detector, is more successful. The best results reported for this supervised method in detecting CW on CIFAR-10 were 81% true positive rate (TPR) at a false positive rate (FPR) of 28% [12]22 2 The FPR is the number of non-attack test patterns that are detected as attack divided by the number of non-attack test patterns. The TPR is the number of attack test patterns detected as attack divided by the number of attack test patterns.. Later we will demonstrate an unsupervised AD which significantly exceeds these results. Moreover, [48] reported that training on some attacks did not generalize well to other attacks (treated as unknown). This is an important limitation – in general the defender may be proactive, without detailed knowledge of an attacker’s perturbation objective/strategy.

A related supervised approach is [46], which quantizes deep layer DNN features, feeding the resulting codes as input to an SVM. The authors argue that this quantization makes their approach resilient even to fully white box attacks (where the attacker has full knowledge of both the classifier and the detector) since gradients of their discriminant function (needed to efficiently search for successful TTE attacks) are difficult to compute (or even zero almost everywhere). Again, this approach, being supervised, may not generalize well to unknown attacks.

III-B2 Unsupervised AD for TTEs Without an Explicit Null Model

Other approaches are unsupervised ADs, some of these based on explicit null hypothesis (no attack) statistical models for patterns, and others not forming an explicit null hypothesis. One crude approach is simply to reject if the maximum a posteriori class probability (produced by the classifier) is less than a given threshold. Use of such “confidence” was shown to be effective for detection of a classifier’s misclassified samples in [29], although effective detection of adversarial attacks may require more powerful detectors. Another non-parametric approach is based on principal component analysis (PCA) [30]. However, [12] found that, while successful on MNIST, this approach has essentially no power to distinguish attacks from non-attacks on CIFAR-10.

A more sophisticated, interesting non-parametric detector is [47]. Extending beyond PCA’s linear representation, [47] extracts nonlinear components via an auto-encoding neural network. [47] uses somewhat unconventional decisionmaking, combining classification and detection – images whose (auto-encoding based) reconstruction error is large (far from the image manifold) are detected as attacks. Images whose reconstruction error is small are sought to be correctly classified. However, it may be desirable to make detections even when the attack is subtle (see reasoning given in Section IV), with the attacked image lying close to the image manifold. A primary concern with this detector is that it requires setting a number of hyperparameters (specifying the auto-encoding network architecture, which in fact required different settings for MNIST and CIFAR-10, and softmax “temperature” variables). Setting hyperparameters in an unsupervised AD setting is extremely difficult. If labeled examples of an attack are available, one can set hyperparameters to maximize a labeled validation set measure. However, the attack is then no longer “unknown” and the detection method is actually supervised. [47] also incurs some degradation in classification accuracy in the absence of an attack – from 90.6% to 86.8% on CIFAR-10 [47].

III-B3 Unsupervised AD of TTEs based on an Explicit Null Hypothesis

A generic DNN-based detection plus classification system is shown in Figure 2, with classification performed if no attack is detected.

Fig. 2: Anomaly detection based on decision statistics derived from internal layers of a DNN. Classification of the image is performed if no anomaly (attack) is detected.

One such detection approach with an explicit null hypothesis is [4]. This approach computes the distance between a DNN layer’s class-conditional mean feature vector and the test image’s feature vector and then evaluates this, under the null hypothesis of no attack, using a Weibull distribution. A few limitations of [4] are that: i) it does not model the joint density of a deep layer – such a model would exploit more information than just the (scalar) distance; ii) [4] is not truly unsupervised – several hyperparameters are set by maximizing a validation measure that requires labeled examples of the attack. The experiments of Section V will show that a purely unsupervised AD [54] outperforms [4] even though we (optimistically) allow [4]’s detector design to use more than one hundred labeled examples of the attack to be detected.

More recently, [20] did propose a method based on a null hypothesis joint density model of a DNN’s deep layer feature vector. To our knowledge, theirs is the first such approach, upon which the detection methodology in [54] builds. [20] used a kernel density estimator to model the penultimate layer of a DNN. However, ultimately, they put forward a supervised method, learning to discriminate ‘attack’ from ‘no attack’, leveraging their density model to create the classifier’s input features. They ultimately settled on a supervised method because their unsupervised detector did not achieve very good results. In [12], it was found that the unsupervised AD in [20] grossly fails on CW attacks on CIFAR-10 – 80% of the time, attacked images had even higher likelihood under the null model in [20] than the original (unattacked) images. The unsupervised ‘anomaly detection of attacks’ (ADA) [54] method that builds on [20] is a much more effective detector, one which currently achieves state-of-the-art results. We next give a brief description of this approach.

Summary of the ADA Method: Consider a successful, targeted TTE attack example – one which was obtained by starting from a “clean” example x¯ from some source class cs (unknown to the defender) and then perturbing it until the DNN’s decision on this perturbed example (now x¯) is no longer cs, but is now cdcs (the “destination” class). The premise behind the detector in [20] is that feeding an attacked version of x¯, not x¯ itself, into the DNN and extracting, as the derived feature vector z¯, the vector of outputs from some internal layer of the DNN (e.g., the penultimate layer, right before the decision layer) will result in a feature vector z¯ that has atypically low likelihood under a learned null density model for z¯ (i.e., the estimated distribution for z¯ when no attack is present), conditioned on the DNN-predicted class cd=c. Denote these null densities, conditioned on each class c, by fZ¯|c(z¯|c). If the perturbation of x¯ is not very large (so that the attack is human-imperceptible and/or detection-evasive), one might also expect that z¯ will exhibit too much typicality (too high a likelihood) under some class other than c, i.e. under the source category, cs. This additional information is exploited in [54].

It does not matter that the source category is unknown to the defender. He can simply determine the best estimate of this category, e.g. as:


Accordingly, it was hypothesized that attack patterns will be both “too atypical” under c and “too typical” under c^s. This is illustrated in Figure 3.

Fig. 3: Illustrative attack examples in the plane with a linear decision boundary – such examples may have relatively high likelihood under the source class and low likelihood under the attack’s destination class.

While this may seem to entail an unwieldy detection strategy that requires use of two detection thresholds, a single, theoretically-grounded decision statistic that captures both requirements (jointly assessing the “atypicality” with respect to c and the “typicality” with respect to c^s) was proposed. Specifically, define a two-class posterior evaluated with respect to the (density-based) null model:


where p0 gives the proper normalization:


Likewise, define the corresponding two-class posterior evaluated via the DNN:

Q [Qc,Qc^s]
= [q0PDNN[C=c|x¯],q0PDNN[C=c^s|x¯]],



Both deviations (“too atypical” and “too typical”) are captured in the Kullback-Leibler divergence decision statistic:


i.e. a detection is declared when this statistic exceeds a preset threshold value, chosen to meet a targeted FPR.

We emphasize that the hypothesis that an attacked pattern will exhibit both “too high atypicality” (with respect to c) and “too high typicality” (with respect to c^s) is not based on any strong assumptions about the attacker. It merely assumes the perturbation is constrained to be small to be human-imperceptible and/or machine-detection evasive. This constraint, in conjunction with the attacked example’s misclassification by the classifier, may necessitate that the test pattern will exhibit unusually high likelihood for a category (cs) other than that decided by the DNN. It may also necessitate that the test pattern will exhibit unusually low likelihood under c (The perturbation is not large enough to make the perturbed image a typical pattern of c.).

The baseline ADA method was further improved by: i) modeling the joint density of a deep layer using highly suitable mixture density-based null hypothesis models (in particular, through log-normal mixture density modeling to match to the non-negative support for RELU layers); ii) exploiting multiple DNN layers by taking the maximum of the ADA statistic over each of the layers being evaluated; iii) accounting for uncertainty about the source class by computing an expected ADA statistic, based on the estimated probabilities for each of the classes being the source class; iv) exploiting the class confusion matrix, comprehensive low-order density modelling [61], and DNN weight information in constructing an ultimate, best-performing, ADA statistic [54] which demonstrated state of the art results against published attacks, including CW [13]. [54] directly compared against several benchmark detection methods and also against the results reported in [12]. Also evaluated were some performance measures of great interest that are sometimes not assessed in prior works, including multiple performance measures as a function of attack strength and classification accuracy (in the absence of attacks) versus the false positive rate of attack detection. Section V gives an evaluation of ADA against other detectors for several classification domains.

III-C Defenses Against Classifier-Degrading DP Attacks

The dominant objective in defending against embedded DP attacks on classifiers is some form of robust learning, aiming to preserve (generalization) accuracy in the presence of training set poisoning. Two common such training strategies are as follows: i) performing some type of training set sanitization, followed by standard classifier learning. Sanitization could consist of detection and removal of outlier samples. Alternatively, it could involve feature selection or compaction (e.g., principal component analysis, followed by removal of low energy coefficients), intended to neuter data poisoning by removing minor signal components/features, presumed to be primarily used by the attacker; ii) robustly modifying the classifier training objective function, accounting both for outliers as well as inliers that are adversarially mislabeled. As an example of ii), [75] accounts for both random as well as adversarial mislabeling for kernel-based SVMs by replacing the dual SVM objective function, which depends on the training labels, with an expected dual function, based on the probability of mislabeling. While this provides robustness benefits, it is not surprising that similar benefits are also obtained in [75] by reducing the SVM (slackness) penalty on margin violations. This allows more data points to contribute to the estimation of the SVM’s weight vector as support vectors, which provides inherent robustness. Another issue is that the solution in [75] relies on knowledge of the mislabeling probability, i.e. a (generally unknown) hyperparameter. While one could choose this parameter in a conservative, worst-case fashion (assuming a high mislabeling probability), this will compromise accuracy in the absence of a DP attack.

[67] considers the first strategy, performing (nearest neighbor based) outlier detection followed by SVM optimization. Under certain assumptions (in particular, that removal of non-attack outliers will not have a strong impact on estimation of the classifier), they are able to obtain a (domain-dependent) upper bound on the worst case loss in accuracy from a DP attack. Subsequently, in [37], the same authors considered reactive attacks against the robust classifier defense considered in their earlier work [67]. This latter work pessimistically reports that DP attacks with knowledge of the sanitization defense can evade the defense and significantly degrade classification accuracy. In particular, the attack may relatively densely concentrate poisoned samples in close proximity, so that they do not resemble outliers.

One crucial assumption in [37] is that the attacker has full knowledge not only of the training set but also the test set, i.e. the set of samples on which the authors evaluate generalization loss. The attacker’s objective function, to be maximized with respect to the chosen poisoned training samples, is the classification loss measured on this test set. Thus, the same test set is used by the attacker in crafting his attack and by the authors to evaluate the degradation in accuracy induced by this attack. This appears to be highly biased – an independent test set, not used by the attacker, is needed to evaluate the true loss in accuracy associated with this attack. This is true in general in machine learning – any ML design should be evaluated on an independent test set, one not used for training. In this case, the “design” is that of the attacker. Note that merely knowing the training set, the attacker already possesses an advantage over the defender (since he knows the clean training set, without the poisoned samples). We accordingly suggest that the approach in [37] could be investigated with the adversary’s attack objective function based on a surrogate for the test set – e.g., the clean training set – rather than the actual (test) set of samples used for assessing the accuracy degradation induced by the attack. Introducing poisoned samples that induce large training set loss (even a reduction in the classifier’s margin) should also induce an increase in (unseen) generalization error (estimated based on a held-out test set). We also note that the attacker could also/instead poison the data set used by the defender to estimate the classifier’s accuracy (i.e., a finite validation or test set), seeking to make this estimate inaccurate. This is yet another DP attacker objective.

There is also prior work on DP defense specific to email spam filtering and to the (less challenging) on-line DP scenario, rather than the embedded DP scenario. Naive Bayes spam filters are highly susceptible to DP attacks. Here, known spam sources/blacklisted IP addresses exploit the fact that their received emails will be treated as (ground truth) labeled spam examples, and used for classifier training (or re-training). The attacking source thus generates emails that will skew the spam model, potentially resulting in great degradation in classifier accuracy. One approach for defending against this is proposed in [56], where it is assumed that: i) one starts with a clean spam-ham training set; ii) there is a clean spam-ham validation set; and iii) poisoned spam is added to the training set subsequently, e.g. in an on-line setting. Under these assumptions, it is not difficult to build a detector based on degradation in the (on-line) learned spam classifier’s accuracy on the validation set – if adding a labeled spam example (or a small batch of such samples) to the training set results in degradation in validation set accuracy, one can reject these samples as poisoned. However, more generally, if the DP attack samples are embedded as part of the original training set and if there are no time stamps, data provenance, or other means for narrowing to a particular subset of the training set as possibly anomalous, the problem is much more challenging. For the approach in [56], under this embedded attack scenario, there would be a combinatoric explosion in the set of candidate poisoned subsets of the training set to assess, which makes [56] quite impractical for addressing embedded attacks.

Recently, a method that achieves promising results for defeating embedded DP attacks focused on the spam-email domain was proposed in [51]. It was noted that such attacks are successful mainly because of the poor representation power of the naive Bayes (NB) spam filter model, with only a single (component) density to represent spam (plus a possible attack). [51] proposed a defense based on the use of a mixture of NB models and demonstrated that the learned mixture almost completely isolates the DP attack samples in a second NB component, with the original spam component essentially unchanged by the attack. Thus, one can discard this second component, leaving a spam filter that is unaffected by the DP attack. Even for weak attack strengths (a small number of attack samples), Bayesian Information Criterion (BIC) based model order selection [63] chose a two-component solution, which invoked the mixture-based defense. Promising results were presented on the TREC 2005 spam corpus.

Defense against DP attacks applied to active learning has also been considered [50]. Active learning systems efficiently grow the number of labeled training samples by selecting, for labeling by an “oracle” (e.g., human expert), the samples from an unlabeled batch whose labelings are expected to improve the classifier’s accuracy the most. The classifier is retrained with each new labeled sample (or new mini-batch of labeled samples, if samples are labeled more than one at a time). [49] has shown that active learning of classifiers can be compromised by adding poisoned samples to the unlabeled batch on the current classifier’s decision boundary if the oracle provides somewhat noisy (adversarial) labels for these samples33 3 The oracle is usually assumed to provide accurate, ground-truth labels for samples actively selected for labeling. However, it is possible that the oracle/human analyst is unreliable or is an insider. In such settings, the assumption made in [49] is reasonable.. Active learning methods typically select for oracle (human analyst) labeling the sample with highest decision uncertainty (i.e., a sample close to the current boundary). This plays right into the adversary’s hands, as the poisoned samples will often be selected for labeling. [50] proposed a defense against this type of attack which exploits randomness to good effect. Note that, anyway, there are multiple active sample selection criteria of operational interest: i) highest class-entropy (decision uncertainty); ii) most outlying sample (potentially an instance of an unknown class, i.e. with no labeled examples [61]); iii) random selection; iv) selecting the sample most likely to belong to a particular category of interest, e.g. the most actionable category, with confirmation by the oracle. Thus, [50] proposed a mixed sample selection strategy, randomly selecting from multiple criteria at each oracle labeling step. This approach reduces the frequency with which an adversarial sample is selected for labeling and thus mitigates classifier accuracy degradation associated with the data poisoning. It also “services” multiple active learning objectives. As seen later (and also earlier, in [73]), randomization plays an important role more generally in some defensive schemes against adversarial attacks.

Finally, a related problem in adversarial learning is where the focus of adversaries is on compromising the ground-truth labeling mechanism. In particular, a problem of great contemporary interest is adversarial crowdsourcing. In crowdsourcing systems for obtaining ground-truth labels, there are a set of (e.g. Amazon Mturk) workers who each make decisions on a given example (news article, image, etc.), with the decisions across the worker ensemble aggregated to produce estimated “ground truth”. Such labeling could be used to create a supervised training set for learning a classifier. Alternatively, it could also be used as an alternative to an automated classifier for making decisions (and taking actions) on a given set of (operational) patterns. Adversarial crowdsourcing means that some of the workers may be giving unreliable answers, either because they are lazy/unskilled or because, in an intentional way, they want to alter the aggregate decision that is reached. One can accordingly pose robust aggregation schemes that attempt to defeat the adversaries and perhaps even identify them [34], [39]. In [39], a stochastic model for workers is developed and an associated EM algorithm learns, in a transductive setting44 4 In transductive learning, the learner does not build a classifier model from training data with the future goal of applying this classifier to make decisions on operational (test) data. Instead, in transductive learning, the learner’s objective function is over the (batch of) test samples, with the learner’s optimization directly yielding the (hard or soft (class posterior)) decisions on these samples., from a batch of workers’ answers, aggregate posterior probabilities for the possible answers for each of the given tasks in the batch. Moreover, the algorithm learns parameter values that characterize each individual worker’s skill level, the relative difficulty of each of the tasks, and each worker’s intention (honest or adversarial). It was demonstrated that this approach is quite robust to adversarial workers amongst the crowd (it can defeat the adversary’s goal of altering the aggregate decision) and it can also reveal the adversarial workers (who could then be blocked from participating in future crowdsourcing). Moreover, the method was found to be effective both in the semisupervised setting, where some “probe” tasks have ground truth answers that help to reveal adversaries, as well as in a purely unsupervised learning setting, where there is no known labeled ground truth for any of the tasks in the batch of answers.

III-D Defenses Against Backdoor DP Attacks

While backdoor attacks on classifiers were only proposed fairly recently [16], there is already nascent activity aiming to detect and mitigate them [44],[15]. [44] considers a scenario where a user outsources classifier training to a third party, who may be adversarial. The user provides a (presumably) clean training set to the third party as well as a specification of the hyperparameters of the DNN (the number of layers and the sizes of these layers). The adversarial third party adds backdoor training patterns, with target class labels, to the provided training set. The adversary will seek to learn the classifier to achieve high accuracy on “normal” patterns, but while classifying to the targeted class when the backdoor pattern is present ([44] also considers the case where backdoor patterns just induce (untargeted) misclassifications, although this should be of less strategic interest). Crucial to [44] is that the user/defender also possesses a clean validation set, not available to the attacker, which can be exploited for defense purposes. The premise behind this work is that backdoor patterns will activate neurons that are not triggered by clean patterns. Thus, the defender can prune neurons in increasing order of their average activations over the (clean) validation set, doing so up until the point where there is an unacceptable loss in classification accuracy on the validation set. This is likely to remove the neurons which trigger on backdoor patterns. Presumably, the authors’ assumption that the user specifies the DNN hyperparameters to the third party ensures that the network is made large enough to allow subsequent pruning without loss in accuracy – if the third party had control of the network architecture, he could make it compact enough that any pruning would necessarily result in loss in classification accuracy. The scenario of this work is somewhat artificial – the choice of the network (its size) is a fundamental aspect of model learning. Pre-specification of this by the user (who is not the learner in this case) may not ensure good accuracy is achievable (the user could underspecify the model size). Moreover, while in this scenario the user is supplying the data and could “hold back” a clean validation set, the embedded data scenario is of great interest not only for general DP attacks but also for backdoor attacks. In this challenging scenario, any held out data may also contain backdoor attack patterns.

[15] does address the embedded backdoor scenario, where there is no clean validation set available. The authors consider an internal layer (e.g. the penultimate layer) of a DNN, extracting this as a feature vector for an input training pattern. They first perform dimensionality reduction on this vector (using independent component analysis) and then perform K-means clustering on the resulting training set feature vector patterns all labeled to the same category, seeking to separate each class’s samples into two clusters. They then propose three different “inference” strategies to determine whether one of the two clusters is a “backdoor” cluster. The first involves retraining a new classifier while excluding all patterns from one of the two clusters. One then classifies the excluded cluster’s patterns using the new classifier. The expectation is that if the cluster has valid patterns, classification should primarily be to the given class label for these patterns, while if the patterns are backdoors, they will likely be classified to the initial (source) class of the patterns, i.e. a class different from the given (target) label. This idea is related to the premise behind [54]. However, this strategy requires 2K classifier retrainings, with K the number of classes. For CIFAR-100 or Imagenet (with more than 1000 categories), this approach is computationally exorbitant and infeasible. The second inference strategy identifies putative backdoor clusters as those that are relatively small in mass. A third approach does require use of a clean validation set, used to assess whether the data is better fit by one cluster than two (in which case no poisoning is hypothesized). If a backdoor cluster is detected, its training samples are removed. After considering detection and removal for all classes, the classifier can be retrained. The experiments in [15] consider the relatively simple MNIST data set as well as traffic signage, i.e. CIFAR datasets were not considered. The authors do suggest that their approach is successful on MNIST even when clean classes are multi-modal (requiring multiple clusters to represent them) – this was experimentally validated by pooling several digits and treating them as one class. It is unclear whether these results would hold on more complex data sets. Some limitations of [15] are that: i) it is assumed backdoor patterns manifest in principal components – if a pattern manifests in minor components, the front-end dimensionality reduction may destroy the backdoor pattern prior to the clustering step; ii) it relies on the effectiveness of K-means clustering, which is susceptible to finding poor local optima; iii) it may not be so effective in general when classes are multi-modal (in such a case, a two-clustering may not isolate the backdoor attack in one cluster).

[70] does not rely on clustering per se to detect back doors. Rather, the authors show that if a back door pattern appreciably alters the mean feature vector pattern of a class (with feature vectors again obtained from the penultimate layer of the DNN, when the image is applied as input to the DNN), then with high probability clean and backdoor patterns labeled to this class can be separated by projecting onto the principal eigenvector of the feature vector’s covariance matrix (This amounts to clustering in one dimension). Again, classifier retraining follows removal of backdoor training samples. The authors demonstrate promising results for this approach applied to the CIFAR-10 data set. We will further investigate this method experimentally in section V.

One issue with [70] is that the detection threshold is based on knowledge of a hyperparameter that essentially bounds the number of poisoned training patterns. In practice, this number will be unknown. While the authors select this guided by test set accuracy, backdoor patterns could also be embedded in the test set (to confound use of the test set for setting this hyperparameter). Thus, one cannot in general use (good) test set accuracy as a guide for setting this hyperparameter. Second, while the authors do demonstrate both that their approach identifies most of the backdoor patterns and, in removing them, retains high classification accuracy, they do not report how many clean patterns are falsely rejected by their method. For CIFAR-10 there is a sufficient amount of training examples (5000 per class) that accuracy should be unaffected even if many clean patterns are falsely removed. However, this could be more problematic if fewer training samples are available. Moreover, one can envision backdoor pattern mechanisms that may not induce large changes in mean pattern vectors. In such a case, one might need to scrutinize minor components, as well as the principal components.

Finally, we mention [21]. This work does not detect backdoors in the training set (which may be unavailable). Instead, the authors aim to detect use of the backdoor at test time. The proposed method mildly assumes availability of a small batch of clean patterns (from multiple classes) from the domain, {x¯i}. For a given test pattern x¯, they form x¯+x¯i and classify it by the trained DNN. They then measure the class decision entropy over this small batch. If a backdoor is present, entropy should be low (the classifier decision is mainly made based on presence of the backdoor pattern). If x¯ is clean, the decision entropy is expected to be higher. [21] demonstrates good detection results on both MNIST and CIFAR-10 when the attack strength is high (a visible black pattern superimposed on the image). It is unclear how effective this method will be as the attack strength is reduced.

III-E Defenses Against RE Attacks

There is a paucity of prior art on defending against RE attacks. One interesting idea, developed for SVMs, is [2], which makes the classifier evasive (a moving target), rather than the adversary. That is, the authors propose a randomized classifier, adding multivariate Gaussian noise to the classifier’s weight vector, with the covariance matrix chosen to ensure both high classifier accuracy and as much “spread” in weight vector realizations as possible. This is nicely formulated as a convex optimization problem. One limitation of [2] is that this approach may not naturally extend to DNNs. A more serious concern is that the authors do not show that randomization prevents an RE attacker from learning a good classifier for the domain – they only show that RE-learned weight vectors deviate significantly from the mean SVM weight vector. Moreover, the authors evaluate the success rate of TTE attacks following the RE classifier learning phase. While TTE success rates are reduced by classifier randomization compared with a fixed classifier, the TTE success rate is still quite high – e.g. dropping from 80% to 65% on some data sets. The problem here is that because the randomization retains high classification accuracy, RE querying, even of random classifier realizations, acquires generally accurate labels for its queries.

An alternative approach for RE defense, proposed recently in [74], is to detect RE querying, much in the way that one can detect TTE querying. That is, because the attacker does not know that much about the classifier (and may have limited legitimate examples from the domain), he will have to be “exploratory” with his queries. However, one can then hypothesize that query patterns may be atypical of legitimate examples from the domain, much as are TTE attack examples. In fact, [74] directly leverages the ADA detector, designed to detect TTE attacks, to detect RE querying. In [74], it is shown that this approach achieves high ROC AUC (0.97 or higher) in detecting the RE querying proposed in [58] for MNIST, and at a “stage” of querying before the attacker has learned an accurate classifier, suitable for mounting a TTE attack. In Section V, we will extend these results to also demonstrate success in RE detection on the CIFAR-10 domain.

IV Digging Deeper: Novel Insights and Broad Perspectives, Particularly About TTE Attacks

IV-A Defenses Against TTEs: Robust Classification versus Attack Detection

As reviewed earlier, some TTE defenses robustify the classifier so that a test pattern from class A that is perturbed by an attacker is still assigned to A by the classifier. Robust classification is an important goal, especially in the presence of “natural” adversaries – additive noise, transformations (image rotation, translation, and scaling, variable lighting), etc. A common approach in training DNNs is data augmentation, with pattern variants added to the training set so that the trained classifier is as robust as possible. However, the works [18],[73],[60],[42],[76] do not focus on (even consider) achieving robustness to natural adversaries, but rather solely to TTEs.

One limitation of robust classification to combat TTEs concerns semantics of inferences. Consider digit recognition. Evasion-resistance means a perturbed version of ‘5’ is still classified as a ‘5’. This may make sense if the perturbed digit is still objectively recognizable (e.g., by a human) as a ‘5’. (Even in this case, the decision may have no legitimate utility. This will be explained in the sequel.) However, the perturbed example may not be unambiguously recognizable as a ‘5’ – consider the perturbed digit examples based on the attack from [59] shown in Figure 4, with significant salt and pepper noise and other artifacts. Some attacked examples from classes ‘5’, ‘3’, and ‘8’ are noticeably ambiguous. This suggests, at a minimum, that there should be a “don’t know” category for domains susceptible to TTE attacks.

Fig. 4: JSMA [59] adversarial image matrix, with the row the true class and with the column the decided class. The diagonal consists of unattacked digits.

However, irrespective of assigning the pattern to the correct class (which might in fact be “don’t know”), it may be crucial operationally to infer that the classifier is being subjected to an evasion attack. Robust classification does not make such inference. Once an attack is detected, measures to defeat the attack may be taken – e.g., blocking the access of the attacker to the classifier through the use of multifactor (challenge-response) authentication involving a human supervisor. Moreover, actions typically made based on the classifier’s decisions may either be preempted or conservatively modified. For example, for an autonomous vehicle, once an attack on its image recognition system is detected, the vehicle may take the following action sequence: 1) slow down, move to the side of the road, and stop; 2) await further instructions. Similarly, a device that is actuated based on recognized voice commands might require multifactor authentication, invoke “least privilege” rules, or simply be put into a “sleep” mode. In a medical diagnostic setting, with an automated classifier used e.g. for pre-screening numerous scans, a radiologist should obviously not try to make a diagnosis based upon an attacked image – if the attack is detected the patient can be re-scanned. In [73],[18],[60] and other papers, it is presumed that correctly classifying attacked test patterns is the right objective, without considering attack detection as a separate, important inference objective.

Analysis of Test-Time Attack Mechanisms: Beyond the above arguments, we can recognize that there are only two digital attack mechanisms on a pattern to be classified (see Figure 5)55 5 Physical attacks, e.g. defacing a road sign, are not considered in this analysis..

Fig. 5: The two possible test-time evasion attack mechanisms applied to digital patterns, with the attacker either an adversarial source of patterns directly or a man-in-the-middle, intercepting a pattern on its way to the classifier. In the former mechanism, there is no legitimate utility obtained from the classifier’s decision. In the latter, the decision recipient could either be the intended recipient or the adversary.

In one mechanism, there is an honest generator of patterns, but the pattern is then intercepted and perturbed by an adversary (a “man-in-the-middle” attack) before being forwarded to the classifier66 6 Again, in response to a detected adversary, one could employ classical cryptographic means to defeat a man-in-the-middle. Alternatively, a legitimate source of data could post cryptographic hashes of samples, created using public key infrastructure or encryption based on pre-assigned symmetric private keys, to allow for their verification upon receipt by an intermediary; this technique would work even if the samples are themselves secret as their hashes would not betray their content. Sample provenance and integrity can also be ensured through block-chain (nested encryption) technology. However, these techniques may be computationally costly.. However, there is a second mechanism, with the adversary the originator of the pattern. Here, the adversary may have his own cache of labeled patterns from the domain. He selects one (even arbitrarily), perturbs it to induce a desired misclassification, and then sends it to the classifier77 7 In both mechanisms, if the attacker knows the classifier’s decision rule, he can make a perturbation that is guaranteed to induce a change in the classifier’s decision, to a target class.. Only for the first mechanism, where there is a legitimate pattern, forwarded by an honest party who is interested in the classifier’s decision, is there utility in correctly classifying the attacked pattern. For the latter (adversarial source) mechanism, the classifier’s decisions are only being made for the possible benefit of the adversary. Moreover, even for the man-in-the-middle mechanism, it is pointless to make correct decisions if they are only going to be intercepted and modified by the adversary on their way back to the intended decision recipient. (see Figure 5). We thus conclude that, if the pattern has been attacked, it is only meaningful to correctly classify when there is an honest generator and when the intended recipient receives the classifier’s decision. While this may be a common scenario, an adversarial source is also a common scenario. However, even under the man-in-the-middle scenario, for the reasons articulated previously, attack detection in high stakes security settings is important, irrespective of making correct decisions. If no attack is detected, one can still make a best effort to correctly classify the pattern. Of course, for a given application one must weigh the importance of detection against that of (unimpeded) classification – patterns that are falsely detected as attacks may be rejected (they may not be classified). Thus, if, to achieve a desirable TPR, the FPR (equivalently, the classifier’s rejection rate) is too high, attack detection may be impractical for the given domain.

IV-B Validity of the TTE Requirement that the Attacker Knows the Ground Truth Label

A key assumption made in all TTE attack works [68],[23],[59],[12] is that, in addition to knowing the classifier, the attacker knows the ground-truth label of the pattern to be attacked. This is required so that an attack perturbation can be crafted that pushes the pattern across the classifier’s decision boundary to the decision region for another class; recall Figure 3. Without knowledge of the true label, an attacker could still perturb a pattern to ensure a change in the classifier’s decision, but such change might actually alter an incorrect decision to a correct one. Such failed attacks, which correct the classifier, could be relatively common for domains with high classifier error rates, i.e. where there is significant confusion between some classes. Thus, in all TTE attack work it is assumed that the true label of the pattern to be attacked is known. Keeping this discussion in mind, consider the man-in-the-middle attack scenario of Figure 5, where the true label of the image to be attacked is actually unknown to the attacker! That is, a universal assumption in TTE attacks is in fact not valid in an important (man-in-the-middle) scenario.

IV-C Underestimating Susceptibility to AD in AL Attacks

A critical review of AL research is [50]. One concern raised is that some attacks are highly susceptible to an AD defense; yet this was not considered in performance-evaluating the attack’s “success”. Consider the JSMA TTE attack from [59]. While the perturbed patterns in [59] do alter the classifier’s decision, they also manifest artifacts (e.g. salt and pepper noise) quite visible in their published figures (see e.g. Figure 4), i.e., they are human perceptible, at least for (MNIST) images of digits. While the professed requirement in [59] that the attack be human-imperceptible may not be so important in practice (this is discussed further, shortly), the attack should still not be readily machine-detectable, i.e. by an AD. The artifacts in [59] are easy to (automatically) detect in practice. ADA achieves an ROC AUC greater than 0.99 for this attack on MNIST digits and an impressive 0.97 AUC is even achieved by a simple region-counting based AD [54]88 8 Clean MNIST digits typically consist of a single white region (where a region is defined as a collection of pixels that are “connected”, with two white pixels connected if they are in the same first-order (8-pixel) neighborhood, and with the region defined by applying transitive closure over the whole image). By contrast, nearly all JSMA attack images have extra isolated white regions (associated with salt and pepper noise). Simply using the number of white regions in the image as a decision statistic yields 0.97 AUC on MNIST – this strong result for this very simple detector indicates the susceptibility of the JSMA attack to a (simple) AD strategy.. Thus, at least for MNIST, the JSMA attack is highly susceptible to anomaly detection.

As a second example, consider the RE attack in [69]. The authors demonstrate that, with a relatively modest number of queries to the true classifier (perhaps as many as ten thousand or more, used to create labeled training examples), one can learn a surrogate classifier for the given domain that closely mimics the true classifier. Significantly, RE attacks may be critically needed to enable TTEs, by providing the required knowledge of the classifier to the TTE attacker if this is unknown a priori. One weakness of [69] mentioned earlier is that it does not consider large classification domains or DNNs. However, a much more critical weakness stems from one of its (purported) greatest advantages – the reverse-engineering in [69] does not require any real training samples from the domain99 9 For certain sensitive domains, or ones where obtaining real examples is expensive, the attacker may in fact have no realistic means of obtaining a significant number of real data examples from the domain.. In fact, in [69], the attacker’s queries to the classifier are randomly drawn, e.g. uniformly, over the given feature space. What was not recognized in [69] is that this makes the attack easily detectable – randomly selected query patterns are very likely to be extreme outliers, of all the classes. Each such query is thus individually highly suspicious. Multiple queries are thus easily detected as jointly improbable under a null distribution (estimable from the training set defined over all the classes from the domain). Even if the attacker employed bots, each making a small number of queries, each bot’s queries should also easily be detected as anomalous. Again, this work does not consider the potential of an AD defense. Later, we will evaluate a variant of the ADA TTE defense [74] against the more evasive RE attack proposed in [58] and show that it is quite effective (and, thus, effective at thwarting RE-enabled TTE attacks). We further note that even more recent work on black box TTE attacks [32], which is also based on querying to estimate gradients, ignores the possibility that queries may be detected as anomalous by an AD (a possibility verified by [74]).

Finally, consider evaluation of classifier performance versus attack strength in [9]. The authors give illustrative (fictitious) curves, showing how the rate at which TTE attacks successfully cause misclassifications should increase monotonically with the strength of the attack (i.e., the maximum allowed size of the perturbation). While we certainly agree with this basic characteristic in the absence of a defense, both this curve (and [9] in general) ignores the effect of an AD defense on such a curve – as the attack perturbation size is increased, it is indeed more likely that an attack pushes an example (say from class cs) over the classifier’s decision boundary (to the decision region of another class, cd). However, as the perturbation size increases, the attacked example is also more easily detected (by an AD) as an outlier, with respect to (null density models for) all classes. This has been experimentally validated in [54]. Moreover, the effective attacker success rate (the rate of crafting of samples that are both misclassified and not detected) may have a peak at a finite attack strength. This is shown in Figure 6 for the ADA detector and the CW attack on the CIFAR-10 domain. Note also that the peak effective attack success rate is about 13%.

Fig. 6: Effective attack success rate versus attack strength for the CW attack and the ADA detector defense on the CIFAR-10 domain.

Thus, if an AD is in play, the attacker’s ultimate success rate does not generally strictly increase with the size of the attack. This may also be the case for DP attacks, unlike the picture presented in [9] – as the amount of injected poisoned training data is increased, the learned classifier’s generalization accuracy should indeed degrade. However, too much data poisoning could make the attack more easily detected. For example, in the case of a backdoor attack, one injects patterns mislabeled to class cd so that the classifier learns a “back door”. As the number of injected samples increases, there may be very distinct clusters of samples labeled to cd, one corresponding to genuine samples from cd and the other to the attack. A clustering algorithm may identify these two clusters as the attack strength (number of injections) is increased. Such a clustering-based defense against availability DP attacks on email spam is [51]. Experiments involving defenses against backdoor DP attacks will be given in Section V.

IV-D Small Perturbations for TTE Attacks: Fallacy or Requirement?

Another fundamentally important issue is the “proper” attacker formulation and assumptions for TTEs. The early approaches from [68], [23], and [59] all involve constrained optimization, with the attacker seeking to minimally perturb a test pattern so as to induce a misclassification. Some nuance is required here in discussing the motivation for “small” perturbations. [59] and even more recently [22] specifically motivated small perturbations so that the attack is not human-perceptible. However, in some application scenarios, human monitoring of signals (images, audio) may be infrequent, or even non-existent. In such cases, one might imagine that larger perturbations, with greater success rate in causing misclassifications, could be invoked. However, as just noted, larger perturbations increase the detectability of the attack. Thus, so long as an AD is deployed along with the classifier, “small” perturbations should be needed in general, to defeat machine detection (AD), irrespective of human perception. In [9], on the other hand, the authors argue that it is a misconception that “adversarial examples should be minimally perturbed”. Interestingly, the reason given by the authors is that “the attacker will aim to maximize the classifier’s confidence on the desired output class, rather than only minimally perturbing the attack samples.” Clearly, [9] is assuming that, even if a detector is deployed, its decisionmaking is solely based on the classifier’s “confidence” in its decisionmaking (with attacks detected only on low confidence examples). However, this is a weak detector. A well-designed AD, one which looks for signatures of an attacked image (in [54], such signatures are anomalies, with respect to class-conditional null densities, evaluated on the deep hidden layer activations of the DNN), should exhibit greater detection power as the attack strength is increased. Indeed, this is what is shown in [54]. Thus, in contrast to [9], we argue (and show in Figure 6) that in many application domains “small perturbations” are indeed a requirement, not a fallacy.

IV-E Transferability in TTE Attacks: Targeted vs. Non-targeted cases

[58] considers the case where the TTE attacker does not have knowledge of the classifier but has the ability to query it. Such querying is used to create a labeled training set for learning a surrogate classifier. The authors demonstrate that TTE attack patterns crafted to induce misclassification by the surrogate classifier with high success rate also induce misclassifications on the true classifier. Moreover, the authors demonstrate that such transferability of the TTE attack does not even require that the true classifier and its surrogate have the same structure – if the true classifier is a DNN, the surrogate could be an SVM or a decision tree. While these results are quite intriguing, it is also important to note that the results in [58] assume that the TTE attack is untargetedi.e., successful transferability only requires that a misclassified TTE pattern for the surrogate is also a misclassified pattern for the true classifier. However, for TTE attacks to be strategic and cause the most damage, the attack should be targeted, inducing assignment to a particular target class ct when the original pattern comes from class cs. In the targeted case, a TTE attack only successfully transfers if the perturbed pattern induces the true classifier to misclassify to class ct. Considering this targeted attack case, we have experimentally evaluated the RE approach in [58]. For CIFAR-100, using this approach to learn an 8-layer ResNet classifier that is a surrogate of the ResNet-16 true classifier resulted in a targeted transferability success rate for the CW attack of 56%. By contrast, the non-targeted transferability success rate was found to be over 80%. This result indicates at any rate that the targeted case (of great practical interest) is more challenging, with a significantly lower success rate, than the non-targeted case.

IV-F Black or White Box TTE Attacks for Evaluation of a Defense

Very different viewpoints have been put forth on what is required in evaluating a defense (even to meet the standard of publication). Discussing TTE attacks, [12] has asserted (without elaboration) the strong white box requirement that “one must also show that an adversary aware of the defense [white box] can not generate attacks that evade detection.” There are several points to make about this statement. First, it does not mention false positives – there is always a true detection/false detection tradeoff, with the FPR controlled by setting the detection threshold. Clearly, the FPR must be sufficiently low for the deployed system to be practicable (not generating too many false alarms). Thus, we can safely construe the requirement from [12] to be that the true detection rate should be nearly perfect, with the FPR acceptably low. One might speciously argue in support of such a requirement by appeal to worst case engineering analysis – i.e., any successful attack could have worst-case consequences, which should be avoided at all cost. However, it is also possible, in this highly asymmetric, fully white box setting, where the attacker knows everything (the classifier and the detector, including even the detector’s threshold value) and the (proactive) defender knows nothing about the attacker, that meeting the requirements of [12] is theoretically impossible. In fact, [47] is not very sanguine about the prospects for defense against a fully white box attack.

To put this into further perspective, consider a robust classification defense against a white box attack. Since the attacker has full knowledge of the robust classifier, he can always successfully craft a perturbed pattern that will induce a desired misclassification. The only defense “safety net” for a robust classifier in this setting is human monitoring – if, to defeat the robust classifier, the attack perturbation is large, the attack may be human-perceptible. However, as noted before, one cannot rely on human monitoring, as it may be infrequent. In this case, robust classification offers no protection against white box attacks – it is deterministically defeated by a white box attacker.

Two fundamental questions need to be asked at this point: 1) In what scenarios is a white box attacker actually realistic?; 2) What should be realistically expected, performance-wise, for a proactive (AD) defense against a white box attack?

Addressing the first question, we note that [12] and other works do not justify or motivate the white box attacker assumption. We believe that, at least in security-sensitive settings where the deployed defense will not be common knowledge, “omnipotent” white box knowledge of both the classifier and the AD defense would have to come from a well-placed “insider”, working with the attacker. It is not reasonable to expect an AD system to defeat an insider attack. There are defenses specifically designed to prevent or to detect insider threats, with the former based on system safeguards (e.g., “need to know” knowledge compartmentalization) and with the latter exploiting information well beyond just the attacked data – e.g. monitoring suspicious behavior of users with sensitive access to the defense system. Thus, we believe a fully white box attack, consistent with an insider threat, should be evaluated against a “holistic” defense that includes both insider threat detection/prevention and AD-based attack detection. Moreover, even assuming insider knowledge, true white box knowledge is in fact not in general possible, as one can always introduce randomness into the defense that even an insider cannot predict. For example, suppose several distinct TTE defenses are mounted, and for each presented test pattern the invoked defense is randomly chosen from the mounted set (e.g., using the CPU clock for random number generation). In this case, the attacker will not know which defense is being used against the crafted TTE attack pattern. He/she can mount multiple attacked patterns, to increase the possibility of matching an attack to a defense realization, but this will increase the work factor in creating a successful attack.

Addressing the second question, we expect a good defense to fundamentally alter the picture given in [9], where the attack success rate grows monotonically with the attack strength – whether the attack is black box or white box with respect to the deployed AD defense, increasing the size of a pattern perturbation increases both the probability that the pattern is pushed across the classifier’s decision boundary and the probability that the attack is detected. A successful attack requires inducing a misclassification and evading detection. Thus, it is quite possible, with an AD defense in play, that the effective attack success rate (the joint probability of these two events) does not grow monotonically with attack strength. This was demonstrated in Figure 6 for black box attacks (ones that know the classifier but not the detector). In [54], this phenomenon was also demonstrated for a fully white box attack – there was a noticeable, single peak effective attack success rate at an intermediate attack strength. In particular, [54] devised a white box version of the CW attack [13] on the ADA defense applied to CIFAR-10 image classification, and found, at its peak, 25% of crafted attack images induce misclassifications and evade detection. That is, since the induced misclassification rate is close to 100%, roughly 75% of successfully crafted attacks (those causing misclassifications) are detected, at the attack strength yielding highest effective success rate. Clearly, at the optimal attack strength, the proactive ADA defense is not “defeating” the white box attack. However, it is increasing the work factor of the attacker in producing successfully attacked images. In fact, as reported in [54], the attacker’s computational effort (to induce misclassifications) is quite high even in the absence of an AD defense – the effort to craft CW attacks is hundreds of times heavier than that required to make ADA-based detections [54], cf. Section V. To defeat both the classifier and the detector, the white box attacker’s computational effort is made even greater.

While [12] demands strong detectability of white box attacks, other opinions on proper defense evaluation have been offered. In [22], the authors quote [31] which states that “if the model gives the adversary an unrealistic degree of information… an omnipotent adversary who knows [everything] can…design optimal attacks… it is necessary to carefully consider what a realistic adversary can know.”

One such realistic scenario is where the attacker has full knowledge of the classifier but no knowledge of a possible AD defense (white box w.r.t. the classifier and black box w.r.t. the detector). This scenario is not only preferred because it is far less asymmetric (i.e., fairer) than the strong white box attack scenario. It is also arguably quite realistic. In particular, for many application settings, the best-performing type of classifier (e.g., a DNN or a support vector machine) may in fact be common knowledge. Moreover, even if the classifier’s parameters are unknown, if the attacker has good labeled data from the domain, he can train a surrogate classifier with similar performance as the target classifier. Again, it has been shown that using surrogate classifiers as the basis for TTE attack crafting may yield a high rate of success transferability to the target classifier [58]. Moreover, even for domains where transferability is not so high, an RE attack, involving relatively numerous queries to the target classifier, can be used to learn a good surrogate of the target classifier. Thus, it is plausible to start with good knowledge of the classifier in play, and even if one does not start with such knowledge, there is a direct mechanism (querying) that allows the attacker to learn the classifier. However, the same cannot be said for the detector – there may be no natural mechanism for “querying” the detector1010 10 If the attacker might receive the classifier’s decision, one should not output a “don’t know” decision when attacks are detected, as this would in fact allow detector querying by the attacker.. Moreover, with numerous potential defense approaches and no definitive standard, it seems less realistic for the attacker to possess substantial knowledge of the deployed AD defense. Also, no one has shown that there is high transferability of a surrogate defense, i.e. a surrogate AD deployed by the attacker may not closely mirror detections made by the true deployed AD. Based on the above arguments, a reasonable framework may involve a proactive detector and an attacker that is white box with respect to the classifier and black box with respect to the detector.

V Attack-Defense Experiments

V-A Test-Time Evasion Experiments

We experimented on the CIFAR-10 and CIFAR-100 [1] data sets. CIFAR-10 is a ten-class data set with 60,000 color images, consisting of various animal and vehicle categories. CIFAR-100 is a 100-class data set with 60,000 color images. Both data sets consist of 50,000 training images and 10,000 test images, with all classes equally represented in both the training and test sets. For AD purposes, the data batch under consideration in our experiments consists of the test images plus the crafted attack images. For training deep neural networks, we used mini-batch gradient descent with a cross entropy loss function and a mini-batch size of 256 samples. For CIFAR-10, we used the 16-layer deep neural network architecture in [28], [13]. This neural net, once trained, reaches an accuracy of 89.47% on the CIFAR-10 test set. For CIFAR-100, we used the ResNet-18 architecture, which reaches an accuracy of 81.73% on the test set. During the phase of crafting adversarial samples, we only perturbed test set samples that were correctly classified. We implemented the fast gradient step method (FGSM) attack [23] and used the authors’ supplied code for the CW-L2 attack [13]. For each test sample, from a particular class, we randomly selected one of the other classes as target and generated an attack instance starting from the test image, such that the classifier will assign the perturbed image to the target class. In this way, on CIFAR-10, for the FGSM attack, we successfully crafted 9243 adversarial images (step size for gradient descent of 0.001) and for the CW attack, we successfully crafted 9920 images (Lagrange multiplier c set to 4). Below in Figure 7, for visual assessment, we show successful CW attack images for the MNIST domain. Note that while CW is generally thought of as an “imperceptible” attack, there are some ghost artifacts, especially horizontally oriented, that are noticeable (with similar ghost artifacts produced by FGSM). For CIFAR-10, though, we did indeed find CW and FGSM attacks to be visually imperceptible.

Fig. 7: CWL2 attack adversarial images [13] on MNIST. Below each image the digit is the classifier’s decision and the number in parentheses is the posterior probability under this winning class, indicating the decision confidence.

We implemented the unsupervised AD from [54] (referred to as L-AWA-ADA) and the blurring “correction” approach from [42]. We used the code provided by the authors for the Openmax detector [4]. As mentioned earlier, Openmax requires setting two hyperparameters using labeled attack examples. Thus, this method is actually supervised.

Figure 8 shows the performance of the three detectors against the FGSM attack on CIFAR-10 [54]. In this case, we used 137 adversarial images to determine the hyperparameters for the Openmax method. We can see that although the blurring method’s detection performance is somewhat competitive with ADA [54], this is achieved with some compromise in the (conditional) classification accuracy as the attack FPR is varied. Openmax’s detection accuracy is the worst, despite its use of (quite a few) labeled attack examples for setting its hyperparameters.

We also assessed the execution time required by methods to make detections. Currently we are using two NIVIDA GTX1080 GPUs, an Intel Core i7-5930K Processor and a 256GB DDR4. For this platform, detection time for a CIFAR-10 image is 0.00114s using ADA and 0.00063s for the blurring method, i.e. it is both quite modest and comparable for the two methods. For comparison’s sake, the average time required by the attacker to craft a successful CW attack example on CIFAR-10 (averaged over the whole test set) is 0.72s – about 700 times that required for ADA detection. Thus, detector computational complexity is much lower than that of the attacker.

Fig. 8: Detector comparison for the FGSM attack on CIFAR-10. The conditional correct decision rate (bottom) is the correct decision rate of the classifier on the non-attacked test patterns that are not detected as attacks.

For the CW L2 attack (Figure 9), we used 152 adversarial samples to determine the hyper-parameters for Openmax. ADA greatly outperforms the other detectors in this experiment. Again, although not shown, the blurring method compromises some classification accuracy as the FPR is varied.

Fig. 9: Detector comparison for the CW L2 attack on CIFAR-10.

We also evaluated the methods on CIFAR-100 [54]. The only modification of the ADA method for this data set was not to perform inverse weighting by the confusion matrix, as the confusion matrix is not reliably estimated in the 100-class (limited training set) case. Again, OpenMax used 150 labeled attack examples to choose its hyperparameters. Figure 10 shows that ADA significantly outperforms the other detectors. Moreover, while ADA’s ROC AUC is lower than for CIFAR-10 (as might be expected), it is still above 0.9.

Fig. 10: CW attack detection results on CIFAR-100.

In [54] ADA was also demonstrated to substantially outperform the unsupervised detection method from [20]; moreover, ADA’s results on the CW attack on CIFAR-10 were compared with the results for a variety of detection methods reported in [12] and found to be better than the best results reported in [12]. While the ADA approach does give relatively promising results in detection of TTE attacks, a ROC AUC of 0.9 is far from perfect – there is potentially room for improvement from a more powerful detector, as well as the need to theoretically characterize what is attainable. In particular, while it is known that imperfect classification accuracy impacts detection accuracy [54] (misclassified patterns are more difficult to distinguish from attacked patterns than correctly classified patterns), there is no theoretical characterization of how the classifier’s error rate bounds the achievable accuracy of a detector.

V-B Reverse Engineering Experiments

We investigated detection of the RE attack from [58], which involves stage-wise querying and surrogate classifier re-training, over a sequence of stages, as discussed in Section II.C. The premise of the RE detection method in [74] is that, in order to learn about the classifier, the attacker must generate “exploratory” queries that may be atypical of legitimate patterns from the domain, just as are TTE attack patterns. Thus, [74] leverages the ADA detector (designed to detect TTE attacks) to jointly exploit batches of images from a common querying stage, seeking to detect RE attacks. To aggregate ADA decision statistics over a batch of (query) images from a given querying stage, [74] performs the following: i) Divide the batch into mini-batches, for example a batch of 50 images could be divided into mini-batches of size 5; ii) For each mini-batch, maximize the ADA decision statistic (produced for each image) over all images in the mini-batch; iii) Make a detection if any of the mini-batches yields a detection statistic greater than the threshold.

For CIFAR-10, as a DNN classifier, we used ResNet-18. We also used the same structure for training the RE attacker’s surrogate network. For stage 0, we used 280 CIFAR-10 samples (28 from each class) to train the initial surrogate classifier. We applied 6 stages of retraining (7 training stages) of the surrogate DNN, using gradient descent with λ=0.37. The number of queries generated by the 6 stages were: 280, 560, 1120, 2240, and 4480. We used mini-batches of size 5 in experiments. Two maxpooling layers and the penultimate layer were used in generating the ADA detection statistics. Figures 11 and 12 show the detector’s and attacker’s performances for different query stages. For CIFAR-10, the surrogate classifier’s accuracy and the TTE attack success rate are not very high, even after stage 7. On the other hand, the ADA based RE detector achieves high ROC AUC, above 0.97, as the batch size is increased, for query stages 6 and 7. Thus, the approach in [74] is highly successful in detecting RE attacks, and at a stage before the attacker has learned a surrogate classifier sufficiently accurate to launch an effective TTE attack. ADA thus provides two ways to defeat a TTE attack: i) at the precursor RE stage; ii) during an actual TTE attack.

Fig. 11: RE detection ROC AUC on CIFAR-10 at different query stages versus batch size.
Fig. 12: TTE attack success rate and surrogate classifier accuracy on CIFAR-10 versus RE query stage.

V-C Embedded Backdoor Data Poisoning Experiments

We experimented on the CIFAR-10 data set with the same train-test split as in the TTE experiments. The training set is poisoned by inserting 1000 backdoor images crafted starting from clean images from the ‘airplane’ class and then labeled as ‘bird’. This (source, target) class pair is one of the eight pairs used in [70]. The backdoor signature is a positive perturbation (whose size is pre-selected by the attacker) on a randomly chosen pixel that is fixed through all the experiments. An example of a backdoor image (on the left) and its original clean version (on the right) are shown in Figure 13. It is not visually perceptible that a perturbation of size1111 11 The pixel values are normalized to [0, 1] from [0, 255] for each of the R, G, B colors. 0.25 was added to each of the R, G, B values of the pixel at position (9, 18) (the image size is 32×32). Scenarios where adversarial images are crafted using other methods, including fixing values of some pixels to form a backdoor shape (pixel, ‘X’, or ‘L’ shapes) with e.g. (R, G, B) = (0.1, 0.1, 0.9)), are tested in [70]. However, compared with our pixel perturbation backdoor, the backdoor signatures in [70] are visually more perceptible and easier to detect in experiments. Thus, in this sub-section, we focus on perturbation attacks on a single pixel, varying the perturbation size, and compare the performance of three defenses. Note that we control the attack strength by varying the perturbation size while fixing the number of backdoor training patterns (at 1000).

Fig. 13: Backdoor-attacked low-resolution image of a plane (left) with a single pixel changed from the clean image (right).

For training DNNs, we used ResNet-20 [28] and trained for 200 epochs with mini-batch size of 32, which achieved an accuracy of 91.18% on the clean test set. Under attacks with different perturbation sizes, the accuracies on the clean test patterns and the attack success rates are shown in Figure 14. Here, the attack success rate is the fraction of backdoor test patterns (those not used during training) that are classified to the target class. In our experiments, we created the backdoor test patterns by adding the same perturbation to the 1000 clean test patterns from a given class. Again, the attacker’s goal is for the classifier to assign these patterns, originating e.g. from the ‘airplane’ category, to the target category, e.g. ‘bird’. For all the perturbation sizes tested, the classification accuracies on the clean test patterns are not appreciably degraded, and the attack success rates are all quite high.

Fig. 14: Attack success rate and accuracy on the clean test set for a range of perturbation sizes.

For defenses against this attack, we assess the approach based on a spectral signature (SS) proposed in [70], the activation clustering (AC) approach proposed in [15], and a method proposed here using cluster impurity (CI) which takes advantage of several existing methods in the broader AL defense literature. All three methods extract, when the training patterns are applied as input, the penultimate layer features from the trained DNN. SS projects a feature vector onto the principal eigenvector of the feature vector’s covariance matrix. AC projects the feature vector onto 10 independent components. CI operates on the whole feature vector, without dimensionality reduction.

SS, as pointed out in Section III.D, relies on knowledge of the number of backdoor training patterns, unlikely to be known in practice. Moreover, [70] did not provide a practical way to explicitly infer whether a class is backdoor-poisoned or not. For convenience of evaluation, we fixed the FPR of SS to 0.5 by applying the same detection threshold to all classes and then evaluated the TPR. Note that FPR=0.5 means half of the unpoisoned training patterns will be falsely detected and removed. This very high FPR was necessary in order for SS to achieve significant TPR (detecting the backdoor).

Regarding AC, even though this method is successful in backdoor detection and mitigation on simpler data sets (MNIST and LISA), it does not perform very effectively for more difficult data sets (e.g. CIFAR-10) with evasive adversarial patterns (e.g. perturbation of a single pixel). The ‘retraining’ method for deciding which of the two clusters obtained by K-means should be discarded is an effective way to look for the poisoned classes and clusters. However, as noted earlier, this method is impractical when the number of classes is large. Without explicitly implementing the retraining method, we present essentially “best possible” results for AC by only applying it to the class that has actually been attacked and by choosing for removal the cluster from this class that actually possesses the most poisoned patterns. That is, we assume there are zero false positive detections associated with unattacked classes (even though this is highly optimistic) and that AC unfailingly picks the correct cluster to detect and remove.

CI first models the feature vectors extracted from the penultimate layer for each class using Gaussian mixture models with full covariance matrices. The number of clusters for each class is selected using BIC model order selection [63]. For the relatively simple MNIST data set, BIC selects one cluster if the class is not poisoned and two clusters otherwise, which agrees with the observations reported in [15]. However, for the more difficult CIFAR-10 data set, the BIC-optimal number of clusters is not necessarily one for the classes that are not poisoned. For example, with perturbation size 0.25, BIC selects 18 clusters in aggregate for the 10 classes, even though only one class is poisoned. Thus, we use a measured ‘class decision impurity fraction’ for each cluster to decide whether it should be removed or not. We first apply to all training patterns the 2×2 averaging filter proposed in [42] for TTE defense. Then each training pattern and its ‘blurred’ version are classified by the trained DNN. The class decision impurity fraction of each cluster is then evaluated as the fraction of patterns in the cluster whose predicted labels disagree for the original and blurred versions. The backdoor signatures in poisoned patterns, once blurred, are likely to become less ‘recognizable’ by the trained neural network, i.e. those blurred patterns are likely to be classified by the DNN to each pattern’s originating class, rather than to the attacker’s target class. If a cluster contains many backdoor-poisoned patterns, the class decision impurity fraction will be abnormally high. As an example, Figure 15 gives the histogram for the class decision impurity fraction for the 18 clusters (across all originating classes) selected using BIC, with perturbation size 0.25 (similar histograms are obtained for other perturbation sizes). Note the clear, large gap between low and high impurity clusters (with the single, high impurity cluster from the attacked class). Thus, we set the impurity threshold at 0.1 for the CI approach in all our experiments.

Fig. 15: Histogram of the class decision impurity fraction of the 18 clusters estimated using BIC for perturbation size 0.25.
TABLE I: (TPR, FPR) tradeoffs for the three detection approaches over a range of perturbation sizes.
Pert Size SS AC CI
0.15: (0.443, 0.5) (0.622, 0.057) (0.962, 0.002)
0.20: (0.483, 0.5) (0.691, 0.040) (0.980, 0.003)
0.25: (0.474, 0.5) (0.763, 0.040) (0.979, 0.003)
0.30: (0.558, 0.5) (0.741, 0.037) (0.976, 0.003)
0.35: (0.666, 0.5) (0.851, 0.034) (0.989, 0.002)
0.40: (0.616, 0.5) (0.847, 0.036) (0.985, 0.004)

Table I shows (TPR, FPR) pairs of the three detection approaches for each perturbation size. In general, CI achieves much higher TPR (greater than 0.96) than the other two approaches and at the same time very low FPR (lower than 0.005). Implemented in the most preferential way, as discussed previously, AC achieves relatively good TPR for large perturbation size. However, TPR clearly diminishes with the perturbation size. The TPR of the SS approach shares the same trend, but it is apparently not comparable to the other two approaches. Regarding FPR, we only evaluated for the AC and CI approaches since it is fixed for the SS approach. Although the FPRs of the AC approach are not very high, note that these numbers are quite optimistic since we assumed AC has zero false positives from non-attacked classes. Another way of saying this is that, in fact, AC’s false positive rate for (conditioned on) the attacked class is in fact quite high – for AC, 0.036 FPR means that 1800 out of the 5000 clean patterns from the target class are falsely detected as backdoor patterns and discarded from the training set (since zero counts are assumed for all non-attacked classes, all the false positives come from the target class) – the FPR rate for the target class is actually 0.36. By contrast, the FPR of the CI approach is truly low, with no more than 5% of clean patterns falsely detected for any of the classes, including the attacked class.

As discussed earlier, the ultimate performance assessment for a detector-based defense (where training patterns are removed and then the classifier is retrained) involves the retrained classifier’s clean test set accuracy and the backdoor attack success rate on the retrained classifier. In Figure 16 we see that the CI approach reduces the attack success rate to a negligible level, and the clean test set accuracy is not degraded at all. The AC approach is relatively successful, but its performance at low perturbation size is poor (with the attacker’s success rate in the absence of defense already at greater than 80% at this attack size, as seen from Figure 14) – an attack success rate above 0.5 should be quite harmful. The SS approach clearly fails to defeat the attack. Moreover, there is some degradation in clean test set accuracy even though training patterns are abundant for CIFAR-10 (5000 training patterns per class). If we allow an even higher FPR to achieve a better TPR for SS, the retrained neural network will suffer even more test set accuracy degradation.

Fig. 16: Attack success rate and accuracy on clean test set of the retrained neural networks for the three defenses.

Anecdotally, we note that, different from the attacks on the simpler MNIST data set in [15] and the attacks crafted using fixed abnormal pixel values in [70], the attacks used in our experiments are clearly less detectable. For the AC approach, the feature vectors corresponding to the adversarial patterns cluster together but are surrounded by clean patterns, when projected onto (a visualizable) lower-dimensional space. For the SS approach, the non-separability between adversarial and clean patterns is quite visible after principal component projection. One reason for CI’s advantage may be its use of all the penultimate layer features – some “minor component” features may contain important information for discriminating clean from poisoned patterns. Another likely reason is use of the blurring heuristic in conjunction with clustering.

Apart from the experiments above, we also applied the three defenses to the more detectable attacks in [70]. In these experimental results (not shown), all three methods performed quite well, obtaining good (TPR, FPR) tradeoffs.

VI Some Continuing Directions and Challenges

VI-A Defenses with a human-in-the-loop

Even if AD-based defenses achieved nearly perfect accuracy in detecting TTE and DP attacks (true positive rate of one, false positive rate of zero), having a human analyst/expert in the loop would be extremely useful for characterizing the nature of a detected attack and for determining suitable actions in response (e.g., measures to block an attacker’s future access, potential retaliation, termination of actions in progress that may have been based on TTE attack or a DP-compromised classifier’s decisions). Human categorization of detected attacks would also enable active learning of an automated classifier designed to mimic the human analyst, i.e., to categorize anomalies, beyond merely detecting them. Such an automated classifier could be very useful, considering that the volume of detected anomalies may be quite large, whereas an analyst may only be sparingly available to inspect them, given the required effort and cost/expertise.

Beyond these benefits, consider that the results here show that good, but far from perfect, TTE attack detection is currently achieved. Sparing human-in-the-loop labeling of detections as “attack” or “no attack” would enable partially supervised (semisupervised [53, 61]) learning of the anomaly detector. Such learning could bridge the existing performance gap in detecting the most challenging attacks, such as CW [12, 13]. For example, the ADA approach [54] could be modified so as to learn a supervised classifier that takes as derived feature input ADA’s rich set of KL decision statistics – the classifier could be a logistic regression model, a support vector machine, or even (ironically) a deep neural network1212 12 In the last case, a DNN is being trained to decide when a different DNN is being “fooled”.. In this case, the classifier would be learning how to weight individual feature anomalies and which patterns of anomalies are most discriminating between TTE attack and non-attack instances.

VI-B The Challenge of Fake News and More General Injected Content Attacks

VI-B1 Sophisticated data poisoning attacks on classifiers

We have demonstrated some potential of clustering and AD to detect DP attacks. However, consider recent advances in creating realistic simulated data, such as generative adversarial networks (GANs) [24]. As an example, suppose one hand-crafts a small set of realistic-looking images of fighter aircraft used by country A, albeit with key but subtle aircraft characteristics altered so as to introduce ambiguity with fighters used by another country, B. GANs [24] could be subsequently used to mass-produce many such instances, starting from the small hand-crafted set. If these synthetic examples (labeled by A) are used in a DP attack, a learned classifier may mistake aircraft from B with those of A, with potentially dire consequences. These sophisticated DP attacks may be difficult to detect using clustering or outlier detection techniques. There are several potential ways to defeat such attacks. First, evaluation of the learned classifier on a validation set or test set may reveal more class confusion than expected (or more than is tolerable) between classes A and B. Second, filtering (rejection) of training data based on its provenance may block “outsider” DP attacks. Third, good system monitoring to detect insider threats is needed to defeat insider DP attacks.

VI-B2 Data Poisoning in an Unsupervised Learning Context

Consider the problem of unsupervised clustering/mixture modeling [19], albeit wherein the data batch to be analyzed has been poisoned to include samples from inauthentic clusters/classes. A good clustering algorithm should identify all the clusters in the data batch, including the inauthentic ones. Moreover, unless the inauthentic clusters are highly unusual1313 13 For example, they may have much greater within-cluster scatter, much greater between-cluster distance to all other clusters than do authentic clusters, or they may have highly atypical feature distributions (e.g., if features are binary, an inauthentic cluster may have an unusually large number of features that are frequently “on” (or “off”), compared with authentic clusters)., there may be no objective basis for detecting them as suspicious. Even if a cluster is unusual, its distinctiveness can, in fact, alternatively be reasonably interpreted as a validation of the cluster, at least compared with other clusters that are too similar to each other. The point here is that unsupervised detection of a DP attack on unsupervised learning is extremely challenging, and possibly ill-posed.

There are two promising approaches to detect such attacks. One assumes a scenario where there are two data batches, one whose content is known to be “normal” (representative of a null hypothesis), 𝒳null, and the other, 𝒳test, which may contain some anomalous (i.e., poisoned) content. In this setting, one can formulate a group (cluster) anomaly detection problem, with the null hypothesis that 𝒳test does not possess content different from 𝒳null and with the alternative hypothesis that 𝒳test contains one (or multiple) anomalous clusters relative to 𝒳null. This problem has been addressed in a quite successful fashion in several works, even considering the challenging case of huge feature spaces (e.g., tens of thousands of features per sample, as seen for document clustering domains), with anomalous clusters manifesting on only small, a priori unknown, subsets of the features [65, 52]. These works exploit previous parsimonious mixture modeling techniques [26, 64], and treat this hypothesis testing problem as one of model selection, with the null hypothesis model (learned on 𝒳null) and the alternative model (which consists of the null model plus at least one additional cluster, learned on 𝒳test) competing as the best descriptors for 𝒳test in the sense of a model-complexity penalized log-likelihood function such as the theoretically grounded Bayesian Information Criterion (BIC) [63]. The alternative hypothesis is chosen when the cost of describing an additional cluster’s parameters is more than offset by the associated gain in log-likelihood on the data points that are estimated to belong to this (putative anomalous) cluster. These works addressed the case of anomalous clusters in general, i.e. they did not address DP attacks per se, but they are quite suitable to detect them in 𝒳test when there is also an available (reference) attack-free batch, 𝒳null.

When no such reference batch is available, the most promising approach to detect data poisoning is by human operator/analyst inspection, and perhaps by leveraging information beyond just the observed data1414 14 Other information sources and knowledge bases may be used by the analyst to suitably “label” the clusters to categories pertinent to the application domain. In this labeling process, anomalous/attack clusters may be discovered..

VI-B3 “Alternative Facts” Attacks

In this paper, we have focused on attacks which induce an ML system to make classification errors. Even DP attacks on unsupervised clustering roughly fit this description, as unsupervised clustering is in fact unsupervised classification. As shown here, all of these attacks are vulnerable to AD (or clustering based) defenses. However, there are DP attacks with a different objective which are in general much more formidable to detect. To illustrate, consider a highly contemporary example in the news. The Mueller investigation shared document evidence with a Russian plaintiff’s lawyers. Apparently these lawyers then shared these documents with fake news agents, who altered the documents and then publicly shared them, claiming them as the “genuine articles”. This is not an attack on machine learning. It is a data integrity attack [3], whose goal is to obfuscate what is genuine and factual – in writ, to create “alternative facts”. Such attacks may be for political gain, to alter a criminal trial’s outcome, to alter a medical diagnosis or decision (by altering medical images or test results), and they may even alter or drive a government’s policies (image doctoring could lead policy analysts to believe a rogue nation is building WMDs). Such attacks appear to be extraordinarily difficult to detect using the statistical defenses advocated in this paper – the facts of a news article can be crucially modified simply by the addition of one word. For example: “Putin did [not] deny that the Russians colluded with Trump…”. Statistical detection is useless here, especially since once such fake news is widely promulgated its version of the facts may be just as prevalent (in the news, on social networks, and in print) as the genuine facts. Related attacks on images or audio files, if artfully made, may be similarly resistant to statistical AD. Again, the key to detecting such attacks may heavily rely on data provenance and e.g. on rigorous use of cryptographic techniques (as discussed above), including block-chain technology.

Beyond data integrity, there are “entity integrity” attacks which essentially seek to pass the Turing test, see e.g. work on neural dialogue systems [41]. These attacks are more amenable to statistical AD (or other detection strategies), since increasing the duration and range of topics covered in a dialogue will eventually test the (common sense skill) limits of an artificial conversant, likely producing responses that are statistically anomalous, logically inconsistent, or which betray large gaps in knowledge of the world.


  • [1] The CIFAR-10 dataset. kriz/cifar.html, 2010.
  • [2] I. Alabdulmohsin, X. Gao, and X. Zhang. Adding robustness to support vector machines against adversarial reverse engineering. In Proc. CIKM, 2014.
  • [3] M. Barreno, B. Nelson, R. Sears, A.D. Joseph, and J.D. Tygar. Can machine learning be secure? In Proc. ACM Symposium on Information, Computer and Communications Security, 2006.
  • [4] A. Bendale and T.E. Boult. Towards open set deep networks. In Proc. CVPR, 2015.
  • [5] B. Biggio, I. Corona, D. Majorca, B. Nelson, N. Srndic, P. Laskov, G. Giacinto, and F. Roli. Evasion attacks against machine learning at test time. In Proc. ECMLPKDD, 2013.
  • [6] B. Biggio, G. Fumera, and F. Roli. Security evaluation of pattern classifiers under attack. IEEE Trans. on Knowledge and Data Engineering, pages 984–996, 2014.
  • [7] B. Biggio, B. Nelson, and P. Laskov. Support vector machines under adversarial label noise. In Asian Conf. on Machine Learning, 2011.
  • [8] B. Biggio, I. Pillai, S.R. Bulo, D. Ariu, M. Pelillo, and F. Roli. Is data clustering in adversarial settings secure? In Proc. AISec, 2013.
  • [9] B. Biggio and F. Roli. Wild patterns: Ten years after the rise of adversarial machine learning. In Proc. ACM CCS, 2018.
  • [10] L. Breiman, J. Friedman, C. Stone, and R. Olshen. Classification and regression trees. Wadsworth, 1984.
  • [11] N. Carlini, P. Mishra, T. Vaidya, Y. Zhang, M. Sherr, C. Shields, D. Wagner, and W. Zhou. Hidden voice commands. In Proc. USENIX Security Symposium, Austin, TX, August 2016.
  • [12] N. Carlini and D. Wagner. Adversarial examples are not easily detected: bypassing ten detection methods. In Proc. AISec, 2017.
  • [13] N. Carlini and D. Wagner. Towards Evaluating the Robustness of Neural Networks. In Proc. IEEE Symposium on Security and Privacy, 2017.
  • [14] N. Carlini and D. Wagner. Audio Adversarial Examples: Targeted Attacks on Speech-to-Text. In Proc. IEEE Security and Privacy Workshops, May 2018.
  • [15] B. Chen, W. Carvalho, N. Baracaldo, H. Ludwig, B. Edwards, T. Lee, I. Malloy, and B. Srivastava. Detecting backdoor attacks on deep neural networks by activation clustering., 2018.
  • [16] X. Chen, C. Liu, B. Li, K. Lu, and D. Song. Targeted backdoor attacks on deep learning systems using data poisoning., 2017.
  • [17] D.T. Davis and J.-N. Hwang. Solving Inverse Problems by Bayesian Neural Network Iterative Inversion with Ground Truth Incorporation. IEEE Trans. Sig. Proc., 45(11), 1997.
  • [18] A. Demontis, M. Melis, B. Biggio, D. Maiorca, D. Arp, K. Rieck, I. Corona, G. Giacinto, and F. Roli. Yes, Machine Learning Can Be More Secure! A Case Study on Android Malware Detection. Transactions on Dependable and Secure Computing, 2017.
  • [19] R.O. Duda, P.E. Hart, and D.G. Stork. Pattern classification. Wiley, 2001.
  • [20] R. Feinman, R. Curtin, S. Shintre, and A. Gardner. Detecting adversarial samples from artifacts., 2017.
  • [21] Y. Gao, C. Xu, D. Wang, S. Chen, D.C. Ranasinghe, and S. Nepal. STRIP: A Defence Against Trojan Attacks on Deep Neural Networks.
  • [22] J. Gilmer, R.P. Adams, I Goodfellow, D. Anderson, and G.E. Dahl. Motivating the rules of the game for adversarial example research., 2018.
  • [23] I. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. In Proc. ICLR, 2015.
  • [24] J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial networks. In Proc. Neural Information Processing Systems (NIPS), 2014.
  • [25] A. Graese, A. Rosza, and T.E. Boult. Assessing threat of adversarial examples on deep neural networks. In Proc. ICMLA, 2016.
  • [26] M.W. Graham and D.J. Miller. Unsupervised learning of parsimonious mixtures on large spaces with integrated feature and component selection. IEEE Transactions on Signal Processing, pages 1289–1303, 2006.
  • [27] K. Grosse, P. Manoharan, N. Papernot, M. Backes, and P. McDaniel. On the (statistical) detection of adversarial examples., 2017.
  • [28] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proc. CVPR, 2016.
  • [29] D. Hendrycks and K. Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks. In Proc. ICLR, 2017.
  • [30] D. Hendrycks and K. Gimpel. Early methods for detecting adversarial images. In Proc. ICLR - Workshop track, 2017.
  • [31] L. Huang, A.D. Joseph, B. Nelson, B.I.P. Rubinstein, and J.D. Tygar. Adversarial machine learning. In Proc. 4th ACM Workshop on Artificial Intelligence and Security (AISec), 2011.
  • [32] A. Ilyas, L. Engstrom, and A. Madry. Prior Convictions: Black-Box Adversarial Attacks with Bandits and Priors. In Proc. ICLR, 2019.
  • [33] B.-H. Juang and S. Katigiri. Discriminative learning for minimum error classification. IEEE Trans. on Signal Processing, 1992.
  • [34] D. Karger, S. Oh, and D. Shah. Iterative learning for reliable crowdsourcing systems. In NIPS, 2011.
  • [35] C. Kereliuk, B. Sturm, and J. Larsen. Deep learning and music adversaries. IEEE Trans. on Multimedia, 2015.
  • [36] C. Kereliuk, B. Sturm, and J. Larsen. Deep learning, audio adversaries, and music content analysis. In Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, NY, 2015.
  • [37] P.W. Koh, J. Steinhardt, and P. Liang. Stronger data poisoning attacks break sanitization defenses., 2018.
  • [38] J. F. Kurose and K. W. Ross. Computer Networking - A Top Down Approach Featuring the Internet. Addison-Wesley, 3rd edition, 2004.
  • [39] A. Kurve, D.J. Miller, and G. Kesidis. Multicategory Crowdsourcing Accounting for Plurality in Worker Skill and Intention and Task Difficulty. IEEE Trans. Knowledge and Data Engineering (TKDE), March 2015.
  • [40] Y. LeCun, C. Cortes, and C.J.C. Burges. The MNIST database of handwritten digits., 1998.
  • [41] J. Li, W. Monroe, T. Shi, S. Jean, A. Ritter, and D. Jurafsky. Adversarial learning for neural dialogue generation. In Proc. Conf. on Empirical Methods in Natural Language Processing, 2017.
  • [42] X. Li and F. Li. Adversarial examples detection in deep networks with convolutional filter statistics. In Proc. ICCV, 2017.
  • [43] C. Liao, H. Zhong, A. Squicciarini, S. Zhu, and D.J. Miller. Backdoor embedding in convolutional neural network models via invisible perturbation. CoRR, 2018.
  • [44] K. Liu, B. Doan-Gavitt, and S. Garg. Fine-Pruning: Defending Against Backdoor Attacks on Deep Neural Networks. In Proc. RAID, 2018.
  • [45] D. Lowd and C. Meek. Good word attacks on statistical spam filters. In CSEAS, 2005.
  • [46] J. Lu, T. Issaranon, and D. Forsyth. Safetynet: detecting and rejecting adversarial examples robustly. In Proc. ICCV, 2017.
  • [47] D. Meng and H. Chen. Magnet: a two-pronged defense against adversarial examples. In Proc. CCS, 2017.
  • [48] J. Metzen, T. Genewein, V. Fischer, and B. Bischoff. On detecting adversarial perturbations. In Proc. ICLR, 2017.
  • [49] B. Miller, A. Kantchelian, S. Afroz, R. Bachwani, E. Dauber, L. Huang, M.C. Tschantz, A.D. Joseph, and J.D. Tygar. Adversarial active learning. In Proc. Workshop on Artificial Intelligence and Security (AISec), 2014.
  • [50] D. Miller, X. Hu, Z. Qiu, and G. Kesidis. Adversarial learning: a critical review and active learning study. In Proc. IEEE MLSP, 2017.
  • [51] D.J. Miller, X. Hu, Z. Xiang, and G. Kesidis. A Mixture Model Based Defense for Data Poisoning Attacks Against Naive Bayes Spam Filters., Oct. 2018.
  • [52] D.J. Miller, Z. Qiu, and G. Kesidis. Parsimonious Cluster-based Anomaly Detection (PCAD). In Proc. IEEE MLSP, Aalborg, Denmark, Sept. 2018.
  • [53] D.J. Miller and H.S. Uyar. A mixture of experts classifier with learning based on both labelled and unlabelled data. Advances in Neural Information Processing Systems, pages 571–577, 1997.
  • [54] D.J. Miller, Y. Wang, and G. Kesidis. Anomaly Detection of Attacks (ADA) on DNN Classifiers at Test Time. Neural Computation (to appear), shorter version in Proc. IEEE MLSP, Aalborg, Denmark, Sept. 2018,
  • [55] S. Mooosavi-Dezfooli, A. Fawzi, O. Fawzi, and P. Frossard. Universal adversarial perturbations. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition, 2017.
  • [56] B. Nelson, M. Barreno, F.J. Chi, A.D. Joseph, B.I.P. Rubinstein, U. Saini, C. Sutton, J.D. Tygar, and K. Xia. Misleading learners: Co-opting your spam filter. In Machine Learning in Cyber Trust: Security, Privacy, and Reliability, 2009.
  • [57] Kamal Nigam, Andrew K. McCallum, Sebastian Thrun, and Tom Mitchell. Text classification from labeled and unlabeled documents using EM. Machine learning, 39(2-3):103–134, 2000.
  • [58] N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. Celik, and A. Swami. Practical black box attacks against machine learning. In Proc. Asia CCS, 2017.
  • [59] N. Papernot, P. McDaniel, S. Jha, M. Fredrikson, Z.B. Celik, and A. Swami. The limitations of deep learning in adversarial settings. In Proc. 1st IEEE European Symp. on Security and Privacy, 2016.
  • [60] N. Papernot, P. McDaniel, Xi Wu, S. Jha, and A. Swami. Distillation as a defense to adversarial perturbations against deep neural networks. In IEEE Symposium on Security and Privacy, 2016.
  • [61] Z. Qiu, D.J. Miller, and G. Kesidis. Semisupervised and active learning with unknown or label-scarce categories. IEEE Trans. on Neural Networks and Learning Systems (TNNLS), 2016.
  • [62] L. Rabiner and B.H. Juang. Fundamentals of speech recognition. Prentice Hall, 1993.
  • [63] G. Schwarz. Estimating the dimension of a model. The Annals of Statistics, 6(2):461–464, 1978.
  • [64] H. Soleimani and D.J. Miller. Parsimonious topic models with salient word discovery. IEEE Transactions on Knowledge and Data Engineering, 2015.
  • [65] H Soleimani and DJ Miller. ATD: Anomalous topic discovery in high dimensional discrete data. IEEE Transactions on Knowledge and Data Engineering, 2016.
  • [66] N. Srivastavanitish, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 2014.
  • [67] J. Steinhardt, P.W. Koh, and P. Liang. Certified defenses for data poisoning. In Proc. NIPS, 2017.
  • [68] C. Szegedy, W. Zaremba, I Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. In Proc. ICLR, 2014.
  • [69] F. Tamer, F. Zhang, A. Juels, M. Reiter, and T. Ristenpart. Stealing machine learning models via prediction apis. In Proc. USENIX Security Symposium, 2016.
  • [70] B. Tran, J. LI, and A. Madry. Spectral signatures in backdoor attacks. In Proc. NIPS, 2018.
  • [71] D. Tsipras, S. Santurkar, L. Engstrom, and A. Turner or A. Madry. Robustness may be at odds with accuracy. In Proc. ICLR, 2019.
  • [72] Vladimir N. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag New York, Inc., New York, NY, USA, 1995.
  • [73] Q. Wang, W. Guo, K. Zhang, A. Ororbia, X. Xing, L. Giles, and X. Liu. Adversary resistant deep neural networks with an application to malware detection. In Proc. KDD, 2017.
  • [74] Y. Wang, D.J. Miller, and G. Kesidis. When Not to Classify: Detection of Reverse Engineering Attacks on DNN Image Classifiers. In Proc. IEEE ICASSP, Brighton, UK, May 2019,
  • [75] H. Xiao, B. Biggio, B. Nelson, H. Xiao, C. Eckert, and F. Roli. Support vector machines under adversarial label contamination. Neurocomputing, 160(C):53–62, July 2015.
  • [76] V. Zantedeschi, M. Nicolae, and A. Rawat. Efficient defenses against adversarial attacks. In Proc. AISec, 2017.
  • [77] H. Zhang, H. Chen, Z. Song, D. Boning, I. Dhillon, and C.-J. Hsieh. The limitations of adversarial training and the blind-spot attack., 2019.