Adversarial or test time robustness measures the susceptibility of a machinelearning system to small perturbations made to the input at test time. This hasattracted much interest on the empirical side, since many existing ML systemsperform poorly under imperceptible adversarial perturbations to the testinputs. On the other hand, our theoretical understanding of this phenomenon islimited, and has mostly focused on supervised learning tasks. In this work we study the problem of computing adversarially robustrepresentations of data. We formulate a natural extension of PrincipalComponent Analysis (PCA) where the goal is to find a low dimensional subspaceto represent the given data with minimum projection error, and that is inaddition robust to small perturbations measured in $\ell_q$ norm (say$q=\infty$). Unlike PCA which is solvable in polynomial time, our formulationis computationally intractable to optimize as it captures the well-studiedsparse PCA objective. We show the following algorithmic and statistical results. - Polynomial time algorithms in the worst-case that achieve constant factorapproximations to the objective while only violating the robustness constraintby a constant factor. - We prove that our formulation (and algorithms) also enjoy significantstatistical benefits in terms of sample complexity over standard PCA on accountof a "regularization effect", that is formalized using the well-studied spikedcovariance model. - Surprisingly, we show that our algorithmic techniques can also be maderobust to corruptions in the training data, in addition to yieldingrepresentations that are robust at test time! Here an adversary is allowed tocorrupt potentially every data point up to a specified amount in the $\ell_q$norm. We further apply these techniques for mean estimation and clusteringunder adversarial corruptions to the training data.
Quick Read (beta)
Adversarially Robust Low Dimensional Representations
Adversarial or test time robustness measures the susceptibility of a machine learning system to small perturbations made to the input at test time. It is now well known that even when trained on high quality training data, many machine learning systems perform poorly under imperceptible adversarial perturbations to the test inputs [szegedy2013intriguing]. There is a large body of work proposing practical methods to make classifiers adversarially robust. On the other hand, our theoretical understanding of the phenomenon of adversarial robustness is limited, and has mostly focused on supervised learning tasks such as binary classification.
In this work we study the problem of computing adversarially robust representations of data. We formulate a natural extension of Principal Component Analysis (PCA) where the goal is to find a low dimensional linear subspace to represent the given data with minimum projection error (measured in Frobenius norm or spectral norm) and that is in addition robust. When adversarial perturbations are measured in norm with (say ), the robustness constraint naturally corresponds to controlling the operator norm of the orthogonal projection matrix onto the subspace. Unlike PCA which is solvable in polynomial time, our formulation is computationally intractable to optimize; even when the subspace is one-dimensional this captures the well-studied sparse PCA objective.
We show the following algorithmic and statistical results.
Polynomial time algorithms in the worst-case that achieve constant factor approximations to the objective while only violating the robustness constraint by a constant factor. This holds for both the Frobenius norm error, and the spectral norm error objectives.
We prove that our formulation (and algorithms) also enjoy significant statistical benefits in terms of sample complexity over standard PCA on account of a “regularization effect” (analogous to sparsity), that is formalized using the well-studied spiked covariance model.
Surprisingly, we show that our algorithmic techniques can also be made robust to corruptions in the training data, in addition to yielding representations that are robust at test time! Here an adversary is allowed to corrupt potentially every data point in (in general, norm) up to a specified amount. We further use these techniques to solve two other fundamental unsupervised learning problems, namely mean estimation and clustering, under adversarial corruptions to the training data.
The above results further validate our model and indicate that the study of adversarially robust unsupervised learning could lead to insights that have wide applicability.
Reliability and trustworthiness of machine learning systems is a key requirement for their secure adoption in day to day life. While robustness to errors of various forms like training data corruptions or data poisoning, drifting data distributions and partial observability are desirable, recently the phenomenon of adversarial robustness or test time robustness has posed a significant hurdle to the design of reliable learning systems. Szegedy et al. [szegedy2013intriguing] identified that learning algorithms like deep neural networks trained on even carefully curated, high quality datasets such as ImageNet [deng2009imagenet] are susceptible at test time to small adversarial corruptions to the test example. These adversarial corruptions are imperceptible to the human eye, yet cause the trained networks to produce an incorrect classification. This represents a potential security vulnerability for critical applications like self driving cars, and secure facial recognition systems.
On the theoretical front, our understanding of adversarial robustness is limited. A series of recent works have started to explore fundamental questions such as the sample complexity of adversarial learning and the inherent computational and statistical tradeoffs needed for achieving adversarial robustness [khim2018adversarial, yin2018rademacher, bubeck2018adversarial, bubeck2018adversarial2, attias2018improved, montasser2019vc], focusing almost exclusively on supervised learning. In contrast there is a substantial body of work formalizing and studying training data corruptions in computer science and statistics, spanning both supervised and unsupervised learning (see e.g., [huber2011robust, tukey1975mathematics, angluin1988learning, kearns1994toward, candes2011robust, diakonikolas2019robust, lai2016agnostic]).
The primary motivation of this work is to formulate and study adversarial robustness in the context of learning data representations and other basic unsupervised learning tasks. A representation given by a function is considered adversarially robust if for any point and its perturbation , it holds that if , then . Downstream learning tasks like classification and regression trained on such robust representations are more likely to have built in robustness. As we will see later, algorithms for learning adversarially robust representations may also have surprising implications for robustness to training-time perturbations. Furthermore, requiring a data representation to be adversarially robust is a natural notion of individual fairness [zemel2013learning, madras2018learning, oneto2019learning]. This leads to the following natural question: can we efficiently learn succinct representations of data that are robust to adversarial perturbations? We will formulate and study this question in the context of principal component analysis.
Principal component analysis (PCA) or low-rank approximations is the predominant tool for obtaining succinct data representations, and is used as a preprocessing primitive in almost every machine learning pipeline. Given data in a high-dimensional space represented by the columns of a matrix , the goal is to find a subspace of dimension at most to represent the points, that minimizes the projection error (or reconstruction error) onto the subspace, formalized as
where the matrix norm is either the Frobenius norm or the spectral norm. The representation of each point corresponds to the projection onto the -dimensional subspace given by (one can also represent the point as an -dimensional vector in terms of a basis for ). Following existing literature on adversarial robustness, we model an imperceptible adversarial perturbation of a point as one for which the norm of the difference is small, i.e., , for some fixed (more generally, we also consider norm where ). Given an -dimensional subspace of with projection matrix , the adversarial robustness of to -perturbations in the norm is then exactly captured by
Observe that characterizes the robustness of the projection to perturbations in norm around every point . The distance between the projections of and a -perturbation of (in norm) is upper bounded by . On the other hand, around each point one can also realize a perturbation with such that . We will call a -robust rank- projection when is an orthogonal projection matrix of rank at most with ; when the robustness parameter and norm are understood, we will just call it a robust rank- projection. Hence, our algorithmic goal given a data matrix composed of points in , a robustness parameter and the norm , is to find a robust low-rank projection with low error formalized by
We will be interested in two versions of the problem, depending on whether we measure the error in Frobenius norm or spectral norm. Recall that the top- terms of the Singular Value Decomposition of simultaneously solves both of these problems in polynomial time, when there is no additional robustness constraint. We also remark that just as for the PCA objective (1), the above objective (3) can be equivalently rephrased as finding the best approximation among low-rank matrices, but among those with a “robust column space” (see Lemma 4.5).
Our optimization problem (3) while motivated by adversarial robustness, has rich connections to (and implications for) well studied problems like the sparse PCA problem [johnstone2009consistency]. Consider the setting when the perturbations are measured in norm and rank . The robustness constraint on the projection imposes an upper bound of on the “analytic sparsity” of the direction (measured in terms of ratio of and norms). Hence in the special case of ,
The complementary objective is the version of the maximization sparse PCA objective,11 1 It is within a factor of the version where the constraint is replaced by (see Section 10.3.3 of [Vershynin]). which is notoriously hard in the worst-case [chan2016approximability].
For general , requiring robustness places a constraint on the dual norm of the direction . Moreover for general rank , gives an upper bound on the norm of every unit vector in the subspace given by (see Lemma 4.3 for the statement, and Lemma 4.4 for an approximate converse). As we will see in Section 2.1, we will design algorithms for finding approximately optimal low-dimensional representations that are indeed robust to test-time adversarial perturbations.
From a practical perspective it is desirable to have robustness to adversarial perturbations at both training-time and test-time. Surprisingly, we show that our algorithmic techniques can be used to achieve this! We study a strong model of training-time corruptions, where every coordinate of every point in the training data set can be adversarially perturbed up to a specified amount (in general the perturbation of every point is measured in norm with ). Apart from the natural motivation of being secure against poisoned data, this model of corruption naturally arises in emerging paradigms such as low precision computation [de2017understanding, de2018high], where only the few most significant bits of each data point is stored in memory, thereby allowing for processing of more training examples. While there have many exciting recent developments in designing efficient robust algorithms for high-dimensional statistical problems, they deal with a fundamentally different model of corruption where a small fraction of the points (outliers) are corrupted arbitrarily [diakonikolas2019robust, lai2016agnostic, charikar2017learning, diakonikolas2018sever]. Our algorithms form a new versatile primitive for achieving robustness. We demonstrate this by applying them to get new training-time robustness guarantees for fundamental unsupervised learning tasks such as mean estimation and clustering.
2 Our Results
In all the results that follow, it will be instructive to think of the setting when i.e., each point can be perturbed adversarially up to an amount in every coordinate. However, our results will apply for all generally. Intuitively, the larger the choice of , the more unrestricted the adversarial perturbations can be (since when ).
2.1 Worst-case Approximations for Robust Low-Rank Projections.
We first consider the two variants of problem (3) i.e., the Frobenius norm error and spectral norm error versions in the worst-case setting. We give efficient algorithms that attain small constant factor approximation algorithms (approaching for the Frobenius norm error, and for the spectral norm error) when we are also allowed to relax the robustness constraint on the projection by a constant factor.
(Informal) Theorem 2.1.
There exist polynomial time algorithms that given a data matrix , and a robustness parameter , find a projection matrix of rank at most s.t.
where is the error obtained by the optimal -robust projection of rank at most ( for the Frobenius norm error objective and for the spectral norm error objective). Moreover, for any , there exist polynomial time algorithms that find an -dimensional projection that gets an -approximation to the objective, and relaxes the robustness parameter by factor.
The algorithms for both objectives – Frobenius norm and spectral norm, both use convex relaxations and use similar ideas. However the algorithms for these two different objectives are different (starting with the relaxations) unlike standard PCA. Please see Theorem 5.1 (Frobenius norm objective) and Theorem 5.7 (spectral norm objective) for the formal statements.
Observe that the approximation factor guarantee in Theorem 2.1 is independent of the desired rank . We remark that our algorithmic guarantees are non-trivial even if we do not want to restrict the rank ; in particular when (unrestricted) our algorithm finds among all subspaces that are -robust, the one with approximately optimal error. The constant factor loss in the robustness parameter depends on the choice of the norm , and remains a constant for all . When , the constant factor that arises is related to the constant in a variant of the Grothendieck problem [alon2004approximating, nesterov1998semidefinite] (the constant is at most ). Hence the above theorem gives a -factor bicriteria approximation when we need an approximation of rank at most ; however we can also achieve an by incurring an extra loss in the rank.
Our result also has new implications for approximating the minimization objective for -sparse PCA. Sparse PCA has been studied under average case models [BerthetRigollet], as well as worst case models [chan2016approximability]. In particular the special case of our result for provides a small constant factor bicriteria approximation to the minimization version of the sparse PCA problem under constraint on the sparsity. This is in stark contrast to the approximability of the maximization version of the problem. Even in the special case when , the best known polynomial time algorithm gives a factor approximation in the worst-case (for both the and versions). This is true even when the sparsity can be relaxed by a factor; moreover no constant factor approximation is possible assuming the Small Set Expansion conjecture [chan2016approximability]. See also Appendix B for computational hardness of (3) based on this observation. Furthermore, as we will see in a bit, the minimization version that we study and our constant factor approximations for this problem will also be crucial in various downstream applications in unsupervised learning.
2.2 Statistical Benefits of Robust Projections.
While the operator norm constraint arises naturally when we need robustness to small adversarial perturbations, it also induces a regularization effect due to the “sparsity” of vectors in the robust subspace given by (see Lemma 4.4). This brings with it additional statistical benefits, which we demonstrate by studying the classic high-dimensional statistical problem of covariance estimation and recovery in the well-studied spiked covariance model [johnstone2001distribution, aminiwainwright, berthet2013]. The spiked covariance model SCM has two sets of parameters:the signal strength specified by the pair with , and the unknown covariance matrix of rank with its eigenvalues in the interview . The top- eigenspace of is given by the projection matrix of rank that is -robust for . Each sample from the distribution is drawn independently from -variate Gaussian .
We will be interested in two different goals – detection and the stronger goal of recovery. The goal in detection will be to distinguish w.h.p. whether the sample data was generated by the spiked covariance model SCM() or from a standard Gaussian . Our goal in recovery is to output an estimate for , with error measured in terms of the Frobenius norm of . We first state the guarantee for and discuss the case of general at the end of the section.
(Informal) Theorem 2.2.
Fix and let be a -robust projection of rank for subspace spanned by the non-trivial eigenvectors of the rank- covariance matrix . For any , samples from SCM suffice to recover a covariance matrix having its top- eigenspace given by a -robust rank- projection such that w.h.p., . At the same time, samples suffice for detection.
Please see Theorem 8.3 and Theorem 8.5 for the formal statements. We also prove our upper bounds are optimal up to factors for (see Theorem 8.6 for the lower bound). Note that for the task of detection, we actually save the additional factor in the sample complexity compared to recovery.
The above bound shows that when , we get significant statistical benefits compared to traditional covariance estimation in high-dimensions, which requires samples even for recovering the principal component () up to good accuracy. This sharply contrasts with our current understanding of adversarial robustness in classification problems where current evidence suggests that the sample complexity for learning robust classifiers may be significantly higher compared to the non-robust case [schmidt2018adversarially, yin2018rademacher, montasser2019vc].
We remark that sample complexity upper bounds are known in the spiked covariance model, when (or the subspace that corresponds to it) satisfies some sparsity constraints [aminiwainwright, VuLei13, Liuetal]. Our results are generally incomparable since we parameterize by the operator norm of which is naturally motivated from robustness considerations. However somewhat surprisingly, our analysis often matches or in fact improves upon the sample complexity upper bound even in the other parameterization, in certain regimes. See Remark 8.4 for details.
The estimation algorithm in Theorem 2.2 is computationally inefficient. We now show how we can simultaneously achieve computational efficiency and statistical efficiency in finding an estimate of ; moreover the top- eigenspace of has a robust projection matrix that is also close to .
(Informal) Theorem 2.3.
There is a polynomial time algorithm that given samples in drawn i.i.d. from the spiked covariance model SCM() with the top- eigenspace of given by a -robust rank- projection , finds w.h.p. a rank- covariance matrix with its top- eigenspace being an -robust rank- projection such that , as long as for some universal constant . Moreover suffices for the detection problem w.h.p.
Please see Theorems 8.8 and 8.9 for the formal statements. We remark that even for , the dependence of is necessary assuming the Planted Clique assumption, due to known lower bounds from sparse PCA [BerthetRigollet]. This points to an interesting statistical vs computational tradeoff that would be interesting to explore in the other regimes of parameters.
We also generalize our results to . We extend the information upper bound in Theorem 2.2 to show random samples suffice for recovery statistically. On the other hand, we provide an information-theoretic lower bound showing that this dependence in terms of is almost tight for (in Theorem 8.7). In particular, we prove that for , the sample complexity incurs a polynomial dependence of . Finally, we remark that our polynomial time algorithm in Theorem 2.3 also works for with an extra factor of (instead of ) in the sample complexity.
2.3 Robustness to Adversarial Errors during Training
Surprisingly, the notion of test-time robustness and the algorithms that we have developed (Theorem 2.1) also allow us to handle robustness to adversarial perturbations in the training data set (this is often referred to as data poisoning). We propose a corruption model under which every training sample can potentially be corrupted adversarially up to a amount, as measured in norm (more generally norm for ). Our input instance is where every column satisfies ; we will refer to such an as a -corrupted instance of . Our goal now is to recover a robust low-rank projection for the uncorrupted matrix . We will show that we can in fact output a robust low-rank projection that is competitive with the best robust low-rank projection of , even though is not known to us!
(Informal) Theorem 2.4.
Suppose and is the unknown uncorrupted data matrix, with a -robust projection matrix of rank at most satisfying for some . There exists a polynomial time algorithm that given as input a -corrupted instance of outputs an -dimensional projection that is approximately optimal
In particular this gives an approximation when .
To interpret the results, let and consider an uncorrupted dataset where every column (sample) is a unit vector in , and let be a small polynomial of , say . When i.e., every coordinate can be perturbed up to , the total corruption to each point is at most in Euclidean norm; in this case one would expect that standard PCA applied to will approximately recovery a good low-rank approximation. The above Theorem 2.4 on the other hand guarantees to find a good (robust) low-rank approximation for the unknown matrix even when . Note that in this setting every point can be completely overwhelmed by the adversarial noise (in Euclidean norm). Furthermore, the above guarantees are optimal up to constant factors; in particular, Proposition 6.7 shows that the additive factor of is unavoidable for every . These results together suggest that our notion of robust projections (measured in operator norm) is key in understanding the robustness to small adversarial perturbations of every point during training as well.
Our guarantees for spectral norm error in the presence of training-time adversarial perturbations are somewhat similar to Theorem 5.1. However, there is a qualitative difference: we will either find a robust low-dimensional projection of the unknown dataset , or we will certify that the dataset has been poisoned substantially. In particular, the algorithm will never output a low-dimensional representation that is bad for the unknown data matrix . In what follows will refer to the spectral norm.
(Informal) Theorem 2.5.
Fix and . Let be the unknown uncorrupted data matrix with a -robust projection of rank at most with . There exists a polynomial time algorithm that given as input a -perturbed instance , outputs either a robust projection of rank at most with
or certifies that the data has been poisoned i.e., .
Please see Theorem 6.4 for a formal statement. Here again, the additive term of is unavoidable as shown in Proposition 6.7. We remark that information-theoretically we can design an estimator (that is computationally inefficient) that achieves the stronger qualitative guarantees as in Theorem 2.4. Designing a computationally efficient algorithm to do the same is a tantalising open question that we describe in more detail in Section 2.5.2. Finally, we would like to point out that our algorithms output -robust projections, and as a result we also get robustness to test-time perturbations.
Comparison to Robust PCA. The problem of robust PCA has received significant attention in recent years [de2003framework, candes2011robust, chandrasekaran2011rank]. Here one assumes that a given corrupted matrix is a sum of two matrices, the true matrix that is low-rank and a sparse corruption matrix with sparsity pattern being essentially random. The corruptions, although sparse, can be unbounded in magnitude. This necessitates an incoherence type assumption that the “mass” or the principal components of is spread out – else recovery of is impossible here. On the other hand, the corruptions may not be sparse in our case. In particular, every data point (in fact every entry of ) could be corrupted up to some specified magnitude . Here as our results show (particularly Theorem 2.4), localization (or sparsity) of the signal is crucial in tolerating adversarial perturbations in the training data (e.g., a spread out signal can be completely overwhelmed by the corruption in each entry of ). Hence we believe that our algorithms could give a new robust primitive that is fundamentally different from existing techniques like robust PCA, for downstream applications in ML.
2.4 Applications to Unsupervised Learning.
As a concrete application of our guarantees from the previous section we study mean estimation and clustering under adversarial perturbations to potentially every data point in the training set. For mean estimation, we assume that the uncorrupted data matrix has mean vector denoted by . The guarantee below shows that in order to recover a good approximation to , the amount of corruption (the parameter) that can be tolerated is directly related to the inherent robustness of the one dimensional subspace spanned by . Please see Theorem 7.1 for a formal statement.
(Informal) Theorem 2.6.
[Robust Mean Estimation] Fix and consider with . Let represent an matrix with copies of , and satisfy . Let be the one dimensional subspace denoting the projection onto such that . There exists a polynomial time algorithm that given as input a -corrupted instance of , either certifies that the data has been poisoned, i.e., , or outputs an estimate such that .
Equivalently, the above implies a relative error guarantee of . Here captures the noise-to-signal ratio. We make a few remarks about the above claim. When , we get a relative error guarantee of . Consider the case of , and being a -sparse (in the sense) unit length vector. Such estimation problems concerning sparse mean vectors arise naturally in many applications and have been widely studied in statistics [donoho1992maximum, donoho1994minimax, johnstone1994minimax]. When every point could be corrupted, our guarantee allows for an adversary to add perturbation per coordinate, as opposed to . Notice that our mean estimation guarantee has a dependence on . In Theorem 7.3 we show that in our model of corruption, this dependence is unavoidable and hence the error in our mean estimation bound is information theoretically optimal upto constant factors.
Next, we show how the robust mean estimation procedure, along with our guarantees from the previous section, can be used to perform robust clustering of well clustered instances. In particular, we modify the popular Lloyd’s algorithm [lloyd1982least] to get robustness to perturbations of every data point. As is done in the standard Lloyd’s analysis, we study the spectral stability condition proposed by Kumar and Kannan [kumar2010clustering], and studied later in [awasthi2012improved, cohen2017local]. This condition captures, as special cases, mixtures of well separated Gaussians and other distributions, and helps in developing a unified analysis of Lloyd’s updates. Below we state our results assuming equal cluster sizes. Our results hold for the more general case and we defer to Section 7.2 for more details. Let be clustered into clusters of equal sizes with means . Furthermore, let be the matrix of corresponding centers for each column of and let be such that . Then satisfies -spectral stability if for each pair of optimal clusters, say, cluster and with means and , any point in cluster , when projected onto the line joining and is closer to than by an additive amount of . Here is a quantity that captures the signal-to-noise ratio and the relative perturbation magnitude, i.e. , as in robust mean estimation22 2 Note that unlike clustering without corruptions, the dependence on is unavoidable in our model even for , i.e., robust mean estimation.. In the special case when is a set of points drawn i.i.d. from a mixture of Gaussians with the variance of each Gaussian being bounded by , and with uniform mixture weight each, the separation condition becomes . In the theorem below we denote to be the robustness, as measured in , of the subspace spanned by the true means .
(Informal) Theorem 2.7.
[Robust Clustering] Fix , and let be a constant that depends on . Let satisfy -spectral stability, for . Then given as input a -corrupted instance of , there is a Lloyd’s style algorithm that either certifies that the dataset is poisoned, i.e, , or recovers each mean up to error . Using the computed centers to cluster , we obtain a clustering of such that the corresponding induced clustering on that misclassifies - fraction of the points.
In the special case of a mixture of Gaussians with equal mixing weights we get the means upto error , where we hide a factor in the notation. This implies -fraction clustering error.
Finally, as in Section 2.3 we can also prove that information theoretically there is an algorithm (though computationally inefficient) that performs robust mean estimation up to the (near optimal) error of on all instances (it will estimate the uncorrupted mean even for instances that are certified to have been poisoned). See Theorem 7.4 for the formal statement. As a result, we can also cluster well-clustered instances information-theoretically up to the claimed error above, without the need for certification (See Section 7 for details). Whether this can be achieved in polynomial time is an open question.
2.5 Overview of Challenges and Techniques
We now give a flavor of the technical ideas involved in obtaining our results. For the sake of exposition, we will mainly restrict our attention to the case when the adversarial perturbations are measured in norm.
2.5.1 Worst-case Constant Factor Approximations
Let us first consider the version of problem (3) of finding a robust rank- projection that has small error measured in Frobenius norm. A natural mathematical programming relaxation is the following:
This is a valid relaxation for the problem since the constraints are all satisfied by any rank- projection matrix that is robust i.e, . Moreover this relaxation is convex; the last constraint in particular is an upper bound on a valid norm. Let be the optimal solution of this program. The first challenge however is that the operator norm constraint (10) is NP-hard to verify efficiently. In general, these operator norm computation problems are APX-hard for most values of . They form a rich class of problems related to the Grothendieck problem [alon2004approximating, nesterov1998semidefinite], and we can use polynomial time factor approximations that are known for general norms with (see Section 4.1).
The bigger challenge here is in producing a projection matrix from that (a) achieves a good objective value (b) has rank at most , and (c) is -robust (bounded operator norm). A natural approach for producing a good low-rank solution is to output a rank- projection matrix , by considering the large singular values of (e.g., the top- singular vector space of ). However we have no control on the robustness of the subspace . In fact, problem (3) is challenging even without the rank constraint on the subspace (i.e., for ). The main issue is to relate the operator norm of the projection matrix we output to that of the relaxation solution which is upper bounded by .
Our crucial insight is that we can indeed design such a rounding scheme that achieves all three goals if the norm in the constraint (10) is a monotone norm! A general matrix norm is monotone iff
(See Definition 4.6 for details.) This monotonicity property allows us to truncate terms in the eigendecomposition without any loss in robustness , and get fine control on the robustness when we rescale different PSD terms slightly. Unfortunately however, the operator norm is not monotone (see Claim A.2; see also Claim A.1 for counterexamples for various other matrix norms that are not unitarily-invariant).
Our next important observation is that we can replace the constraint (10) by a similar constraint in terms of the norm (more generally norm in terms of the norm). This is because for any matrix , we have that where is the dual norm for and satisfies . The main advantage of this reformulation is that the operator norms are indeed monotone (see Lemma 5.6 for a simple proof). Moreover these norms can be efficiently separated (within an slack factor) when since .
Our final algorithm approximately solves the mathematical program (9) with constraint (10) replaced by . We produce a low-rank projection matrix by first truncating the solution to the program to the large singular values, and then picking the terms that contribute the most towards the objective. Our analysis leverages the monotonicity property to prove -robustness, while also achieving an approximation to the objective. See Theorem 5.3 for an analysis with a general monotone norm constraint, and Theorem 5.1 for our desired problem.
Similar ideas can be used when the error is measured in terms of the spectral norm instead of the Frobenius norm. The objective function in the mathematical relaxation (9) is instead rephrased as where is the spectral norm. Theorem 5.7 gives a slightly different rounding and analysis that again leverages the monotonicity property of operator norms.
2.5.2 Robustness to Training-Time Adversarial Perturbations (Data Poisoning)
Let . Recall that our input instance is a -corrupted instance obtained from by potentially corrupting every entry of it. Our goal is to output a robust low-rank projection matrix of rank at most for the uncorrupted matrix , that is not known to us. This question is interesting even from a purely statistical standpoint; but additionally, we would also like our algorithm to run in polynomial time.
Why should this be possible? Suppose the uncorrupted matrix has a robust low-rank projection of small error i.e., (where is either the Frobenius norm or spectral norm). Also assume for just this discussion that the average column (Euclidean) length of is 1 (or even that each column is of unit length), say and . For any -robust projection , for each data point . So one could apply the worst-case algorithm to find a robust projection for the corrupted input , and hope to also get a robust subspace of low-error for the unknown matrix .
However, there are two major challenges in implementing this strategy. (1) Solution value of : the robust projection may not achieve low error on the input matrix ; in fact, may not have any good robust low-rank approximation – in this case the algorithm output may be useless. This is because the entry-wise perturbations could make and far away in aggregate e.g., the spectral norm of could be (whereas, even . (2) Identifiability issue: perhaps more importantly, even if the perturbation has a robust low-rank robust projection of small error, we need to argue that this subspace indeed attains small error on ! The second issue is crucial in resolving the purely statistical aspect of the question; it involves ruling out the scenario where has good robust low-rank approximation that is very different from any for .
Issue (2): Identifiability. To address the second issue, we prove that if the projection gives a small error on , it necessarily gives a low-error on . Roughly speaking, if there are two data-matrices and that are both -perturbations of each other i.e., , then for
where . This statement is particularly tricky to show for the spectral norm (See Lemma 6.5 for a formal statement). We know that and are small since are robust (see Lemma 4.3); however this does not give a handle on . A natural approach is to argue that the actions of and on any unit vector are similar. The difficulty is in handling scenarios where some directions in are close to the subspace of , yet their orthogonal component is small, but spread out (even when ). Our proof is somewhat indirect; we show that for every direction , (1) the lengths and are similar and (2) the difference in the lengths is (approximately) lower bounded by . This will allow us to conclude that is small in spectral norm.
Issue (1): Solution value of . To tackle the first question, we do know that there is a data matrix (in particular the uncorrupted matrix ) that can be obtained from by perturbing each entry by at most (i.e., it is in the neighborhood of ) that has a robust low-rank approximation of low value . Let us suppose that we can solve the following optimization problem:
We know that is a feasible solution, and hence has a robust low-rank approximation of even smaller error. Moreover . This reduces the first issue to a purely computational question of solving the above optimization problem.
We show that when the error is measured in Frobenius norm, we can solve this problem approximately by instead solving a simpler optimization problem of finding that over all matrices that are valid -perturbations of i.e., . The minimizer here just reduces the magnitude of each entry by or until it is . For more general , this corresponds to a simple convex minimization problem. For the spectral norm problem, we do not have an efficient algorithm. However by running the constant factor approximation algorithm for robust low-rank approximations (in spectral norm) for , we will either find a good solution that also works for , or we will certify that has no good robust low-rank approximation; this certifies that is too large i.e., the data was poisoned significantly.
Finally we get the stronger computationally efficient guarantee for the spectral norm error (as for the Frobenius norm error) if we can resolve the following open question.
Open Question. Given , is there a polynomial time bicriteria approximation algorithm for
2.5.3 Applications to Unsupervised Learning
Our low rank approximation guarantees that tolerate corruptions to training data have interesting consequences for unsupervised learning, even in the special case when the subspace rank equals one (for mean estimation). More importantly, these guarantees when used carefully, lead to a provably robust clustering algorithm. Establishing this is quite non-trivial and is technically involved. Here we will mention the key insights.
At a high level we need to ensure (both algorithhmically and in the analysis) that running an iterative procedure on a dataset where every point could be corrupted does not compound the errors. For mean estimation, Theorem 2.5 gives an algorithm that either certifies that the data is poisoned, or outputs a one dimensional subspace that is good for both and . It is easy to see then that will be a good estimate of . However, this yields a suboptimal bound. We instead prove a stronger version for one dimensional subspaces (Lemma 7.2) that gives the claimed upper bound on the estimation error. We complement this with a matching lower bound in Theorem 7.3 by constructing two -close datasets, both having low projection errors ( error) onto the -robust subspaces, but the means separated by , where is the maximum length among the two mean vectors.
Next we will sketch how we achieve the clustering guarantee in Theorem 2.7. We describe the techniques for the special case of equal cluster sizes that captures the main ideas. See Section 7.2 for the general analysis. We analyze a robust variant of the Lloyd’s algorithm. Our initialization algorithm is presented in Figure 4 and the iterative Lloyd’s updates are presented in Figure 5. We proceed in three stages: a) an initialization stage, b) a center improvement stage, and c) analyzing the robust Lloyd’s updates. Each stage poses unique challenges arising from working with where each data point is potentially corrupted. The standard way to initialize Lloyd’s algorithm is to project the data onto the best rank- subspace and run a -means approximation algorithm33 3 This initialization is needed for theoretical bounds. In practice, the initial centers are chosen as random data points.. However, when every data point is corrupted, this could be arbitrarily bad. We instead run the algorithm from Figure 2 to compute a robust -dimensional for with small error, or certify that the dataset has been poisoned. We then project onto and run a constant factor -means approximation algorithm on the projected points to compute centers . Using the guarantee of Theorem 2.5, the key is to establish that when projected onto has a low cost solution when using the true means as centers. Since the true centers are well separated, the estimated centers will be somewhat close to the true centers. This shows that for each , the center is close to upto .
In the second stage we improve the initial center estimates, by a factor, to get that are -close to the corresponding true means. This is sketched in step of the algorithm in Figure 5. While such a step is not needed in standard analysis of Lloyd’s, it is crucial for our robust version. See Section 7.2 for a discussion on this. To argue about the center improvement stage, we use a trick from [awasthi2012improved]. The main technical contribution then is to establish Lemma 7.9 that bounds the clustering error in terms of how close the current centers are to true ones. The lemma is a stronger version of similar statements that appear in [kumar2010clustering, awasthi2012improved]. It simultaneously helps us argue about the clustering error, and also the variance of each current cluster around its mean, a quantity crucial to bound in order to analyze the iterative updates. Here we need a novel “charging” argument to relate the mistakes made by the current centers on corrupted points, to the mistakes they would have made on uncorrupted points. This crucially relies on the fact that and are close to each other in the projected space, since the subspace is robust. Using the center improvement step and Lemma 7.9 we get Theorem 2.7.
To establish the stronger guarantee for mixtures of Gaussians we argue about the iterative updates in two steps. First we analyze the “ideal” updates, as if we had access to the uncorrupted data. This largely follows the analysis in [kumar2010clustering] and helps us argue that if the current center estimates are close to the corresponding (where ), then in the next step the ideal updates will lead to an estimate of . Next, the key technical contribution of this stage is to show, using Lemma 7.9, that when performing ideal updates, the variance of the formed clusters around their means is bounded even though the clusters themselves are impure! Next, we analyze the actual updates and use the guarantee of the robust mean procedure to argue that when given perturbed set as input, the RobustMean procedure will either certify that the set is poisoned, or will output an estimate that is within . This in turn means that the new estimate output by the RobustMean procedure will be within of the true mean . Hence, the updates will keep improving until the unavoidable error of (Theorem 7.6)
2.5.4 Statistical Model: The Spiked Covariance Model
Let us consider the special case when (hence ) for the purposes of this discussion. Let us first consider the simpler problem of detection. Observe that if we look at the expectation (population average), then in the Yes case (when the distribution is ), , whereas in the No case (the distribution is ), for any rank projection matrix. One can distinguish between the Yes and No case if we can establish concentration bound for
Here the high-dimensional Gaussian need not be spherical. It is tricky to directly analyze this quantity for the set of rank- robust projection matrices ; this often introduces extra factors involving or (and gave suboptimal bounds in some earlier results). We instead analyze the deviation as in (12) for each vector in an orthonormal basis for separately. Using Lemma 4.4, we show this reduces to analyzing deviation bounds for over all “analytically” sparse vectors . We can then use tools in high dimensional probability and empirical process theory [LTbook, Vershynin, Mendelson2010] to obtain almost tight upper bounds. Similar ideas also work for the problem of recovering , when the data distribution is . Our (inefficient) algorithm just outputs the empirical minimizer of Frobenius norm error among projections in as in (12). We use concentration bounds along with Davis-Kahan theorem for perturbations of singular spaces to give the required recovery guarantees. Our analysis is simple since we only deal with (sparse) vectors, as opposed to (robust) rank- projection matrices. This simpler analysis also gives us a tight upper bound, and allows us to improve upon the upper bound in [VuLei13] for by a factor of .
Our statistical lower bounds shows the tightness of results for , but also interestingly shows that a polynomial dependence of (on the dimension ) is necessary when . For this statistical lower bound, the insight is that the unit ball over vectors in has a large packing set (in terms of Euclidean distance) because it contains an ball of radius inside it (note that when , the radius is ; this is consistent with the ball having a small cover in distance). Based on this packing of vectors along with a clever trick of [VuLei13], we construct a good enough packing set of projection matrices of dimension to apply Fano’s inequality.
Our computationally efficient algorithm for recovering the unknown robust subspace uses the same mathematical program (9) that we used for the worst-case algorithm along with an additional constraint on where is the dual norm for (this is a valid constraint because of Lemma 4.3). This allows us to get a better deviation bound for a quantity of the form (12) where is instead the set of feasible SDP solutions. The rest of the analysis is similar to the worst-case algorithm, along with the Davis-Kahan perturbation bounds for singular-spaces. Finally, we remark that the upper bounds arguments can be easily extended to the case of more general covariance structure for the principal components; once we get an estimate for the robust subspace projection , we can use the empirical covariance matrix from the projected samples to approximately recover .
Note that the entry-wise norm in the constraint on is not a monotone norm for any (see Claim A.1). However on account of the monotonicity property of the operator norm (and its associated constraint), we get the additional advantage that our algorithm always outputs a robust (or “analytically” sparse) subspace. To the best of our knowledge, previous polynomial time algorithms for spiked covariance model with large (see e.g., [Liuetal]) do not necessarily output a sparse subspace (they only argue about estimation error).
3 Related Work
Adversarial Robustness. Existing theoretical work on adversarial robustness has almost exclusively focused on supervised learning, and in particular on binary classification. These works include the study of adversarial counterparts of notions such as VC dimension and Rademacher complexity [cullina2018pac, khim2018adversarial, yin2018rademacher], evidence of computational barriers [bubeck2018adversarial, bubeck2018adversarial2, nakkiran2019adversarial, degwekar2019computational] and statistical barriers towards ensuring both low test error and low adversarial test error [tsipras2018robustness], and computationally efficient algorithms for adversarially robust learning of restricted classes such as degree- and degree- polynomial threshold functions [awasthi2019robustness]. Furthermore, recent works also provide evidence that adversarially robust supervised learning requires more training data than its non-robust counterpart [schmidt2018adversarially, montasser2019vc]. As discussed in Section 2.2, in our setting, it is possible to be robust to test time perturbations and simultaneously enjoy a statistical edge over the non-robust scenario!
The closest to our work is the result of [garg2018spectral] that studies a particular formulation of adversarially robust features. The authors consider computing, given i.i.d. samples from a distribution, a map such that, with high probability over a new example drawn from the same distribution, points close to get a nearby mapping (in distance) under . While similar in motivation to our work, the results in [garg2018spectral] do not aim to minimize the projection error and simply require the projection to be mean zero and variance one to avoid trivial solutions. Furthermore, the authors look at a specific type of spectral embedding given by the top eigenvectors of the Laplacian of an appropriate graph constructed on the training data. The bounds presented for this embedding depend on the eigenvalue gap present in the Laplacian matrix. Finally, it is not clear how to efficiently use the embedding on new test points, as it involves recomputing the Laplacian by incorporating the new point into the training set.
Low Rank Approximations.
There is a large body of work in randomized numerical linear algebra [kannan2017randomized] on methods such as column subset selection [boutsidis2009improved, deshpande2010efficient, boutsidis2014near] and CUR decompositions [drineas2008relative, boutsidis2017optimal] that aim to approximate a given matrix via a low dimensional subspace spanned by a small number of rows/columns of the matrix. However these algorithms do not necessarily yield robust representations; in particular the subspace that is spanned may not be robust in our sense ( operator norm). There is also a large body of work on fast algorithms for computing low rank approximations [clarkson2017low, price2017fast, musco2017sublinear, song2017low, ban2019ptas]. Some of these works study the problem when the approximation error is measured in a more robust metric such as the norm as opposed to the Frobenius norm [song2017low]. Again these results are not directly related to the notion of subspaces robustness that we study in this paper.
Sparse PCA. In the high dimensional regime where the number of samples is much less than the dimensionality, several works have pointed out inconsistent behavior of PCA [paul2007asymptotics, nadler2008finite, johnstone2009consistency]. As a result this led to the study of the sparse PCA problem where it is assumed that the leading eigenvector is sparse. This problem is typically studied under under an average case model known as the spiked covariance model [johnstone2001distribution]. In this model the data is assumed to be generated from a Gaussian with covariance matrix , where the leading eigenvector is assumed to be a sparse vector and is a parameter characterizing the signal strength. There have been several works that study minimax rates of estimating the leading eigenvector and more general subspaces under the spiked covariance model and various notions of sparsity [aminiwainwright, ma2013sparse, cai2013sparse, shen2013consistency, VuLei12, VuLei13]. In particular, the work of [VuLei12, VuLei13] studies estimation of principal subspaces in the high dimensional regime under two different notions of row/column sparsity of the orthonormal basis, and under the spiked covariance model and its generalizations. There have also been practical methods proposed to perform the above estimation under computational considerations. In particular, the work of [d2005direct] proposes an SDP based approach (without provable guarantees) and has resemblance to our SDP (we impose an additional constraint on ) studied in Section 8.3. The work of [BerthetRigollet] studies statistical and computational tradeoffs for the detection problem, i.e., given i.i.d. samples from a distribution that is either a standard Gaussian or a spiked model (with one leading sparse eigenvector). They design minimax optimal methods and show that the SDP of [d2005direct] achieves the best possible rate for a polynomial time algorithm assuming the planted clique conjecture. However, their lower bound does not apply to the spiked covariance model. The work of [gao2017sparse] was the first to provide computational lower bounds for sparse PCA under the spiked covariance model. These bounds have been further improved in the work of [brennan2018reducibility, brennan2019optimal]. The work of [Liuetal] studies computationally efficient estimation of subspaces under the spiked covariance and more general models.
Robustness to Corruptions in the Training Data.
There is large body of work, spanning both the theoretical computer science and the statistical communities, that formulates and studies robustness to training data corruptions in the context of both supervised and unsupervised learning. Here, we survey works that are most relevant in the context of our results. For classification problems, the commonly studied models are the random classification noise model [angluin1988learning] and the agnostic learning model [kearns1994toward], that capture corruptions to the labels of training points. More generally, there are also works studying the malicious noise [valiant1985learning, kearns1993learning] and the nasty noise [bshouty2002pac] models that allow for corruptions in both the training points and the corresponding labels. These models led to exciting developments in the design of robust algorithms for classification problems (see e.g., [kalai2008agnostically, klivans2009learning, kalai2012reliable, feldman2009distribution, awasthi2014power, diakonikolas2018learning, daniely2015ptas]). There is also a large body of work on Robust Optimization [ben2009robust], where the input is uncertain and is assumed to belong to a structured uncertainty set. In robust optimization one looks for a single solution that is simultaneously good for all inputs in the uncertainty set, leading to a max-min formulation of the problem. In our model of corruption, we are interested in instance wise guarantees - for every input and its corruption , the algorithm is required to output a solution that is good for . Moreover we are not aware of any results related to PCA in this context.
The problem of robust PCA has received significant attention in recent years [de2003framework, candes2011robust, chandrasekaran2011rank]. Here one assumes that a given corrupted matrix is a sum of two matrices, the true matrix that is low-rank and a sparse corruption matrix with sparsity pattern being essentially random. The corruptions, although sparse, can be unbounded in magnitude. This necessitates an incoherence type assumption that the “mass” or the principal components of is spread out – recovery of is impossible under unbounded sparse corruptions when the signal is localized or sparse. On the other hand, the corruptions may not be sparse in our case. In particular, every data point (in fact every entry of ) could be corrupted up to some specified magnitude . Here as our results show (particularly Theorem 2.4), localization (or sparsity) of the signal is crucial in tolerating adversarial perturbations in the training data (e.g., a spread out signal can be completely overwhelmed by the corruption in each entry of ).
In statistics, Huber’s -contamination model [huber2011robust] is the most widely studied. In this model the dataset is assumed to be generated i.i.d. from a mixture namely, . Here is the true distribution and is assumed to be well behaved, for example the Gaussian distribution, and is an arbitrary distribution modeling the noise. The study of this model has led to insightful results for a variety of problems. The work of Yatracos [yatracos1985rates] and more recently of [chen2016general] studies general estimation in Huber’s model. More relevant to us are works on mean estimation and clustering in Huber’s model.
Mean Estimation. The classical work of Tukey [tukey1975mathematics] proposed a robust estimator, now known as Tukey’s median, for robust mean estimation of Gaussian data. The more recent work of [chen2018robust] showed that Tukey’s estimator is minimax optimal and also proposed a minimax optimal estimator for robust covariance estimation. Recently, there have been many exciting developments in designing robust estimators of mean and covariance that are also computationally efficient. We discuss a few here. The works of [diakonikolas2019robust, lai2016agnostic] were the first to propose polynomial time algorithms for robust mean and covariance estimation of Gaussian data in Huber’s model, with dimension independent error bounds. This was later extended to more general distributions and the list-decodable setting [charikar2017learning, steinhardt2017resilience], optimal bounds for Gaussian data [diakonikolas2018robustly] and the study of computational/statistical tradeoffs [diakonikolas2017statistical, hopkins2019hard] and robust method-of-moments [kothari2017outlier]. There have also been works providing better sample complexity bounds in the Huber model if the mean vector is sparse [balakrishnan2017computationally, klivans2018efficient]. These works also study estimation in the spiked covariance model under corruptions. More recent developments include a linear time estimator for robust mean estimation [cheng2019high] and fast algorithms for robust covariance estimation [cheng2019faster]. There are also recent works studying computationally efficient robust optimization of more general objectives [diakonikolas2018sever, prasad2018robust]. We would like to point out that in these works (and several other recent works), the model of corruption is different than ours. In particular, rather than assuming that the data contains a few outliers (Huber’s model), in our model an adversary can potentially corrupt every data point up to magnitude (measured in norm for ). Hence the guarantees from these works do not translate into our setting.
Clustering. From the computational point of view, the work of Dasgupta [dasgupta1999learning] formulated the goal of clustering data generated from a mixture of well-separated Gaussians. There is a large body of work on designing efficient algorithms for clustering in this setting, both for Gaussians and more general distributions [arora2005learning, vempala2004spectral, achlioptas2005spectral]. See recent works [regev2017learning, hopkinsli2018, diakonikolas2018list, kothari2017better] for a detailed discussion. The work of Kumar and Kannan [kumar2010clustering] abstracted out a common property of datasets (spectral stability as defined in (66)) that captures mixtures of well separated Gaussians, the planted partitioning model, and other well clustered instances. They showed that a single algorithm, namely the popular Lloyd’s algorithm, with the right initialization, provably computes optimal solution for such stable instances. The separation factor needed for Lloyd’s to work in [kumar2010clustering] was later improved by [awasthi2012improved]. The recent work of [cohen2017local] shows that local search obtains a polynomial time approximation scheme (PTAS) on such spectrally stable instances. However, this in general does not guarantee closeness of the clustering obtained to the optimal one. The works of [kalai2010efficiently, moitra2010settling, belkin2010polynomial] provided algorithms for clustering Gaussian mixtures with no separation requirement. These algorithms, inherently have an exponential dependence on the number of clusters, , in the running time. Building on robust algorithms for mean estimation, there have also been works to perform robust clustering of well separated instances under Huber’s contamination model and its variants [brubaker2009robust, diakonikolas2018list, kothari2017better, kothari2017outlier, hopkinsli2018]. There have also been works in analyzing the EM algorithm for Gaussian mixtures. See [balakrishnan2017statistical] for a detailed discussion. While motivated by the study of the phenomenon of robustness, the above results do not provide guarantees in our model of corruption. As in the case of mean estimation, these results are designed to be robust to a small number of outliers (e.g., a small constant fraction) in the training set. In our corruption model on the other hand, every data point could be potentially corrupted up to magnitude (measured in norm for ).
4 Notation and Preliminaries
For every and , we will use to denote the norm of the vector i.e., . The dual norm of is where . We will heavily use Holder’s inequality which states that
When not specified, will denote the Euclidean norm of . Further will represent the unit sphere for the Euclidean norm. For convenience, we will use to denote the sparsity i.e., the size of the support of (note that is not a valid norm on vectors).
Operator Norms of Matrices.
We will use the following matrix norms. For any and any matrix , we will denote by . By duality of vector norms, we have
When , this corresponds to the spectral norm of the matrix i.e., the maximum singular value of . When unspecified, we will use to denote the spectral norm of . (Note that the above equalities from duality also show that the i.e., the maximum right singular value is the same as the maximum left singular value).
Entry-wise Norms of Matrices.
We will also consider various matrix norms obtained by considering a matrix as a vector of size . In particular, for any we will use to denote the norm of the “flattened” vector corresponding to i.e., . The Frobenius norm . Moreover for matrices , we use to represent the trace inner product.
High probability bounds.
We will say that an event holds with high probability (w.h.p.) if the probability of failure on a given instance is less than any polynomial of the input parameters e.g., the dimension , and the number of data points . We remark that for our randomized algorithms, we can amplify the success probability to for any by repeating the algorithm times (hence these guarantees will in fact hold with exponentially small failure probability).
4.1 Approximation Algorithms for Operator Norms.
Here we briefly describe some known positive and negative results for approximating the operator norm of a matrix (sometimes referred to as the -Grothendieck problem. We will say that a randomized algorithm gives an -factor approximation for the operator norm (for some ) iff for any input matrix the algorithm outputs with probability at least a vector such that . There are three simple cases (, and , and , where the norm can be computed in polynomial time. Some notable cases that are known to be intractable are the and norm (hence also by duality). The norm is the well-known Grothendieck’s problem [Gro56] (that is related to the cut-norm of a matrix [alon2004approximating] and has a rich history [krivine1978constantes, braverman2013grothendieck, alon2006quadratic, khot2011grothendieck, pisier2012grothendieck, reeds1991, khot2006sdp]).
More generally, if (non-hypercontractive norms), computing the norm is NP-hard [steinberg2005computation] and in fact the problem exhibits a dichotomy in terms of approximation. Specifically, whenever , the problem admits constant factor approximation algorithms, whereas whenever , the problem is hard to approximate within almost polynomial factors [bhaskara2011approximating]. Regarding approximation algorithms, the work of [steinberg2005computation] builds upon Nesterov’s theorems and extensions [nesterov1998semidefinite, wolkowicz2012handbook] and provides a approximation for when , and for the special case or the factor becomes . Recently, improved upper and (almost matching) lower bounds were proved for many settings of in [bhattiprolu2018approximating, bhattiprolu2018inapproximability].
In what follows let be the norm of a standard normal random variable i.e., the -th root of the -th Gaussian moment, and let be the dual norm for . Specifically, for with , it is NP-hard to approximate the norm within a factor smaller than . The hardness factor matches the polynomial time algorithms from [steinberg2005computation] for cases when or equals 2, and this is encapsulated in Theorem 4.1. These algorithms are based on semi-definite programming (SDP) relaxations. The algorithm for the norm for example first solves in polynomial time an SDP relaxation and produces a solution of value ; then the algorithm uses to produce with high probability a vector with , where is the approximation factor.
Theorem 4.1 (Combining [nesterov1998semidefinite, steinberg2005computation]).
For computing the norm, there is a randomized polynomial time algorithm that gives a -approximation algorithm, and for the norm there is a randomized polynomial time algorithm that gives a -factor approximation. Furthermore, when the input matrices are positive semidefinite matrices, there exists randomized polynomial time algorithms that give a approximation for the norm, and a factor approximation for the operator norm respectively. These algorithms are SDP-based randomized algorithms that succeed with high probability for any given instance.
The approximation algorithms from Theorem 4.1 will serve as separation oracles for solving the convex relaxations that we will use for our algorithms. As noted above these approximability results are tight assuming [bhattiprolu2018inapproximability].
4.2 Properties of Robust Projections.
Throughout the paper we will use the term projections and projection matrices to always refer to orthogonal projection matrices on to linear subspaces of . Next we list and prove some simple properties of subspaces with robust projection matrices i.e., subspaces with (or more generally norm for some ) that is upper bounded.
For any , the ratio of the vs corresponds to an analytic notion of sparsity. The following claim gives an upper bound on the norm in terms of the sparsity.
Lemma 4.2 (Analytic Sparsity).
Consider any vector of support size . For any , we have
In particular, for vectors with support size at most .
On the other hand, it is easy to see that the bound given here is tight for any vector that is equally spread out among its support of size .
Without loss of generality suppose (if it holds trivially). Let have support of size . Set , and let be the vector such that for each . By Holder’s inequality
hence establishing the lemma. ∎
Recall that corresponds to the dual norm for , and when . The following simple lemma proves two useful properties of robust subspaces i.e., subspaces having projection matrices with bounded norm (or more generally norm for ). The first property shows that any two vectors that are close in norm will have nearby projections onto any subspace that is robust. The second property shows that a subspace is robust (i.e., has a robust projection matrix) exactly when every vector in the subspace is analytically sparse.
[Properties of Robust Subspaces and Projections] Consider any subspace of with projection matrix satisfying . We have the following two properties:
Closeness of projections of pertubations: For any vector and its perturbation
Analytic sparsity: For any , we have , where . Moreover, if every vector in has , then . In particular if and only if for all unit vectors .
We first show property (I). Let . Then
To show property (II), note that by duality of matrix operator norms we have .
For the converse, if there exists s.t. , then by duality .
Observe that the robustness condition on the subspace as captured by the operator norm bound of its projection matrix is basis independent.The following gives a simple sufficient condition on the basis of the subspace that implies robustness of the subspace spanned by them. This relates our robustness of the subspace to alternate notions of sparsity of subspaces that have been studied in the literature on sparse PCA [VuLei12, VuLei13].
Given any orthonormal basis for a subspace such that for each , we have .
Firstly, , and . We have
For a given matrix , let us denote by to the projection matrix onto the column space of . The following lemma shows that the best low-rank -robust projection objective (3) also finds the low-rank approximation that has smallest error among ones with a robust column space.
Let be the set of all rank- projection matrices. Given a data matrix and a given parameter , , we have
where here stands for the spectral norm. The above statement is also true for the Frobenius norm.
Let be the minimizer for the right minimization problem and let be its projection matrix, and let be the minimizer for the left optimization problem. It is easy to see that , since is also a feasible choice for in the right minimization problem. To prove the other direction, if are the th columns of respectively then,
as required. The first inequality above follows since the column space of and the column space of are orthogonal. An identical proof also follows for the Frobenius norm.
Monotonicity of Matrix Norms.
The following property of certain matrix norms will be crucial in designing constant factor approximation algorithms for the low-rank approximations.
Definition 4.6 (Monotone matrix norm).
A matrix norm is said to be monotone if and only if
Observe that it suffices to check the above condition for all rank- PSD matrices i.e., for . It is well known that all unitarily invariant matrix norms44 4 A matrix norm is unitarily invariant iff for all matrices and all unitary matrices . are monotone (this is because unitarily invariant norms are just norms on the singular values). On the other hand, many other matrix norms including other entry-wise norms or general operator norms are not necessarily monotone (see Claim A.1 and Claim A.2 for some counterexamples). Perhaps surprisingly, the matrix operator norms are monotone (see Lemma 5.6 for a simple proof of this fact)!
5 Worst-case Approximation Guarantees
5.1 Approximations in Frobenius norm error
We will aim to obtain a bicriteria approximation for the robust low-rank approximation problem given in (3) for the Frobenius norm objective. In what follows .
We prove the following theorem.
Suppose the data matrix has an (orthogonal) projection of rank at most such that and the approximation error . There exists an algorithm that w.h.p. runs in polynomial time and finds a projection matrix of rank at most such that for every ,
where is a constant that only depends on as given in Theorem 4.1 (for this value is at most ). Moreover, for any , there exists an algorithm that w.h.p. runs in polynomial time and finds an -dimensional orthogonal projection such that
The theorem above will be established by proving a statement about the more general problem of finding a low-rank projection that satisfies any monotone norm constraint that can be approximately separated. While the norm is not monotone, as discussed in Section 4, we will show that applying the more general guarantee on an appropriate monotone norm helps prove Theorem 5.1 above. Let be a monotone matrix norm. Consider the following generalization of problem (19) that given a data matrix , and a parameter , finds a projection
[-approximately separable matrix norm] A matrix norm over matrices is -approximately separable for some iff there exists a (potentially) randomized algorithm that w.h.p. runs in time , and when given a PSD matrix and a parameter as input will either certify that , or finds a such that (1) , and (2) for all s.t. .
Note that in the above definition, the only potential randomness is from the potential random choices of the algorithm. As we will see later the operator norms that we will consider (e.g., norm and the norm for ) will be -approximately separable. This will allow us to construct an appropriate separation oracle for using the Ellipsoid algorithm.
The following general theorem gives an bicriteria approximation for the problem assuming the monotone matrix norm can be optimized approximately in polynomial time.
Let be any matrix norm that is monotone and -approximately separable for some . Suppose we are given as input a data matrix that has an (orthogonal) projection of rank at most such that and the approximation error . There exists an algorithm that w.h.p. runs in polynomial time, and finds an orthogonal projection matrix of rank at most such that for every ,
Moreover, for any , there exists an algorithm that w.h.p. runs in polynomial time and finds an -dimensional orthogonal projection such that
We remark that the only randomization in the algorithm is in the construction of the separation oracle for the matrix norm constraint . In particular the algorithm from Theorem 5.3 has a Las Vegas guarantee i.e., the algorithm is always correct, and the running time is polynomial with high probability (in fact with exponentially small failure probability), and hence in expectation.
We consider the following mathematical programming relaxation for the problem.
First we observe that this is a valid relaxation to the problem. In fact any feasible projection matrix of rank at most for (19) is a feasible solution to the above program (22)-(25) with the same value. The intended solution here is just . All the eigenvalues of are or , since is a projection matrix; hence (23), (24) are satisfied. Moreover (25) is satisfied just because of the same constraint as in (19). Finally, the objective value is preserved since
In the above program, the objective (22) and constraints (23)-(24) define a semi-definite program (SDP). Moreover, for (25), we see that for any , by triangle inequality . Hence the set of all that satisfies constraints (23) - (25) is convex. In general, constraint (25) may be NP-hard to verify for a given PSD matrix . However, we can use the fact that is approximately separable to get a approximately feasible solution to the program in polynomial time, using the Ellipsoid method.
Let be the eigendecomposition of (note that since is p.s.d.), and let . We have
Substituting the above inequality in , we get
Proof of Theorem 5.3.
Let for some . Without loss of generality, in the rest of the proof we will assume that . We will use the Ellipsoid algorithm to approximately solve the relaxation in (22)-(25). As we have explained before, the feasible set is convex. We now show how to design an approximate hyperplane separation oracle for (25); the rest of the constraints just correspond to a simple SDP. Since is -approximately separable, we have a randomized polynomial time algorithm that given a matrix , either certifies that (e.g., when the SDP value is at most ), and otherwise produces a separating hyperplane of the form that is not satisfied by . Let be the number of iterations taken by the Ellipsoid algorithm to produce a solution of value at most (up to exponentially small additive error), assuming access to a separation oracle. Hence from a union bound over all the iterations, we can use the above randomized polynomial time algorithm for separating (25), and run the Ellipsoid algorithm to find a solution that satisfies , and has objective value that is arbitrarily close to . This algorithm runs in polynomial time with high probability (with an exponentially small probability it may not terminate when the separation oracle does not terminate in polynomial time). For the rest of the analysis, we condition on the event that the algorithm terminates in polynomial time.
Set . Let and let . Define for each , , where . We sort the elements of based on , and pick greedily the first of them to form . Our projection matrix will be .
We first argue that the operator norm constraint is approximately satisfied. By the monotonicity of the we have
Also, from Lemma 5.4, we have
for our choice of . By our greedy choice of , we have , as , with each . Thus .
Proof of Theorem 5.1.
Our goal will be to apply Theorem 5.3 to obtain our required guarantee. However the operator norm is not monotone when ; see Claim A.2 for a counter-example. Our crucial insight in this result is we can write down the following mathematical programming relaxation for the problem, where we instead use the norm which we show is indeed monotone!
In the above program, the objective (28) and constraints (29)-(30) define a semi-definite program (SDP). We will show later that we can get approximate separation oracle for the problem. The following lemma shows that the above SDP is a valid relaxation to the problem.
Any feasible projection matrix of rank satisfying (19) is a feasible SDP solution with the same value.
Lemma 5.6 (Monotonicity of operator norm).
For any , the operator norm is monotone.
Let . It suffices to show that for any , .
In other words, the quadratic form is maximized by . Moreover for every , . Hence, .
Proof of Theorem 5.1.
Crucially, there is an efficient approximate separation oracle for (31) using an approximation algorithm for the operator norm (see Theorem 4.1). The -factor SDP-based approximation algorithm immediately gives a -factor approximate separation oracle. This SDP-based algorithm either certifies that the given matrix has (when the SDP value is at most ); otherwise (when the SDP value is larger than ) the algorithm, w.h.p. runs in polynomial time and produces a solution with that gives a separating hyperplane of the form (that does not satisfy)55 5 The -factor SDP-based algorithm has the property that when the SDP value is larger than , there is a rounding algorithm that runs in time and with high probability produces the desired solution . To get the Las Vegas guarantee, we simply run the rounding algorithm repeatedly until it outputs the desired solution.. Hence is approximately separable (in particular when , we have ). Hence Theorem 5.3 can be applied to finish the proof. ∎
5.2 Approximations in the Spectral norm
We now show how techniques similar to those in Section 5.2 can also be extended to robust low-rank approximations, when the error is measured in spectral norm as opposed to the Frobenius norm. In this section we will use to denote the spectral norm of matrix . For convenience of exposition, we will measure the projection error in (3) using the spectral norm as opposed to the squared spectral norm.
Suppose the data matrix has an (orthogonal) projection of rank at most such that and the approximation error . There exists an algorithm that w.h.p. runs in polynomial time, and finds an (orthogonal) projection matrix of rank at most such that for every ,
where is a constant that only depends on as given in Theorem 4.1. For this value is known to be at most .
Moreover, for any , there exists an algorithm that w.h.p. runs in polynomial time and finds an dimensional orthogonal projection such that
We remark that as in Theorem 5.1, the only randomization in the algorithm from Theorem 5.7 is in the construction of the separation oracle for the matrix norm constraint. In particular the algorithm from Theorem 5.7 has a Las Vegas guarantee i.e., the algorithm is always correct, and the running time is polynomial with high probability (and hence in expectation).
Proof of Theorem 5.7.
We will use the following mathematical relaxation for the problem.
Finally to see that the objective value is preserved, note that for any projection matrix ,
as required. Lemma 5.8 shows that the above program can be solved approximately in polynomial time.
We will use the Ellipsoid algorithm to approximately solve the relaxation in (34)-(37). Note that the feasible set is convex (including (36), which is just a upper bound constraint on a matrix norm). We now show how to design an approximate hyperplane separation oracle for (36) and a separation oracle for the rest of the constraints. The constraint (36) can be rewritten as for all such that . Crucially, there is an efficient approximate separation oracle for (36) using an approximation algorithm for the operator norm problem (see Theorem 4.1). As in Theorem 5.1, the -factor SDP-based approximation algorithm immediately gives a -factor approximate separation oracle. This SDP-based algorithm either certifies that the given matrix has (when the SDP value is at most ); otherwise (when the SDP value is larger than ) the algorithm, w.h.p. runs in polynomial time, and produces a solution with that gives a separating hyperplane of the form (that does not satisfy). Hence by a union bound, we can assume that (36) can be separated up to a factor over all the iterations of the Ellipsoid algorithm by an algorithm that w.h.p. runs in polynomial time.
The constraints (37) can be efficiently separated using an algorithm for eigenvalue computations. Finally, given , (35) can also be efficiently separated by computing the maximum eigenvalue of . Let be the corresponding eigenvector. If the constraint is violated, the hyperplane separator is of the form
since . This completes the proof. ∎
The proof of Theorem 5.7 also crucially uses the monotonicity of matrix operator norm.
Proof of Theorem 5.7.
Let for some . Set . Lemma 5.8 shows that in polynomial time we obtain a solution satisfying (37), (35) with , and (with exponentially small probability the algorithm does not terminate in polynomial time). The rest of the algorithm (and analysis) conditions on the success of this event. Let and let .
For the rest of the analysis we will assume without loss of generality that . We first show the guarantee in (33). The projection output is just . Observe that from (37). Since the projector we output is just , each of its associated eigenvalues as at least . Hence, the operator norm bounds follows using the monotonicity of the norm since . To verify the objective value we see that
as required. We now show the guarantee in (32) where we output a projection of rank at most (with no slack). Let . Let be the projection matrix for the subspace corresponding to the best rank projection of . The algorithm outputs .
Note that , hence by monotonicity, the operator norm constraint is satisfied up to a factor. Also note that if is the projection that gives the optimal solution to the problem,
But is a valid approximation of of rank at most . Hence, we have that
As before the same ideas also give the following more general theorem for any monotone matrix norm that is approximately separable.
Let be any matrix norm that is monotone and -approximately separable for some . Suppose the data matrix has a projection of rank at most such that and the approximation error . There exists an algorithm that w.h.p. runs in polynomial time, and finds an orthogonal projection of dimension at most such that for every ,
Moreover, for any , there exists an algorithm that w.h.p. runs in polynomial time and finds an orthogonal projection of rank such that
We omit the proof, since the ideas are identical to Theorem 5.7.
5.3 Recovering the Optimal Projection Matrix
We now show that if the optimal robust low-rank projection has very small error compared to the th smallest singular value of , then we can in fact approximately recover the subspace itself up to small error measured in terms of the principal angles. For two subspaces with projection matrices , the Sin of the canonical angles matrix is given by . These techniques will also be helpful for recovery in the Spiked Covariance model. The following simple corollary will work for both Frobenius norm error and spectral norm error. For this purpose, we will just use to denote the norm of , where the unspecified matrix norm is norm in which we are measuring the error – either Frobenius norm or spectral norm.
Suppose the data matrix has an -dimensional projection such that , the approximation error and . There exists an algorithm that w.h.p. runs in polynomial time, and finds a projection of rank at most such that
Note that the above bound holds for the spectral norm error and the Frobenius norm error.
The algorithm is exactly the same algorithm used in Theorem 5.1. Let denote the best robust low-rank subspace for . We will then use the Davis-Kahan theorem about perturbations of singular vectors to show that the subspaces given by and are close. Note that the Davis-Kahan theorem states that if is the projection matrix onto eigenspaces of respectively () with the least singular values of being at least more than the singular values of , then for any unitarily invariant norm ,
where we used the fact that gives an -factor approximation to the objective. Moreover, in our case is itself of rank- and . Under the stronger assumption in (40), we have . Hence we see that (40) holds since
6 Data Poisoning and Robustness to Adversarial Perturbations at Training time
6.1 Training-time robustness: Approximations in Frobenius norm error
Suppose and is the (unknown) uncorrupted data matrix, with an (orthogonal) projection matrix of rank at most that is robust i.e., satisfying for some . There exists an algorithm that w.h.p. runs in polynomial time, and given as input any (adversarially perturbed) data matrix s.t. for each column , , outputs an orthogonal projection of rank at most such that for any
To get a multiplicative approximation we will set , and get an extra additive term of . Here think of . Further we remark that the above guarantees are optimal up to constant factors; in particular, the additive factor of is unavoidable (see Proposition 6.7).
The main challenge here is that while has a good low-rank projection (in fact a robust one), may be very far from a rank- matrix (let alone having a robust rank- approximation). Further, the best robust low-rank approximation of could be very different from the best robust low-rank projection of . This is because the entry-wise perturbations of could be too large in aggregate; for instance, it could be the case that . Suppose is the best robust low-rank projection of . We will run the algorithm in the previous section not on the given matrix , but on a suitably modified matrix .
There is a polynomial time algorithm that given any matrix , can find
up to arbitrary accuracy.
First we note that since , the optimization problem is separable across each of the samples i.e.,
We now describe how to solve each of the subinstances corresponding to the column , which for a given is of the form
Note that the least-squares objective is convex. Moreover the constraint is also convex; moreover there is a simple separation oracle for this constraint since by duality
Hence by using the Ellipsoid algorithm, this problem can be solved in polynomial time. 66 6 In fact Projected Gradient Descent Algorithm can also be used here; see [sra2012fast]. ∎
Note that when , it is easy to find the matrix , by just setting
We will argue that also gives a good low-rank approximation to . This crucially uses the fact that has bounded norm, which implies the following useful lemma.
Suppose are two matrices such that for each column , , and let be any rank- projection matrix such that . Then for any ,
For each , let be the th columns of and respectively. Then . Using this along with the triangle inequality we get,
This proves the first inequality. A similar argument also shows the other inequality. ∎
We now prove that Algorithm 1 finds an approximately optimal robust low-rank projection for unknown, uncorrupted data matrix .
Proof of Theorem 6.1.
The first step of the algorithm finds the matrix given by
Note that since is also a feasible solution for the above minimization. Moreover since for each , we get from Lemma 6.3,
6.2 Training-time robustness: Approximations in Spectral norm error
We now show guarantees for low-rank approximations in spectral norm error that are similar to Theorem 6.1. However, there is a qualitative difference: we will either find a robust low-dimensional projection of the unknown dataset , or we will certify that the dataset has been poisoned substantially. In particular, the algorithm will never output a low-dimensional representation that is bad for the unknown data matrix . We will later see how these guarantees also imply training-time robustness for downstream unsupervised learning applications like spectral clustering, robust mean estimation and learning mixture models. In what follows will refer to the spectral norm.
Suppose and is the (unknown) uncorrupted data matrix, and let have the smallest spectral norm error among (orthogonal) projections of rank at most that are robust i.e., . There exists an algorithm (Alg. 2) that given as input any (adversarially perturbed) data matrix s.t. for each column , and a parameter , w.h.p. runs in polynomial time and outputs either a projection matrix of rank at most or outputs Bad Input s.t.
if the algorithm outputs a projection of rank at most , then it is a near-optimal robust low-rank approximation for the unknown matrix i.e., for some small universal constant ,
if the algorithm outputs Bad Input , then either the data was poisoned i.e., , or there is no good robust spectral norm approximation for i.e., for all rank- projection matrices s.t. .
In particular, if we are promised that has a good robust projection of value , then the algorithm either finds an approximately optimal robust projection of rank at most for with
or certifies that the data has been poisoned i.e., .
Our algorithm just runs the worst-case approximation algorithm from Theorem 5.7 on to find a projection . If the error is less than , it outputs ; else it certifies that the data is corrupt.
The main feature of the above algorithm is that it is always correct. The algorithm certifies that the input is Bad only when the data has been poisoned i.e., is substantially far from , or did not have a good robust low-rank approximation to begin with. More crucially, when it does output a projection matrix , it is guaranteed to be a valid robust projection77 7 In particular it rules out the scenario where the algorithm finds a solution that it thinks is good (on ), but is in fact bad for the unknown, uncorrupted matrix . for the unknown matrix (with an exponentially small probability, the algorithm may not terminate in polynomial time; this happens exactly when the algorithm from Theorem 5.7 fails to terminate in polynomial time). We remark that the additive error term of is unavoidable here information-theoretically; see Proposition 6.7 for an example.
The following is the key lemma that argues that if the projection gives a small error on , it necessarily gives a low-error on .
Let and such that . Let be projection matrices such that , and = and . Then we have that for any ,
The projection matrices are both robust. For
Let . We also know that .
But . We have for any
Combining the two, we get that
as required. A similar statement also follows for using a symmetric proof. ∎
Proof of Theorem 6.4.
Firstly the algorithm from Theorem 5.7 runs on and with high probability produced a robust projection matrix (with some exponentially small probability, the algorithm fails to terminate in polynomial time). We now condition on the event that algorithm from Theorem 5.7 terminates in polynomial time. The proof consists of two parts. We first argue that if the algorithm outputs any robust rank- projection matrix, then it has to be robust for . Any such satisfies . Applying Lemma 6.5 with () and , we have
On the other hand, if the input is not “Bad” i.e., (a) for the unknown matrix , , and (b) , we now show that the algorithm outputs a good solution for . In this case we have that ; hence,
Hence, by Lemma 6.5 applied with and (), we have that
This proves the theorem. The moreover part follows by setting .
In fact, Lemma 6.5 implies a stronger information-theoretic statement about finding a robust low-rank approximation of the unknown, uncorrupted matrix with low spectral norm (just like Theorem 6.1 for Frobenius norm error). In fact we get a polynomial time algorithm assuming access to a polynomial time algorithm approximation algorithm for solving the following problem: given a matrix , find88 8 This problem is reminiscent of the concept of -rank [AlonVempala], that corresponds to the smallest rank attainable by changing every entry of the given matrix by at most .
where stands for the spectral norm.
Suppose and is the (unknown) uncorrupted data matrix, and let have the smallest spectral norm error among rank- projections that are robust i.e., . Suppose further that there is an efficient algorithm for finding an -factor approximation algorithm for (49). Then there exists an algorithm that w.h.p. runs in polynomial time, and given as input any (adversarially perturbed) data matrix s.t. for each column , and a parameter , outputs a robust projection matrix of rank at most that is near optimal in approximation error for the unknown matrix i.e., for some small universal constant ,
Moreover, the above bound is achieved information-theoretically by an algorithm (that potentially does not have polynomial running time), by using an inefficient algorithm for problem (49).
We remark that the main difference between the above proposition and Theorem 6.4 is that Proposition 6.6 will always output a good robust projection for (just like Theorem 6.1 for Frobenius norm error), but the algorithm is not computationally efficient unless (49) can be solved efficiently.
Given , the algorithm first runs the -factor approximation algorithm for solving (49) on . The uncorrupted matrix is itself a feasible solution; hence the solution output by the algorithm has a robust low-rank approximation of error . Such a robust low-rank projection for i.e., a projection for rank at most with and can be found by running Theorem 5.7 on . Moreover and are valid adversarial perturbations of each other. Now applying Lemma 6.5 with and completes the proof. ∎
6.3 Lower Bound for the Additive error in Training with Adversarial Perturbations
We now show that the additive terms of in Theorem 6.1 is unavoidable.
For any data matrix with the following two properties:
Each column ,
There exists of rank and (which is at least 2) satisfying ,
for any small enough, there exist as a -perturbation of (i.e., ) and a projection matrix of rank 1 satisfying
is more robust than , i.e., .
We still have but . Since is of rank , this also implies a similar lower bound for the spectral norm.
Let be the unit eigenvector of such that . Without loss of generality, we assume and . Notice that by this definition such that . At the same time, because of the Cauchy-Schwartz inequality, we have such that .
Then we consider any and perturb to another “sparser” vector whose point-wise absolute value is
However, since may be negative or positive, we define according to the sign function:
We have and
Since , this is at least . So let with unit norm and . So .
Next we consider . Since , we assume with coefficient . We set such that and .
Finally we lower bound . Notice that
Thus for . So we lower bound the distance between by counting the entries from to :
Since , each term in the summation is at least . So . ∎
7 Robustness to Training Time Corruptions in Unsupervised Learning
7.1 Mean Estimation
Recall that the uncorrupted data matrix has mean vector denoted by . In our setting, every point can potentially be corrupted adversarially up to a amount, as measured in norm (more generally norm for ). Our input instance is where every column satisfies . In this section we will show how to use the guarantee from Theorem 6.4 to either compute an accurate estimate of the mean of the unperturbed dataset from the given set of perturbed points, or certify that the data has been poisoned. In particular, our estimate of the mean of the unperturbed dataset will simply be the mean of the perturbed data points when projected onto a robust subspace. We will use the algorithm from Figure 2 to compute such a projection, or certify that the dataset is poisoned.
How much can one handle?
Consider, for instance the case when . Let be the mean of the data points in the matrix with representing the matrix with each column being . Furthermore denote to be the data variance, i.e., . Notice that is the bound on the norm of the perturbation. If is the corrupted data points, in the worst case the Mean() and Mean() could differ by , thereby needing to be inversely proportional to . However, if the one dimensional subspace spanned by A is -robust we can do better. As an extreme case, assume that both and exactly lie in a -robust subspace. Then Mean() and Mean() can only differ by . Notice that in this best case scenario we can handle much larger perturbations. In particular, if the mean is a -sparse unit length vector, then and we can potentially handle as large as . Our main result of this section is that we can indeed achieve this best case guarantee or certify that the data has been poisoned, i.e., is large.
Why natural certification approaches fail?
A natural approach for robust mean estimation in our model is to simply compute as an estimate, and if the variance of around is not much larger than then output , otherwise say that the data has been poisoned. This would work for perturbations of the order of . However, in the regime of interest to us, i.e., perturbations of the order of , this could fail miserably. Consider a data set of points generated as where and is a standard normal vector orthogonal to and with variance in every direction orthogonal to . Hence, the data lies close to a robust ( sparse in sense) one dimensional subspace with variance . If perturbations of magnitude are allowed in each dimension we could construct by taking every point in and zeroing out the first coordinates and adding to the rest of the coordinates. This new data set has also has variance around the mean. However, Mean() and Mean(A) are quite far. In this case we would like to certify that the data set has been poisoned. In particular, this would follow from certifying that the new data set has no small projection (in spectral norm) onto any robust one dimensional subspace. On the other hand, if the adversary hides the perturbations such that has a good projection on to a robust one dimensional subspace (not necessarily the one spanned by ), we require to output a good approximation to . The guarantee of Theorem 6.4 will be used to achieve this.
The robust mean estimation guarantee from this section will also serve as a subroutine in the application to clustering a mixture of well separated Gaussians and more general data sets, even when every data point is poisoned. We discuss the clustering application in Section 7.2. In what follows, for a vector , will denote , and for a matrix we will denote by the spectral norm of matrix . The goal of this section is to prove the following theorem.
Let be an matrix representing data points in dimensions and let be the mean of the data points in the matrix with representing the matrix with each column being . Let be the one dimensional subspace denoting the projection onto and assume that , for some . Let be the given input such that for every column we have . Furthermore, let be a given upper bound on the variance of the data around the mean, i.e., . Then the algorithm from Figure 3 when run on , w.h.p. runs in polynomial time, and either certifies that the data has been poisoned, i.e., , or outputs an estimate of the true mean such that
where is a constant that depends on . In particular, the above implies a relative error guarantee of
By assumption we know that . This implies that
Since we set , from the guarantee of Theorem 6.4, we know that if the algorithm outputs Bad Input , the data must be poisoned, i.e., . Next suppose that the algorithm outputs a projection matrix . Setting , and to be the all-ones vector we have that
Next we make a crucial observation that if is good for then it is also good for and hence is small. This is formally established in Lemma 7.2. Applying the lemma on and with , , , , and we get that
Substituting into (51) we get that
Next, notice that by triangle inequality,
Hence we get that
Substituting into (52) we get that
From the above we get the relative error guarantee of
Note: We would like to point out that for robust mean estimation, our analysis also shows that in step of the algorithm above, we can replace with . This is because if the algorithm did not output Bad Input then and hence mean of and that of will be close. However, in this case, the subspace spanned by the output vector, i.e., Mean() might not be robust and hence susceptible to test time perturbations.
Fix . Let and be two matrices, each representing data points in dimensions such that for every , columns and are close, i.e., . Furthermore, assume that there exist projection matrices, and such that and and that and . Then, letting and , it also holds that
We will show the desired bound on and by symmetry the same bound will also apply to . Notice that both and are projections onto one dimensional subspaces and a bound on norm of the projection matrices implies that and , where is such that . Next, let be the projection matrix onto the subspace spanned by and . By triangle inequality we have that
Recall the standard fact that if and are projection matrices on to subspaces and such that , then for any matrix , . Using this we to get that
From the closeness of and and the robustness of we also know that
Note that if , then we have the desired bound on . We now look at the case when . Notice that is the union of robust subspaces and has columns bounded in norm. Hence, the only way can be very large is if the norm of the projection matrix of a union of two subspaces () is much larger than the norm of the projection matrices of individual subspaces ( and ). For this to happen the two subspaces must be very close to each other and then we can bound in a different way. Formally, we have that
Next we establish an upper bound on in terms of the distance between subspaces and . Suppose and that (otherwise we work with ). We also know that and . Now, is the maximum norm of any unit vector in the span of and . We can write any such vector as
where and . Next we have that
Hence we get that for any in the span of and
The above also establishes that
Substituting into (63) we get that
Hence, if we must have that
In this case we can bound as
Tightness of the Guarantee in Theorem 7.1. We close out this section by showing that the dependence on in our bound on mean estimation is necessary even information theoretically. In what follows will be an matrix with such that has small norm, i.e., . Furthermore, let and define . We will prove the following.
Fix . Let be the set of matrices with mean that satisfies , variance around the mean that satisfies , and the subspace spanned by being -robust. Call a perturbation of of be valid if . Then, any algorithm that takes as input a valid perturbation of a matrix and either certifies that the data is poisoned, i.e., or outputs an estimate of must incur an error of
We will establish the lower bound by constructing two matrices and , both of which lie in the set , satisfy for , but have means separated by , where is the maximum norm among Mean() and . In this case, given either or as input, any provably robust certification procedure cannot output Bad Input and must output an estimate , thereby making error on either or . We next describe our construction.
For a to be determined later, let . Hence, is a unit length sparse vector with . We define the set of points in by generating i.i.d. points of the form , where is a mean zero Gaussian with variance in the first coordinates and variance in the other coordinates. Then it is a standard fact that with high probability and that Mean(A) will be -close to . Next we define the set of points in to be , where is a vector representing the sign of the corresponding coordinate of . Here we can arbitrarily set to be . It is easy to see that Mean() will be -close to , and that . Next notice that
By setting and we ensure that and . Hence we get that for the matrix , . Hence, both and lie in the set with sparsity bound . Furthermore, the fact that , ensures that . Finally, notice that the difference between two means is
We end this section by showing that via an (inefficient) algorithm one can get the same guarantee for mean estimation as in Theorem 7.1 without the need for certification.
Theorem 7.4 (Information Theoretic Upper Bound).
Let be an matrix representing data points in dimensions and let be the mean of the data points in the matrix with representing the matrix with each column being . Let be the one dimensional subspace denoting the projection onto and assume that , for some . Let be the given input such that for every column we have . Furthermore, let be a given upper bound on the variance of the data around the mean, i.e., . Then there is an (inefficient) exponential time algorithm that takes as input and outputs an estimate of the true mean such that
where is a constant that depends on . In particular, the above implies a relative error guarantee of
Furthermore, we also have that
Hence both and have good projections onto rank- subspaces. Plugging into Lemma 7.2 we get that
where . Hence, letting we get from (51) that
In this section we will use the guarantee from Theorem 6.4 to show how to perform clustering of a well-clustered instance when every data point in the instance could be corrupted. Our guarantees will apply to clustering a mixture of well separated Gaussians and more general data distributions. In particular, we will show that a robust modification of the popular Lloyd’s algorithm [lloyd1982least] (also known as the -means algorithm) can be used to perform clustering in our model, thereby providing further evidence towards the widespread applicability of the algorithm. Existing guarantees for using Lloyd’s algorithm [kumar2010clustering, awasthi2012improved] for clustering a mixture of Gaussians and general datasets assume that every pair of means are separated by , where is the maximum variance of the dataset around the mean and is the number of clusters (see (66) for the formal condition). However, our lower bound example from the previous section shows that such a separation condition is too weak for robust clustering in our noise model. If we view the datasets and in our lower bound instance as two target clusters, then they have the property that the means are separated by , where , and yet via perturbations we can reach one cluster from the other, making the clustering task meaningless. Hence, we need a separation condition that rules out this scenario. Furthermore, even if one is given the original optimal clustering of the given perturbed dataset, we know from previous section that one must incur a loss of in simply estimating the true mean of a cluster. This suggests a separation condition of the type , where depends on and the guarantee to aim for is to estimate means upto error error. Here is the maximum norm of the any of the mean vectors. In this section we will show that a modified Lloyd’s combined with our certification procedure can indeed achieve this guarantee or certify that the dataset has been poisoned.
More formally, we will assume that there is a set of points in with ground truth clustering , and means for and . Let be the data matrix and be the matrix of corresponding centers. We will assume that we have an upper bound on the maximum variance of the data points around their mean, i.e. and define . We will enforce the spectral stability condition studied in [kumar2010clustering] on our clustering instance. This condition implies that for each pair of clusters with means and each point , is closer to than to by a margin . Here is the projection of onto the line joining and . For a constant , the -spectral stability condition requires that for each ,
Notice that the above also implies that every pair of means are separated i.e.,
It is worth mentioning that in the typical analysis of Lloyd’s algorithm [kumar2010clustering, awasthi2012improved] the dependence on in the separation condition is not needed. However, as discussed before, in our noise model, some dependence on is unavoidable to get a meaningful clustering guarantee.
Assumptions I: Fix . We will assume that we are given access to such that for every , . Furthermore, define to be the robustness, of the subspace spanned by the means . Formally, let be the projection matrix for the orthogonal projection onto the space of the means. Then is such that . Under Assumptions I, we prove the following theorem that applies to any stable dataset as defined in (66).
Fix , and let be a constant that depends on . Let be a dataset that satisfies -spectral stability for . Under Assumptions I, there is a Lloyd’s style algorithm that takes as input, w.h.p. runs in polynomial time, and either certifies that the dataset is poisoned, i.e, , or outputs a clustering and means such that
for an appropriately chosen bijection .
While the above theorem works for any data set that satisfies spectral stability, notice that it leads to a sub optimal mean estimation error of for each cluster . For example, when the clusters are balanced, this will lead to a guarantee of . Next, we show that for data sets that additionally satisfy Gaussian type concentration, we can indeed get estimation error even when each data point is corrupted.
Assumptions II: Let be a given dataset with optimal clustering . We will assume that we are given that satisfies Assumptions I. Furthermore, we will assume that for each and that for any subset of points such that , we have that . Here are the matrices and restricted to the columns of the points in . Additionally, we require a pointwise guarantee that for each , and , . It is easy to see that samples generated from a mixture of Gaussians with maximum variance and minimum mixture weight will, with high probability, satisfy the above assumptions. Under Assumptions II, we prove the following theorem that applies to any stable dataset as defined in (66).
Fix , and let be a constant that depends on . Let be a dataset that satisfies -spectral stability for . Under Assumptions II, there is a Lloyd’s style algorithm that takes as input, w.h.p. runs in polynomial time, and either certifies that the dataset is poisoned, i.e, , or outputs a clustering and means such that
for an appropriately chosen bijection , where we hide a polylogarithmic (in ) factor in the notation.
As a corollary we get the following statement about robustly clustering a mixture of Gaussians.
Fix , and let be a constant that depends on . Define to be a distribution that is a mixture of Gaussians, i.e., . Furthermore, let and define and , . Let be a set samples generated i.i.d. from the mixture. If the mixture if well separated, i.e, for , and the means span a robust subspace, then given access to such that , there is a Lloyd’s style algorithm that, w.h.p. runs in polynomial time, and either certifies that the data is poisoned, i.e., , or outputs a clustering and means such that
for an appropriately chosen bijection .
Computing Good Initial Centers. The first step in establishing the above theorems is to compute centers/means that are close to the true ones. A common approach for this step is to use PCA to project the data onto the top- subspace of the input data matrix, and run any constant factor approximation algorithm for -means clustering [kumar2010clustering]. However this can be arbitrarily bad if the data is corrupted as in our model. We instead show that by projecting the data onto a robust subspace as output by our guarantee from Theorem 6.4 and then using a -means approximation algorithm, we do indeed recover good centers. Our initialization algorithm is shown in Figure 4. We next provide proofs for our claims.
Assume that the clustering instance is -stable for . If Assumptions I hold, then the algorithm in Figure 4 w.h.p. runs in polynomial time, and either certifies that the data has been poisoned, i.e., , or the algorithm outputs centers such that for all ,
for an appropriately chosen bijection .
The proof will follow the general outline of Lemma 5.1 of [kumar2010clustering], except that we need to argue following two stronger conditions. Firstly, we need to establish that projected on to has cost comparable to that of . This will ensure that the approximation algorithm will output a clustering of low cost. Secondly, we also simultaneously need to establish that when projected on to has low cost clustering when true means are used to cluster it. Together with the fact that and are pointwise close in the projected space, we can then claim that missing out on a good approximation for even a single cluster center of will incur a cost of , thereby contradicting the approximation guarantee of the -means algorithm used in step 2.
Establishing that has low cost with respect to boils down to showing that is good for given that it is good for , a perturbation of . This statement, established in Theorem 6.4 is the key in analyzing the initialization phase, and is a generalization of Lemma 7.2 to higher dimensional subspaces. Let’s first establish that projected on to has a low cost clustering. We have
Here the third inequality follows from the fact that for any matrix , , where is the maximum norm of a column of . Furthermore, from the robustness of we know that for any ,
Next, let’s establish that is small. By triangle inequality we know that
Furthermore, from the guarantee of Theorem 6.4 we have that for any ,
Setting we get that
The last inequality above follows from the fact that
Next notice that
Now we are ready to establish the claim of the theorem. From (67) we get that the centers will have -means cost at most on . Furthermore, suppose that there exists a center such that every is far from it. For any point , let be the center in the set that is closest to the projection of on to . Then we have that
Noticing that , we get a contradiction to the fact that is far from every . This combined with the fact that the clustering instance is -stable for implies that one can find a bijection between and such that each is close to a unique . ∎
7.3 Analyzing Lloyd’s Updates
Next we will use the obtained initial centers and run the robust Lloyd’s algorithm starting with these centers as shown in Figure 5. Our goal in this section is to analyze the updates and establish Theorem 7.5 and Theorem 7.6.
Overview of Analysis and Challenges. Our analysis of the modified Lloyd’s updates proceeds in two stages: a) a center improvement step, and b) analyzing robust Lloyd’s updates. In (a), we first improve the initial center estimates obtained form the initialization phase to get estimates such that each is -close to the corresponding , where . In other words, we get a factor improvement over the initial estimates. This is sketched in step of the algorithm in Figure 5. First, we motivate the need for this intermediate step, since it is not necessary in the analysis of Lloyd’s algorithm for uncorrupted data.
Just as in standard analysis of Lloyd’s updates, we would like to argue that if we have non-trivial estimates of the centers, as obtained from the initialization stage, forming clusters using these points and moving to the means of these clusters will improve the center estimates. To argue this we will crucially rely on the fact that when projected onto , and are close pointwise. Hence, we can come up with a charging argument to assign mistakes made by the current centers on the uncorrupted points to the mistakes made by the centers on the corrupted points. We can then bound the number of such mistakes by observing that on , the true means have a small -means cost. This forces us to work in the projected space , but as a result inherently limits the accuracy to which we can obtain center estimates. Notice that if the initialization algorithm outputs , then is guaranteed to be good overall for , in the sense that . However, has no per cluster guarantee, and in general when restricted to a cluster could be as large as . Hence, to achieve our goal of estimating the centers upto accuracy, we also need to work outside of the projection at the same time. Due to these conflicting demands, notice that the Lloyd’s updates we analyze in step of the algorithm in Figure 5 perform clustering using current centers in the projected space, but perform robust mean estimation on the original input data.
Furthermore, from our guarantee on robust mean estimation in Theorem 7.1, we know that in the RobustMean step of the algorithm the centers will be accurate upto , where is the standard deviation of the uncorrupted data points in around the uncorrupted mean of . As a result we need a stronger argument that not only shows that we have low clustering error given the current estimates, but also helps us argue about the variance of the formed clusters at each step. Such an argument (Lemma 7.9) is a main technical contribution in the analysis.
Unfortunately, the argument (Lemma 7.9) only kicks in when we have much better center estimates than the one provided by the initialization stage, thereby requiring an additional center improvement stage. To argue about the center improvement stage, we use a trick from [awasthi2012improved] and form sets that correspond to points in that are significantly close to one of the centers than any other center . Notice that these sets do not form a partitioning of the data. We then argue that any mistake made by this assignment must have also been made if one had used the true centers , to cluster . Using the fact that the true means have small -means cost on we can bound the number of such mistakes and hence get sets that have low error, thereby helping us show that the means of these sets will be much closer to the true centers. This is established in Theorem 7.10. The above arguments help us establish Theorem 7.5. We next state the key technical lemma for our analysis.
Let be the robust subspace computed in step (1) of the algorithm in Figure 5. For each cluster in the optimal clustering of , define . Suppose we have center estimates such that , , and some . When using s to cluster , define to be the set of points that are misclassified, w.r.t. the induced clustering on , i.e., . There exists a constant depending on such that if the clustering instance is -stable for then we have that .
Fix and let be the subspace spanned by with being the projection matrix for the orthogonal projection on to the subspace. Define to be the projection of onto the line joining and . Since contains , this is also the same as the projection of on to the line joining and . Similarly, define to be the projection of on to the line joining and , and again this is the same as the projection of on to the line joining and . We will crucially make use of the fact that
The above holds since from -stability we know that