Adversarially Robust Low Dimensional Representations

  • 2019-11-29 18:06:29
  • Pranjal Awasthi, Vaggos Chatziafratis, Xue Chen, Aravindan Vijayaraghavan
  • 3

Abstract

Adversarial or test time robustness measures the susceptibility of a machinelearning system to small perturbations made to the input at test time. This hasattracted much interest on the empirical side, since many existing ML systemsperform poorly under imperceptible adversarial perturbations to the testinputs. On the other hand, our theoretical understanding of this phenomenon islimited, and has mostly focused on supervised learning tasks. In this work we study the problem of computing adversarially robustrepresentations of data. We formulate a natural extension of PrincipalComponent Analysis (PCA) where the goal is to find a low dimensional subspaceto represent the given data with minimum projection error, and that is inaddition robust to small perturbations measured in $\ell_q$ norm (say$q=\infty$). Unlike PCA which is solvable in polynomial time, our formulationis computationally intractable to optimize as it captures the well-studiedsparse PCA objective. We show the following algorithmic and statistical results. - Polynomial time algorithms in the worst-case that achieve constant factorapproximations to the objective while only violating the robustness constraintby a constant factor. - We prove that our formulation (and algorithms) also enjoy significantstatistical benefits in terms of sample complexity over standard PCA on accountof a "regularization effect", that is formalized using the well-studied spikedcovariance model. - Surprisingly, we show that our algorithmic techniques can also be maderobust to corruptions in the training data, in addition to yieldingrepresentations that are robust at test time! Here an adversary is allowed tocorrupt potentially every data point up to a specified amount in the $\ell_q$norm. We further apply these techniques for mean estimation and clusteringunder adversarial corruptions to the training data.

 

Quick Read (beta)

Adversarially Robust Low Dimensional Representations

Pranjal Awasthi
Google and Rutgers University
[email protected]
   Vaggos Chatziafratis
Stanford University
[email protected]
The second author is partially supported by an Onassis Foundation Scholarship.
   Xue Chen
Northwestern University
[email protected]
   Aravindan Vijayaraghavan
Northwestern University
[email protected]
The last author is supported by the National Science Foundation (NSF) under Grant No. CCF-1652491 and CCF-1637585.
Abstract

Adversarial or test time robustness measures the susceptibility of a machine learning system to small perturbations made to the input at test time. It is now well known that even when trained on high quality training data, many machine learning systems perform poorly under imperceptible adversarial perturbations to the test inputs [szegedy2013intriguing]. There is a large body of work proposing practical methods to make classifiers adversarially robust. On the other hand, our theoretical understanding of the phenomenon of adversarial robustness is limited, and has mostly focused on supervised learning tasks such as binary classification.

In this work we study the problem of computing adversarially robust representations of data. We formulate a natural extension of Principal Component Analysis (PCA) where the goal is to find a low dimensional linear subspace to represent the given data with minimum projection error (measured in Frobenius norm or spectral norm) and that is in addition robust. When adversarial perturbations are measured in q norm with q2 (say q=), the robustness constraint naturally corresponds to controlling the q2 operator norm of the orthogonal projection matrix onto the subspace. Unlike PCA which is solvable in polynomial time, our formulation is computationally intractable to optimize; even when the subspace is one-dimensional this captures the well-studied sparse PCA objective.

We show the following algorithmic and statistical results.

  • Polynomial time algorithms in the worst-case that achieve constant factor approximations to the objective while only violating the robustness constraint by a constant factor. This holds for both the Frobenius norm error, and the spectral norm error objectives.

  • We prove that our formulation (and algorithms) also enjoy significant statistical benefits in terms of sample complexity over standard PCA on account of a “regularization effect” (analogous to sparsity), that is formalized using the well-studied spiked covariance model.

  • Surprisingly, we show that our algorithmic techniques can also be made robust to corruptions in the training data, in addition to yielding representations that are robust at test time! Here an adversary is allowed to corrupt potentially every data point in (in general, q norm) up to a specified amount. We further use these techniques to solve two other fundamental unsupervised learning problems, namely mean estimation and clustering, under adversarial corruptions to the training data.

The above results further validate our model and indicate that the study of adversarially robust unsupervised learning could lead to insights that have wide applicability.

1 Introduction

Reliability and trustworthiness of machine learning systems is a key requirement for their secure adoption in day to day life. While robustness to errors of various forms like training data corruptions or data poisoning, drifting data distributions and partial observability are desirable, recently the phenomenon of adversarial robustness or test time robustness has posed a significant hurdle to the design of reliable learning systems. Szegedy et al. [szegedy2013intriguing] identified that learning algorithms like deep neural networks trained on even carefully curated, high quality datasets such as ImageNet [deng2009imagenet] are susceptible at test time to small adversarial corruptions to the test example. These adversarial corruptions are imperceptible to the human eye, yet cause the trained networks to produce an incorrect classification. This represents a potential security vulnerability for critical applications like self driving cars, and secure facial recognition systems.

On the theoretical front, our understanding of adversarial robustness is limited. A series of recent works have started to explore fundamental questions such as the sample complexity of adversarial learning and the inherent computational and statistical tradeoffs needed for achieving adversarial robustness [khim2018adversarial, yin2018rademacher, bubeck2018adversarial, bubeck2018adversarial2, attias2018improved, montasser2019vc], focusing almost exclusively on supervised learning. In contrast there is a substantial body of work formalizing and studying training data corruptions in computer science and statistics, spanning both supervised and unsupervised learning (see e.g.,  [huber2011robust, tukey1975mathematics, angluin1988learning, kearns1994toward, candes2011robust, diakonikolas2019robust, lai2016agnostic]).

The primary motivation of this work is to formulate and study adversarial robustness in the context of learning data representations and other basic unsupervised learning tasks. A representation given by a function xg(x) is considered adversarially robust if for any point xn and its perturbation x, it holds that if xx, then g(x)g(x). Downstream learning tasks like classification and regression trained on such robust representations are more likely to have built in robustness. As we will see later, algorithms for learning adversarially robust representations may also have surprising implications for robustness to training-time perturbations. Furthermore, requiring a data representation to be adversarially robust is a natural notion of individual fairness [zemel2013learning, madras2018learning, oneto2019learning]. This leads to the following natural question: can we efficiently learn succinct representations of data that are robust to adversarial perturbations? We will formulate and study this question in the context of principal component analysis.

Principal component analysis (PCA) or low-rank approximations is the predominant tool for obtaining succinct data representations, and is used as a preprocessing primitive in almost every machine learning pipeline. Given data in a high-dimensional space n represented by the columns of a matrix A, the goal is to find a subspace of dimension at most rn to represent the points, that minimizes the projection error (or reconstruction error) onto the subspace, formalized as

minΠ𝒫ΠA2=minΠ𝒫A-ΠA2, where 𝒫 is the set of all orthogonal projection matrices of rankr, (1)

where the matrix norm is either the Frobenius norm or the spectral norm. The representation of each point xn corresponds to the projection Πx onto the r-dimensional subspace given by Π (one can also represent the point as an r-dimensional vector in terms of a basis for Π). Following existing literature on adversarial robustness, we model an imperceptible adversarial perturbation x of a point x as one for which the norm of the difference is small, i.e., x-xδ, for some fixed δ>0 (more generally, we also consider q norm where q2). Given an r-dimensional subspace of n with projection matrix Π, the adversarial robustness of Π to δ-perturbations in the q norm is then exactly captured by

supx,x:x-xqδΠ(x-x)2=δΠq2. (2)

Observe that κ=Πq2 characterizes the robustness of the projection Π to perturbations in q norm around every point xn. The distance between the projections of x and a δ-perturbation x of x (in q norm) is upper bounded by κδ. On the other hand, around each point x one can also realize a perturbation x=x+z with zqδ such that Πx-Πx2=κδ. We will call Π a (κ,q)-robust rank-r projection when Π is an orthogonal projection matrix of rank at most r with Πq2κ; when the robustness parameter κ and norm q are understood, we will just call it a robust rank-r projection. Hence, our algorithmic goal given a data matrix An×m composed of m points in n, a robustness parameter κ1 and the norm q[2,], is to find a robust low-rank projection with low error formalized by

minΠ ΠA2=minΠA-ΠA2 (3)
s.t. Π is an (orthogonal) projection matrix of rank at most r, and Πq2κ. (4)

We will be interested in two versions of the problem, depending on whether we measure the error in Frobenius norm or spectral norm. Recall that the top-r terms of the Singular Value Decomposition of A simultaneously solves both of these problems in polynomial time, when there is no additional robustness constraint. We also remark that just as for the PCA objective (1), the above objective (3) can be equivalently rephrased as finding the best approximation among low-rank matrices, but among those with a “robust column space” (see Lemma 4.5).

Our optimization problem (3) while motivated by adversarial robustness, has rich connections to (and implications for) well studied problems like the sparse PCA problem [johnstone2009consistency]. Consider the setting when the perturbations are measured in norm and rank r=1. The robustness constraint on the projection Π=vv imposes an upper bound of κ on the “analytic sparsity” of the direction vn (measured in terms of ratio of 1 and 2 norms). Hence in the special case of r=1,

minA-vvAF2=tr(AA)-maxvAAv subject to v1κ, and v2=1. (5)

The complementary objective is the 1 version of the maximization sparse PCA objective,11 1 It is within a factor 2 of the 0 version where the constraint v1κ is replaced by v0κ2 (see Section 10.3.3 of [Vershynin]). which is notoriously hard in the worst-case [chan2016approximability].

For general q2, requiring robustness places a constraint on the dual q* norm of the direction v. Moreover for general rank r1, Πq2 gives an upper bound on the q* norm of every unit vector in the subspace given by Π (see Lemma 4.3 for the statement, and Lemma 4.4 for an approximate converse). As we will see in Section 2.1, we will design algorithms for finding approximately optimal low-dimensional representations that are indeed robust to test-time adversarial perturbations.

From a practical perspective it is desirable to have robustness to adversarial perturbations at both training-time and test-time. Surprisingly, we show that our algorithmic techniques can be used to achieve this! We study a strong model of training-time corruptions, where every coordinate of every point in the training data set can be adversarially perturbed up to a specified amount (in general the perturbation of every point is measured in q norm with q2). Apart from the natural motivation of being secure against poisoned data, this model of corruption naturally arises in emerging paradigms such as low precision computation [de2017understanding, de2018high], where only the few most significant bits of each data point is stored in memory, thereby allowing for processing of more training examples. While there have many exciting recent developments in designing efficient robust algorithms for high-dimensional statistical problems, they deal with a fundamentally different model of corruption where a small fraction of the points (outliers) are corrupted arbitrarily [diakonikolas2019robust, lai2016agnostic, charikar2017learning, diakonikolas2018sever]. Our algorithms form a new versatile primitive for achieving robustness. We demonstrate this by applying them to get new training-time robustness guarantees for fundamental unsupervised learning tasks such as mean estimation and clustering.

2 Our Results

In all the results that follow, it will be instructive to think of the setting when q= i.e., each point can be perturbed adversarially up to an amount δ in every coordinate. However, our results will apply for all q2 generally. Intuitively, the larger the choice of q, the more unrestricted the adversarial perturbations can be (since xpxq when pq).

2.1 Worst-case Approximations for Robust Low-Rank Projections.

We first consider the two variants of problem (3) i.e., the Frobenius norm error and spectral norm error versions in the worst-case setting. We give efficient algorithms that attain small constant factor approximation algorithms (approaching 2 for the Frobenius norm error, and 3 for the spectral norm error) when we are also allowed to relax the robustness constraint on the projection by a constant factor.

(Informal) Theorem 2.1.

There exist polynomial time algorithms that given a data matrix A, q2 and a robustness parameter κ, find a projection matrix Π^ of rank at most r s.t.

γ(0,1],Π^q2 O(1/γ)κ, and A-Π^A2(α+γ)OPT, (6)

where OPT is the error obtained by the optimal (κ,q)-robust projection of rank at most r (α=2 for the Frobenius norm error objective and α=3 for the spectral norm error objective). Moreover, for any γ(0,1], there exist polynomial time algorithms that find an rr(1+O(γ-1))-dimensional projection Π^ that gets an (1+γ)-approximation to the objective, and relaxes the robustness parameter by O(1/γ) factor.

The algorithms for both objectives – Frobenius norm and spectral norm, both use convex relaxations and use similar ideas. However the algorithms for these two different objectives are different (starting with the relaxations) unlike standard PCA. Please see Theorem 5.1 (Frobenius norm objective) and Theorem 5.7 (spectral norm objective) for the formal statements.

Observe that the approximation factor guarantee in Theorem 2.1 is independent of the desired rank r. We remark that our algorithmic guarantees are non-trivial even if we do not want to restrict the rank r; in particular when r=n (unrestricted) our algorithm finds among all subspaces that are O(κ)-robust, the one with approximately optimal error. The constant factor loss in the robustness parameter depends on the choice of the norm q, and remains a constant for all q2. When q=, the constant factor that arises is related to the constant in a variant of the Grothendieck problem [alon2004approximating, nesterov1998semidefinite] (the constant is at most π/2). Hence the above theorem gives a (O(1),O(1))-factor bicriteria approximation when we need an approximation of rank at most r; however we can also achieve an ((1+γ),O(1/γ)) by incurring an extra loss in the rank.

Our result also has new implications for approximating the minimization objective for 1-sparse PCA. Sparse PCA has been studied under average case models [BerthetRigollet], as well as worst case models [chan2016approximability]. In particular the special case of our result for r=1 provides a small constant factor bicriteria approximation to the minimization version of the sparse PCA problem under 1 constraint on the sparsity. This is in stark contrast to the approximability of the maximization version of the problem. Even in the special case when r=1, the best known polynomial time algorithm gives a O(n1/3) factor approximation in the worst-case (for both the 1 and 0 versions). This is true even when the sparsity can be relaxed by a O(1) factor; moreover no constant factor approximation is possible assuming the Small Set Expansion conjecture [chan2016approximability]. See also Appendix B for computational hardness of (3) based on this observation. Furthermore, as we will see in a bit, the minimization version that we study and our constant factor approximations for this problem will also be crucial in various downstream applications in unsupervised learning.

2.2 Statistical Benefits of Robust Projections.

While the operator norm constraint arises naturally when we need robustness to small adversarial perturbations, it also induces a regularization effect due to the “sparsity” of vectors in the robust subspace given by Π (see Lemma 4.4). This brings with it additional statistical benefits, which we demonstrate by studying the classic high-dimensional statistical problem of covariance estimation and recovery in the well-studied spiked covariance model  [johnstone2001distribution, aminiwainwright, berthet2013]. The spiked covariance model SCM(θ=(θmin,θmax),Σ*) has two sets of parameters:the signal strength specified by the pair θmin,θmax with θmaxθmin0, and the unknown covariance matrix Σ* of rank r with its eigenvalues in the interview [θmin,θmax]. The top-r eigenspace of Σ* is given by the projection matrix Π*n×n of rank r that is (κ,q)-robust for q2. Each sample from the distribution is drawn independently from n-variate Gaussian N(0,I+Σ*).

We will be interested in two different goals – detection and the stronger goal of recovery. The goal in detection will be to distinguish w.h.p. whether the sample data was generated by the spiked covariance model SCM(θ,Σ*) or from a standard Gaussian N(0,I). Our goal in recovery is to output an estimate Σ for Σ*, with error measured in terms of the Frobenius norm of Σ-Σ*. We first state the guarantee for q= and discuss the case of general q at the end of the section.

(Informal) Theorem 2.2.

Fix q= and let Π* be a (κ,)-robust projection of rank r for subspace spanned by the non-trivial eigenvectors of the rank-r covariance matrix Σ*. For any ε>0, O((1+θmax)2θmin2r2κ2logn/ε2) samples from SCM(θmin,θmax,Σ*) suffice to recover a covariance matrix Σ having its top-r eigenspace given by a (κ,)-robust rank-r projection Π such that w.h.p., Σ-Σ*F2ε. At the same time, O((1+θmax)2θmin2κ2logn/ε2) samples suffice for detection.

Please see Theorem 8.3 and Theorem 8.5 for the formal statements. We also prove our upper bounds are optimal up to O(logn) factors for q= (see Theorem 8.6 for the lower bound). Note that for the task of detection, we actually save the additional r2 factor in the sample complexity compared to recovery.

The above bound shows that when rκn, we get significant statistical benefits compared to traditional covariance estimation in high-dimensions, which requires Ω(n) samples even for recovering the principal component (r=1) up to good accuracy. This sharply contrasts with our current understanding of adversarial robustness in classification problems where current evidence suggests that the sample complexity for learning robust classifiers may be significantly higher compared to the non-robust case [schmidt2018adversarially, yin2018rademacher, montasser2019vc].

We remark that sample complexity upper bounds are known in the spiked covariance model, when Σ* (or the subspace that corresponds to it) satisfies some sparsity constraints [aminiwainwright, VuLei13, Liuetal]. Our results are generally incomparable since we parameterize by the 2 operator norm of Π* which is naturally motivated from robustness considerations. However somewhat surprisingly, our analysis often matches or in fact improves upon the sample complexity upper bound even in the other parameterization, in certain regimes. See Remark 8.4 for details.

The estimation algorithm in Theorem 2.2 is computationally inefficient. We now show how we can simultaneously achieve computational efficiency and statistical efficiency in finding an estimate Σ^ of Σ*; moreover the top-r eigenspace of Σ^ has a robust projection matrix Π^ that is also close to Π*.

(Informal) Theorem 2.3.

There is a polynomial time algorithm that given m samples in Rn drawn i.i.d. from the spiked covariance model SCM(θmin,θmax,Σ*) with the top-r eigenspace of Σ* given by a (κ,q)-robust rank-r projection Π*, finds w.h.p. a rank-r covariance matrix Σ^ with its top-r eigenspace being an (O(κ),q)-robust rank-r projection Π^ such that Σ^-Σ*F2ε, as long as mc(1+θmax)2θmin2r2κ4logn/ε2 for some universal constant c>0. Moreover mc(1+θmax)2θmin2κ4logn/ε2 suffices for the detection problem w.h.p.

Please see Theorems 8.8 and 8.9 for the formal statements. We remark that even for r=1, the dependence of κ4 is necessary assuming the Planted Clique assumption, due to known lower bounds from sparse PCA [BerthetRigollet]. This points to an interesting statistical vs computational tradeoff that would be interesting to explore in the other regimes of parameters.

We also generalize our results to q<. We extend the information upper bound in Theorem 2.2 to show O((1+θmax)2θmin2r2κ2qn2/qlogn/ε2) random samples suffice for recovery statistically. On the other hand, we provide an information-theoretic lower bound showing that this dependence in terms of n,κ,r is almost tight for q[2,) (in Theorem 8.7). In particular, we prove that for q[2,), the sample complexity incurs a polynomial dependence of n2/q. Finally, we remark that our polynomial time algorithm in Theorem 2.3 also works for q[2,) with an extra factor of n4/q (instead of n2/q) in the sample complexity.

2.3 Robustness to Adversarial Errors during Training

Surprisingly, the notion of test-time robustness and the algorithms that we have developed (Theorem 2.1) also allow us to handle robustness to adversarial perturbations in the training data set (this is often referred to as data poisoning). We propose a corruption model under which every training sample Ain can potentially be corrupted adversarially up to a δ amount, as measured in norm (more generally q norm for q2). Our input instance is A~n×m where every column i[m] satisfies A~i-Aiqδ; we will refer to such an A~ as a δ-corrupted instance of A. Our goal now is to recover a robust low-rank projection for the uncorrupted matrix A. We will show that we can in fact output a robust low-rank projection Π^ that is competitive with the best robust low-rank projection of A, even though A is not known to us!

(Informal) Theorem 2.4.

Suppose q2 and ARn×m is the unknown uncorrupted data matrix, with a (κ,q)-robust projection matrix Π* of rank at most r satisfying A-Π*AF2εAF2 for some ε[0,1]. There exists a polynomial time algorithm that given as input a δ-corrupted instance A~ of A outputs an r-dimensional projection Π^ that is approximately optimal

η>0,Π^q2 O(κ), and A-Π^AF2O(ε+η)AF2+O(1η)δ2κ2m. (7)

In particular this gives an O(1) approximation when δ2<(ε2/κ2)1mAF2.

To interpret the results, let q= and consider an uncorrupted dataset A where every column (sample) is a unit vector in n, and let κn be a small polynomial of n, say κ=n0.1. When δ=o(n-1/2) i.e., every coordinate can be perturbed up to δ1/n, the total corruption to each point is at most o(1) in Euclidean norm; in this case one would expect that standard PCA applied to A~ will approximately recovery a good low-rank approximation. The above Theorem 2.4 on the other hand guarantees to find a good (robust) low-rank approximation for the unknown matrix A even when δ=o(1/κ)=o(n-0.1). Note that in this setting every point can be completely overwhelmed by the adversarial noise (in Euclidean norm). Furthermore, the above guarantees are optimal up to constant factors; in particular, Proposition 6.7 shows that the additive factor of O(mδ2κ2) is unavoidable for every κ,δ=O(1/κ). These results together suggest that our notion of robust projections (measured in q2 operator norm) is key in understanding the robustness to small adversarial perturbations of every point during training as well.

Our guarantees for spectral norm error in the presence of training-time adversarial perturbations are somewhat similar to Theorem 5.1. However, there is a qualitative difference: we will either find a robust low-dimensional projection of the unknown dataset A, or we will certify that the dataset has been poisoned substantially. In particular, the algorithm will never output a low-dimensional representation that is bad for the unknown data matrix A. In what follows will refer to the spectral norm.

(Informal) Theorem 2.5.

Fix q2 and ε(0,1). Let ARn×m be the unknown uncorrupted data matrix with a (κ,q)-robust projection Π* of rank at most r with A-Π*AεA. There exists a polynomial time algorithm that given as input a δ-perturbed instance A~, outputs either a robust projection Π^ of rank at most r with

Π^q2=O(κ), and A-Π^AO(εA+mδκ/ε), (8)

or certifies that the data has been poisoned i.e., A~-A>εA.

Please see Theorem 6.4 for a formal statement. Here again, the additive term of Ω(mδκ) is unavoidable as shown in Proposition 6.7. We remark that information-theoretically we can design an estimator (that is computationally inefficient) that achieves the stronger qualitative guarantees as in Theorem 2.4. Designing a computationally efficient algorithm to do the same is a tantalising open question that we describe in more detail in Section 2.5.2. Finally, we would like to point out that our algorithms output (O(κ),q)-robust projections, and as a result we also get robustness to test-time perturbations.

Comparison to Robust PCA. The problem of robust PCA has received significant attention in recent years [de2003framework, candes2011robust, chandrasekaran2011rank]. Here one assumes that a given corrupted matrix A~ is a sum of two matrices, the true matrix A that is low-rank and a sparse corruption matrix S with sparsity pattern being essentially random. The corruptions, although sparse, can be unbounded in magnitude. This necessitates an incoherence type assumption that the “mass” or the principal components of A is spread out – else recovery of A is impossible here. On the other hand, the corruptions may not be sparse in our case. In particular, every data point (in fact every entry of A) could be corrupted up to some specified magnitude δ. Here as our results show (particularly Theorem 2.4), localization (or sparsity) of the signal is crucial in tolerating adversarial perturbations in the training data (e.g., a spread out signal can be completely overwhelmed by the corruption in each entry of A). Hence we believe that our algorithms could give a new robust primitive that is fundamentally different from existing techniques like robust PCA, for downstream applications in ML.

2.4 Applications to Unsupervised Learning.

As a concrete application of our guarantees from the previous section we study mean estimation and clustering under adversarial perturbations to potentially every data point in the training set. For mean estimation, we assume that the uncorrupted data matrix An×m has mean vector denoted by μ:=Mean(A). The guarantee below shows that in order to recover a good approximation to μ, the amount of corruption (the δ parameter) that can be tolerated is directly related to the inherent robustness of the one dimensional subspace spanned by μ. Please see Theorem 7.1 for a formal statement.

(Informal) Theorem 2.6.

[Robust Mean Estimation] Fix q2 and consider ARn×m with μ=Mean(A). Let C=μ1 represent an n×m matrix with copies of μ, and σ satisfy A-Cσm. Let Π*=μμ/μ2 be the one dimensional subspace denoting the projection onto μ such that Π*q2κ. There exists a polynomial time algorithm that given as input a δ-corrupted instance A~ of A, either certifies that the data has been poisoned, i.e., A~-A=Ω(σm), or outputs an estimate μ^ such that μ^-μ2O((1+κδσ)max{σ,σμmissing}).

Equivalently, the above implies a relative error guarantee of O((1+κδσ)max{σ/μ,σ/μ}). Here σ/μ captures the noise-to-signal ratio. We make a few remarks about the above claim. When κδ=O(σ), we get a relative error guarantee of O(max{σ/μ,σ/μ}). Consider the case of q=, and μ being a k=κ2-sparse (in the 0 sense) unit length vector. Such estimation problems concerning sparse mean vectors arise naturally in many applications and have been widely studied in statistics [donoho1992maximum, donoho1994minimax, johnstone1994minimax]. When every point could be corrupted, our guarantee allows for an adversary to add σ/k perturbation per coordinate, as opposed to σ/n. Notice that our mean estimation guarantee has a dependence on μ. In Theorem 7.3 we show that in our model of corruption, this dependence is unavoidable and hence the error in our mean estimation bound is information theoretically optimal upto constant factors.

Next, we show how the robust mean estimation procedure, along with our guarantees from the previous section, can be used to perform robust clustering of well clustered instances. In particular, we modify the popular Lloyd’s algorithm [lloyd1982least] to get robustness to perturbations of every data point. As is done in the standard Lloyd’s analysis, we study the spectral stability condition proposed by Kumar and Kannan [kumar2010clustering], and studied later in [awasthi2012improved, cohen2017local]. This condition captures, as special cases, mixtures of well separated Gaussians and other distributions, and helps in developing a unified analysis of Lloyd’s updates. Below we state our results assuming equal cluster sizes. Our results hold for the more general case and we defer to Section 7.2 for more details. Let An×m be clustered into k clusters of equal sizes with means μ1,μ2,,μk. Furthermore, let Cn×m be the matrix of corresponding centers for each column of A and let σ be such that A-Cσm. Then A satisfies c-spectral stability if for each pair of optimal clusters, say, cluster r and s with means μr and μs, any point in cluster r, when projected onto the line joining μr and μs is closer to μr than μs by an additive amount of Δr,s:=cαkσ. Here α is a quantity that captures the signal-to-noise ratio and the relative perturbation magnitude, i.e. (1+κδσ), as in robust mean estimation22 2 Note that unlike clustering without corruptions, the dependence on α is unavoidable in our model even for k=1, i.e., robust mean estimation.. In the special case when A is a set of m=poly(n,k) points drawn i.i.d. from a mixture of Gaussians with the variance of each Gaussian being bounded by σ2, and with uniform mixture weight 1/k each, the separation condition becomes Δr,s=cαkpolylog(nk)σ. In the theorem below we denote κ to be the robustness, as measured in q2, of the subspace spanned by the true means {μ1,μ2,,μk}.

(Informal) Theorem 2.7.

[Robust Clustering] Fix q2, and let cq be a constant that depends on q. Let ARn×m satisfy c-spectral stability, for c>200cq. Then given as input a δ-corrupted instance A~ of A, there is a Lloyd’s style algorithm that either certifies that the dataset is poisoned, i.e, A-A~=Ω(σm), or recovers each mean μr up to error O(αkσ). Using the computed centers to cluster A~, we obtain a clustering of A~ such that the corresponding induced clustering on A that misclassifies O(1/k)- fraction of the points.

In the special case of a mixture of Gaussians with equal mixing weights we get the means upto error O~(ασ), where we hide a 𝑝𝑜𝑙𝑦𝑙𝑜𝑔(m,n) factor in the O~ notation. This implies O(1/k2)-fraction clustering error.

See Theorem 7.5 and Theorem 7.7 for formal statements that also handle more general cluster sizes and mixing weights.

Finally, as in Section 2.3 we can also prove that information theoretically there is an algorithm (though computationally inefficient) that performs robust mean estimation up to the (near optimal) error of O((1+κδσ)max(σ,σμ)) on all instances (it will estimate the uncorrupted mean even for instances that are certified to have been poisoned). See Theorem 7.4 for the formal statement. As a result, we can also cluster well-clustered instances information-theoretically up to the claimed error above, without the need for certification (See Section 7 for details). Whether this can be achieved in polynomial time is an open question.

2.5 Overview of Challenges and Techniques

We now give a flavor of the technical ideas involved in obtaining our results. For the sake of exposition, we will mainly restrict our attention to the case when the adversarial perturbations are measured in norm.

2.5.1 Worst-case Constant Factor Approximations

Let us first consider the version of problem (3) of finding a robust rank-r projection that has small error measured in Frobenius norm. A natural mathematical programming relaxation is the following:

minXAF2-AA,X (9)
subject to tr(X)r,  0XI
 and X2κ (10)

This is a valid relaxation for the problem since the constraints are all satisfied by any rank-r projection matrix that is robust i.e, Π2κ. Moreover this relaxation is convex; the last constraint in particular is an upper bound on a valid norm. Let X* be the optimal solution of this program. The first challenge however is that the operator norm constraint (10) is NP-hard to verify efficiently. In general, these operator norm qp computation problems are APX-hard for most values of p,q. They form a rich class of problems related to the Grothendieck problem [alon2004approximating, nesterov1998semidefinite], and we can use polynomial time O(1) factor approximations that are known for general q2 norms with q2 (see Section 4.1).

The bigger challenge here is in producing a projection matrix from X* that (a) achieves a good objective value (b) has rank at most r, and (c) is O(κ)-robust (bounded 2 operator norm). A natural approach for producing a good low-rank solution is to output a rank-r projection matrix Πr, by considering the large singular values of X* (e.g., the top-r singular vector space of X*). However we have no control on the robustness of the subspace Πr2. In fact, problem (3) is challenging even without the rank constraint on the subspace (i.e., for r=n). The main issue is to relate the 2 operator norm of the projection matrix we output to that of the relaxation solution X*2 which is upper bounded by κ.

Our crucial insight is that we can indeed design such a rounding scheme that achieves all three goals if the norm in the constraint (10) is a monotone norm! A general matrix norm is monotone iff

A+BA, for any pair of PSD matrices A,B.

(See Definition 4.6 for details.) This monotonicity property allows us to truncate terms in the eigendecomposition without any loss in robustness κ, and get fine control on the robustness κ when we rescale different PSD terms slightly. Unfortunately however, the 2 operator norm is not monotone (see Claim A.2; see also Claim A.1 for counterexamples for various other matrix norms that are not unitarily-invariant).

Our next important observation is that we can replace the constraint (10) by a similar constraint in terms of the 1 norm (more generally q2 norm in terms of the qq* norm). This is because for any matrix B, we have that Bq22=BBqq* where q* is the dual norm for q and satisfies 1q*+1q=1. The main advantage of this reformulation is that the qq* operator norms are indeed monotone (see Lemma 5.6 for a simple proof). Moreover these norms can be efficiently separated (within an O(1) slack factor) when q2 since M1=maxx,y{±1}nM,xy.

Our final algorithm approximately solves the mathematical program (9) with constraint (10) replaced by X1κ2. We produce a low-rank projection matrix by first truncating the solution to the program X^ to the large singular values, and then picking the r terms that contribute the most towards the objective. Our analysis leverages the monotonicity property to prove O(κ)-robustness, while also achieving an O(1) approximation to the objective. See Theorem 5.3 for an analysis with a general monotone norm constraint, and Theorem 5.1 for our desired problem.

Similar ideas can be used when the error is measured in terms of the spectral norm instead of the Frobenius norm. The objective function in the mathematical relaxation (9) is instead rephrased as min(A(I-X)A where is the spectral norm. Theorem 5.7 gives a slightly different rounding and analysis that again leverages the monotonicity property of qq* operator norms.

2.5.2 Robustness to Training-Time Adversarial Perturbations (Data Poisoning)

Let q=. Recall that our input instance A~ is a δ-corrupted instance obtained from A by potentially corrupting every entry of it. Our goal is to output a robust low-rank projection matrix Π of rank at most r for the uncorrupted matrix A, that is not known to us. This question is interesting even from a purely statistical standpoint; but additionally, we would also like our algorithm to run in polynomial time.

Why should this be possible? Suppose the uncorrupted matrix A has a robust low-rank projection Π* of small error i.e., A-Π*A<εA (where A is either the Frobenius norm or spectral norm). Also assume for just this discussion that the average column (Euclidean) length of A is 1 (or even that each column is of unit length), κ=n0.1 say and δ=o(n-0.1). For any κ-robust projection Π, ΠAjΠA~j for each data point j[m] . So one could apply the worst-case algorithm to find a robust projection for the corrupted input A~, and hope to also get a robust subspace of low-error for the unknown matrix A.

However, there are two major challenges in implementing this strategy. (1) Solution value of A~: the robust projection Π* may not achieve low error on the input matrix A~; in fact, A~ may not have any good robust low-rank approximation – in this case the algorithm output may be useless. This is because the entry-wise perturbations could make A and A~ far away in aggregate e.g., the spectral norm of A-A~ could be δnmm (whereas, even AFm). (2) Identifiability issue: perhaps more importantly, even if the perturbation A~ has a robust low-rank robust projection of small error, we need to argue that this subspace indeed attains small error on A! The second issue is crucial in resolving the purely statistical aspect of the question; it involves ruling out the scenario where A~ has good robust low-rank approximation that is very different from any for A.

Issue (2): Identifiability. To address the second issue, we prove that if the projection Π^ gives a small error on A~, it necessarily gives a low-error on A. Roughly speaking, if there are two data-matrices A and B that are both δ-perturbations of each other i.e., A-Bδ, then for γ(0,1)

A-Π1A,B-Π2B<γA   A-Π2Aγ1A+1γ2mδκ, (and similarly for B),

where γ1=γ1(γ),γ2=γ2(γ)(0,1). This statement is particularly tricky to show for the spectral norm (See Lemma 6.5 for a formal statement). We know that Π1A-Π1B and Π2A-Π2B are small since Π1,Π2 are robust (see Lemma 4.3); however this does not give a handle on A-Π2A. A natural approach is to argue that the actions of Π1A and Π2B on any unit vector are similar. The difficulty is in handling scenarios where some directions in Π2 are close to the subspace of Π1, yet their orthogonal component is small, but spread out (even when r=1). Our proof is somewhat indirect; we show that for every direction v𝕊n-1, (1) the lengths Av2 and Bv2 are similar and (2) the difference in the lengths |Av2-Bv2| is (approximately) lower bounded by Π2Av2. This will allow us to conclude that Π2A is small in spectral norm.

Issue (1): Solution value of A~. To tackle the first question, we do know that there is a data matrix (in particular the uncorrupted matrix A) that can be obtained from A~ by perturbing each entry by at most δ (i.e., it is in the neighborhood of A~) that has a robust low-rank approximation of low value εA2. Let us suppose that we can solve the following optimization problem:

A=argminB:B-A~δ   minΠ:rank(Π)=r,Π2κB-ΠB2. (11)

We know that A is a feasible solution, and hence A has a robust low-rank approximation of even smaller error. Moreover A-A2δ. This reduces the first issue to a purely computational question of solving the above optimization problem.

We show that when the error is measured in Frobenius norm, we can solve this problem approximately by instead solving a simpler optimization problem of finding A that minBBF2 over all matrices B that are valid δ-perturbations of A~ i.e., A-Bδ. The minimizer here just reduces the magnitude of each entry by δ or until it is 0. For more general q, this corresponds to a simple convex minimization problem. For the spectral norm problem, we do not have an efficient algorithm. However by running the constant factor approximation algorithm for robust low-rank approximations (in spectral norm) for A~, we will either find a good solution that also works for A, or we will certify that A~ has no good robust low-rank approximation; this certifies that A-A~ is too large i.e., the data was poisoned significantly.

Finally we get the stronger computationally efficient guarantee for the spectral norm error (as for the Frobenius norm error) if we can resolve the following open question.

Open Question. Given A~, is there a polynomial time (O(1),O(1)) bicriteria approximation algorithm for

minB:B-A~δ   minΠ:rank(Π)=r,Π2κB-ΠB2, where  stands for the spectral norm.

2.5.3 Applications to Unsupervised Learning

Our low rank approximation guarantees that tolerate corruptions to training data have interesting consequences for unsupervised learning, even in the special case when the subspace rank r equals one (for mean estimation). More importantly, these guarantees when used carefully, lead to a provably robust clustering algorithm. Establishing this is quite non-trivial and is technically involved. Here we will mention the key insights.

At a high level we need to ensure (both algorithhmically and in the analysis) that running an iterative procedure on a dataset where every point could be corrupted does not compound the errors. For mean estimation, Theorem 2.5 gives an algorithm that either certifies that the data is poisoned, or outputs a one dimensional subspace Π that is good for both A and A~. It is easy to see then that μ^=Mean(ΠA~) will be a good estimate of Mean(A). However, this yields a suboptimal bound. We instead prove a stronger version for one dimensional subspaces (Lemma 7.2) that gives the claimed upper bound on the estimation error. We complement this with a matching lower bound in Theorem 7.3 by constructing two δ-close datasets, both having low projection errors (σ2m error) onto the κ-robust subspaces, κδ=O(σ) but the means separated by Ω(max(σ,σμmax)), where μmax is the maximum 2 length among the two mean vectors.

Next we will sketch how we achieve the clustering guarantee in Theorem 2.7. We describe the techniques for the special case of equal cluster sizes that captures the main ideas. See Section 7.2 for the general analysis. We analyze a robust variant of the Lloyd’s algorithm. Our initialization algorithm is presented in Figure 4 and the iterative Lloyd’s updates are presented in Figure 5. We proceed in three stages: a) an initialization stage, b) a center improvement stage, and c) analyzing the robust Lloyd’s updates. Each stage poses unique challenges arising from working with A~ where each data point is potentially corrupted. The standard way to initialize Lloyd’s algorithm is to project the data onto the best rank-k subspace and run a k-means approximation algorithm33 3 This initialization is needed for theoretical bounds. In practice, the initial centers are chosen as random data points.. However, when every data point is corrupted, this could be arbitrarily bad. We instead run the algorithm from Figure 2 to compute a robust k-dimensional Π for A~ with small error, or certify that the dataset has been poisoned. We then project A~ onto Π and run a constant factor k-means approximation algorithm on the projected points to compute centers ν1(0),,νk(0). Using the guarantee of Theorem 2.5, the key is to establish that A~ when projected onto Π has a low cost solution when using the true means as centers. Since the true centers are well separated, the estimated centers will be somewhat close to the true centers. This shows that for each r[k], the center νr(0) is close to μr upto O(αkσ).

In the second stage we improve the initial center estimates, by a k factor, to get ν1(1),,νk(1) that are αkσ/2-close to the corresponding true means. This is sketched in step 3 of the algorithm in Figure 5. While such a step is not needed in standard analysis of Lloyd’s, it is crucial for our robust version. See Section 7.2 for a discussion on this. To argue about the center improvement stage, we use a trick from [awasthi2012improved]. The main technical contribution then is to establish Lemma 7.9 that bounds the clustering error in terms of how close the current centers are to true ones. The lemma is a stronger version of similar statements that appear in [kumar2010clustering, awasthi2012improved]. It simultaneously helps us argue about the clustering error, and also the variance of each current cluster around its mean, a quantity crucial to bound in order to analyze the iterative updates. Here we need a novel “charging” argument to relate the mistakes made by the current centers on corrupted points, to the mistakes they would have made on uncorrupted points. This crucially relies on the fact that A and A~ are close to each other in the projected space, since the subspace is robust. Using the center improvement step and Lemma 7.9 we get Theorem 2.7.

To establish the stronger guarantee for mixtures of Gaussians we argue about the iterative updates in two steps. First we analyze the “ideal” updates, as if we had access to the uncorrupted data. This largely follows the analysis in [kumar2010clustering] and helps us argue that if the current center estimates νr(t) are βαkσ close to the corresponding μr (where β<1), then in the next step the ideal updates will lead to an estimate of β3αkσ. Next, the key technical contribution of this stage is to show, using Lemma 7.9, that when performing ideal updates, the variance of the formed clusters around their means is bounded even though the clusters themselves are impure! Next, we analyze the actual updates and use the guarantee of the robust mean procedure to argue that when given perturbed set as input, the RobustMean procedure will either certify that the set is poisoned, or will output an estimate that is within O~(ασ)+β4αkσ. This in turn means that the new estimate output by the RobustMean procedure will be within O~(ασ)+β2αkσ of the true mean μr. Hence, the updates will keep improving until the unavoidable error of O~(ασ) (Theorem 7.6)

2.5.4 Statistical Model: The Spiked Covariance Model

Let us consider the special case when Σ*=θΠ* (hence θmin=θmax=θ) for the purposes of this discussion. Let us first consider the simpler problem of detection. Observe that if we look at the expectation 𝔼[AA] (population average), then in the Yes case (when the distribution is N(0,I+θΠ*)), 𝔼[AA],Π*=I+θΠ*,Π*=r+θr, whereas in the No case (the distribution is N(0,I)), 𝔼[AA],Π=I,Π=r for any rank r projection matrix. One can distinguish between the Yes and No case if we can establish concentration bound for

supΠ𝒫|AA,Π-𝔼[AA],Π| where 𝒫={Π:Πq2κ,Π2=Π,Π=Π,rank(Π)=r}. (12)

Here the high-dimensional Gaussian need not be spherical. It is tricky to directly analyze this quantity for the set of rank-r robust projection matrices 𝒫; this often introduces extra factors involving r or κ (and gave suboptimal bounds in some earlier results). We instead analyze the deviation as in (12) for each vector in an orthonormal basis for Π separately. Using Lemma 4.4, we show this reduces to analyzing deviation bounds for Av22 over all “analytically” sparse vectors v. We can then use tools in high dimensional probability and empirical process theory [LTbook, Vershynin, Mendelson2010] to obtain almost tight upper bounds. Similar ideas also work for the problem of recovering Π*, when the data distribution is N(0,I+θΠ*). Our (inefficient) algorithm just outputs the empirical minimizer of Frobenius norm error among projections in 𝒫 as in (12). We use concentration bounds along with Davis-Kahan theorem for perturbations of singular spaces to give the required recovery guarantees. Our analysis is simple since we only deal with (sparse) vectors, as opposed to (robust) rank-r projection matrices. This simpler analysis also gives us a tight upper bound, and allows us to improve upon the upper bound in [VuLei13] for q*=1 by a factor of r.

Our statistical lower bounds shows the tightness of results for q=, but also interestingly shows that a polynomial dependence of n2/q (on the dimension n) is necessary when q(2,). For this statistical lower bound, the insight is that the unit q* ball over vectors in n has a large packing set (in terms of Euclidean distance) because it contains an 2 ball of radius n1/2-1/q*n-1/2 inside it (note that when q=, the radius is n-1/2; this is consistent with the 1 ball having a small cover in 2 distance). Based on this packing of vectors along with a clever trick of [VuLei13], we construct a good enough packing set of projection matrices of dimension r to apply Fano’s inequality.

Our computationally efficient algorithm for recovering the unknown robust subspace uses the same mathematical program (9) that we used for the worst-case algorithm along with an additional constraint on Xq*q*rq*κ2q* where q* is the dual norm for q (this is a valid constraint because of Lemma 4.3). This allows us to get a better deviation bound for a quantity of the form (12) where 𝒫 is instead the set of feasible SDP solutions. The rest of the analysis is similar to the worst-case algorithm, along with the Davis-Kahan perturbation bounds for singular-spaces. Finally, we remark that the upper bounds arguments can be easily extended to the case of more general covariance structure for the principal components; once we get an estimate for the robust subspace projection Π*, we can use the empirical covariance matrix from the projected samples to approximately recover Σ*.

Note that the entry-wise q* norm in the constraint on Xq*q* is not a monotone norm for any q>2 (see Claim A.1). However on account of the monotonicity property of the qq* operator norm (and its associated constraint), we get the additional advantage that our algorithm always outputs a robust (or “analytically” sparse) subspace. To the best of our knowledge, previous polynomial time algorithms for spiked covariance model with large r (see e.g., [Liuetal]) do not necessarily output a sparse subspace (they only argue about estimation error).

3 Related Work

Adversarial Robustness. Existing theoretical work on adversarial robustness has almost exclusively focused on supervised learning, and in particular on binary classification. These works include the study of adversarial counterparts of notions such as VC dimension and Rademacher complexity [cullina2018pac, khim2018adversarial, yin2018rademacher], evidence of computational barriers [bubeck2018adversarial, bubeck2018adversarial2, nakkiran2019adversarial, degwekar2019computational] and statistical barriers towards ensuring both low test error and low adversarial test error [tsipras2018robustness], and computationally efficient algorithms for adversarially robust learning of restricted classes such as degree-1 and degree-2 polynomial threshold functions [awasthi2019robustness]. Furthermore, recent works also provide evidence that adversarially robust supervised learning requires more training data than its non-robust counterpart [schmidt2018adversarially, montasser2019vc]. As discussed in Section 2.2, in our setting, it is possible to be robust to test time perturbations and simultaneously enjoy a statistical edge over the non-robust scenario!

The closest to our work is the result of [garg2018spectral] that studies a particular formulation of adversarially robust features. The authors consider computing, given i.i.d. samples from a distribution, a map f such that, with high probability over a new example x drawn from the same distribution, points close to x get a nearby mapping (in 2 distance) under f. While similar in motivation to our work, the results in [garg2018spectral] do not aim to minimize the projection error and simply require the projection f to be mean zero and variance one to avoid trivial solutions. Furthermore, the authors look at a specific type of spectral embedding given by the top eigenvectors of the Laplacian of an appropriate graph constructed on the training data. The bounds presented for this embedding depend on the eigenvalue gap present in the Laplacian matrix. Finally, it is not clear how to efficiently use the embedding on new test points, as it involves recomputing the Laplacian by incorporating the new point into the training set.

Low Rank Approximations.

There is a large body of work in randomized numerical linear algebra [kannan2017randomized] on methods such as column subset selection [boutsidis2009improved, deshpande2010efficient, boutsidis2014near] and CUR decompositions [drineas2008relative, boutsidis2017optimal] that aim to approximate a given matrix via a low dimensional subspace spanned by a small number of rows/columns of the matrix. However these algorithms do not necessarily yield robust representations; in particular the subspace that is spanned may not be robust in our sense (q2 operator norm). There is also a large body of work on fast algorithms for computing low rank approximations [clarkson2017low, price2017fast, musco2017sublinear, song2017low, ban2019ptas]. Some of these works study the problem when the approximation error is measured in a more robust metric such as the 1 norm as opposed to the Frobenius norm [song2017low]. Again these results are not directly related to the notion of subspaces robustness that we study in this paper.

Sparse PCA. In the high dimensional regime where the number of samples is much less than the dimensionality, several works have pointed out inconsistent behavior of PCA [paul2007asymptotics, nadler2008finite, johnstone2009consistency]. As a result this led to the study of the sparse PCA problem where it is assumed that the leading eigenvector is sparse. This problem is typically studied under under an average case model known as the spiked covariance model [johnstone2001distribution]. In this model the data is assumed to be generated from a Gaussian with covariance matrix I+θvv, where the leading eigenvector v is assumed to be a sparse vector and θ is a parameter characterizing the signal strength. There have been several works that study minimax rates of estimating the leading eigenvector and more general subspaces under the spiked covariance model and various notions of sparsity [aminiwainwright, ma2013sparse, cai2013sparse, shen2013consistency, VuLei12, VuLei13]. In particular, the work of [VuLei12, VuLei13] studies estimation of principal subspaces in the high dimensional regime under two different notions of row/column sparsity of the orthonormal basis, and under the spiked covariance model and its generalizations. There have also been practical methods proposed to perform the above estimation under computational considerations. In particular, the work of [d2005direct] proposes an SDP based approach (without provable guarantees) and has resemblance to our SDP (we impose an additional constraint on qq*) studied in Section 8.3. The work of [BerthetRigollet] studies statistical and computational tradeoffs for the detection problem, i.e., given i.i.d. samples from a distribution that is either a standard Gaussian or a spiked model (with one leading sparse eigenvector). They design minimax optimal methods and show that the SDP of [d2005direct] achieves the best possible rate for a polynomial time algorithm assuming the planted clique conjecture. However, their lower bound does not apply to the spiked covariance model. The work of [gao2017sparse] was the first to provide computational lower bounds for sparse PCA under the spiked covariance model. These bounds have been further improved in the work of [brennan2018reducibility, brennan2019optimal]. The work of [Liuetal] studies computationally efficient estimation of subspaces under the spiked covariance and more general models.

Robustness to Corruptions in the Training Data.

There is large body of work, spanning both the theoretical computer science and the statistical communities, that formulates and studies robustness to training data corruptions in the context of both supervised and unsupervised learning. Here, we survey works that are most relevant in the context of our results. For classification problems, the commonly studied models are the random classification noise model [angluin1988learning] and the agnostic learning model [kearns1994toward], that capture corruptions to the labels of training points. More generally, there are also works studying the malicious noise [valiant1985learning, kearns1993learning] and the nasty noise [bshouty2002pac] models that allow for corruptions in both the training points and the corresponding labels. These models led to exciting developments in the design of robust algorithms for classification problems (see e.g., [kalai2008agnostically, klivans2009learning, kalai2012reliable, feldman2009distribution, awasthi2014power, diakonikolas2018learning, daniely2015ptas]). There is also a large body of work on Robust Optimization [ben2009robust], where the input is uncertain and is assumed to belong to a structured uncertainty set. In robust optimization one looks for a single solution that is simultaneously good for all inputs in the uncertainty set, leading to a max-min formulation of the problem. In our model of corruption, we are interested in instance wise guarantees - for every input A and its corruption A~, the algorithm is required to output a solution that is good for A. Moreover we are not aware of any results related to PCA in this context.

The problem of robust PCA has received significant attention in recent years [de2003framework, candes2011robust, chandrasekaran2011rank]. Here one assumes that a given corrupted matrix A~ is a sum of two matrices, the true matrix A that is low-rank and a sparse corruption matrix S with sparsity pattern being essentially random. The corruptions, although sparse, can be unbounded in magnitude. This necessitates an incoherence type assumption that the “mass” or the principal components of A is spread out – recovery of A is impossible under unbounded sparse corruptions when the signal is localized or sparse. On the other hand, the corruptions may not be sparse in our case. In particular, every data point (in fact every entry of A) could be corrupted up to some specified magnitude δ. Here as our results show (particularly Theorem 2.4), localization (or sparsity) of the signal is crucial in tolerating adversarial perturbations in the training data (e.g., a spread out signal can be completely overwhelmed by the corruption in each entry of A).

In statistics, Huber’s ε-contamination model [huber2011robust] is the most widely studied. In this model the dataset is assumed to be generated i.i.d. from a mixture namely, (1-ε)P+εQ. Here P is the true distribution and is assumed to be well behaved, for example the Gaussian distribution, and Q is an arbitrary distribution modeling the noise. The study of this model has led to insightful results for a variety of problems. The work of Yatracos [yatracos1985rates] and more recently of [chen2016general] studies general estimation in Huber’s model. More relevant to us are works on mean estimation and clustering in Huber’s model.

Mean Estimation. The classical work of Tukey [tukey1975mathematics] proposed a robust estimator, now known as Tukey’s median, for robust mean estimation of Gaussian data. The more recent work of [chen2018robust] showed that Tukey’s estimator is minimax optimal and also proposed a minimax optimal estimator for robust covariance estimation. Recently, there have been many exciting developments in designing robust estimators of mean and covariance that are also computationally efficient. We discuss a few here. The works of [diakonikolas2019robust, lai2016agnostic] were the first to propose polynomial time algorithms for robust mean and covariance estimation of Gaussian data in Huber’s model, with dimension independent error bounds. This was later extended to more general distributions and the list-decodable setting [charikar2017learning, steinhardt2017resilience], optimal bounds for Gaussian data [diakonikolas2018robustly] and the study of computational/statistical tradeoffs [diakonikolas2017statistical, hopkins2019hard] and robust method-of-moments [kothari2017outlier]. There have also been works providing better sample complexity bounds in the Huber model if the mean vector is sparse [balakrishnan2017computationally, klivans2018efficient]. These works also study estimation in the spiked covariance model under corruptions. More recent developments include a linear time estimator for robust mean estimation [cheng2019high] and fast algorithms for robust covariance estimation [cheng2019faster]. There are also recent works studying computationally efficient robust optimization of more general objectives [diakonikolas2018sever, prasad2018robust]. We would like to point out that in these works (and several other recent works), the model of corruption is different than ours. In particular, rather than assuming that the data contains a few outliers (Huber’s model), in our model an adversary can potentially corrupt every data point up to magnitude δ (measured in q norm for q2). Hence the guarantees from these works do not translate into our setting.

Clustering. From the computational point of view, the work of Dasgupta [dasgupta1999learning] formulated the goal of clustering data generated from a mixture of well-separated Gaussians. There is a large body of work on designing efficient algorithms for clustering in this setting, both for Gaussians and more general distributions [arora2005learning, vempala2004spectral, achlioptas2005spectral]. See recent works [regev2017learning, hopkinsli2018, diakonikolas2018list, kothari2017better] for a detailed discussion. The work of Kumar and Kannan [kumar2010clustering] abstracted out a common property of datasets (spectral stability as defined in (66)) that captures mixtures of well separated Gaussians, the planted partitioning model, and other well clustered instances. They showed that a single algorithm, namely the popular Lloyd’s algorithm, with the right initialization, provably computes optimal solution for such stable instances. The separation factor needed for Lloyd’s to work in [kumar2010clustering] was later improved by [awasthi2012improved]. The recent work of [cohen2017local] shows that local search obtains a polynomial time approximation scheme (PTAS) on such spectrally stable instances. However, this in general does not guarantee closeness of the clustering obtained to the optimal one. The works of [kalai2010efficiently, moitra2010settling, belkin2010polynomial] provided algorithms for clustering Gaussian mixtures with no separation requirement. These algorithms, inherently have an exponential dependence on the number of clusters, k, in the running time. Building on robust algorithms for mean estimation, there have also been works to perform robust clustering of well separated instances under Huber’s contamination model and its variants [brubaker2009robust, diakonikolas2018list, kothari2017better, kothari2017outlier, hopkinsli2018]. There have also been works in analyzing the EM algorithm for Gaussian mixtures. See [balakrishnan2017statistical] for a detailed discussion. While motivated by the study of the phenomenon of robustness, the above results do not provide guarantees in our model of corruption. As in the case of mean estimation, these results are designed to be robust to a small number of outliers (e.g., a small constant fraction) in the training set. In our corruption model on the other hand, every data point could be potentially corrupted up to magnitude δ (measured in q norm for q2).

4 Notation and Preliminaries

Norms.

For every q1 and xn, we will use xq to denote the q norm of the vector x i.e., xqq=i[n]|xi|q. The dual norm of q is q* where 1/q*+1/q=1. We will heavily use Holder’s inequality which states that

(Hölder’s inequality)|u,v|uqvq*u,vn. (13)

When not specified, x will denote the Euclidean norm of x. Further 𝕊n-1 will represent the unit sphere for the Euclidean norm. For convenience, we will use x0 to denote the sparsity i.e., the size of the support of x (note that 0 is not a valid norm on vectors).

Operator Norms of Matrices.

We will use the following matrix norms. For any q,p1 and any matrix Mn×m, we will denote by Mqp=maxym,yq1Myp. By duality of vector norms, we have

Mqp=maxym,yq1maxzn,zp*1zMy=maxzn,zp*1maxym,yq1yMz=Mp*q*.

When p=q=2, this corresponds to the spectral norm of the matrix M i.e., the maximum singular value of M. When unspecified, we will use M to denote the spectral norm of M. (Note that the above equalities from duality also show that the A=A i.e., the maximum right singular value is the same as the maximum left singular value).

Entry-wise Norms of Matrices.

We will also consider various matrix norms obtained by considering a matrix Mm×n as a vector of size mn. In particular, for any q1 we will use Mq to denote the q norm of the “flattened” vector corresponding to M i.e., Mqq=i=1,j=1m,n|M(i,j)|q. The Frobenius norm MF=M2. Moreover for matrices A,B, we use A,B:=tr(AB) to represent the trace inner product.

High probability bounds.

We will say that an event holds with high probability (w.h.p.) if the probability of failure on a given instance is less than any polynomial of the input parameters e.g., the dimension n, and the number of data points m. We remark that for our randomized algorithms, we can amplify the success probability to 1-η for any η>0 by repeating the algorithm log(1/η) times (hence these guarantees will in fact hold with exponentially small failure probability).

4.1 Approximation Algorithms for Operator Norms.

Here we briefly describe some known positive and negative results for approximating the qp operator norm of a matrix (sometimes referred to as the (q,p)-Grothendieck problem. We will say that a randomized algorithm gives an α-factor approximation for the qp operator norm (for some α1) iff for any input matrix M the algorithm outputs with probability at least (1-n-ω(1)) a vector x^0 such that Mx^p/x^q1αMqp. There are three simple cases (q=1,1p, and p=,1q, and q=p=2), where the norm can be computed in polynomial time. Some notable cases that are known to be intractable are the 1 and 2 norm (hence also 21 by duality). The 1 norm is the well-known Grothendieck’s problem [Gro56] (that is related to the cut-norm of a matrix [alon2004approximating] and has a rich history [krivine1978constantes, braverman2013grothendieck, alon2006quadratic, khot2011grothendieck, pisier2012grothendieck, reeds1991, khot2006sdp]).

More generally, if 1p<q (non-hypercontractive norms), computing the qp norm is NP-hard [steinberg2005computation] and in fact the problem exhibits a dichotomy in terms of approximation. Specifically, whenever 2[p,q], the problem admits constant factor approximation algorithms, whereas whenever 2[p,q], the problem is hard to approximate within almost polynomial factors [bhaskara2011approximating]. Regarding approximation algorithms, the work of [steinberg2005computation] builds upon Nesterov’s theorems and extensions [nesterov1998semidefinite, wolkowicz2012handbook] and provides a 1/(23π-23)2.29 approximation for when 1p2q, and for the special case p=2 or q=2 the factor becomes π/21.25. Recently, improved upper and (almost matching) lower bounds were proved for many settings of q,p in [bhattiprolu2018approximating, bhattiprolu2018inapproximability].

In what follows let γr:=𝔼g[gr] be the r norm of a standard normal random variable gN(0,1) i.e., the r-th root of the r-th Gaussian moment, and let q* be the dual norm for q. Specifically, for qp with 2[p,q], it is NP-hard to approximate the qp norm within a factor smaller than 1γq*γp. The hardness factor matches the polynomial time algorithms from [steinberg2005computation] for cases when q or p equals 2, and this is encapsulated in Theorem 4.1. These algorithms are based on semi-definite programming (SDP) relaxations. The algorithm for the qq* norm for example first solves in polynomial time an SDP relaxation and produces a solution X0 of value SDPvalueMqq*; then the algorithm uses X to produce with high probability a vector y with Myq*/yq1αSDPvalue1αMqq*, where α1 is the approximation factor.

Theorem 4.1 (Combining [nesterov1998semidefinite, steinberg2005computation]).

For computing the 2 norm, there is a randomized polynomial time algorithm that gives a π/21.25-approximation algorithm, and for the q2 norm there is a randomized polynomial time algorithm that gives a 1/γq*-factor approximation. Furthermore, when the input matrices are positive semidefinite matrices, there exists randomized polynomial time algorithms that give a π/2 approximation for the 1 norm, and a 1/γq*2 factor approximation for the qq* operator norm respectively. These algorithms are SDP-based randomized algorithms that succeed with high probability for any given instance.

The approximation algorithms from Theorem 4.1 will serve as separation oracles for solving the convex relaxations that we will use for our algorithms. As noted above these approximability results are tight assuming PNP [bhattiprolu2018inapproximability].

4.2 Properties of Robust Projections.

Throughout the paper we will use the term projections and projection matrices to always refer to orthogonal projection matrices on to linear subspaces of n. Next we list and prove some simple properties of subspaces with robust projection matrices i.e., subspaces with Π2 (or more generally q2 norm for some q2) that is upper bounded.

For any q*[1,2], the ratio of the q* vs 2 corresponds to an analytic notion of sparsity. The following claim gives an upper bound on the q* norm in terms of the sparsity.

Lemma 4.2 (Analytic Sparsity).

Consider any vector vRn of support size k. For any q*[1,2], we have

vq*k1q*-12v2.

In particular, v1kv2 for vectors with support size at most k.

On the other hand, it is easy to see that the bound given here is tight for any vector that is equally spread out among its support of size k.

Proof.

Without loss of generality suppose v2=1 (if v=0 it holds trivially). Let v have support S of size k. Set p:=2/q*, and let u be the vector such that ui=|vi|q* for each i[n]. By Holder’s inequality

vq*q*=iS1ui𝟏Sp*upk1/p*(i|vi|pq*)1/pk1-q*/2v22/p=k1-q*/2,

hence establishing the lemma. ∎

Recall that q* corresponds to the dual norm for q, and q*[1,2] when q2. The following simple lemma proves two useful properties of robust subspaces i.e., subspaces having projection matrices with bounded 2 norm (or more generally q2 norm for q>2). The first property shows that any two vectors that are close in norm will have nearby projections onto any subspace that is robust. The second property shows that a subspace is robust (i.e., has a robust projection matrix) exactly when every vector in the subspace is analytically sparse.

Lemma 4.3.

[Properties of Robust Subspaces and Projections] Consider any subspace of VRn with projection matrix ΠRn×n satisfying Πq2κ. We have the following two properties:

  1. I.

    Closeness of projections of pertubations: For any vector v and its perturbation v~

    v-v~qδΠv~-Πv2κδ.
  2. II.

    Analytic sparsity: For any v𝒱, we have vq*κv2, where q*=q/(q-1). Moreover, if every vector in 𝒱 has vq*κv2, then Πq2κ. In particular Π2κ if and only if v1κ for all unit vectors v𝕊n-1𝒱.

Proof.

We first show property (I). Let u:=v-v~. Then

Πv~-Πv2=ΠuΠq2uqκδ.

To show property (II), note that by duality of matrix operator norms we have Πq2=Π2q*=Π2q*.

Hence    v𝕊n-1𝒱,vq*=Πvq*Π2q*v2κ.

For the converse, if there exists v𝕊n-1𝒱 s.t. vq*>κ, then by duality Π2q*=Πq2>κ.

Observe that the robustness condition on the subspace as captured by the q2 operator norm bound of its projection matrix Π is basis independent.The following gives a simple sufficient condition on the basis of the subspace that implies robustness of the subspace spanned by them. This relates our robustness of the subspace to alternate notions of sparsity of subspaces that have been studied in the literature on sparse PCA [VuLei12, VuLei13].

Lemma 4.4.

Given any orthonormal basis v1,v2,,vr for a subspace V such that viq*κ for each i[r], we have Πq2rκ.

Proof.

Firstly, Π=i=1rvivi, and Πq2=Π2q*. We have

Πq2=maxu:uq1i=1ru,vivi2i=1ru,vi2rmaxu:uq1maxv:vq*κ|u,v|rκ.

For a given matrix Bn×m, let us denote by Π(B) to the projection matrix onto the column space of B. The following lemma shows that the best low-rank (κ,q)-robust projection objective (3) also finds the low-rank approximation that has smallest error among ones with a (κ,q) robust column space.

Lemma 4.5.

Let Pr be the set of all rank-r projection matrices. Given a data matrix ARn×m and a given parameter κ1, q>0, we have

minΠ𝒫rΠq2κA-ΠA=minB:𝑟𝑎𝑛𝑘(B)r,Π(B)q2κA-B,

where M here stands for the spectral norm. The above statement is also true for the Frobenius norm.

Proof.

Let B* be the minimizer for the right minimization problem and let Π2=Π(B*) be its projection matrix, and let Π1 be the minimizer for the left optimization problem. It is easy to see that A-Π1AA-B*, since Π1A is also a feasible choice for B in the right minimization problem. To prove the other direction, if Ai,Bi* are the ith columns of A,B* respectively then,

A-B* =A-Π2B*=Π2(A-B*)+Π2A
=maxu𝕊m-1i[m]uiΠ2(Ai-Bi*)+i[m]uiΠ2Ai2
maxu𝕊m-1i[m]uiΠ2Ai2=Π2AΠ1A

as required. The first inequality above follows since the column space of Π2A and the column space of Π2(A-B*) are orthogonal. An identical proof also follows for the Frobenius norm.

Monotonicity of Matrix Norms.

The following property of certain matrix norms will be crucial in designing constant factor approximation algorithms for the low-rank approximations.

Definition 4.6 (Monotone matrix norm).

A matrix norm is said to be monotone if and only if

A,B0,A+BA. (14)

Observe that it suffices to check the above condition for all rank-1 PSD matrices B i.e., B=vv for vn. It is well known that all unitarily invariant matrix norms44 4 A matrix norm is unitarily invariant iff A=UAV for all matrices A and all unitary matrices U,V. are monotone (this is because unitarily invariant norms are just norms on the singular values). On the other hand, many other matrix norms including other entry-wise norms Xq or general operator norms Xqp are not necessarily monotone (see Claim A.1 and Claim A.2 for some counterexamples). Perhaps surprisingly, the qq* matrix operator norms are monotone (see Lemma 5.6 for a simple proof of this fact)!

5 Worst-case Approximation Guarantees

5.1 Approximations in Frobenius norm error

We will aim to obtain a bicriteria approximation for the robust low-rank approximation problem given in (3) for the Frobenius norm objective. In what follows q[2,].

minΠ ΠAF2=minΠAF2-AA,Π (15)
s.t. Π is a projection matrix of rank r, and Πq2κ. (16)

We prove the following theorem.

Theorem 5.1.

Suppose the data matrix ARn×m has an (orthogonal) projection Π* of rank at most r such that Π*q2κ and the approximation error OPT:=(I-Π*)AF2. There exists an algorithm that w.h.p. runs in polynomial time and finds a projection matrix Π^ of rank at most r such that for every γ>0,

Π^q2 CG(q)(1+1/γ)κ, and (I-Π^)AF2(2+γ)OPT, (17)

where CG(q)>0 is a constant that only depends on q as given in Theorem 4.1 (for q= this value is at most π/2). Moreover, for any γ>0, there exists an algorithm that w.h.p. runs in polynomial time and finds an rr(1+1/γ)-dimensional orthogonal projection Π^ such that

Π^q2 CG(q)(1+1/γ)κ, and (I-Π^)AF2(1+γ)OPT. (18)

The theorem above will be established by proving a statement about the more general problem of finding a low-rank projection that satisfies any monotone norm constraint that can be approximately separated. While the q2 norm is not monotone, as discussed in Section 4, we will show that applying the more general guarantee on an appropriate monotone norm helps prove Theorem 5.1 above. Let be a monotone matrix norm. Consider the following generalization of problem (19) that given a data matrix An×m, and a parameter κ1, finds a projection

minΠAF2-AA,Πs.t.Π is a projection matrix of rankr, and Πκ. (19)
Definition 5.2.

[α-approximately separable matrix norm] A matrix norm over n×m matrices is α-approximately separable for some α1 iff there exists a (potentially) randomized algorithm that w.h.p. runs in time poly(n,m), and when given a PSD matrix Bn×n and a parameter κ as input will either certify that Bακ, or finds a Zn×n such that (1) B,Z>κ, and (2) M,Zκ for all M s.t. Mκ.

Note that in the above definition, the only potential randomness is from the potential random choices of the algorithm. As we will see later the operator norms that we will consider (e.g., q2 norm and the qq* norm for q2) will be O(1)-approximately separable. This will allow us to construct an appropriate separation oracle for using the Ellipsoid algorithm.

The following general theorem gives an O(1) bicriteria approximation for the problem assuming the monotone matrix norm can be optimized approximately in polynomial time.

Theorem 5.3.

Let be any matrix norm that is monotone and α-approximately separable for some α1. Suppose we are given as input a data matrix ARn×m that has an (orthogonal) projection Π* of rank at most r such that Π*κ and the approximation error OPT:=(I-Π*)AF2. There exists an algorithm that w.h.p. runs in polynomial time, and finds an orthogonal projection matrix Π^ of rank at most r such that for every γ(0,1),

Π^ α(1+1γ)κ, and (I-Π^)AF2(2+γ)OPT. (20)

Moreover, for any γ(0,1), there exists an algorithm that w.h.p. runs in polynomial time and finds an r(1+1/γ)r-dimensional orthogonal projection Π^ such that

Π^ α(1+1γ)κ, and (I-Π^)AF2(1+γ)OPT. (21)

We remark that the only randomization in the algorithm is in the construction of the separation oracle for the matrix norm constraint Πκ. In particular the algorithm from Theorem 5.3 has a Las Vegas guarantee i.e., the algorithm is always correct, and the running time is polynomial with high probability (in fact with exponentially small failure probability), and hence in expectation.

We consider the following mathematical programming relaxation for the problem.

minX AF2-AA,X (22)
s.t. tr(X)r (23)
0XI (24)
Xκ (25)

First we observe that this is a valid relaxation to the problem. In fact any feasible projection matrix Π of rank at most r for (19) is a feasible solution to the above program (22)-(25) with the same value. The intended solution here is just X=Π. All the eigenvalues of Π are 0 or 1, since Π is a projection matrix; hence (23), (24) are satisfied. Moreover (25) is satisfied just because of the same constraint as in (19). Finally, the objective value is preserved since

AF2-AA,Π=AF2-tr(AAΠ)=AF2-tr(ΠA(ΠA))=AF2-ΠAF2.

In the above program, the objective (22) and constraints (23)-(24) define a semi-definite program (SDP). Moreover, for (25), we see that for any λ[0,1], by triangle inequality λX1+(1-λ)X2λX1+(1-λ)X2. Hence the set of all X that satisfies constraints (23) - (25) is convex. In general, constraint (25) may be NP-hard to verify for a given PSD matrix X. However, we can use the fact that is approximately separable to get a approximately feasible solution to the program in polynomial time, using the Ellipsoid method.

The following lemma shows that by truncating a solution of the program (22)-(25) to just the terms corresponding to the large eigenvalues, we retain much of the objective.

Lemma 5.4.

Let ε,δ>0, and let M0 have 𝑡𝑟(M)=1. Suppose X satisfies the SDP constraints (23) and (24) and M,X1-ε. Suppose PX1-δ is the projection operator onto the subspace spanned by eigenvectors of X with eigenvalues at least τ:=1-δ. Then we have

I-PX1-δ,Mε1-τ. (26)
Proof.

Let X=i=1nλivivi be the eigendecomposition of X (note that λi0 since X is p.s.d.), and let S={i:λiτ}. We have

(1-ε)tr(M) M,X=iλiviMvi
tr(M) =iviMvi, since {vi:i[n]} is an orthonormal basis
Hence i(λi-1+ε)viMvi 0iSεviMvi+iS(τ+ε-1)viMvi0
iSεviMvi iS(δ-ε)viMvi
Hence, I-PX1-δ,M εδ-εPX1-δ,M.

Substituting the above inequality in PX1-δ,M+I-PX1-δ,M=tr(M)=1, we get

I-PX1-δ,M11+δ-εε=εδ.

Proof of Theorem 5.3.

Let OPT:=εAF2 for some ε[0,1]. Without loss of generality, in the rest of the proof we will assume that AF=1. We will use the Ellipsoid algorithm to approximately solve the relaxation in (22)-(25). As we have explained before, the feasible set is convex. We now show how to design an approximate hyperplane separation oracle for (25); the rest of the constraints just correspond to a simple SDP. Since is α-approximately separable, we have a randomized polynomial time algorithm that given a matrix X^0 , either certifies that X^ακ (e.g., when the SDP value is at most ακ), and otherwise produces a separating hyperplane of the form Z,Xκ that is not satisfied by X^. Let T=poly(n,m) be the number of iterations taken by the Ellipsoid algorithm to produce a solution of value at most OPT (up to exponentially small additive error), assuming access to a separation oracle. Hence from a union bound over all the T iterations, we can use the above randomized polynomial time algorithm for separating (25), and run the Ellipsoid algorithm to find a solution X^ that satisfies X^ακ, and has objective value that is arbitrarily close to OPT. This algorithm runs in polynomial time with high probability (with an exponentially small probability it may not terminate when the separation oracle does not terminate in polynomial time). For the rest of the analysis, we condition on the event that the algorithm terminates in polynomial time.

Set δ:=1/(1+γ). Let X=iλivivi and let S={i:λi1-δ}. Define for each i[S], αi:=vivi,M, where M=AA. We sort the elements of S based on {αi}, and pick greedily the first min{r,|S|} of them to form TS. Our projection matrix will be ΠT=iTvivi.

We first argue that the operator norm constraint is approximately satisfied. By the monotonicity of the we have

ΠT =iTvivi1τi:λi>τλiviviXτακ1-δ=α(1+γ)γκ (27)

Also, from Lemma 5.4, we have

iSvivi,M =I-PX1-δ,Mεδ, and
iS(1-λi)vivi,M i(1-λi)vivi,M1-X,Mε.
Hence, iSλiαi =iSλivivi,M1-ε(1+1δ)=1-(2+γ)ε,

for our choice of δ=1/(1+γ). By our greedy choice of T, we have iTαiiSλiαi, as iSλitr(X)min{r,|S|}, with each λi[0,1]. Thus ΠTAF2(2+γ)ε.

The guarantee in (18) is obtained by returning the projection ΠS=iSvivi. Observe that |S|r/(1-δ) from (29). The operator norm bounds follows using the same argument as (18) with T=S. Moreover, the objective value follows directly from Lemma 5.4. ∎

Proof of Theorem 5.1.

Our goal will be to apply Theorem 5.3 to obtain our required guarantee. However the q2 operator norm is not monotone when q>2; see Claim A.2 for a counter-example. Our crucial insight in this result is we can write down the following mathematical programming relaxation for the problem, where we instead use the qq* norm which we show is indeed monotone!

minX AF2-AA,X (28)
s.t. tr(X)r (29)
0XI (30)
Xqq*κ2 (31)

In the above program, the objective (28) and constraints (29)-(30) define a semi-definite program (SDP). We will show later that we can get O(1) approximate separation oracle for the problem. The following lemma shows that the above SDP is a valid relaxation to the problem.

Lemma 5.5.

Any feasible projection matrix Π of rank r satisfying (19) is a feasible SDP solution with the same value.

Proof.

The intended SDP solution here is just X=Π. All the eigenvalues of Π are 0 or 1, since Π is a projection matrix; hence (29), (30) are satisfied. To verify (31), note that by duality and since Π2=Π,

Πqq* =maxyq1Πyq*=maxyq1,zq1zΠy=maxyq1,zq1Πz,Πy=maxyq1Πy,Πy
=maxyq1Πy22=Πq22,

as required. Finally, the objective value is preserved since

AF2-AA,Π=AF2-tr(AAΠ)=AF2-tr(ΠA(AΠ))=AF2-ΠAF2.

The following lemma shows that the norm in constraint (31) satisfies the monotonicity property (Definition 4.6), so that we can apply Propositon 5.3.

Lemma 5.6 (Monotonicity of qq* operator norm).

For any q1, the operator norm qq* is monotone.

Proof.

Let Bn×n. It suffices to show that for any B0,vn, B+vvqq*Bqq*.

Bqq* =maxyq1,zq1zBy=maxyq1,zq1B1/2z,B1/2ymaxyq1B1/2y,B1/2y=maxyq1yBy.

In other words, the quadratic form is maximized by y=z. Moreover for every y, yT(B+vv)y=yTBy+y,v2yBy. Hence, B+vvqq*Bqq*.

Proof of Theorem 5.1.

We will apply Theorem 5.3 with :=qq*. Lemma 5.5 shows the feasibility of the convex program . From Lemma 5.6, we have that the qq* is monotone. We now show that qq* is O(1)-approximately separable. Recall that the

Xqq*=maxy,zns.t.yq,zq1yz,X.

Crucially, there is an efficient approximate separation oracle for (31) using an approximation algorithm for the qq* operator norm (see Theorem 4.1). The CG(q)-factor SDP-based approximation algorithm immediately gives a CG(q)-factor approximate separation oracle. This SDP-based algorithm either certifies that the given matrix B0 has Bqq*CGκ2 (when the SDP value is at most CGκ2); otherwise (when the SDP value is larger than CGκ2) the algorithm, w.h.p. runs in polynomial time and produces a solution y,zn with yq,zq1 that gives a separating hyperplane of the form y(z),Xκ2 (that B does not satisfy)55 5 The CG(q)-factor SDP-based algorithm has the property that when the SDP value is larger than CGκ2, there is a rounding algorithm that runs in time poly(n,m) and with high probability produces the desired solution y,z. To get the Las Vegas guarantee, we simply run the rounding algorithm repeatedly until it outputs the desired solution.. Hence qq* is α=Oq(1) approximately separable (in particular when q=, we have απ/2). Hence Theorem 5.3 can be applied to finish the proof. ∎

5.2 Approximations in the Spectral norm

We now show how techniques similar to those in Section 5.2 can also be extended to robust low-rank approximations, when the error is measured in spectral norm as opposed to the Frobenius norm. In this section we will use A to denote the spectral norm of matrix A. For convenience of exposition, we will measure the projection error in (3) using the spectral norm A-ΠA as opposed to the squared spectral norm.

Theorem 5.7.

Suppose the data matrix ARn×m has an (orthogonal) projection Π* of rank at most r such that Π*q2κ and the approximation error OPT:=(I-Π*)A. There exists an algorithm that w.h.p. runs in polynomial time, and finds an (orthogonal) projection matrix Π^ of rank at most r such that for every γ(0,1),

Π^q2 CG(q)(1+2/γ)κ, and (I-Π^)A(3+γ)OPT, (32)

where CG(q)>0 is a constant that only depends on q as given in Theorem 4.1. For q= this value is known to be at most π/2.

Moreover, for any γ(0,1), there exists an algorithm that w.h.p. runs in polynomial time and finds an r(1+2γ)r dimensional orthogonal projection Π^ such that

Π^q2 CG(q)(1+2/γ)κ, and (I-Π^)A(1+γ)OPT. (33)

We remark that as in Theorem 5.1, the only randomization in the algorithm from Theorem 5.7 is in the construction of the separation oracle for the matrix norm constraint. In particular the algorithm from Theorem 5.7 has a Las Vegas guarantee i.e., the algorithm is always correct, and the running time is polynomial with high probability (and hence in expectation).

Proof of Theorem 5.7.

We will use the following mathematical relaxation for the problem.

min λ (34)
s.t. A(I-X)A λI (35)
Xqq* κ2 (36)
tr(X)r, and 0XI (37)

To see that the above program is a valid relaxation for the problem, note that any rank-r projection matrix Π satisfies (37). Moreover as in Lemma 5.5 we see that

Πqq* =maxyq1,zq1zΠy=maxyq1,zq1Πz,Πy=maxyq1Πy22=Πq22.

Finally to see that the objective value is preserved, note that for any projection matrix Π,

A(I-Π)A=AΠΠA=ΠA2=A-ΠA2,

as required. Lemma 5.8 shows that the above program can be solved approximately in polynomial time.

Lemma 5.8.

There exists a universal constant c=c(q)1 and a randomized algorithm that given an instance ARm×n that has a feasible solution X* to the relaxation (34)-(37) with objective value λ*, runs in polynomial time w.h.p., and finds a solution X^ satisfying (37) with objective value arbitrarily close to λ* such that X^qq*cκ2.

Proof.

We will use the Ellipsoid algorithm to approximately solve the relaxation in (34)-(37). Note that the feasible set is convex (including (36), which is just a upper bound constraint on a matrix norm). We now show how to design an approximate hyperplane separation oracle for (36) and a separation oracle for the rest of the constraints. The constraint (36) can be rewritten as yz,Xκ2 for all y,zn such that yq,zq1. Crucially, there is an efficient approximate separation oracle for (36) using an approximation algorithm for the qq* operator norm problem (see Theorem 4.1). As in Theorem 5.1, the c:=CG(q)-factor SDP-based approximation algorithm immediately gives a c-factor approximate separation oracle. This SDP-based algorithm either certifies that the given matrix B0 has Bqq*CGκ2 (when the SDP value is at most CGκ2); otherwise (when the SDP value is larger than CGκ2) the algorithm, w.h.p. runs in polynomial time, and produces a solution y,zn with yq,zq1 that gives a separating hyperplane of the form y(z),Xκ2 (that B does not satisfy). Hence by a union bound, we can assume that (36) can be separated up to a O(1) factor over all the T=poly(n,m) iterations of the Ellipsoid algorithm by an algorithm that w.h.p. runs in polynomial time.

The constraints (37) can be efficiently separated using an algorithm for eigenvalue computations. Finally, given λ,X, (35) can also be efficiently separated by computing the maximum eigenvalue of A(I-X)A. Let v𝕊n-1 be the corresponding eigenvector. If the constraint is violated, the hyperplane separator is of the form

vv,AA-vv,AXA-λ0, i.e., vv,AA-AvvA,X-λ0,

since tr(vvAXA)=tr(AvvAX). This completes the proof. ∎

The proof of Theorem 5.7 also crucially uses the monotonicity of qq* matrix operator norm.

Proof of Theorem 5.7.

Let OPT2:=ε2A2 for some ε(0,1]. Set δ:=2/(2+γ). Lemma 5.8 shows that in polynomial time we obtain a solution X0 satisfying (37), (35) with λOPT, and Xqq*CG(q)κ2 (with exponentially small probability the algorithm does not terminate in polynomial time). The rest of the algorithm (and analysis) conditions on the success of this event. Let X=iλivivi and let S={i:λi1-δ}.

For the rest of the analysis we will assume without loss of generality that A=1. We first show the guarantee in (33). The projection output is just ΠS=iSvivi. Observe that |S|r/(1-δ) from (37). Since the projector we output is just ΠS, each of its associated eigenvalues as at least 1-δ. Hence, the operator norm bounds follows using the monotonicity of the norm since ΠS11-δX. To verify the objective value we see that

i[n](1-λi)AviviA =A(I-X)Aε2A2=ε2
iSδAviviA iS(1-λi)AviviAi[n](1-λi)AviviAε2
Hence AΠSA ε2δ=(1+γ2)ε2,

as required. We now show the guarantee in (32) where we output a projection of rank at most r (with no slack). Let M:=iSAviviA. Let Π be the projection matrix for the subspace corresponding to the best rank r projection of M. The algorithm outputs Π.

Note that ΠΠS, hence by monotonicity, the qq* operator norm constraint is satisfied up to a α:=c factor. Also note that if Π* is the projection that gives the optimal solution to the problem,

M-AΠ*AM-AA+AA-AΠ*Aε2+ε2δ=ε2(1+1δ)=(2+γ2)ε2.

But AΠ*A is a valid approximation of M of rank at most r. Hence, we have that

M-ΠMΠ ε2(1+1δ)
AA-ΠAAΠ AA-ΠΠSAAΠSΠAA-M+M-ΠMΠ
2ε2δ+ε2=(1+2δ)ε2=(3+γ)ε2,

as required.

As before the same ideas also give the following more general theorem for any monotone matrix norm that is approximately separable.

Theorem 5.9.

Let be any matrix norm that is monotone and α-approximately separable for some α1. Suppose the data matrix ARn×m has a projection Π* of rank at most r such that Π*κ and the approximation error OPT:=(I-Π*)A. There exists an algorithm that w.h.p. runs in polynomial time, and finds an orthogonal projection Π^ of dimension at most r such that for every γ(0,1),

Π^ α(1+2/γ)κ, and (I-Π^)A(3+γ)OPT. (38)

Moreover, for any γ(0,1), there exists an algorithm that w.h.p. runs in polynomial time and finds an orthogonal projection Π^ of rank r(1+2γ)r such that

Π^ α(1+2/γ)κ, and (I-Π^)A(1+γ)OPT. (39)

We omit the proof, since the ideas are identical to Theorem 5.7.

5.3 Recovering the Optimal Projection Matrix

We now show that if the optimal robust low-rank projection has very small error compared to the rth smallest singular value of A, then we can in fact approximately recover the subspace itself up to small error measured in terms of the principal angles. For two subspaces with projection matrices Π1,Π2, the Sin of the canonical angles matrix is given by Π1Π2. These techniques will also be helpful for recovery in the Spiked Covariance model. The following simple corollary will work for both Frobenius norm error and spectral norm error. For this purpose, we will just use A to denote the norm of A, where the unspecified matrix norm is norm in which we are measuring the error – either Frobenius norm or spectral norm.

Corollary 5.10.

Suppose the data matrix ARn×m has an r-dimensional projection Π* such that Π*q2κ, the approximation error OPT:=(I-Π*)A2<ε2A2 and σr(Π*A)θ. There exists an algorithm that w.h.p. runs in polynomial time, and finds a projection Π^ of rank at most r such that

ΠΠ^O(1+α)εAθ. (40)

where the subspace corresponding to Π is a subset of the subspace given by Π* and α is the approximation factor attained by the algorithm in Theorem 5.1 (or Theorem 5.7).

Note that the above bound holds for the spectral norm error and the Frobenius norm error.

Proof.

The algorithm is exactly the same algorithm used in Theorem 5.1. Let Π denote the best robust low-rank subspace for A. We will then use the Davis-Kahan sinΘ theorem about perturbations of singular vectors to show that the subspaces given by Π1 and Π2 are close. Note that the Davis-Kahan theorem states that if Πi is the projection matrix onto eigenspaces of AiAi respectively (i{1,2}) with the least singular values of Π1A1 being at least δ>0 more than the singular values of Π2A2, then for any unitarily invariant norm ,

Π2Π1A1-A2δ.

We would like to apply it with A2=Π^A,A1=Π*A and Π2=Π^,Π1=Π*. We know that by the triangle inequality, for some constant α given by the approximation ratio in Theorem 5.1 (or Theorem 5.7),

Π*A-Π^A ΠA*-A+A-A^εA+αεA(α+1)εA, (41)

where we used the fact that Π^ gives an α-factor approximation to the objective. Moreover, in our case A1=Π*A is itself of rank-r and Π2A2=0. Under the stronger assumption in (40), we have σr(Π*A)θ. Hence we see that (40) holds since

Π^Π* Π*A-Π^Aθ(1+α)εθ.

6 Data Poisoning and Robustness to Adversarial Perturbations at Training time

6.1 Training-time robustness: Approximations in Frobenius norm error

Theorem 6.1.

Suppose q2 and ARn×m is the (unknown) uncorrupted data matrix, with an (orthogonal) projection matrix Π* of rank at most r that is robust i.e., Π*q2κ satisfying A-Π*AF2εAF2 for some ε[0,1]. There exists an algorithm that w.h.p. runs in polynomial time, and given as input any (adversarially perturbed) data matrix A~ s.t. for each column j[m], A~j-Ajqδ, outputs an orthogonal projection Π^ of rank at most r such that for any η>0

Π^q2 O(κ), and A-Π^AF2O(ε+η)AF2+O(1η)δ2κ2m. (42)

To get a multiplicative approximation we will set η=O(ε), and get an extra additive term of δ2κ2m/ε. Here think of δ2κ21mεAF2. Further we remark that the above guarantees are optimal up to constant factors; in particular, the additive factor of O(mδ2κ2) is unavoidable (see Proposition 6.7).

The main challenge here is that while A has a good low-rank projection (in fact a robust one), A~ may be very far from a rank-r matrix (let alone having a robust rank-r approximation). Further, the best robust low-rank approximation of A~ could be very different from the best robust low-rank projection of A. This is because the entry-wise perturbations of δ could be too large in aggregate; for instance, it could be the case that A~F2AF2. Suppose Π* is the best robust low-rank projection of A. We will run the algorithm in the previous section not on the given matrix A~, but on a suitably modified matrix A.

Input: A~, the corrupted n×m data matrix, rank r, robustness parameter κ1 and norm q2. 1. Compute A (using Lemma 6.2) such that A=argminBn×m s.t. Bj-A~jqδ,j[m]BF. 2. Run the algorithm from Theorem 5.1 on A, to obtain a rank-r projection matrix Π^. 3. Output Π^.

Figure 1: Robust rank-r approximations in Frobenius norm under adversarial perturbations during training.
Lemma 6.2.

There is a polynomial time algorithm that given any matrix MRm×n, can find

Γq(M)=minBn×m s.t. Bj-Mjqδ,j[m]BF2,

up to arbitrary accuracy.

Proof.

First we note that since BF2=jBj22, the optimization problem is separable across each of the m samples i.e.,

minBn×m s.t. Bj-Mjqδ,j[m]BF2=j[m]minBjn s.t. Bj-MjqδBj22

We now describe how to solve each of the m subinstances corresponding to the column j[m], which for a given bn is of the form

minznzqδb-z22.

Note that the least-squares objective b-z22 is convex. Moreover the constraint zqδ is also convex; moreover there is a simple separation oracle for this constraint since by duality

zq=maxyn:yq*1y,z=z*z*q*,z, where zi*=sign(zi)|z(i)|q-1i[n].

Hence by using the Ellipsoid algorithm, this problem can be solved in polynomial time. 66 6 In fact Projected Gradient Descent Algorithm can also be used here; see [sra2012fast].

Note that when q=, it is easy to find the matrix A, by just setting

Aij=sign(Mij)max{0,|Mij|-δ},i,j[n].

We will argue that Π* also gives a good low-rank approximation to A. This crucially uses the fact that Π* has bounded q2 norm, which implies the following useful lemma.

Lemma 6.3.

Suppose A,BRn×m are two matrices such that for each column j[m], Aj-Bjqδ, and let Π be any rank-r projection matrix such that Πq2κ. Then for any η(0,1),

(1-η)ΠAF2-(1η-1)δ2κ2mΠBF2(1+η)ΠAF2+(1η+1)δ2κ2m.
Proof.

For each j[m], let Aj,Bj be the jth columns of A and B respectively. Then Π(Aj-Bj)2Πq2Aj-Bjqδκ. Using this along with the triangle inequality we get,

ΠBF2 =j=1mΠ(Bj-Aj)+ΠAj22j=1m(ΠAj2-δκ)2j=1mΠAj2)2-2(δκη)(ηΠAj2)+(δκ)2
(1-η)ΠAF2-(1η-1)δ2κ2m, for any η(0,1).

This proves the first inequality. A similar argument also shows the other inequality. ∎

We now prove that Algorithm 1 finds an approximately optimal robust low-rank projection for unknown, uncorrupted data matrix A.

Proof of Theorem 6.1.

The first step of the algorithm finds the matrix A given by

A=argminBn×m s.t. Bj-A~jqδ,j[m]BF2.

Note that AFAF since A is also a feasible solution for the above minimization. Moreover since Aj-Ajq2δ for each j[m], we get from Lemma 6.3,

Π*AF2 (1-η)Π*AF2-4(1η-1)δ2κ2m, for any η(0,1). (43)

Now we run the algorithm from the previous section (Theorem 5.1) on A. From Theorem 5.1 (with δ=1/2 say), we find a rank-r projection matrix Π with Π2O(κ) such that

A-ΠAF2 3(AF2-(1-η)Π*AF2+4(1η-1)δ2κ2m)
3(A-Π*AF2)+3ηΠ*AF2+12(1η-1)δ2κ2m
3(ε+η)AF2+12(1η-1)δ2κ2m.

However we know that AF2Π*AF2. Hence

ΠAF2 AF2-A-ΠAF2Π*AF2-A-ΠAF2
(1-η)Π*AF2-3(ε+η)AF2-16(1η-1)δ2κ2m
Hence, A-ΠAF2 =AF2-ΠAF2Lemma6.3AF2-(1-η)ΠAF2+(1η+1)δ2κ2m
AF2-(1-η)2Π*AF2+3(ε+η)(1-η)AF2+mδ2κ2(1+1η+16(1-η)(1η-1))
A-Π*AF2+(3ε+5η)AF2+(1+17η)δ2κ2mO(η)AF2+O(1η)δ2κ2m,

for any η4ε. ∎

6.2 Training-time robustness: Approximations in Spectral norm error

We now show guarantees for low-rank approximations in spectral norm error that are similar to Theorem 6.1. However, there is a qualitative difference: we will either find a robust low-dimensional projection of the unknown dataset A, or we will certify that the dataset has been poisoned substantially. In particular, the algorithm will never output a low-dimensional representation that is bad for the unknown data matrix A. We will later see how these guarantees also imply training-time robustness for downstream unsupervised learning applications like spectral clustering, robust mean estimation and learning mixture models. In what follows will refer to the spectral norm.

Theorem 6.4.

Suppose q2 and ARn×m is the (unknown) uncorrupted data matrix, and let Π* have the smallest spectral norm error A-ΠA among (orthogonal) projections of rank at most r that are robust i.e., Πq2κ. There exists an algorithm (Alg. 2) that given as input any (adversarially perturbed) data matrix A~ s.t. for each column j[m], A~j-Ajqδ and a parameter τ>0, w.h.p. runs in polynomial time and outputs either a projection matrix Π^ of rank at most r or outputs Bad Input s.t.

  1. (I)

    if the algorithm outputs a projection Π^ of rank at most r, then it is a near-optimal robust low-rank approximation for the unknown matrix A i.e., for some small universal constant c1,

    η>0,Π^q2cqκ, and (I-Π^)AO(1+1η)(τ+A-Π*A+mδκ)+2ηA. (44)
  2. (II)

    if the algorithm outputs Bad Input , then either the data was poisoned i.e., A-A~>τ, or there is no good robust spectral norm approximation for A i.e., A-ΠA>τ for all rank-r projection matrices Π s.t. Πq2κ.

In particular, if we are promised that A has a good robust projection Π* of value A-Π*AεA, then the algorithm either finds an approximately optimal robust projection Π^ of rank at most r for A with

Π^q2cqκ, and η>0,(I-Π^)AO(1+1η)(A-Π*A+mδκ)+2ηA, (45)

or certifies that the data has been poisoned i.e., A~-A>εA.

Our algorithm just runs the worst-case approximation algorithm from Theorem 5.7 on A~ to find a projection Π^. If the error is less than τ, it outputs Π^; else it certifies that the data is corrupt.

The main feature of the above algorithm is that it is always correct. The algorithm certifies that the input is Bad only when the data has been poisoned i.e., A~ is substantially far from A, or A did not have a good robust low-rank approximation to begin with. More crucially, when it does output a projection matrix Π^, it is guaranteed to be a valid robust projection77 7 In particular it rules out the scenario where the algorithm finds a solution that it thinks is good (on A~), but is in fact bad for the unknown, uncorrupted matrix A. for the unknown matrix A (with an exponentially small probability, the algorithm may not terminate in polynomial time; this happens exactly when the algorithm from Theorem 5.7 fails to terminate in polynomial time). We remark that the additive error term of Ω(δκm) is unavoidable here information-theoretically; see Proposition 6.7 for an example.

Input: A~, the corrupted n×m data matrix, tolerance parameter τ>0, rank r, robustness parameter κ1 and norm q2. 1. Run the algorithm from Theorem 5.7 on A~, to obtain a rank-r projection matrix Π^. 2. If the robust low-rank approximation error on A~, A~-Π^A~τ, output Π^. 3. Otherwise output Bad Input .

Figure 2: Robust rank-r approximations in Spectral norm error under adversarial perturbations in training.

The following is the key lemma that argues that if the projection Π^ gives a small error on A~, it necessarily gives a low-error on A.

Lemma 6.5.

Let δR+ and A,BRn×m such that A-Bqδ. Let Π1,Π2 be projection matrices such that Π1q2,Π2q2κ, and = A-Π1Aε1 and B-Π2Bε2. Then we have that for any η(0,1),

A-Π2AO(1+1η)(ε1+ε2+mδκ)+2ηA, and B-Π1BO(1+1η)(ε1+ε2+mδκ)+2ηB. (46)
Proof.

The projection matrices Π1,Π2 are both robust. For {1,2}

ΠA-ΠB2 Π(A-B)F2=j[m]Π(Aj-Bj)22mκ2δ2
Hence |ΠA-ΠB| mκδ. (47)

Let γ:=mδκ. We also know that A-Π1Aε1.

A-Π1BA-Π1A+Π1A-Π1Bε1+γ
Hence v𝕊n-1, Av-Π1Bv2ε1+γ,and similarly Bv-Π2Av2ε2+γ. (48)

But Bv=Π1Bv+Π1Bv. We have for any η(0,1)

Bv22 =Π1Bv22+Π1Bv22(Av2-ε1-γ)2+Π1Bv22
(1-η)Av22-(ε1+γ)2(1+1η)+Π1Bv22
Similarly, Av22 (1-η)Bv22-(ε2+γ)2(1+1η)+Π2Av22

Combining the two, we get that

Bv22 (1-η)2Bv22-(1+1η)((ε1+γ)2+(ε1+γ)2)+(1-η)Π2Av22+Π1Bv22
(1-η)Π2Av22+Π1Bv22 (2η-η2)Bv22+(1+1η)((ε1+γ)2+(ε1+γ)2)
v𝕊n-1,Π1Bv22 2ηBv22+(1+1η)((ε1+γ)2+(ε1+γ)2).
Hence, B-Π1B2 2ηB2+(1+1η)(ε1+2γ+ε2)2,

as required. A similar statement also follows for A using a symmetric proof. ∎

Proof of Theorem 6.4.

Firstly the algorithm from Theorem 5.7 runs on A~ and with high probability produced a robust projection matrix Π^ (with some exponentially small probability, the algorithm fails to terminate in polynomial time). We now condition on the event that algorithm from Theorem 5.7 terminates in polynomial time. The proof consists of two parts. We first argue that if the algorithm outputs any robust rank-r projection matrix, then it has to be robust for A. Any such Π^ satisfies A~-Π^A~τ. Applying Lemma 6.5 with ε2=τ (B=A~) and ε1=A-Π*A, we have

A-Π^AO(1+1η)(τ+A-Π*A+mδκ)+2ηA.

On the other hand, if the input A~ is not “Bad” i.e., (a) for the unknown matrix A, A-Π*Aτ, and (b) A-A~τ, we now show that the algorithm outputs a good solution for A. In this case we have that A~-Π*A2τ; hence,

A~-Π*A~A~-Π*A+Π*A~-Π*A2τ+j[m]Π*Aj-Π*A~j222τ+mκδ.

Hence, by Lemma 6.5 applied with ε1=τ and ε2= (B=A~), we have that

A-Π^AO(1+1η)(τ+mδκ)+2ηA.

This proves the theorem. The moreover part follows by setting τ:=εA.

In fact, Lemma 6.5 implies a stronger information-theoretic statement about finding a robust low-rank approximation of the unknown, uncorrupted matrix A with low spectral norm (just like Theorem 6.1 for Frobenius norm error). In fact we get a polynomial time algorithm assuming access to a polynomial time algorithm approximation algorithm for solving the following problem: given a matrix A~n×m, find88 8 This problem is reminiscent of the concept of ε-rank [AlonVempala], that corresponds to the smallest rank attainable by changing every entry of the given matrix by at most δ.

minB:Bj-Aj~qδj[m]   minΠ:rank(Π)=r,Πq2κB-ΠB2, (49)

where stands for the spectral norm.

Proposition 6.6.

Suppose q2 and ARn×m is the (unknown) uncorrupted data matrix, and let Π* have the smallest spectral norm error A-ΠA among rank-r projections that are robust i.e., Πq2κ. Suppose further that there is an efficient algorithm for finding an α-factor approximation algorithm for (49). Then there exists an algorithm that w.h.p. runs in polynomial time, and given as input any (adversarially perturbed) data matrix A~ s.t. for each column j[m], A~j-Ajqδ and a parameter τ>0, outputs a robust projection matrix Π^ of rank at most r that is near optimal in approximation error for the unknown matrix A i.e., for some small universal constant c1,

η>0,(I-Π^)AO(1+1η)(αA-Π*A+mδκ)+2ηA. (50)

Moreover, the above bound is achieved information-theoretically by an algorithm (that potentially does not have polynomial running time), by using an inefficient algorithm for problem (49).

We remark that the main difference between the above proposition and Theorem 6.4 is that Proposition 6.6 will always output a good robust projection for A (just like Theorem 6.1 for Frobenius norm error), but the algorithm is not computationally efficient unless (49) can be solved efficiently.

Proof.

Given A~, the algorithm first runs the α-factor approximation algorithm for solving (49) on A~. The uncorrupted matrix A is itself a feasible solution; hence the solution output by the algorithm A has a robust low-rank approximation of error O(α)A-Π*A. Such a robust low-rank projection Π^ for A i.e., a projection for rank at most r with Π^q2O(κ) and A-Π^AO(α)A-ΠA can be found by running Theorem 5.7 on A. Moreover A and A are valid 2δ adversarial perturbations of each other. Now applying Lemma 6.5 with A,Π* and A,Π^ completes the proof. ∎

6.3 Lower Bound for the Additive error in Training with Adversarial Perturbations

We now show that the additive terms of Ω(mδ2κ2) in Theorem 6.1 is unavoidable.

Proposition 6.7.

For any data matrix A with the following two properties:

  1. 1.

    Each column Aj2[1/10,10],

  2. 2.

    There exists Π* of rank 1 and Π*2κ (which is at least 2) satisfying Π*A=A,

for any δ small enough, there exist A as a δ-perturbation of A (i.e., A-Aδ) and a projection matrix Π of rank 1 satisfying

  1. 1.

    Π is more robust than Π*, i.e., Π2<Π*2.

  2. 2.

    We still have ΠA=A but A-ΠAF=Ω(δκm). Since A-A is of rank 2, this also implies a similar lower bound for the spectral norm.

Proof.

Let v be the unit eigenvector of Π* such that v1κ. Without loss of generality, we assume |v1||v2||vn| and =supp(v)/2. Notice that supp(v)2 by this definition such that v20. At the same time, because of the Cauchy-Schwartz inequality, we have v12supp(v)v22 such that κ2/2-1.

Then we consider any δ|v2| and perturb v to another “sparser” vector u whose point-wise absolute value is

(|v1|+δ,,|v|+δ,|v+1|-δ,,|v2|-δ,|v2+1|,,|vn|).

However, since vi may be negative or positive, we define u according to the sign function:

u=(v1+sign(v1)δ,,v+sign(v)δ,v+1-sign(v+1)δ,,v2-sign(v2)δ,v2+1,,vn).

We have u1=i=1|vi|+δ+i=+12|vi|-δ+i>2vi=v1 and

u22 =(|v1|+δ)2++(|v|+δ)2+(|v+1|-δ)2++(|v2|-δ)2+v2+12++vn2
=ivi2+2(i=1|vi|-i=+12|vi|)δ+2δ2.

Since |v1||v2||vn|, this is at least ivi2+2δ2>v22. So let u¯=u/u2 with unit 2 norm and Π=u¯u¯. So Π2=u¯1<v1=Π*2.

Next we consider A. Since Π*A=A, we assume A=[c1v,c2v,,cmv] with coefficient |ci|[1/10,10]. We set A=[c1u,,cmu] such that A-A10δ and ΠA=A.

Finally we lower bound A-ΠAF2. Notice that

u,v=i=1(vi2+|vi|δ)+i=+12(vi2-|vi|δ)+i>2vi2=1+(i=1|vi|-i=+12|vi|)δ.

Thus Πv=u,vu/u22=αu for α=1+(i=1|vi|-i=+12|vi|)δ1+(i=1|vi|-i=+12|vi|)δ+2δ2<1. So we lower bound the distance between v-Πv by counting the entries from v+1 to v2:

i=+12[vi-α(vi-sign(vi)δ)]2=i=+12[(1-α)|vi|+αδ]2.

Since δ|v2|, each term in the summation is at least δ2. So A-ΠAF=c12++cm2δ=Ω(δκm). ∎

7 Robustness to Training Time Corruptions in Unsupervised Learning

7.1 Mean Estimation

Recall that the uncorrupted data matrix An×m has mean vector denoted by μ:=Mean(A). In our setting, every point Ain can potentially be corrupted adversarially up to a δ amount, as measured in norm (more generally q norm for q2). Our input instance is A~n×m where every column j[m] satisfies A~j-Ajqδ. In this section we will show how to use the guarantee from Theorem 6.4 to either compute an accurate estimate of the mean of the unperturbed dataset from the given set of perturbed points, or certify that the data has been poisoned. In particular, our estimate of the mean of the unperturbed dataset will simply be the mean of the perturbed data points when projected onto a robust subspace. We will use the algorithm from Figure 2 to compute such a projection, or certify that the dataset is poisoned.

How much δ can one handle?

Consider, for instance the case when q=. Let μ be the mean of the data points in the matrix A with C representing the n×m matrix with each column being μ. Furthermore denote σ2 to be the data variance, i.e., A-C2=σ2m. Notice that δ is the bound on the norm of the perturbation. If A~ is the corrupted data points, in the worst case the Mean(A) and Mean(A~) could differ by δn, thereby needing δ to be inversely proportional to n. However, if the one dimensional subspace spanned by Mean(A) is κ-robust we can do better. As an extreme case, assume that both A and A~ exactly lie in a κ-robust subspace. Then Mean(A) and Mean(A~) can only differ by κδ. Notice that in this best case scenario we can handle much larger perturbations. In particular, if the mean is a k-sparse unit length vector, then κk and we can potentially handle δ as large as 1/k. Our main result of this section is that we can indeed achieve this best case guarantee or certify that the data has been poisoned, i.e., A-A~ is large.

Why natural certification approaches fail?

A natural approach for robust mean estimation in our model is to simply compute μ^=Mean(A~) as an estimate, and if the variance of A~ around μ^ is not much larger than σ2m then output μ^, otherwise say that the data has been poisoned. This would work for perturbations of the order of 1/n. However, in the regime of interest to us, i.e., perturbations of the order of 1/κ, this could fail miserably. Consider a data set of m points generated as μ+g where μ=(1/k,,1/k,0,,0) and g is a standard normal vector orthogonal to μ and with variance σ2 in every direction orthogonal to μ. Hence, the data lies close to a robust (k sparse in 1/2 sense) one dimensional subspace with variance σ2m. If perturbations of magnitude 1/κ1/k are allowed in each dimension we could construct A~ by taking every point in A and zeroing out the first k coordinates and adding +1/k to the rest of the coordinates. This new data set has also has variance σ2m around the mean. However, Mean(A~) and Mean(A) are quite far. In this case we would like to certify that the data set has been poisoned. In particular, this would follow from certifying that the new data set has no small projection (in spectral norm) onto any robust one dimensional subspace. On the other hand, if the adversary hides the perturbations such that A~ has a good projection on to a robust one dimensional subspace (not necessarily the one spanned by μ), we require to output a good approximation to μ. The guarantee of Theorem 6.4 will be used to achieve this.

The robust mean estimation guarantee from this section will also serve as a subroutine in the application to clustering a mixture of well separated Gaussians and more general data sets, even when every data point is poisoned. We discuss the clustering application in Section 7.2. In what follows, for a vector u, u will denote u2, and for a matrix M we will denote by M the spectral norm of matrix M. The goal of this section is to prove the following theorem.

Theorem 7.1.

Let A be an n×m matrix representing m data points in n dimensions and let μ be the mean of the data points in the matrix A with C representing the n×m matrix with each column being μ. Let Π*=μμ/μ2 be the one dimensional subspace denoting the projection onto μ and assume that Π*q2κ, for some q2. Let A~ be the given input such that for every column j[m] we have Aj-A~jqδ. Furthermore, let σ2>0 be a given upper bound on the variance of the data around the mean, i.e., A-Cσm. Then the algorithm from Figure 3 when run on A~, w.h.p. runs in polynomial time, and either certifies that the data has been poisoned, i.e., A~-A=Ω(σm), or outputs an estimate μ^ of the true mean μ such that

μ^-μ2 O(cq)(1+κδσ)max(σ,σμ),

where cq is a constant that depends on q. In particular, the above implies a relative error guarantee of

μ^-μ2μ O(cq)(1+κδσ)max(σμ,σμ).
Proof.

By assumption we know that A-Cσm. This implies that

A-Π*A A-C+C-Π*A
=A-C+Π*(C-A)
2A-C2σm.

Since we set τ=2σm, from the guarantee of Theorem 6.4, we know that if the algorithm outputs Bad Input , the data must be poisoned, i.e., A-A~>2σm. Next suppose that the algorithm outputs a projection matrix Π. Setting μ^:=Mean(ΠA~), and 𝟏 to be the all-ones vector (1,1,,1) we have that

μ^-μ2 =1mj=1m(Aj-ΠA~j)2
1mj=1m(Aj-ΠAj)2+1mj=1m(ΠAj-ΠA~j)2
1m𝟏(A-ΠA)2+1mi=1mΠ(Aj-A~j)2
1mA-ΠA+cqκδ. (51)

Next we make a crucial observation that if Π is good for A~ then it is also good for A and hence A-ΠA is small. This is formally established in Lemma 7.2. Applying the lemma on A and A~ with Π1=Π*, Π2=Π, κ1=κ, κ2=cqκ, and ε=4σm/A we get that

A-ΠA (ε+ε)A+8κδmε.

Substituting into (51) we get that

μ^-μ2 (ε+ε)Am+8κδε+cqκδ
(ε+ε)4σε+8κδε+cqκδ(by writing A in terms of ε)
O(cq)(σ+κδ)(1+1ε). (52)

Next, notice that by triangle inequality,

A A-C+C
(σ+μ)m.

Hence we get that

1ε=A4σm12(1+μσ)12.

Substituting into (52) we get that

μ^-μ2 O(cq)(σ+κδ)(1+(1+μσ)12) (53)
O(cq)(1+κδσ)max(σ,σμ). (54)

From the above we get the relative error guarantee of

μ^-μ2μ O(cq)(1+κδσ)max(σμ,σμ). (55)

Note: We would like to point out that for robust mean estimation, our analysis also shows that in step 2 of the algorithm above, we can replace Mean(ΠA~) with Mean(A~). This is because if the algorithm did not output Bad Input then A~-ΠA~2/m2σ and hence mean of A~ and that of ΠA~ will be close. However, in this case, the subspace spanned by the output vector, i.e., Mean(A) might not be robust and hence susceptible to test time perturbations.

RobustMean(A~,κ,σ2) Input: The corrupted data matrix A~n×m with columns A~i for i[m], and the upper bound κ on the robustness of the subspace Π*q2. 1. Run the algorithm from Figure 2 with τ=2σm,r=1, and κ. 2. If the algorithm outputs Bad Input then terminate and certify that the data has been poisoned. Otherwise let Π be the 1-dimensional subspace output by the algorithm. Return μ^=Mean(ΠA~).

Figure 3: Robust Mean Estimation.
Lemma 7.2.

Fix q2,δ>0,κ1. Let A and A~ be two n×m matrices, each representing m data points in n dimensions such that for every j[m], columns Aj and A~j are close, i.e., Aj-A~jqδ. Furthermore, assume that there exist projection matrices, Π1=vv and Π2=uu such that Π1q2κ1 and Π2q2κ2 and that A-Π1Aε1A and A~-Π2A~ε2A. Then, letting ε=ε1+ε2 and κ=κ1+κ2, it also holds that

A-Π2A O(ε+ε)A+8κδmε (56)
A~-Π1A~ O(ε+ε)A+8κδmε (57)
Proof.

We will show the desired bound on A-Π2A and by symmetry the same bound will also apply to A~-Π1A~. Notice that both Π1 and Π2 are projections onto one dimensional subspaces and a bound on q2 norm of the projection matrices implies that vq*κ1 and uq*κ2, where q* is such that 1/q+1/q*=1. Next, let Π be the projection matrix onto the subspace spanned by v and u. By triangle inequality we have that

A-Π2A A-ΠA+ΠA-Π2A
A-ΠA+ΠA-Π2A~+Π2A~-Π2A
A-ΠA+ΠA-ΠA~+ΠA~-Π2A~+Π2A~-Π2A. (58)

Recall the standard fact that if P1 and P2 are projection matrices on to subspaces S1 and S2 such that S1S2, then for any matrix B, P1BP2B. Using this we to get that

A-ΠA A-Π1Aε1A (59)

and,

Π2A~-ΠA~ Π2A~-A~+A~-ΠA~
2Π2A~-A~2ε2A. (60)

From the closeness of A and A~ and the robustness of Π2 we also know that

Π2A~-Π2A Π2q2δmκ2δm. (61)

Substituting (59), (60), and (61) into (58) we get that

A-Π2Aε1A+2ε2A+κ2δm+ΠA-ΠA~. (62)

Note that if ΠA-ΠA~κδm/ε, then we have the desired bound on A-Π2A. We now look at the case when ΠA-ΠA~>κδm/ε. Notice that Π is the union of robust subspaces and A-A~ has columns bounded in q norm. Hence, the only way ΠA-ΠA~ can be very large is if the q2 norm of the projection matrix of a union of two subspaces (Π) is much larger than the q2 norm of the projection matrices of individual subspaces (Π1 and Π2). For this to happen the two subspaces must be very close to each other and then we can bound A-Π2A in a different way. Formally, we have that

ΠA-ΠA~ =maxz:z=1Π(A-A~)z
=maxz:z=1j=1mzjΠ(Aj-A~j)
maxz:z=1j=1m|zj|Π(Aj-A~j)
Πq2δm. (63)

Next we establish an upper bound on Πq2 in terms of the distance between subspaces Π1=vv and Π2=uu. Suppose u-v=γ and that uv0 (otherwise we work with -u). We also know that vq*κ1 and uq*κ2. Now, Πq2 is the maximum q* norm of any unit vector in the span of v and u. We can write any such vector z as

z=α1v+α2v

where α12+α22=1 and v=u-(uv)vu-(uv)v. Next we have that

u-(uv)v2=1-(uv)2
1-uv=γ22.

Hence we get that for any z in the span of v and u

zq* vq*+vq*
κ1+2γ(uq*+vq*)
22γκ.

The above also establishes that

Πq222γκ.

Substituting into (63) we get that

ΠA-ΠA~22γκδm. (64)

Hence, if ΠA-ΠA~>κδm/ε we must have that

v-u=γ22ε. (65)

In this case we can bound A-Π2A as

A-Π2A A-Π1A+(Π1-Π2)A
ε1A+Π1-Π2A
εA+vv-uuA
=εA+12(v+u)(v-u)+12(v-u)(v+u)A
εA+2v-uA(by triangle inequality and the fact that v+u2)
εA+2γA
εA+42εA.

Tightness of the Guarantee in Theorem 7.1. We close out this section by showing that the dependence on σμ in our bound on mean estimation is necessary even information theoretically. In what follows A will be an n×m matrix with μ=Mean(A) such that Π*=μμ/μ2 has small norm, i.e., Π*2=κ. Furthermore, let C=μ𝟏 and define σ=A-C/m. We will prove the following.

Theorem 7.3.

Fix q=. Let M be the set of n×m matrices A with mean μ that satisfies μ[1,2], variance σ2 around the mean that satisfies σ(0,1/6], and the subspace spanned by μ being κ-robust. Call a perturbation A~ of AM of be valid if A-A~δ. Then, any algorithm that takes as input a valid perturbation A~ of a matrix AM and either certifies that the data is poisoned, i.e., A-A~>8σm or outputs an estimate μ^ of μ must incur an error of

μ-μ^Ω((1+κδσ)max(σ,σμ)),

where μ=Mean(A).

Proof.

We will establish the lower bound by constructing two matrices A and A~, both of which lie in the set , satisfy A-A~=δ for κδ=O(σ), but have means separated by Ω(max(σ,σμmax)), where μmax is the maximum 2 norm among Mean(A) and Mean(A~). In this case, given either A or A~ as input, any provably robust certification procedure cannot output Bad Input and must output an estimate μ^, thereby making Ω(max(σ,σμmax)) error on either A or A~. We next describe our construction.

For a k to be determined later, let μ1=(1/k,1/k,,1/k,0,0,,0). Hence, μ1 is a unit length sparse vector with μ11=k. We define the set of m points in A by generating i.i.d. points of the form μ1+g, where g is a mean zero Gaussian with variance 0 in the first k coordinates and variance σ2 in the other coordinates. Then it is a standard fact that with high probability A-μ1𝟏2σm and that Mean(A) will be σd/m=o(σ)-close to μ1. Next we define the set of points in A~ to be A~j=Aj+δsgn(μ1), where sgn(μ1) is a ±1 vector representing the sign of the corresponding coordinate of μ1. Here we can arbitrarily set sgn(0) to be +1. It is easy to see that Mean(A~) will be o(σ)-close to μ2=μ1+δsgn(μ1), and that A~-μ2𝟏2σm. Next notice that

μ22 =1+δ2n+δk
μ21 =k+δk.

By setting δk=3σ and δn=k we ensure that μ2[1,2] and μ212k. Hence we get that for the matrix Π=μ2μ2/μ22, Π22k. Hence, both A and A~ lie in the set with sparsity bound κ=2k. Furthermore, the fact that δk=3σ, ensures that κδ6σ. Finally, notice that the difference between two means is

μ1-μ2 =δn
=δk1/4
=3σ
=Ω((1+κδσ)max(σ,σμmax)).

We end this section by showing that via an (inefficient) algorithm one can get the same guarantee for mean estimation as in Theorem 7.1 without the need for certification.

Theorem 7.4 (Information Theoretic Upper Bound).

Let A be an n×m matrix representing m data points in n dimensions and let μ be the mean of the data points in the matrix A with C representing the n×m matrix with each column being μ. Let Π*=μμ/μ2 be the one dimensional subspace denoting the projection onto μ and assume that Π*q2κ, for some q2. Let A~ be the given input such that for every column j[m] we have Aj-A~jqδ. Furthermore, let σ2>0 be a given upper bound on the variance of the data around the mean, i.e., A-Cσm. Then there is an (inefficient) exponential time algorithm that takes A~ as input and outputs an estimate μ^ of the true mean μ such that

μ^-μ2 O(cq)(1+κδσ)max(σ,σμ),

where cq is a constant that depends on q. In particular, the above implies a relative error guarantee of

μ^-μ2μ O(cq)(1+κδσ)max(σμ,σμ).
Proof.

In order to establish the theorem above we first optimize (49) exactly. Let A be the matrix and Π be the projection that achieve the minimum of (49). Then we have that

A-ΠA A-Π*A

Furthermore, we also have that

A-Π*A A-C+C-Π*A
2A-C
2σm.

Hence both A and A have good projections onto rank-1 subspaces. Plugging into Lemma 7.2 we get that

A-ΠAO(ε+ε)A+8κδmε,

where ε=2σm/A. Hence, letting μ^=Mean(ΠA) we get from (51) that

μ-μ^ 1mA-ΠA+cqκδ
O(ε+ε)Am+8κδmε+cqκδ.

The rest of the argument proceeds exactly as in the Proof of Theorem 7.1 by writing A in terms of ε, as done in (52). ∎

7.2 Clustering

In this section we will use the guarantee from Theorem 6.4 to show how to perform clustering of a well-clustered instance when every data point in the instance could be corrupted. Our guarantees will apply to clustering a mixture of well separated Gaussians and more general data distributions. In particular, we will show that a robust modification of the popular Lloyd’s algorithm [lloyd1982least] (also known as the k-means algorithm) can be used to perform clustering in our model, thereby providing further evidence towards the widespread applicability of the algorithm. Existing guarantees for using Lloyd’s algorithm [kumar2010clustering, awasthi2012improved] for clustering a mixture of Gaussians and general datasets assume that every pair of means μi,μj are separated by σk, where σ is the maximum variance of the dataset around the mean and k is the number of clusters (see (66) for the formal condition). However, our lower bound example from the previous section shows that such a separation condition is too weak for robust clustering in our noise model. If we view the datasets A and A~ in our lower bound instance as two target clusters, then they have the property that the means are separated by σ>Ω(σk), where k=2, and yet via perturbations we can reach one cluster from the other, making the clustering task meaningless. Hence, we need a separation condition that rules out this scenario. Furthermore, even if one is given the original optimal clustering of the given perturbed dataset, we know from previous section that one must incur a loss of Ω(σmax(1,μ/σ)) in simply estimating the true mean μ of a cluster. This suggests a separation condition of the type ασk, where α depends on (1+μmaxσ) and the guarantee to aim for is to estimate means upto error O(ασ) error. Here μmax is the maximum 2 norm of the any of the k mean vectors. In this section we will show that a modified Lloyd’s combined with our certification procedure can indeed achieve this guarantee or certify that the dataset has been poisoned.

More formally, we will assume that there is a set of m points in n with ground truth clustering C1*,C2*,,Ck*, and means μr=Mean(Cr*) for r[k] and μmax=maxrμr. Let A be the n×m data matrix and C be the matrix of corresponding centers. We will assume that we have an upper bound σ2 on the maximum variance of the data points around their mean, i.e. A-C2σ2m and define α=(1+κδσ)(1+μmaxσ)2/3. We will enforce the spectral stability condition studied in [kumar2010clustering] on our clustering instance. This condition implies that for each pair of clusters Cr*,Cs* with means μr,μs and each point xCr*, x¯ is closer to μr than to μs by a margin Δr,s. Here x¯ is the projection of x onto the line joining μr and μs. For a constant c>0, the c-spectral stability condition requires that for each rs,

Δr,scασk(m|Cr*|+m|Cs*|) (66)

Notice that the above also implies that every pair of means are separated i.e.,

μr-μscασk(m|Cr*|+m|Cs*|).

It is worth mentioning that in the typical analysis of Lloyd’s algorithm [kumar2010clustering, awasthi2012improved] the dependence on α in the separation condition is not needed. However, as discussed before, in our noise model, some dependence on α is unavoidable to get a meaningful clustering guarantee.

Assumptions I: Fix q2. We will assume that we are given access to A~ such that for every j[m], Aj-A~jqδ. Furthermore, define κ to be the robustness, of the subspace spanned by the means μ1,,μk. Formally, let ΠC be the projection matrix for the orthogonal projection onto the space of the means. Then κ is such that ΠCq2κ. Under Assumptions I, we prove the following theorem that applies to any stable dataset as defined in (66).

Theorem 7.5.

Fix q2, and let cq be a constant that depends on q. Let A be a dataset that satisfies c-spectral stability for c>200cq. Under Assumptions I, there is a Lloyd’s style algorithm that takes A~ as input, w.h.p. runs in polynomial time, and either certifies that the dataset is poisoned, i.e, A-A~=Ω(σm), or outputs a clustering C^1,C2^,,Ck^ and means μ1^,μ2^,,μk^ such that

r=1k|Cr*C^π(r)|O(cq2mkα2c2)
μr-μ^π(r)cqασm|Cr*|.

for an appropriately chosen bijection π:[k][k].

While the above theorem works for any data set that satisfies spectral stability, notice that it leads to a sub optimal mean estimation error of O(ασm/|Cr*|) for each cluster r. For example, when the clusters are balanced, this will lead to a guarantee of O(ασ)k. Next, we show that for data sets that additionally satisfy Gaussian type concentration, we can indeed get O(ασ) estimation error even when each data point is corrupted.

Assumptions II: Let A be a given dataset with optimal clustering C1*,C2*,,Ck*. We will assume that we are given A~ that satisfies Assumptions I. Furthermore, we will assume that Cr*n3 for each r[k] and that for any subset SCr* of points such that |S|>nlogn, we have that AS-CSσ|S|polylog(m,n). Here AS,CS are the matrices A and C restricted to the columns of the points in S. Additionally, we require a pointwise guarantee that for each r[k], and AiCr*, Ai-μr22σ2npolylog(m,n). It is easy to see that mpoly(m,1/wmin) samples generated from a mixture of Gaussians with maximum variance σ2 and minimum mixture weight wmin will, with high probability, satisfy the above assumptions. Under Assumptions II, we prove the following theorem that applies to any stable dataset as defined in (66).

Theorem 7.6.

Fix q2, and let cq be a constant that depends on q. Let A be a dataset that satisfies c-spectral stability for c>200cq. Under Assumptions II, there is a Lloyd’s style algorithm that takes A~ as input, w.h.p. runs in polynomial time, and either certifies that the dataset is poisoned, i.e, A-A~=Ω(σm), or outputs a clustering C^1,C2^,,Ck^ and means μ1^,μ2^,,μk^ such that

r=1k|Cr*C^π(r)|O(cq2mk2α2c2)
μr-μ^π(r)O~(ασ).

for an appropriately chosen bijection π:[k][k], where we hide a polylogarithmic (in n,m) factor in the O~ notation.

As a corollary we get the following statement about robustly clustering a mixture of Gaussians.

Theorem 7.7.

Fix q2, and let cq be a constant that depends on q. Define M to be a distribution that is a mixture of k Gaussians, i.e., M:=r=1kwrN(μr,Σr). Furthermore, let Σrσ2I and define wmin=minrwr and μmax=maxrμr, α=(1+κδσ)(1+μmaxσ)2/3. Let A be a set poly(n,1/wmin) samples generated i.i.d. from the mixture. If the mixture if well separated, i.e, μr-μscασkpolylog(n/wmin)/wmin for c>200cq, and the means span a κ robust subspace, then given access to A~ such that A~j-Ajqδ, there is a Lloyd’s style algorithm that, w.h.p. runs in polynomial time, and either certifies that the data is poisoned, i.e., A-A~Ω(σm), or outputs a clustering C^1,C2^,,Ck^ and means μ1^,μ2^,,μk^ such that

r=1k|Cr*C^π(r)|O(cq2mk2α2c2)
μr-μ^π(r)O~(ασ).

for an appropriately chosen bijection π:[k][k].

Computing Good Initial Centers. The first step in establishing the above theorems is to compute centers/means that are close to the true ones. A common approach for this step is to use PCA to project the data onto the top-k subspace of the input data matrix, and run any constant factor approximation algorithm for k-means clustering [kumar2010clustering]. However this can be arbitrarily bad if the data is corrupted as in our model. We instead show that by projecting the data onto a robust subspace as output by our guarantee from Theorem 6.4 and then using a k-means approximation algorithm, we do indeed recover good centers. Our initialization algorithm is shown in Figure 4. We next provide proofs for our claims.

Input: The corrupted data matrix A~n×m with columns A~i for i[m], upper bound κ on the robustness of the subspace ΠCq2, upper bound σ2 on the data variance A-C2/m. 1. Run the algorithm from Figure 2 with τ=2σm,r=k, and κ. 2. If the algorithm outputs Bad Input then certify that the data has been poisoned and terminate. 3. If the algorithm outputs a projection matrix Π then project A~ onto Π. Denote A^=ΠA~ as the projected matrix. 4. Run a 10-approximation algorithm for k-means clustering on A^, and obtain k centers ν1,ν2,,νk. 5. Output ν1,ν2,,νk.

Figure 4: Computing initial center estimates.
Theorem 7.8.

Assume that the clustering instance A is c-stable for c>200cq. If Assumptions I hold, then the algorithm in Figure 4 w.h.p. runs in polynomial time, and either certifies that the data has been poisoned, i.e., A-A~>2σm, or the algorithm outputs centers ν1,ν2,,νk such that for all r[k],

μr-νπ(r)30cqαkσm|Cr*|.

for an appropriately chosen bijection π.

Proof.

The proof will follow the general outline of Lemma 5.1 of [kumar2010clustering], except that we need to argue following two stronger conditions. Firstly, we need to establish that A~ projected on to Π has cost comparable to that of kσ2m. This will ensure that the approximation algorithm will output a clustering of low cost. Secondly, we also simultaneously need to establish that A~ when projected on to Π has low cost clustering when true means C are used to cluster it. Together with the fact that A~ and A are pointwise close in the projected space, we can then claim that missing out on a good approximation for even a single cluster center of C* will incur a cost of Ω(kσ2m), thereby contradicting the approximation guarantee of the k-means algorithm used in step 2.

Establishing that ΠA~ has low cost with respect to C boils down to showing that Π is good for A given that it is good for A~, a perturbation of A. This statement, established in Theorem 6.4 is the key in analyzing the initialization phase, and is a generalization of Lemma 7.2 to higher dimensional subspaces. Let’s first establish that A~ projected on to Π has a low cost clustering. We have

ΠA~-ΠCF 3kΠA~-ΠC(since both ΠA~ and ΠC have rank at most k)
3k(Π(A~-A)+Π(A-C))
3k(cqδκm+A-C)
23cqkσmcq(1+κδσ)12kσm. (67)

Here the third inequality follows from the fact that for any n×m matrix M, Mmaxm, where max is the maximum 2 norm of a column of M. Furthermore, from the robustness of Π we know that for any j[m],

Π(A~j-Aj)cqκδ.

Next, let’s establish that A-ΠA is small. By triangle inequality we know that

A-ΠCA A-C+C-ΠCA
=A-C+ΠC(C-A)
2A-C2σm.

Furthermore, from the guarantee of Theorem 6.4 we have that for any η(0,1),

A-ΠA O(1+1η)(2σm+A-ΠCA+κδm)+2ηA
O(1+1η)(4σ+κδ)m+2ηA.

Setting η=(5σm/A)2/3 we get that

A-ΠA 4cq(1+κδσ)σ(1+(Aσm)2/3)m
4cqασm. (68)

The last inequality above follows from the fact that

A A-C+C
σm+μmaxm.

Next notice that

ΠA~-CF 3kΠA~-C(since both ΠA~ and C have rank at most k)
3k(Π(A~-A)+ΠA-A+A-C)
3k(cqδκm+5cqασm+σm)
63kcqασm. (69)

Now we are ready to establish the claim of the theorem. From (67) we get that the centers ν1,ν2,,νk will have k-means cost at most 120k(1+κδσ)2cq2σ2m on A~. Furthermore, suppose that there exists a center μr such that every νs is far from it. For any point Ai, let νc(i) be the center in the set {ν1,ν2,,νk} that is closest to the projection of A~i on to Π. Then we have that

120kcq2(1+κδσ)2σ2mAiCrΠA~i-νc(i)2 =AiCrΠA~i-μr+μr-νc(i)2
12|Cr|μr-νc(i)2-AiCrΠA~i-μr2
12|Cr|μr-νc(i)2-ΠA~-CF2
450kα2cq2σ2m-ΠA~-CF2
>120kcq2α2σ2m. (70)

Noticing that α(1+κδσ), we get a contradiction to the fact that μr is far from every νs. This combined with the fact that the clustering instance is c-stable for c>200cq implies that one can find a bijection π:[k][k] between {μ1,,μk} and {ν1,,νk} such that each μi is close to a unique νπ(i). ∎

7.3 Analyzing Lloyd’s Updates

Next we will use the obtained initial centers and run the robust Lloyd’s algorithm starting with these centers as shown in Figure 5. Our goal in this section is to analyze the updates and establish Theorem 7.5 and Theorem 7.6.

Input: The corrupted data matrix A~n×m with columns A~i for i[m], upper bound κ on the robustness of the subspace ΠCq2, upper bound σ2 on the data variance A-C2/m. 1. Let Π be the robust k dimensional projection matrix as computed by the initialization algorithm in Figure 4. 2. Define initial center estimates ν1(0),ν2(0),,νk(0) to be the centers output by the algorithm in Figure 4. 3. For each r[k], define Sr={ΠA~i:ΠA~i-νrΠA~i-νs3,sr}. Update νr(1)=Mean(Sr). 4. For t=2,3, do: (a) For each r[k], compute Sr={A~i:ΠA~i-νr(t-1)ΠA~i-νs(t-1),sr}. (b) For each r[k], update νr(t)=RobustMean(Sr,κ,4σ) //If RobustMean outputs Bad Input then certify that data is poisoned.

Figure 5: Iterative Updates of the Lloyd’s Algorithm.

Overview of Analysis and Challenges. Our analysis of the modified Lloyd’s updates proceeds in two stages: a) a center improvement step, and b) analyzing robust Lloyd’s updates. In (a), we first improve the initial center estimates obtained form the initialization phase to get estimates ν1(1),,νk(1) such that each νr(1) is Δr/2-close to the corresponding μr, where Δr=40cqασm/|Cr*|. In other words, we get a factor k improvement over the initial estimates. This is sketched in step 3 of the algorithm in Figure 5. First, we motivate the need for this intermediate step, since it is not necessary in the analysis of Lloyd’s algorithm for uncorrupted data.

Just as in standard analysis of Lloyd’s updates, we would like to argue that if we have non-trivial estimates of the centers, as obtained from the initialization stage, forming clusters using these points and moving to the means of these clusters will improve the center estimates. To argue this we will crucially rely on the fact that when projected onto Π, A and A~ are close pointwise. Hence, we can come up with a charging argument to assign mistakes made by the current centers on the uncorrupted points to the mistakes made by the centers on the corrupted points. We can then bound the number of such mistakes by observing that on ΠA~, the true means have a small k-means cost. This forces us to work in the projected space Π, but as a result inherently limits the accuracy to which we can obtain center estimates. Notice that if the initialization algorithm outputs Π, then Π is guaranteed to be good overall for A~, in the sense that A~-ΠA~=O(σm). However, Π has no per cluster guarantee, and in general A~r-ΠA~r when restricted to a cluster Cr* could be as large as σm/|Cr*|. Hence, to achieve our goal of estimating the centers upto O~(ασ) accuracy, we also need to work outside of the projection Π at the same time. Due to these conflicting demands, notice that the Lloyd’s updates we analyze in step 4 of the algorithm in Figure 5 perform clustering using current centers in the projected space, but perform robust mean estimation on the original input data.

Furthermore, from our guarantee on robust mean estimation in Theorem 7.1, we know that in the RobustMean step of the algorithm the centers will be accurate upto ασS, where σS is the standard deviation of the uncorrupted data points in Sr around the uncorrupted mean of Sr. As a result we need a stronger argument that not only shows that we have low clustering error given the current estimates, but also helps us argue about the variance of the formed clusters Sr at each step. Such an argument (Lemma 7.9) is a main technical contribution in the analysis.

Unfortunately, the argument (Lemma 7.9) only kicks in when we have much better center estimates than the one provided by the initialization stage, thereby requiring an additional center improvement stage. To argue about the center improvement stage, we use a trick from [awasthi2012improved] and form sets Sr that correspond to points in ΠA~ that are significantly close to one of the centers νr(0) than any other center νs(0). Notice that these sets do not form a partitioning of the data. We then argue that any mistake made by this assignment must have also been made if one had used the true centers μ1,,μk, to cluster ΠA~. Using the fact that the true means have small k-means cost on ΠA~ we can bound the number of such mistakes and hence get sets Sr that have low error, thereby helping us show that the means of these sets will be much closer to the true centers. This is established in Theorem 7.10. The above arguments help us establish Theorem 7.5. We next state the key technical lemma for our analysis.

Lemma 7.9.

Let Π be the robust subspace computed in step (1) of the algorithm in Figure 5. For each cluster Cr* in the optimal clustering of A, define Δr=40cqασm/|Cr*|. Suppose we have center estimates {ν1,ν2,,νk} such that νr-μrβΔr, r[k], and some β<1. When using νis to cluster ΠA~, define Tr,s to be the set of points that are misclassified, w.r.t. the induced clustering on A, i.e., Tsr={i:AiCr* and ΠA~i-νsΠA~i-νr}. There exists a constant c1>0 depending on q such that if the clustering instance is c-stable for c>200cq then we have that |Tsr|c1β2σ2mkc2μr-μs2.

Proof.

Fix sr and let W be the subspace spanned by {μr,μs,νr,νs} with ΠW being the projection matrix for the orthogonal projection on to the subspace. Define A¯i to be the projection of Ai onto the line joining μr and μs. Since W contains μr,μs, this is also the same as the projection of ΠWAi on to the line joining μr and μs. Similarly, define A~¯i to be the projection of A~i on to the line joining μr and μs, and again this is the same as the projection of ΠWA~i on to the line joining μr and μs. We will crucially make use of the fact that

A~¯i-μs-A~¯i-μr Δr,s-O(κδ)Δr,s/2. (71)

The above holds since from c-stability we know that A¯i-μs-A¯i-μr