Abstract
Adversarial or test time robustness measures the susceptibility of a machinelearning system to small perturbations made to the input at test time. This hasattracted much interest on the empirical side, since many existing ML systemsperform poorly under imperceptible adversarial perturbations to the testinputs. On the other hand, our theoretical understanding of this phenomenon islimited, and has mostly focused on supervised learning tasks. In this work we study the problem of computing adversarially robustrepresentations of data. We formulate a natural extension of PrincipalComponent Analysis (PCA) where the goal is to find a low dimensional subspaceto represent the given data with minimum projection error, and that is inaddition robust to small perturbations measured in $\ell_q$ norm (say$q=\infty$). Unlike PCA which is solvable in polynomial time, our formulationis computationally intractable to optimize as it captures the wellstudiedsparse PCA objective. We show the following algorithmic and statistical results.  Polynomial time algorithms in the worstcase that achieve constant factorapproximations to the objective while only violating the robustness constraintby a constant factor.  We prove that our formulation (and algorithms) also enjoy significantstatistical benefits in terms of sample complexity over standard PCA on accountof a "regularization effect", that is formalized using the wellstudied spikedcovariance model.  Surprisingly, we show that our algorithmic techniques can also be maderobust to corruptions in the training data, in addition to yieldingrepresentations that are robust at test time! Here an adversary is allowed tocorrupt potentially every data point up to a specified amount in the $\ell_q$norm. We further apply these techniques for mean estimation and clusteringunder adversarial corruptions to the training data.
Quick Read (beta)
Adversarially Robust Low Dimensional Representations
Abstract
Adversarial or test time robustness measures the susceptibility of a machine learning system to small perturbations made to the input at test time. It is now well known that even when trained on high quality training data, many machine learning systems perform poorly under imperceptible adversarial perturbations to the test inputs [szegedy2013intriguing]. There is a large body of work proposing practical methods to make classifiers adversarially robust. On the other hand, our theoretical understanding of the phenomenon of adversarial robustness is limited, and has mostly focused on supervised learning tasks such as binary classification.
In this work we study the problem of computing adversarially robust representations of data. We formulate a natural extension of Principal Component Analysis (PCA) where the goal is to find a low dimensional linear subspace to represent the given data with minimum projection error (measured in Frobenius norm or spectral norm) and that is in addition robust. When adversarial perturbations are measured in ${\mathrm{\ell}}_{q}$ norm with $q\ge 2$ (say $q=\mathrm{\infty}$), the robustness constraint naturally corresponds to controlling the $q\to 2$ operator norm of the orthogonal projection matrix onto the subspace. Unlike PCA which is solvable in polynomial time, our formulation is computationally intractable to optimize; even when the subspace is onedimensional this captures the wellstudied sparse PCA objective.
We show the following algorithmic and statistical results.

•
Polynomial time algorithms in the worstcase that achieve constant factor approximations to the objective while only violating the robustness constraint by a constant factor. This holds for both the Frobenius norm error, and the spectral norm error objectives.

•
We prove that our formulation (and algorithms) also enjoy significant statistical benefits in terms of sample complexity over standard PCA on account of a “regularization effect” (analogous to sparsity), that is formalized using the wellstudied spiked covariance model.

•
Surprisingly, we show that our algorithmic techniques can also be made robust to corruptions in the training data, in addition to yielding representations that are robust at test time! Here an adversary is allowed to corrupt potentially every data point in ${\mathrm{\ell}}_{\mathrm{\infty}}$ (in general, ${\mathrm{\ell}}_{q}$ norm) up to a specified amount. We further use these techniques to solve two other fundamental unsupervised learning problems, namely mean estimation and clustering, under adversarial corruptions to the training data.
The above results further validate our model and indicate that the study of adversarially robust unsupervised learning could lead to insights that have wide applicability.
1 Introduction
Reliability and trustworthiness of machine learning systems is a key requirement for their secure adoption in day to day life. While robustness to errors of various forms like training data corruptions or data poisoning, drifting data distributions and partial observability are desirable, recently the phenomenon of adversarial robustness or test time robustness has posed a significant hurdle to the design of reliable learning systems. Szegedy et al. [szegedy2013intriguing] identified that learning algorithms like deep neural networks trained on even carefully curated, high quality datasets such as ImageNet [deng2009imagenet] are susceptible at test time to small adversarial corruptions to the test example. These adversarial corruptions are imperceptible to the human eye, yet cause the trained networks to produce an incorrect classification. This represents a potential security vulnerability for critical applications like self driving cars, and secure facial recognition systems.
On the theoretical front, our understanding of adversarial robustness is limited. A series of recent works have started to explore fundamental questions such as the sample complexity of adversarial learning and the inherent computational and statistical tradeoffs needed for achieving adversarial robustness [khim2018adversarial, yin2018rademacher, bubeck2018adversarial, bubeck2018adversarial2, attias2018improved, montasser2019vc], focusing almost exclusively on supervised learning. In contrast there is a substantial body of work formalizing and studying training data corruptions in computer science and statistics, spanning both supervised and unsupervised learning (see e.g., [huber2011robust, tukey1975mathematics, angluin1988learning, kearns1994toward, candes2011robust, diakonikolas2019robust, lai2016agnostic]).
The primary motivation of this work is to formulate and study adversarial robustness in the context of learning data representations and other basic unsupervised learning tasks. A representation given by a function $x\mapsto g(x)$ is considered adversarially robust if for any point $x\in {\mathbb{R}}^{n}$ and its perturbation ${x}^{\prime}$, it holds that if $x\approx {x}^{\prime}$, then $g(x)\approx g({x}^{\prime})$. Downstream learning tasks like classification and regression trained on such robust representations are more likely to have built in robustness. As we will see later, algorithms for learning adversarially robust representations may also have surprising implications for robustness to trainingtime perturbations. Furthermore, requiring a data representation to be adversarially robust is a natural notion of individual fairness [zemel2013learning, madras2018learning, oneto2019learning]. This leads to the following natural question: can we efficiently learn succinct representations of data that are robust to adversarial perturbations? We will formulate and study this question in the context of principal component analysis.
Principal component analysis (PCA) or lowrank approximations is the predominant tool for obtaining succinct data representations, and is used as a preprocessing primitive in almost every machine learning pipeline. Given data in a highdimensional space ${\mathbb{R}}^{n}$ represented by the columns of a matrix $A$, the goal is to find a subspace of dimension at most $r\ll n$ to represent the points, that minimizes the projection error (or reconstruction error) onto the subspace, formalized as
$$\underset{\mathrm{\Pi}\in \mathcal{P}}{\mathrm{min}}\mathit{\hspace{1em}}{\parallel {\mathrm{\Pi}}^{\u27c2}A\parallel}^{2}=\underset{\mathrm{\Pi}\in \mathcal{P}}{\mathrm{min}}{\parallel A\mathrm{\Pi}A\parallel}^{2},\text{where}\mathcal{P}\text{is the set of all orthogonal projection matrices of rank}\le r,$$  (1) 
where the matrix norm $\parallel \cdot \parallel $ is either the Frobenius norm or the spectral norm. The representation of each point $x\in {\mathbb{R}}^{n}$ corresponds to the projection $\mathrm{\Pi}x$ onto the $r$dimensional subspace given by $\mathrm{\Pi}$ (one can also represent the point as an $r$dimensional vector in terms of a basis for $\mathrm{\Pi}$). Following existing literature on adversarial robustness, we model an imperceptible adversarial perturbation ${x}^{\prime}$ of a point $x$ as one for which the ${\mathrm{\ell}}_{\mathrm{\infty}}$ norm of the difference is small, i.e., ${\parallel x{x}^{\prime}\parallel}_{\mathrm{\infty}}\le \delta $, for some fixed $\delta >0$ (more generally, we also consider ${\mathrm{\ell}}_{q}$ norm where $q\ge 2$). Given an $r$dimensional subspace of ${\mathbb{R}}^{n}$ with projection matrix $\mathrm{\Pi}$, the adversarial robustness of $\mathrm{\Pi}$ to $\delta $perturbations in the ${\mathrm{\ell}}_{q}$ norm is then exactly captured by
$\underset{x,{x}^{\prime}:{\parallel x{x}^{\prime}\parallel}_{q}\le \delta}{sup}{\parallel \mathrm{\Pi}(x{x}^{\prime})\parallel}_{2}=\delta {\parallel \mathrm{\Pi}\parallel}_{q\to 2}.$  (2) 
Observe that $\kappa ={\parallel \mathrm{\Pi}\parallel}_{q\to 2}$ characterizes the robustness of the projection $\mathrm{\Pi}$ to perturbations in ${\mathrm{\ell}}_{q}$ norm around every point $x\in {\mathbb{R}}^{n}$. The distance between the projections of $x$ and a $\delta $perturbation ${x}^{\prime}$ of $x$ (in ${\mathrm{\ell}}_{q}$ norm) is upper bounded by $\kappa \delta $. On the other hand, around each point $x$ one can also realize a perturbation ${x}^{\prime}=x+z$ with ${\parallel z\parallel}_{q}\le \delta $ such that ${\parallel \mathrm{\Pi}x\mathrm{\Pi}{x}^{\prime}\parallel}_{2}=\kappa \delta $. We will call $\mathrm{\Pi}$ a $(\kappa ,q)$robust rank$r$ projection when $\mathrm{\Pi}$ is an orthogonal projection matrix of rank at most $r$ with ${\parallel \mathrm{\Pi}\parallel}_{q\to 2}\le \kappa $; when the robustness parameter $\kappa $ and norm $q$ are understood, we will just call it a robust rank$r$ projection. Hence, our algorithmic goal given a data matrix $A\in {\mathbb{R}}^{n\times m}$ composed of $m$ points in ${\mathbb{R}}^{n}$, a robustness parameter $\kappa \ge 1$ and the norm $q\in [2,\mathrm{\infty}]$, is to find a robust lowrank projection with low error formalized by
$\underset{\mathrm{\Pi}}{\mathrm{min}}$  ${\parallel {\mathrm{\Pi}}^{\u27c2}A\parallel}^{2}=\underset{\mathrm{\Pi}}{\mathrm{min}}{\parallel A\mathrm{\Pi}A\parallel}^{2}$  (3)  
s.t.  $\mathrm{\Pi}\text{is an (orthogonal) projection matrix of rank at most}r,\text{and}{\parallel \mathrm{\Pi}\parallel}_{q\to 2}\le \kappa .$  (4) 
We will be interested in two versions of the problem, depending on whether we measure the error in Frobenius norm or spectral norm. Recall that the top$r$ terms of the Singular Value Decomposition of $A$ simultaneously solves both of these problems in polynomial time, when there is no additional robustness constraint. We also remark that just as for the PCA objective (1), the above objective (3) can be equivalently rephrased as finding the best approximation among lowrank matrices, but among those with a “robust column space” (see Lemma 4.5).
Our optimization problem (3) while motivated by adversarial robustness, has rich connections to (and implications for) well studied problems like the sparse PCA problem [johnstone2009consistency]. Consider the setting when the perturbations are measured in ${\mathrm{\ell}}_{\mathrm{\infty}}$ norm and rank $r=1$. The robustness constraint on the projection $\mathrm{\Pi}=v{v}^{\top}$ imposes an upper bound of $\kappa $ on the “analytic sparsity” of the direction $v\in {\mathbb{R}}^{n}$ (measured in terms of ratio of ${\mathrm{\ell}}_{1}$ and ${\mathrm{\ell}}_{2}$ norms). Hence in the special case of $r=1$,
$$\mathrm{min}{\parallel Av{v}^{\top}A\parallel}_{F}^{2}=\text{tr}(A{A}^{\top})\mathrm{max}{v}^{\top}A{A}^{\top}v\text{subject to}{\parallel v\parallel}_{1}\le \kappa ,\text{and}{\parallel v\parallel}_{2}=1.$$  (5) 
The complementary objective is the ${\mathrm{\ell}}_{1}$ version of the maximization sparse PCA objective,^{1}^{1} 1 It is within a factor $2$ of the ${\mathrm{\ell}}_{0}$ version where the constraint ${\parallel v\parallel}_{1}\le \kappa $ is replaced by ${\parallel v\parallel}_{0}\le {\kappa}^{2}$ (see Section 10.3.3 of [Vershynin]). which is notoriously hard in the worstcase [chan2016approximability].
For general $q\ge 2$, requiring robustness places a constraint on the dual ${\mathrm{\ell}}_{{q}^{*}}$ norm of the direction $v$. Moreover for general rank $r\ge 1$, ${\parallel \mathrm{\Pi}\parallel}_{q\to 2}$ gives an upper bound on the ${\mathrm{\ell}}_{{q}^{*}}$ norm of every unit vector in the subspace given by $\mathrm{\Pi}$ (see Lemma 4.3 for the statement, and Lemma 4.4 for an approximate converse). As we will see in Section 2.1, we will design algorithms for finding approximately optimal lowdimensional representations that are indeed robust to testtime adversarial perturbations.
From a practical perspective it is desirable to have robustness to adversarial perturbations at both trainingtime and testtime. Surprisingly, we show that our algorithmic techniques can be used to achieve this! We study a strong model of trainingtime corruptions, where every coordinate of every point in the training data set can be adversarially perturbed up to a specified amount (in general the perturbation of every point is measured in ${\mathrm{\ell}}_{q}$ norm with $q\ge 2$). Apart from the natural motivation of being secure against poisoned data, this model of corruption naturally arises in emerging paradigms such as low precision computation [de2017understanding, de2018high], where only the few most significant bits of each data point is stored in memory, thereby allowing for processing of more training examples. While there have many exciting recent developments in designing efficient robust algorithms for highdimensional statistical problems, they deal with a fundamentally different model of corruption where a small fraction of the points (outliers) are corrupted arbitrarily [diakonikolas2019robust, lai2016agnostic, charikar2017learning, diakonikolas2018sever]. Our algorithms form a new versatile primitive for achieving robustness. We demonstrate this by applying them to get new trainingtime robustness guarantees for fundamental unsupervised learning tasks such as mean estimation and clustering.
2 Our Results
In all the results that follow, it will be instructive to think of the setting when $q=\mathrm{\infty}$ i.e., each point can be perturbed adversarially up to an amount $\delta $ in every coordinate. However, our results will apply for all $q\ge 2$ generally. Intuitively, the larger the choice of $q$, the more unrestricted the adversarial perturbations can be (since ${\parallel x\parallel}_{p}\le {\parallel x\parallel}_{q}$ when $p\ge q$).
2.1 Worstcase Approximations for Robust LowRank Projections.
We first consider the two variants of problem (3) i.e., the Frobenius norm error and spectral norm error versions in the worstcase setting. We give efficient algorithms that attain small constant factor approximation algorithms (approaching $2$ for the Frobenius norm error, and $3$ for the spectral norm error) when we are also allowed to relax the robustness constraint on the projection by a constant factor.
(Informal) Theorem 2.1.
There exist polynomial time algorithms that given a data matrix $A$, $q\mathrm{\ge}\mathrm{2}$ and a robustness parameter $\kappa $, find a projection matrix $\widehat{\mathrm{\Pi}}$ of rank at most $r$ s.t.
$\forall \gamma \in (0,1],{\parallel \widehat{\mathrm{\Pi}}\parallel}_{q\to 2}$  $\le O(1/\sqrt{\gamma})\cdot \kappa ,\mathit{\text{and}}{\parallel A\widehat{\mathrm{\Pi}}A\parallel}^{2}\le \left(\alpha +\gamma \right)OPT,$  (6) 
where $O\mathit{}P\mathit{}T$ is the error obtained by the optimal $\mathrm{(}\kappa \mathrm{,}q\mathrm{)}$robust projection of rank at most $r$ ($\alpha \mathrm{=}\mathrm{2}$ for the Frobenius norm error objective and $\alpha \mathrm{=}\mathrm{3}$ for the spectral norm error objective). Moreover, for any $\gamma \mathrm{\in}\mathrm{(}\mathrm{0}\mathrm{,}\mathrm{1}\mathrm{]}$, there exist polynomial time algorithms that find an ${r}^{\mathrm{\prime}}\mathrm{\le}r\mathit{}\mathrm{(}\mathrm{1}\mathrm{+}O\mathit{}\mathrm{(}{\gamma}^{\mathrm{}\mathrm{1}}\mathrm{)}\mathrm{)}$dimensional projection $\widehat{\mathrm{\Pi}}$ that gets an $\mathrm{(}\mathrm{1}\mathrm{+}\gamma \mathrm{)}$approximation to the objective, and relaxes the robustness parameter by $O\mathit{}\mathrm{(}\mathrm{1}\mathrm{/}\sqrt{\gamma}\mathrm{)}$ factor.
The algorithms for both objectives – Frobenius norm and spectral norm, both use convex relaxations and use similar ideas. However the algorithms for these two different objectives are different (starting with the relaxations) unlike standard PCA. Please see Theorem 5.1 (Frobenius norm objective) and Theorem 5.7 (spectral norm objective) for the formal statements.
Observe that the approximation factor guarantee in Theorem 2.1 is independent of the desired rank $r$. We remark that our algorithmic guarantees are nontrivial even if we do not want to restrict the rank $r$; in particular when $r=n$ (unrestricted) our algorithm finds among all subspaces that are $O(\kappa )$robust, the one with approximately optimal error. The constant factor loss in the robustness parameter depends on the choice of the norm ${\mathrm{\ell}}_{q}$, and remains a constant for all $q\ge 2$. When $q=\mathrm{\infty}$, the constant factor that arises is related to the constant in a variant of the Grothendieck problem [alon2004approximating, nesterov1998semidefinite] (the constant is at most $\pi /2$). Hence the above theorem gives a $(O(1),O(1))$factor bicriteria approximation when we need an approximation of rank at most $r$; however we can also achieve an $((1+\gamma ),O(1/\sqrt{\gamma}))$ by incurring an extra loss in the rank.
Our result also has new implications for approximating the minimization objective for ${\mathrm{\ell}}_{1}$sparse PCA. Sparse PCA has been studied under average case models [BerthetRigollet], as well as worst case models [chan2016approximability]. In particular the special case of our result for $r=1$ provides a small constant factor bicriteria approximation to the minimization version of the sparse PCA problem under ${\mathrm{\ell}}_{1}$ constraint on the sparsity. This is in stark contrast to the approximability of the maximization version of the problem. Even in the special case when $r=1$, the best known polynomial time algorithm gives a $O({n}^{1/3})$ factor approximation in the worstcase (for both the ${\mathrm{\ell}}_{1}$ and ${\mathrm{\ell}}_{0}$ versions). This is true even when the sparsity can be relaxed by a $O(1)$ factor; moreover no constant factor approximation is possible assuming the Small Set Expansion conjecture [chan2016approximability]. See also Appendix B for computational hardness of (3) based on this observation. Furthermore, as we will see in a bit, the minimization version that we study and our constant factor approximations for this problem will also be crucial in various downstream applications in unsupervised learning.
2.2 Statistical Benefits of Robust Projections.
While the operator norm constraint arises naturally when we need robustness to small adversarial perturbations, it also induces a regularization effect due to the “sparsity” of vectors in the robust subspace given by $\mathrm{\Pi}$ (see Lemma 4.4). This brings with it additional statistical benefits, which we demonstrate by studying the classic highdimensional statistical problem of covariance estimation and recovery in the wellstudied spiked covariance model [johnstone2001distribution, aminiwainwright, berthet2013]. The spiked covariance model SCM$(\theta =({\theta}_{\mathrm{min}},{\theta}_{\mathrm{max}}),{\mathrm{\Sigma}}^{*})$ has two sets of parameters:the signal strength specified by the pair ${\theta}_{\mathrm{min}},{\theta}_{\mathrm{max}}$ with ${\theta}_{\mathrm{max}}\ge {\theta}_{\mathrm{min}}\ge 0$, and the unknown covariance matrix ${\mathrm{\Sigma}}^{*}$ of rank $r$ with its eigenvalues in the interview $[{\theta}_{\mathrm{min}},{\theta}_{\mathrm{max}}]$. The top$r$ eigenspace of ${\mathrm{\Sigma}}^{*}$ is given by the projection matrix ${\mathrm{\Pi}}^{*}\in {\mathbb{R}}^{n\times n}$ of rank $r$ that is $(\kappa ,q)$robust for $q\ge 2$. Each sample from the distribution is drawn independently from $n$variate Gaussian $N(0,I+{\mathrm{\Sigma}}^{*})$.
We will be interested in two different goals – detection and the stronger goal of recovery. The goal in detection will be to distinguish w.h.p. whether the sample data was generated by the spiked covariance model SCM($\theta ,{\mathrm{\Sigma}}^{*}$) or from a standard Gaussian $N(0,I)$. Our goal in recovery is to output an estimate $\mathrm{\Sigma}$ for ${\mathrm{\Sigma}}^{*}$, with error measured in terms of the Frobenius norm of $\mathrm{\Sigma}{\mathrm{\Sigma}}^{*}$. We first state the guarantee for $q=\mathrm{\infty}$ and discuss the case of general $q$ at the end of the section.
(Informal) Theorem 2.2.
Fix $q\mathrm{=}\mathrm{\infty}$ and let ${\mathrm{\Pi}}^{\mathrm{*}}$ be a $\mathrm{(}\kappa \mathrm{,}\mathrm{\infty}\mathrm{)}$robust projection of rank $r$ for subspace spanned by the nontrivial eigenvectors of the rank$r$ covariance matrix ${\mathrm{\Sigma}}^{\mathrm{*}}$. For any $\epsilon \mathrm{>}\mathrm{0}$, $O\mathit{}\mathrm{\left(}\frac{{\mathrm{(}\mathrm{1}\mathrm{+}{\theta}_{\mathrm{max}}\mathrm{)}}^{\mathrm{2}}}{{\theta}_{\mathrm{min}}^{\mathrm{2}}}\mathrm{\cdot}{r}^{\mathrm{2}}\mathit{}{\kappa}^{\mathrm{2}}\mathit{}\mathrm{log}\mathit{}n\mathrm{/}{\epsilon}^{\mathrm{2}}\mathrm{\right)}$ samples from SCM$\mathrm{(}{\theta}_{\mathrm{min}}\mathrm{,}{\theta}_{\mathrm{max}}\mathrm{,}{\mathrm{\Sigma}}^{\mathrm{*}}\mathrm{)}$ suffice to recover a covariance matrix $\mathrm{\Sigma}$ having its top$r$ eigenspace given by a $\mathrm{(}\kappa \mathrm{,}\mathrm{\infty}\mathrm{)}$robust rank$r$ projection $\mathrm{\Pi}$ such that w.h.p., ${\mathrm{\parallel}\mathrm{\Sigma}\mathrm{}{\mathrm{\Sigma}}^{\mathrm{*}}\mathrm{\parallel}}_{F}^{\mathrm{2}}\mathrm{\le}\epsilon $. At the same time, $O\mathit{}\mathrm{\left(}\frac{{\mathrm{(}\mathrm{1}\mathrm{+}{\theta}_{\mathrm{max}}\mathrm{)}}^{\mathrm{2}}}{{\theta}_{\mathrm{min}}^{\mathrm{2}}}\mathrm{\cdot}{\kappa}^{\mathrm{2}}\mathit{}\mathrm{log}\mathit{}n\mathrm{/}{\epsilon}^{\mathrm{2}}\mathrm{\right)}$ samples suffice for detection.
Please see Theorem 8.3 and Theorem 8.5 for the formal statements. We also prove our upper bounds are optimal up to $O(\mathrm{log}n)$ factors for $q=\mathrm{\infty}$ (see Theorem 8.6 for the lower bound). Note that for the task of detection, we actually save the additional ${r}^{2}$ factor in the sample complexity compared to recovery.
The above bound shows that when $r\kappa \ll \sqrt{n}$, we get significant statistical benefits compared to traditional covariance estimation in highdimensions, which requires $\mathrm{\Omega}(n)$ samples even for recovering the principal component ($r=1$) up to good accuracy. This sharply contrasts with our current understanding of adversarial robustness in classification problems where current evidence suggests that the sample complexity for learning robust classifiers may be significantly higher compared to the nonrobust case [schmidt2018adversarially, yin2018rademacher, montasser2019vc].
We remark that sample complexity upper bounds are known in the spiked covariance model, when ${\mathrm{\Sigma}}^{*}$ (or the subspace that corresponds to it) satisfies some sparsity constraints [aminiwainwright, VuLei13, Liuetal]. Our results are generally incomparable since we parameterize by the $\mathrm{\infty}\to 2$ operator norm of ${\mathrm{\Pi}}^{*}$ which is naturally motivated from robustness considerations. However somewhat surprisingly, our analysis often matches or in fact improves upon the sample complexity upper bound even in the other parameterization, in certain regimes. See Remark 8.4 for details.
The estimation algorithm in Theorem 2.2 is computationally inefficient. We now show how we can simultaneously achieve computational efficiency and statistical efficiency in finding an estimate $\widehat{\mathrm{\Sigma}}$ of ${\mathrm{\Sigma}}^{*}$; moreover the top$r$ eigenspace of $\widehat{\mathrm{\Sigma}}$ has a robust projection matrix $\widehat{\mathrm{\Pi}}$ that is also close to ${\mathrm{\Pi}}^{*}$.
(Informal) Theorem 2.3.
There is a polynomial time algorithm that given $m$ samples in ${\mathrm{R}}^{n}$ drawn i.i.d. from the spiked covariance model SCM(${\theta}_{\mathrm{min}}\mathrm{,}{\theta}_{\mathrm{max}}\mathrm{,}{\mathrm{\Sigma}}^{\mathrm{*}}$) with the top$r$ eigenspace of ${\mathrm{\Sigma}}^{\mathrm{*}}$ given by a $\mathrm{(}\kappa \mathrm{,}q\mathrm{)}$robust rank$r$ projection ${\mathrm{\Pi}}^{\mathrm{*}}$, finds w.h.p. a rank$r$ covariance matrix $\widehat{\mathrm{\Sigma}}$ with its top$r$ eigenspace being an $\mathrm{(}O\mathit{}\mathrm{(}\kappa \mathrm{)}\mathrm{,}q\mathrm{)}$robust rank$r$ projection $\widehat{\mathrm{\Pi}}$ such that ${\mathrm{\parallel}\widehat{\mathrm{\Sigma}}\mathrm{}{\mathrm{\Sigma}}^{\mathrm{*}}\mathrm{\parallel}}_{F}^{\mathrm{2}}\mathrm{\le}\epsilon $, as long as $m\mathrm{\ge}\frac{c\mathit{}{\mathrm{(}\mathrm{1}\mathrm{+}{\theta}_{\mathrm{max}}\mathrm{)}}^{\mathrm{2}}}{{\theta}_{\mathrm{min}}^{\mathrm{2}}}\mathrm{\cdot}{r}^{\mathrm{2}}\mathit{}{\kappa}^{\mathrm{4}}\mathit{}\mathrm{log}\mathit{}n\mathrm{/}{\epsilon}^{\mathrm{2}}$ for some universal constant $c\mathrm{>}\mathrm{0}$. Moreover $m\mathrm{\ge}\frac{c\mathit{}{\mathrm{(}\mathrm{1}\mathrm{+}{\theta}_{\mathrm{max}}\mathrm{)}}^{\mathrm{2}}}{{\theta}_{\mathrm{min}}^{\mathrm{2}}}\mathrm{\cdot}{\kappa}^{\mathrm{4}}\mathit{}\mathrm{log}\mathit{}n\mathrm{/}{\epsilon}^{\mathrm{2}}$ suffices for the detection problem w.h.p.
Please see Theorems 8.8 and 8.9 for the formal statements. We remark that even for $r=1$, the dependence of ${\kappa}^{4}$ is necessary assuming the Planted Clique assumption, due to known lower bounds from sparse PCA [BerthetRigollet]. This points to an interesting statistical vs computational tradeoff that would be interesting to explore in the other regimes of parameters.
We also generalize our results to $$. We extend the information upper bound in Theorem 2.2 to show $O\left(\frac{{(1+{\theta}_{\mathrm{max}})}^{2}}{{\theta}_{\mathrm{min}}^{2}}\cdot {r}^{2}{\kappa}^{2}\cdot q{n}^{2/q}\mathrm{log}n/{\epsilon}^{2}\right)$ random samples suffice for recovery statistically. On the other hand, we provide an informationtheoretic lower bound showing that this dependence in terms of $n,\kappa ,r$ is almost tight for $q\in [2,\mathrm{\infty})$ (in Theorem 8.7). In particular, we prove that for $q\in [2,\mathrm{\infty})$, the sample complexity incurs a polynomial dependence of ${n}^{2/q}$. Finally, we remark that our polynomial time algorithm in Theorem 2.3 also works for $q\in [2,\mathrm{\infty})$ with an extra factor of ${n}^{4/q}$ (instead of ${n}^{2/q}$) in the sample complexity.
2.3 Robustness to Adversarial Errors during Training
Surprisingly, the notion of testtime robustness and the algorithms that we have developed (Theorem 2.1) also allow us to handle robustness to adversarial perturbations in the training data set (this is often referred to as data poisoning). We propose a corruption model under which every training sample ${A}_{i}\in {\mathbb{R}}^{n}$ can potentially be corrupted adversarially up to a $\delta $ amount, as measured in ${\mathrm{\ell}}_{\mathrm{\infty}}$ norm (more generally ${\mathrm{\ell}}_{q}$ norm for $q\ge 2$). Our input instance is $\stackrel{~}{A}\in {\mathbb{R}}^{n\times m}$ where every column $i\in [m]$ satisfies ${\parallel {\stackrel{~}{A}}_{i}{A}_{i}\parallel}_{q}\le \delta $; we will refer to such an $\stackrel{~}{A}$ as a $\delta $corrupted instance of $A$. Our goal now is to recover a robust lowrank projection for the uncorrupted matrix $A$. We will show that we can in fact output a robust lowrank projection $\widehat{\mathrm{\Pi}}$ that is competitive with the best robust lowrank projection of $A$, even though $A$ is not known to us!
(Informal) Theorem 2.4.
Suppose $q\mathrm{\ge}\mathrm{2}$ and $A\mathrm{\in}{\mathrm{R}}^{n\mathrm{\times}m}$ is the unknown uncorrupted data matrix, with a $\mathrm{(}\kappa \mathrm{,}q\mathrm{)}$robust projection matrix ${\mathrm{\Pi}}^{\mathrm{*}}$ of rank at most $r$ satisfying ${\mathrm{\parallel}A\mathrm{}{\mathrm{\Pi}}^{\mathrm{*}}\mathit{}A\mathrm{\parallel}}_{F}^{\mathrm{2}}\mathrm{\le}\epsilon \mathit{}{\mathrm{\parallel}A\mathrm{\parallel}}_{F}^{\mathrm{2}}$ for some $\epsilon \mathrm{\in}\mathrm{[}\mathrm{0}\mathrm{,}\mathrm{1}\mathrm{]}$. There exists a polynomial time algorithm that given as input a $\delta $corrupted instance $\stackrel{\mathrm{~}}{A}$ of $A$ outputs an $r$dimensional projection $\widehat{\mathrm{\Pi}}$ that is approximately optimal
$\forall \eta >0,{\parallel \widehat{\mathrm{\Pi}}\parallel}_{q\to 2}$  $\le O(\kappa ),\mathit{\text{and}}{\parallel A\widehat{\mathrm{\Pi}}A\parallel}_{F}^{2}\le O(\epsilon +\eta )\cdot {\parallel A\parallel}_{F}^{2}+O(\frac{1}{\eta})\cdot {\delta}^{2}{\kappa}^{2}m.$  (7) 
In particular this gives an $O\mathit{}\mathrm{(}\mathrm{1}\mathrm{)}$ approximation when $$.
To interpret the results, let $q=\mathrm{\infty}$ and consider an uncorrupted dataset $A$ where every column (sample) is a unit vector in ${\mathbb{R}}^{n}$, and let $\kappa \ll \sqrt{n}$ be a small polynomial of $n$, say $\kappa ={n}^{0.1}$. When $\delta =o({n}^{1/2})$ i.e., every coordinate can be perturbed up to $\delta \ll 1/\sqrt{n}$, the total corruption to each point is at most $o(1)$ in Euclidean norm; in this case one would expect that standard PCA applied to $\stackrel{~}{A}$ will approximately recovery a good lowrank approximation. The above Theorem 2.4 on the other hand guarantees to find a good (robust) lowrank approximation for the unknown matrix $A$ even when $\delta =o(1/\kappa )=o({n}^{0.1})$. Note that in this setting every point can be completely overwhelmed by the adversarial noise (in Euclidean norm). Furthermore, the above guarantees are optimal up to constant factors; in particular, Proposition 6.7 shows that the additive factor of $O(m{\delta}^{2}{\kappa}^{2})$ is unavoidable for every $\kappa ,\delta =O(1/\kappa )$. These results together suggest that our notion of robust projections (measured in $q\to 2$ operator norm) is key in understanding the robustness to small adversarial perturbations of every point during training as well.
Our guarantees for spectral norm error in the presence of trainingtime adversarial perturbations are somewhat similar to Theorem 5.1. However, there is a qualitative difference: we will either find a robust lowdimensional projection of the unknown dataset $A$, or we will certify that the dataset has been poisoned substantially. In particular, the algorithm will never output a lowdimensional representation that is bad for the unknown data matrix $A$. In what follows $\parallel \cdot \parallel $ will refer to the spectral norm.
(Informal) Theorem 2.5.
Fix $q\mathrm{\ge}\mathrm{2}$ and $\epsilon \mathrm{\in}\mathrm{(}\mathrm{0}\mathrm{,}\mathrm{1}\mathrm{)}$. Let $A\mathrm{\in}{\mathrm{R}}^{n\mathrm{\times}m}$ be the unknown uncorrupted data matrix with a $\mathrm{(}\kappa \mathrm{,}q\mathrm{)}$robust projection ${\mathrm{\Pi}}^{\mathrm{*}}$ of rank at most $r$ with $\mathrm{\parallel}A\mathrm{}{\mathrm{\Pi}}^{\mathrm{*}}\mathit{}A\mathrm{\parallel}\mathrm{\le}\epsilon \mathit{}\mathrm{\parallel}A\mathrm{\parallel}$. There exists a polynomial time algorithm that given as input a $\delta $perturbed instance $\stackrel{\mathrm{~}}{A}$, outputs either a robust projection $\widehat{\mathrm{\Pi}}$ of rank at most $r$ with
$${\parallel \widehat{\mathrm{\Pi}}\parallel}_{q\to 2}=O(\kappa ),\mathit{\text{and}}\parallel A\widehat{\mathrm{\Pi}}A\parallel \le O\left(\sqrt{\epsilon}\parallel A\parallel +\sqrt{m}\delta \kappa /\sqrt{\epsilon}\right),$$  (8) 
or certifies that the data has been poisoned i.e., $\mathrm{\parallel}\stackrel{\mathrm{~}}{A}\mathrm{}A\mathrm{\parallel}\mathrm{>}\epsilon \mathit{}\mathrm{\parallel}A\mathrm{\parallel}$.
Please see Theorem 6.4 for a formal statement. Here again, the additive term of $\mathrm{\Omega}(\sqrt{m}\delta \kappa )$ is unavoidable as shown in Proposition 6.7. We remark that informationtheoretically we can design an estimator (that is computationally inefficient) that achieves the stronger qualitative guarantees as in Theorem 2.4. Designing a computationally efficient algorithm to do the same is a tantalising open question that we describe in more detail in Section 2.5.2. Finally, we would like to point out that our algorithms output $(O(\kappa ),q)$robust projections, and as a result we also get robustness to testtime perturbations.
Comparison to Robust PCA. The problem of robust PCA has received significant attention in recent years [de2003framework, candes2011robust, chandrasekaran2011rank]. Here one assumes that a given corrupted matrix $\stackrel{~}{A}$ is a sum of two matrices, the true matrix $A$ that is lowrank and a sparse corruption matrix $S$ with sparsity pattern being essentially random. The corruptions, although sparse, can be unbounded in magnitude. This necessitates an incoherence type assumption that the “mass” or the principal components of $A$ is spread out – else recovery of $A$ is impossible here. On the other hand, the corruptions may not be sparse in our case. In particular, every data point (in fact every entry of $A$) could be corrupted up to some specified magnitude $\delta $. Here as our results show (particularly Theorem 2.4), localization (or sparsity) of the signal is crucial in tolerating adversarial perturbations in the training data (e.g., a spread out signal can be completely overwhelmed by the corruption in each entry of $A$). Hence we believe that our algorithms could give a new robust primitive that is fundamentally different from existing techniques like robust PCA, for downstream applications in ML.
2.4 Applications to Unsupervised Learning.
As a concrete application of our guarantees from the previous section we study mean estimation and clustering under adversarial perturbations to potentially every data point in the training set. For mean estimation, we assume that the uncorrupted data matrix $A\in {\mathbb{R}}^{n\times m}$ has mean vector denoted by $\mu :=\mathrm{\text{Mean}}(A)$. The guarantee below shows that in order to recover a good approximation to $\mu $, the amount of corruption (the $\delta $ parameter) that can be tolerated is directly related to the inherent robustness of the one dimensional subspace spanned by $\mu $. Please see Theorem 7.1 for a formal statement.
(Informal) Theorem 2.6.
[Robust Mean Estimation] Fix $q\mathrm{\ge}\mathrm{2}$ and consider $A\mathrm{\in}{\mathrm{R}}^{n\mathrm{\times}m}$ with $\mu \mathrm{=}\mathrm{\text{Mean}}\mathit{}\mathrm{(}A\mathrm{)}$. Let $C\mathrm{=}\mu \mathit{}{\mathrm{1}}^{\mathrm{\top}}$ represent an $n\mathrm{\times}m$ matrix with copies of $\mu $, and $\sigma $ satisfy $\mathrm{\parallel}A\mathrm{}C\mathrm{\parallel}\mathrm{\le}\sigma \mathit{}\sqrt{m}$. Let ${\mathrm{\Pi}}^{\mathrm{*}}\mathrm{=}\mu \mathit{}{\mu}^{\mathrm{\top}}\mathrm{/}{\mathrm{\parallel}\mu \mathrm{\parallel}}^{\mathrm{2}}$ be the one dimensional subspace denoting the projection onto $\mu $ such that ${\mathrm{\parallel}{\mathrm{\Pi}}^{\mathrm{*}}\mathrm{\parallel}}_{q\mathrm{\to}\mathrm{2}}\mathrm{\le}\kappa $. There exists a polynomial time algorithm that given as input a $\delta $corrupted instance $\stackrel{\mathrm{~}}{A}$ of $A$, either certifies that the data has been poisoned, i.e., $\mathrm{\parallel}\stackrel{\mathrm{~}}{A}\mathrm{}A\mathrm{\parallel}\mathrm{=}\mathrm{\Omega}\mathit{}\mathrm{(}\sigma \mathit{}\sqrt{m}\mathrm{)}$, or outputs an estimate $\widehat{\mu}$ such that ${\mathrm{\parallel}\widehat{\mu}\mathrm{}\mu \mathrm{\parallel}}_{\mathrm{2}}\mathrm{\le}O\mathit{}\mathrm{(}\mathrm{(}\mathrm{1}\mathrm{+}\frac{\kappa \mathit{}\delta}{\sigma}\mathrm{)}\mathit{}\mathrm{max}\mathit{}\mathrm{\{}\sigma \mathrm{,}\sqrt{\sigma}\mu \mathrm{\parallel}\mathrm{\parallel}\mathrm{missing}\mathrm{\}}\mathrm{)}$.
Equivalently, the above implies a relative error guarantee of $O((1+\frac{\kappa \delta}{\sigma})\mathrm{max}\{\sigma /\parallel \mu \parallel ,\sqrt{\sigma /\parallel \mu \parallel}\})$. Here $\sigma /\parallel \mu \parallel $ captures the noisetosignal ratio. We make a few remarks about the above claim. When $\kappa \delta =O(\sigma )$, we get a relative error guarantee of $O(\mathrm{max}\{\sigma /\parallel \mu \parallel ,\sqrt{\sigma /\parallel \mu \parallel}\})$. Consider the case of $q=\mathrm{\infty}$, and $\mu $ being a $k={\kappa}^{2}$sparse (in the ${\mathrm{\ell}}_{0}$ sense) unit length vector. Such estimation problems concerning sparse mean vectors arise naturally in many applications and have been widely studied in statistics [donoho1992maximum, donoho1994minimax, johnstone1994minimax]. When every point could be corrupted, our guarantee allows for an adversary to add $\sim \sigma /\sqrt{k}$ perturbation per coordinate, as opposed to $\sim \sigma /\sqrt{n}$. Notice that our mean estimation guarantee has a dependence on $\parallel \mu \parallel $. In Theorem 7.3 we show that in our model of corruption, this dependence is unavoidable and hence the error in our mean estimation bound is information theoretically optimal upto constant factors.
Next, we show how the robust mean estimation procedure, along with our guarantees from the previous section, can be used to perform robust clustering of well clustered instances. In particular, we modify the popular Lloyd’s algorithm [lloyd1982least] to get robustness to perturbations of every data point. As is done in the standard Lloyd’s analysis, we study the spectral stability condition proposed by Kumar and Kannan [kumar2010clustering], and studied later in [awasthi2012improved, cohen2017local]. This condition captures, as special cases, mixtures of well separated Gaussians and other distributions, and helps in developing a unified analysis of Lloyd’s updates. Below we state our results assuming equal cluster sizes. Our results hold for the more general case and we defer to Section 7.2 for more details. Let $A\in {\mathbb{R}}^{n\times m}$ be clustered into $k$ clusters of equal sizes with means ${\mu}_{1},{\mu}_{2},\mathrm{\dots},{\mu}_{k}$. Furthermore, let $C\in {\mathbb{R}}^{n\times m}$ be the matrix of corresponding centers for each column of $A$ and let $\sigma $ be such that $\parallel AC\parallel \le \sigma \sqrt{m}$. Then $A$ satisfies $c$spectral stability if for each pair of optimal clusters, say, cluster $r$ and $s$ with means ${\mu}_{r}$ and ${\mu}_{s}$, any point in cluster $r$, when projected onto the line joining ${\mu}_{r}$ and ${\mu}_{s}$ is closer to ${\mu}_{r}$ than ${\mu}_{s}$ by an additive amount of ${\mathrm{\Delta}}_{r,s}:=c\alpha k\cdot \sigma $. Here $\alpha $ is a quantity that captures the signaltonoise ratio and the relative perturbation magnitude, i.e. $(1+\frac{\kappa \delta}{\sigma})$, as in robust mean estimation^{2}^{2} 2 Note that unlike clustering without corruptions, the dependence on $\alpha $ is unavoidable in our model even for $k=1$, i.e., robust mean estimation.. In the special case when $A$ is a set of $m=\mathrm{poly}(n,k)$ points drawn i.i.d. from a mixture of Gaussians with the variance of each Gaussian being bounded by ${\sigma}^{2}$, and with uniform mixture weight $1/k$ each, the separation condition becomes ${\mathrm{\Delta}}_{r,s}=c\alpha k\cdot \text{polylog}(nk)\cdot \sigma $. In the theorem below we denote $\kappa $ to be the robustness, as measured in $\parallel {\parallel}_{q\to 2}$, of the subspace spanned by the true means $\{{\mu}_{1},{\mu}_{2},\mathrm{\dots},{\mu}_{k}\}$.
(Informal) Theorem 2.7.
[Robust Clustering] Fix $q\mathrm{\ge}\mathrm{2}$, and let ${c}_{q}$ be a constant that depends on $q$. Let $A\mathrm{\in}{\mathrm{R}}^{n\mathrm{\times}m}$ satisfy $c$spectral stability, for $c\mathrm{>}\mathrm{200}\mathit{}{c}_{q}$. Then given as input a $\delta $corrupted instance $\stackrel{\mathrm{~}}{A}$ of $A$, there is a Lloyd’s style algorithm that either certifies that the dataset is poisoned, i.e, $\mathrm{\parallel}A\mathrm{}\stackrel{\mathrm{~}}{A}\mathrm{\parallel}\mathrm{=}\mathrm{\Omega}\mathit{}\mathrm{(}\sigma \mathit{}\sqrt{m}\mathrm{)}$, or recovers each mean ${\mu}_{r}$ up to error $O\mathit{}\mathrm{(}\alpha \mathit{}\sqrt{k}\mathit{}\sigma \mathrm{)}$. Using the computed centers to cluster $\stackrel{\mathrm{~}}{A}$, we obtain a clustering of $\stackrel{\mathrm{~}}{A}$ such that the corresponding induced clustering on $A$ that misclassifies $O\mathit{}\mathrm{(}\mathrm{1}\mathrm{/}k\mathrm{)}$ fraction of the points.
In the special case of a mixture of Gaussians with equal mixing weights we get the means upto error $\stackrel{\mathrm{~}}{O}\mathit{}\mathrm{(}\alpha \mathit{}\sigma \mathrm{)}$, where we hide a $\text{\mathit{p}\mathit{o}\mathit{l}\mathit{y}\mathit{l}\mathit{o}\mathit{g}}\mathit{}\mathrm{(}m\mathrm{,}n\mathrm{)}$ factor in the $\stackrel{\mathrm{~}}{O}$ notation. This implies $O\mathit{}\mathrm{(}\mathrm{1}\mathrm{/}{k}^{\mathrm{2}}\mathrm{)}$fraction clustering error.
See Theorem 7.5 and Theorem 7.7 for formal statements that also handle more general cluster sizes and mixing weights.
Finally, as in Section 2.3 we can also prove that information theoretically there is an algorithm (though computationally inefficient) that performs robust mean estimation up to the (near optimal) error of $O((1+\frac{\kappa \delta}{\sigma})\mathrm{max}(\sigma ,\sqrt{\sigma \parallel \mu \parallel}))$ on all instances (it will estimate the uncorrupted mean even for instances that are certified to have been poisoned). See Theorem 7.4 for the formal statement. As a result, we can also cluster wellclustered instances informationtheoretically up to the claimed error above, without the need for certification (See Section 7 for details). Whether this can be achieved in polynomial time is an open question.
2.5 Overview of Challenges and Techniques
We now give a flavor of the technical ideas involved in obtaining our results. For the sake of exposition, we will mainly restrict our attention to the case when the adversarial perturbations are measured in ${\mathrm{\ell}}_{\mathrm{\infty}}$ norm.
2.5.1 Worstcase Constant Factor Approximations
Let us first consider the version of problem (3) of finding a robust rank$r$ projection that has small error measured in Frobenius norm. A natural mathematical programming relaxation is the following:
$\underset{X}{\mathrm{min}}{\parallel A\parallel}_{F}^{2}\u27e8A{A}^{\top},X\u27e9$  (9)  
subject to  $\text{tr}(X)\le r,\mathrm{\hspace{0.25em}\hspace{0.25em}0}\u2aafX\u2aafI$  
$\text{and}$  ${\parallel X\parallel}_{\mathrm{\infty}\to 2}\le \kappa $  (10) 
This is a valid relaxation for the problem since the constraints are all satisfied by any rank$r$ projection matrix that is robust i.e, ${\parallel \mathrm{\Pi}\parallel}_{\mathrm{\infty}\to 2}\le \kappa $. Moreover this relaxation is convex; the last constraint in particular is an upper bound on a valid norm. Let ${X}^{*}$ be the optimal solution of this program. The first challenge however is that the operator norm constraint (10) is NPhard to verify efficiently. In general, these operator norm ${\parallel \cdot \parallel}_{q\to p}$ computation problems are APXhard for most values of $p,q$. They form a rich class of problems related to the Grothendieck problem [alon2004approximating, nesterov1998semidefinite], and we can use polynomial time $O(1)$ factor approximations that are known for general $q\to 2$ norms with $q\ge 2$ (see Section 4.1).
The bigger challenge here is in producing a projection matrix from ${X}^{*}$ that (a) achieves a good objective value (b) has rank at most $r$, and (c) is $O(\kappa )$robust (bounded $\mathrm{\infty}\to 2$ operator norm). A natural approach for producing a good lowrank solution is to output a rank$r$ projection matrix ${\mathrm{\Pi}}_{r}$, by considering the large singular values of ${X}^{*}$ (e.g., the top$r$ singular vector space of ${X}^{*}$). However we have no control on the robustness of the subspace ${\parallel {\mathrm{\Pi}}_{r}\parallel}_{\mathrm{\infty}\to 2}$. In fact, problem (3) is challenging even without the rank constraint on the subspace (i.e., for $r=n$). The main issue is to relate the $\mathrm{\infty}\to 2$ operator norm of the projection matrix we output to that of the relaxation solution ${\parallel {X}^{*}\parallel}_{\mathrm{\infty}\to 2}$ which is upper bounded by $\kappa $.
Our crucial insight is that we can indeed design such a rounding scheme that achieves all three goals if the norm in the constraint (10) is a monotone norm! A general matrix norm $\u2af4\cdot \u2af4$ is monotone iff
$$\u2af4A+B\u2af4\ge \u2af4A\u2af4,\text{for any pair of PSD matrices}A,B.$$ 
(See Definition 4.6 for details.) This monotonicity property allows us to truncate terms in the eigendecomposition without any loss in robustness $\kappa $, and get fine control on the robustness $\kappa $ when we rescale different PSD terms slightly. Unfortunately however, the $\mathrm{\infty}\to 2$ operator norm is not monotone (see Claim A.2; see also Claim A.1 for counterexamples for various other matrix norms that are not unitarilyinvariant).
Our next important observation is that we can replace the constraint (10) by a similar constraint in terms of the $\mathrm{\infty}\to 1$ norm (more generally $q\to 2$ norm in terms of the $q\to {q}^{*}$ norm). This is because for any matrix $B$, we have that ${\parallel B\parallel}_{q\to 2}^{2}={\parallel B{B}^{\top}\parallel}_{q\to {q}^{*}}$ where ${\mathrm{\ell}}_{{q}^{*}}$ is the dual norm for ${\mathrm{\ell}}_{q}$ and satisfies $\frac{1}{{q}^{*}}+\frac{1}{q}=1$. The main advantage of this reformulation is that the $q\to {q}^{*}$ operator norms are indeed monotone (see Lemma 5.6 for a simple proof). Moreover these norms can be efficiently separated (within an $O(1)$ slack factor) when $q\ge 2$ since ${\parallel M\parallel}_{\mathrm{\infty}\to 1}={\mathrm{max}}_{x,y\in {\{\pm 1\}}^{n}}\u27e8M,x{y}^{\top}\u27e9$.
Our final algorithm approximately solves the mathematical program (9) with constraint (10) replaced by ${\parallel X\parallel}_{\mathrm{\infty}\to 1}\le {\kappa}^{2}$. We produce a lowrank projection matrix by first truncating the solution to the program $\widehat{X}$ to the large singular values, and then picking the $r$ terms that contribute the most towards the objective. Our analysis leverages the monotonicity property to prove $O(\kappa )$robustness, while also achieving an $O(1)$ approximation to the objective. See Theorem 5.3 for an analysis with a general monotone norm constraint, and Theorem 5.1 for our desired problem.
Similar ideas can be used when the error is measured in terms of the spectral norm instead of the Frobenius norm. The objective function in the mathematical relaxation (9) is instead rephrased as $\mathrm{min}\parallel ({A}^{\top}(IX)A\parallel $ where $\parallel \cdot \parallel $ is the spectral norm. Theorem 5.7 gives a slightly different rounding and analysis that again leverages the monotonicity property of $q\to {q}^{*}$ operator norms.
2.5.2 Robustness to TrainingTime Adversarial Perturbations (Data Poisoning)
Let $q=\mathrm{\infty}$. Recall that our input instance $\stackrel{~}{A}$ is a $\delta $corrupted instance obtained from $A$ by potentially corrupting every entry of it. Our goal is to output a robust lowrank projection matrix $\mathrm{\Pi}$ of rank at most $r$ for the uncorrupted matrix $A$, that is not known to us. This question is interesting even from a purely statistical standpoint; but additionally, we would also like our algorithm to run in polynomial time.
Why should this be possible? Suppose the uncorrupted matrix $A$ has a robust lowrank projection ${\mathrm{\Pi}}^{*}$ of small error i.e., $$ (where $\parallel A\parallel $ is either the Frobenius norm or spectral norm). Also assume for just this discussion that the average column (Euclidean) length of $A$ is 1 (or even that each column is of unit length), $\kappa ={n}^{0.1}$ say and $\delta =o({n}^{0.1})$. For any $\kappa $robust projection $\mathrm{\Pi}$, $\mathrm{\Pi}{A}_{j}\approx \mathrm{\Pi}{\stackrel{~}{A}}_{j}$ for each data point $j\in [m]$ . So one could apply the worstcase algorithm to find a robust projection for the corrupted input $\stackrel{~}{A}$, and hope to also get a robust subspace of lowerror for the unknown matrix $A$.
However, there are two major challenges in implementing this strategy. (1) Solution value of $\stackrel{\mathrm{~}}{A}$: the robust projection ${\mathrm{\Pi}}^{*}$ may not achieve low error on the input matrix $\stackrel{~}{A}$; in fact, $\stackrel{~}{A}$ may not have any good robust lowrank approximation – in this case the algorithm output may be useless. This is because the entrywise perturbations could make $A$ and $\stackrel{~}{A}$ far away in aggregate e.g., the spectral norm of $A\stackrel{~}{A}$ could be $\delta \sqrt{nm}\gg \sqrt{m}$ (whereas, even ${\parallel A\parallel}_{F}\approx \sqrt{m})$. (2) Identifiability issue: perhaps more importantly, even if the perturbation $\stackrel{~}{A}$ has a robust lowrank robust projection of small error, we need to argue that this subspace indeed attains small error on $A$! The second issue is crucial in resolving the purely statistical aspect of the question; it involves ruling out the scenario where $\stackrel{~}{A}$ has good robust lowrank approximation that is very different from any for $A$.
Issue (2): Identifiability. To address the second issue, we prove that if the projection $\widehat{\mathrm{\Pi}}$ gives a small error on $\stackrel{~}{A}$, it necessarily gives a lowerror on $A$. Roughly speaking, if there are two datamatrices $A$ and $B$ that are both $\delta $perturbations of each other i.e., ${\parallel AB\parallel}_{\mathrm{\infty}}\le \delta $, then for $\gamma \in (0,1)$
$$ 
where ${\gamma}_{1}={\gamma}_{1}(\gamma ),{\gamma}_{2}={\gamma}_{2}(\gamma )\in (0,1)$. This statement is particularly tricky to show for the spectral norm (See Lemma 6.5 for a formal statement). We know that $\parallel {\mathrm{\Pi}}_{1}A{\mathrm{\Pi}}_{1}B\parallel $ and $\parallel {\mathrm{\Pi}}_{2}A{\mathrm{\Pi}}_{2}B\parallel $ are small since ${\mathrm{\Pi}}_{1},{\mathrm{\Pi}}_{2}$ are robust (see Lemma 4.3); however this does not give a handle on $\parallel A{\mathrm{\Pi}}_{2}A\parallel $. A natural approach is to argue that the actions of ${\mathrm{\Pi}}_{1}A$ and ${\mathrm{\Pi}}_{2}B$ on any unit vector are similar. The difficulty is in handling scenarios where some directions in ${\mathrm{\Pi}}_{2}$ are close to the subspace of ${\mathrm{\Pi}}_{1}$, yet their orthogonal component is small, but spread out (even when $r=1$). Our proof is somewhat indirect; we show that for every direction $v\in {\mathbb{S}}^{n1}$, (1) the lengths ${\parallel Av\parallel}_{2}$ and ${\parallel Bv\parallel}_{2}$ are similar and (2) the difference in the lengths ${\parallel Av\parallel}_{2}{\parallel Bv\parallel}_{2}$ is (approximately) lower bounded by ${\parallel {\mathrm{\Pi}}_{2}^{\u27c2}Av\parallel}_{2}$. This will allow us to conclude that ${\mathrm{\Pi}}_{2}^{\u27c2}A$ is small in spectral norm.
Issue (1): Solution value of $\stackrel{\mathrm{~}}{A}$. To tackle the first question, we do know that there is a data matrix (in particular the uncorrupted matrix $A$) that can be obtained from $\stackrel{~}{A}$ by perturbing each entry by at most $\delta $ (i.e., it is in the ${\mathrm{\ell}}_{\mathrm{\infty}}$ neighborhood of $\stackrel{~}{A}$) that has a robust lowrank approximation of low value $\epsilon {\parallel A\parallel}^{2}$. Let us suppose that we can solve the following optimization problem:
$${A}^{\prime}=\underset{\begin{array}{c}B:{\parallel B\stackrel{~}{A}\parallel}_{\mathrm{\infty}}\le \delta \end{array}}{\text{argmin}}\mathit{\hspace{1em}\hspace{1em}\u2006}\underset{\begin{array}{c}\mathrm{\Pi}:\text{rank}(\mathrm{\Pi})=r,{\parallel \mathrm{\Pi}\parallel}_{\mathrm{\infty}\to 2}\le \kappa \end{array}}{\mathrm{min}}{\parallel B\mathrm{\Pi}B\parallel}^{2}.$$  (11) 
We know that $A$ is a feasible solution, and hence ${A}^{\prime}$ has a robust lowrank approximation of even smaller error. Moreover ${\parallel A{A}^{\prime}\parallel}_{\mathrm{\infty}}\le 2\delta $. This reduces the first issue to a purely computational question of solving the above optimization problem.
We show that when the error is measured in Frobenius norm, we can solve this problem approximately by instead solving a simpler optimization problem of finding ${A}^{\prime}$ that ${\mathrm{min}}_{B}{\parallel B\parallel}_{F}^{2}$ over all matrices $B$ that are valid $\delta $perturbations of $\stackrel{~}{A}$ i.e., ${\parallel AB\parallel}_{\mathrm{\infty}}\le \delta $. The minimizer here just reduces the magnitude of each entry by $\delta $ or until it is $0$. For more general $q\ne \mathrm{\infty}$, this corresponds to a simple convex minimization problem. For the spectral norm problem, we do not have an efficient algorithm. However by running the constant factor approximation algorithm for robust lowrank approximations (in spectral norm) for $\stackrel{~}{A}$, we will either find a good solution that also works for $A$, or we will certify that $\stackrel{~}{A}$ has no good robust lowrank approximation; this certifies that $\parallel A\stackrel{~}{A}\parallel $ is too large i.e., the data was poisoned significantly.
Finally we get the stronger computationally efficient guarantee for the spectral norm error (as for the Frobenius norm error) if we can resolve the following open question.
Open Question. Given $\stackrel{~}{A}$, is there a polynomial time $(O(1),O(1))$ bicriteria approximation algorithm for
$$\underset{\begin{array}{c}B:{\parallel B\stackrel{~}{A}\parallel}_{\mathrm{\infty}}\le \delta \end{array}}{\mathrm{min}}\mathit{\hspace{1em}\hspace{1em}\u2006}\underset{\begin{array}{c}\mathrm{\Pi}:\text{rank}(\mathrm{\Pi})=r,{\parallel \mathrm{\Pi}\parallel}_{\mathrm{\infty}\to 2}\le \kappa \end{array}}{\mathrm{min}}{\parallel B\mathrm{\Pi}B\parallel}^{2},\text{where}\parallel \cdot \parallel \text{stands for the spectral norm}.$$ 
2.5.3 Applications to Unsupervised Learning
Our low rank approximation guarantees that tolerate corruptions to training data have interesting consequences for unsupervised learning, even in the special case when the subspace rank $r$ equals one (for mean estimation). More importantly, these guarantees when used carefully, lead to a provably robust clustering algorithm. Establishing this is quite nontrivial and is technically involved. Here we will mention the key insights.
At a high level we need to ensure (both algorithhmically and in the analysis) that running an iterative procedure on a dataset where every point could be corrupted does not compound the errors. For mean estimation, Theorem 2.5 gives an algorithm that either certifies that the data is poisoned, or outputs a one dimensional subspace $\mathrm{\Pi}$ that is good for both $A$ and $\stackrel{~}{A}$. It is easy to see then that $\widehat{\mu}=\mathrm{\text{Mean}}(\mathrm{\Pi}\stackrel{~}{A})$ will be a good estimate of $\mathrm{\text{Mean}}(A)$. However, this yields a suboptimal bound. We instead prove a stronger version for one dimensional subspaces (Lemma 7.2) that gives the claimed upper bound on the estimation error. We complement this with a matching lower bound in Theorem 7.3 by constructing two $\delta $close datasets, both having low projection errors (${\sigma}^{2}m$ error) onto the $\kappa $robust subspaces, $\kappa \delta =O(\sigma )$ but the means separated by $\mathrm{\Omega}(\mathrm{max}(\sigma ,\sqrt{\sigma {\mu}_{\mathrm{max}}}))$, where ${\mu}_{\mathrm{max}}$ is the maximum ${\mathrm{\ell}}_{2}$ length among the two mean vectors.
Next we will sketch how we achieve the clustering guarantee in Theorem 2.7. We describe the techniques for the special case of equal cluster sizes that captures the main ideas. See Section 7.2 for the general analysis. We analyze a robust variant of the Lloyd’s algorithm. Our initialization algorithm is presented in Figure 4 and the iterative Lloyd’s updates are presented in Figure 5. We proceed in three stages: a) an initialization stage, b) a center improvement stage, and c) analyzing the robust Lloyd’s updates. Each stage poses unique challenges arising from working with $\stackrel{~}{A}$ where each data point is potentially corrupted. The standard way to initialize Lloyd’s algorithm is to project the data onto the best rank$k$ subspace and run a $k$means approximation algorithm^{3}^{3} 3 This initialization is needed for theoretical bounds. In practice, the initial centers are chosen as random data points.. However, when every data point is corrupted, this could be arbitrarily bad. We instead run the algorithm from Figure 2 to compute a robust $k$dimensional $\mathrm{\Pi}$ for $\stackrel{~}{A}$ with small error, or certify that the dataset has been poisoned. We then project $\stackrel{~}{A}$ onto $\mathrm{\Pi}$ and run a constant factor $k$means approximation algorithm on the projected points to compute centers ${\nu}_{1}^{(0)},\mathrm{\dots},{\nu}_{k}^{(0)}$. Using the guarantee of Theorem 2.5, the key is to establish that $\stackrel{~}{A}$ when projected onto $\mathrm{\Pi}$ has a low cost solution when using the true means as centers. Since the true centers are well separated, the estimated centers will be somewhat close to the true centers. This shows that for each $r\in [k]$, the center ${\nu}_{r}^{(0)}$ is close to ${\mu}_{r}$ upto $O(\alpha k\sigma )$.
In the second stage we improve the initial center estimates, by a $\sqrt{k}$ factor, to get ${\nu}_{1}^{(1)},\mathrm{\dots},{\nu}_{k}^{(1)}$ that are $\sim \alpha \sqrt{k}\sigma /2$close to the corresponding true means. This is sketched in step $3$ of the algorithm in Figure 5. While such a step is not needed in standard analysis of Lloyd’s, it is crucial for our robust version. See Section 7.2 for a discussion on this. To argue about the center improvement stage, we use a trick from [awasthi2012improved]. The main technical contribution then is to establish Lemma 7.9 that bounds the clustering error in terms of how close the current centers are to true ones. The lemma is a stronger version of similar statements that appear in [kumar2010clustering, awasthi2012improved]. It simultaneously helps us argue about the clustering error, and also the variance of each current cluster around its mean, a quantity crucial to bound in order to analyze the iterative updates. Here we need a novel “charging” argument to relate the mistakes made by the current centers on corrupted points, to the mistakes they would have made on uncorrupted points. This crucially relies on the fact that $A$ and $\stackrel{~}{A}$ are close to each other in the projected space, since the subspace is robust. Using the center improvement step and Lemma 7.9 we get Theorem 2.7.
To establish the stronger guarantee for mixtures of Gaussians we argue about the iterative updates in two steps. First we analyze the “ideal” updates, as if we had access to the uncorrupted data. This largely follows the analysis in [kumar2010clustering] and helps us argue that if the current center estimates ${\nu}_{r}^{(t)}$ are $\beta \alpha \sqrt{k}\sigma $ close to the corresponding ${\mu}_{r}$ (where $$), then in the next step the ideal updates will lead to an estimate of $\frac{\beta}{3}\alpha \sqrt{k}\sigma $. Next, the key technical contribution of this stage is to show, using Lemma 7.9, that when performing ideal updates, the variance of the formed clusters around their means is bounded even though the clusters themselves are impure! Next, we analyze the actual updates and use the guarantee of the robust mean procedure to argue that when given perturbed set as input, the RobustMean procedure will either certify that the set is poisoned, or will output an estimate that is within $\stackrel{~}{O}(\alpha \sigma )+\frac{\beta}{4}\alpha \sqrt{k}\sigma $. This in turn means that the new estimate output by the RobustMean procedure will be within $\stackrel{~}{O}(\alpha \sigma )+\frac{\beta}{2}\alpha \sqrt{k}\sigma $ of the true mean ${\mu}_{r}$. Hence, the updates will keep improving until the unavoidable error of $\stackrel{~}{O}(\alpha \sigma )$ (Theorem 7.6)
2.5.4 Statistical Model: The Spiked Covariance Model
Let us consider the special case when ${\mathrm{\Sigma}}^{*}=\theta {\mathrm{\Pi}}^{*}$ (hence ${\theta}_{\mathrm{min}}={\theta}_{\mathrm{max}}=\theta $) for the purposes of this discussion. Let us first consider the simpler problem of detection. Observe that if we look at the expectation $\mathbb{E}[A{A}^{\top}]$ (population average), then in the Yes case (when the distribution is $N(0,I+\theta {\mathrm{\Pi}}^{*})$), $\u27e8\mathbb{E}[A{A}^{\top}],{\mathrm{\Pi}}^{*}\u27e9=\u27e8I+\theta {\mathrm{\Pi}}^{*},{\mathrm{\Pi}}^{*}\u27e9=r+\theta r$, whereas in the No case (the distribution is $N(0,I)$), $\u27e8\mathbb{E}[A{A}^{\top}],\mathrm{\Pi}\u27e9=\u27e8I,\mathrm{\Pi}\u27e9=r$ for any rank $r$ projection matrix. One can distinguish between the Yes and No case if we can establish concentration bound for
$$\underset{\begin{array}{c}\mathrm{\Pi}\in \mathcal{P}\end{array}}{sup}\left\u27e8A{A}^{\top},\mathrm{\Pi}\u27e9\u27e8\mathbb{E}[A{A}^{\top}],\mathrm{\Pi}\u27e9\right\mathit{\hspace{1em}}\text{where}\mathcal{P}=\{\mathrm{\Pi}:{\parallel \mathrm{\Pi}\parallel}_{q\to 2}\le \kappa ,{\mathrm{\Pi}}^{2}=\mathrm{\Pi},\mathrm{\Pi}={\mathrm{\Pi}}^{\top},\text{rank}(\mathrm{\Pi})=r\}.$$  (12) 
Here the highdimensional Gaussian need not be spherical. It is tricky to directly analyze this quantity for the set of rank$r$ robust projection matrices $\mathcal{P}$; this often introduces extra factors involving $r$ or $\kappa $ (and gave suboptimal bounds in some earlier results). We instead analyze the deviation as in (12) for each vector in an orthonormal basis for $\mathrm{\Pi}$ separately. Using Lemma 4.4, we show this reduces to analyzing deviation bounds for ${\parallel Av\parallel}_{2}^{2}$ over all “analytically” sparse vectors $v$. We can then use tools in high dimensional probability and empirical process theory [LTbook, Vershynin, Mendelson2010] to obtain almost tight upper bounds. Similar ideas also work for the problem of recovering ${\mathrm{\Pi}}^{*}$, when the data distribution is $N(0,I+\theta {\mathrm{\Pi}}^{*})$. Our (inefficient) algorithm just outputs the empirical minimizer of Frobenius norm error among projections in $\mathcal{P}$ as in (12). We use concentration bounds along with DavisKahan theorem for perturbations of singular spaces to give the required recovery guarantees. Our analysis is simple since we only deal with (sparse) vectors, as opposed to (robust) rank$r$ projection matrices. This simpler analysis also gives us a tight upper bound, and allows us to improve upon the upper bound in [VuLei13] for ${q}^{*}=1$ by a factor of $r$.
Our statistical lower bounds shows the tightness of results for $q=\mathrm{\infty}$, but also interestingly shows that a polynomial dependence of ${n}^{2/q}$ (on the dimension $n$) is necessary when $q\in (2,\mathrm{\infty})$. For this statistical lower bound, the insight is that the unit ${\mathrm{\ell}}_{{q}^{*}}$ ball over vectors in ${\mathbb{R}}^{n}$ has a large packing set (in terms of Euclidean distance) because it contains an ${\mathrm{\ell}}_{2}$ ball of radius ${n}^{1/21/{q}^{*}}\gg {n}^{1/2}$ inside it (note that when $q=\mathrm{\infty}$, the radius is ${n}^{1/2}$; this is consistent with the ${\mathrm{\ell}}_{1}$ ball having a small cover in ${\mathrm{\ell}}_{2}$ distance). Based on this packing of vectors along with a clever trick of [VuLei13], we construct a good enough packing set of projection matrices of dimension $r$ to apply Fano’s inequality.
Our computationally efficient algorithm for recovering the unknown robust subspace uses the same mathematical program (9) that we used for the worstcase algorithm along with an additional constraint on ${\parallel X\parallel}_{{q}^{*}}^{{q}^{*}}\le {r}^{{q}^{*}}{\kappa}^{2{q}^{*}}$ where ${\mathrm{\ell}}_{{q}^{*}}$ is the dual norm for ${\mathrm{\ell}}_{q}$ (this is a valid constraint because of Lemma 4.3). This allows us to get a better deviation bound for a quantity of the form (12) where $\mathcal{P}$ is instead the set of feasible SDP solutions. The rest of the analysis is similar to the worstcase algorithm, along with the DavisKahan perturbation bounds for singularspaces. Finally, we remark that the upper bounds arguments can be easily extended to the case of more general covariance structure for the principal components; once we get an estimate for the robust subspace projection ${\mathrm{\Pi}}^{*}$, we can use the empirical covariance matrix from the projected samples to approximately recover ${\mathrm{\Sigma}}^{*}$.
Note that the entrywise ${\mathrm{\ell}}_{{q}^{*}}$ norm in the constraint on ${\parallel X\parallel}_{{q}^{*}}^{{q}^{*}}$ is not a monotone norm for any $q>2$ (see Claim A.1). However on account of the monotonicity property of the $q\to {q}^{*}$ operator norm (and its associated constraint), we get the additional advantage that our algorithm always outputs a robust (or “analytically” sparse) subspace. To the best of our knowledge, previous polynomial time algorithms for spiked covariance model with large $r$ (see e.g., [Liuetal]) do not necessarily output a sparse subspace (they only argue about estimation error).
3 Related Work
Adversarial Robustness. Existing theoretical work on adversarial robustness has almost exclusively focused on supervised learning, and in particular on binary classification. These works include the study of adversarial counterparts of notions such as VC dimension and Rademacher complexity [cullina2018pac, khim2018adversarial, yin2018rademacher], evidence of computational barriers [bubeck2018adversarial, bubeck2018adversarial2, nakkiran2019adversarial, degwekar2019computational] and statistical barriers towards ensuring both low test error and low adversarial test error [tsipras2018robustness], and computationally efficient algorithms for adversarially robust learning of restricted classes such as degree$1$ and degree$2$ polynomial threshold functions [awasthi2019robustness]. Furthermore, recent works also provide evidence that adversarially robust supervised learning requires more training data than its nonrobust counterpart [schmidt2018adversarially, montasser2019vc]. As discussed in Section 2.2, in our setting, it is possible to be robust to test time perturbations and simultaneously enjoy a statistical edge over the nonrobust scenario!
The closest to our work is the result of [garg2018spectral] that studies a particular formulation of adversarially robust features. The authors consider computing, given i.i.d. samples from a distribution, a map $f$ such that, with high probability over a new example $x$ drawn from the same distribution, points close to $x$ get a nearby mapping (in ${\mathrm{\ell}}_{2}$ distance) under $f$. While similar in motivation to our work, the results in [garg2018spectral] do not aim to minimize the projection error and simply require the projection $f$ to be mean zero and variance one to avoid trivial solutions. Furthermore, the authors look at a specific type of spectral embedding given by the top eigenvectors of the Laplacian of an appropriate graph constructed on the training data. The bounds presented for this embedding depend on the eigenvalue gap present in the Laplacian matrix. Finally, it is not clear how to efficiently use the embedding on new test points, as it involves recomputing the Laplacian by incorporating the new point into the training set.
Low Rank Approximations.
There is a large body of work in randomized numerical linear algebra [kannan2017randomized] on methods such as column subset selection [boutsidis2009improved, deshpande2010efficient, boutsidis2014near] and CUR decompositions [drineas2008relative, boutsidis2017optimal] that aim to approximate a given matrix via a low dimensional subspace spanned by a small number of rows/columns of the matrix. However these algorithms do not necessarily yield robust representations; in particular the subspace that is spanned may not be robust in our sense ($q\to 2$ operator norm). There is also a large body of work on fast algorithms for computing low rank approximations [clarkson2017low, price2017fast, musco2017sublinear, song2017low, ban2019ptas]. Some of these works study the problem when the approximation error is measured in a more robust metric such as the ${\mathrm{\ell}}_{1}$ norm as opposed to the Frobenius norm [song2017low]. Again these results are not directly related to the notion of subspaces robustness that we study in this paper.
Sparse PCA. In the high dimensional regime where the number of samples is much less than the dimensionality, several works have pointed out inconsistent behavior of PCA [paul2007asymptotics, nadler2008finite, johnstone2009consistency]. As a result this led to the study of the sparse PCA problem where it is assumed that the leading eigenvector is sparse. This problem is typically studied under under an average case model known as the spiked covariance model [johnstone2001distribution]. In this model the data is assumed to be generated from a Gaussian with covariance matrix $I+\theta v{v}^{\top}$, where the leading eigenvector $v$ is assumed to be a sparse vector and $\theta $ is a parameter characterizing the signal strength. There have been several works that study minimax rates of estimating the leading eigenvector and more general subspaces under the spiked covariance model and various notions of sparsity [aminiwainwright, ma2013sparse, cai2013sparse, shen2013consistency, VuLei12, VuLei13]. In particular, the work of [VuLei12, VuLei13] studies estimation of principal subspaces in the high dimensional regime under two different notions of row/column sparsity of the orthonormal basis, and under the spiked covariance model and its generalizations. There have also been practical methods proposed to perform the above estimation under computational considerations. In particular, the work of [d2005direct] proposes an SDP based approach (without provable guarantees) and has resemblance to our SDP (we impose an additional constraint on $\parallel \cdot {\parallel}_{q\to {q}^{*}}$) studied in Section 8.3. The work of [BerthetRigollet] studies statistical and computational tradeoffs for the detection problem, i.e., given i.i.d. samples from a distribution that is either a standard Gaussian or a spiked model (with one leading sparse eigenvector). They design minimax optimal methods and show that the SDP of [d2005direct] achieves the best possible rate for a polynomial time algorithm assuming the planted clique conjecture. However, their lower bound does not apply to the spiked covariance model. The work of [gao2017sparse] was the first to provide computational lower bounds for sparse PCA under the spiked covariance model. These bounds have been further improved in the work of [brennan2018reducibility, brennan2019optimal]. The work of [Liuetal] studies computationally efficient estimation of subspaces under the spiked covariance and more general models.
Robustness to Corruptions in the Training Data.
There is large body of work, spanning both the theoretical computer science and the statistical communities, that formulates and studies robustness to training data corruptions in the context of both supervised and unsupervised learning. Here, we survey works that are most relevant in the context of our results. For classification problems, the commonly studied models are the random classification noise model [angluin1988learning] and the agnostic learning model [kearns1994toward], that capture corruptions to the labels of training points. More generally, there are also works studying the malicious noise [valiant1985learning, kearns1993learning] and the nasty noise [bshouty2002pac] models that allow for corruptions in both the training points and the corresponding labels. These models led to exciting developments in the design of robust algorithms for classification problems (see e.g., [kalai2008agnostically, klivans2009learning, kalai2012reliable, feldman2009distribution, awasthi2014power, diakonikolas2018learning, daniely2015ptas]). There is also a large body of work on Robust Optimization [ben2009robust], where the input is uncertain and is assumed to belong to a structured uncertainty set. In robust optimization one looks for a single solution that is simultaneously good for all inputs in the uncertainty set, leading to a maxmin formulation of the problem. In our model of corruption, we are interested in instance wise guarantees  for every input $A$ and its corruption $\stackrel{~}{A}$, the algorithm is required to output a solution that is good for $A$. Moreover we are not aware of any results related to PCA in this context.
The problem of robust PCA has received significant attention in recent years [de2003framework, candes2011robust, chandrasekaran2011rank]. Here one assumes that a given corrupted matrix $\stackrel{~}{A}$ is a sum of two matrices, the true matrix $A$ that is lowrank and a sparse corruption matrix $S$ with sparsity pattern being essentially random. The corruptions, although sparse, can be unbounded in magnitude. This necessitates an incoherence type assumption that the “mass” or the principal components of $A$ is spread out – recovery of $A$ is impossible under unbounded sparse corruptions when the signal is localized or sparse. On the other hand, the corruptions may not be sparse in our case. In particular, every data point (in fact every entry of $A$) could be corrupted up to some specified magnitude $\delta $. Here as our results show (particularly Theorem 2.4), localization (or sparsity) of the signal is crucial in tolerating adversarial perturbations in the training data (e.g., a spread out signal can be completely overwhelmed by the corruption in each entry of $A$).
In statistics, Huber’s $\epsilon $contamination model [huber2011robust] is the most widely studied. In this model the dataset is assumed to be generated i.i.d. from a mixture namely, $(1\epsilon )P+\epsilon Q$. Here $P$ is the true distribution and is assumed to be well behaved, for example the Gaussian distribution, and $Q$ is an arbitrary distribution modeling the noise. The study of this model has led to insightful results for a variety of problems. The work of Yatracos [yatracos1985rates] and more recently of [chen2016general] studies general estimation in Huber’s model. More relevant to us are works on mean estimation and clustering in Huber’s model.
Mean Estimation. The classical work of Tukey [tukey1975mathematics] proposed a robust estimator, now known as Tukey’s median, for robust mean estimation of Gaussian data. The more recent work of [chen2018robust] showed that Tukey’s estimator is minimax optimal and also proposed a minimax optimal estimator for robust covariance estimation. Recently, there have been many exciting developments in designing robust estimators of mean and covariance that are also computationally efficient. We discuss a few here. The works of [diakonikolas2019robust, lai2016agnostic] were the first to propose polynomial time algorithms for robust mean and covariance estimation of Gaussian data in Huber’s model, with dimension independent error bounds. This was later extended to more general distributions and the listdecodable setting [charikar2017learning, steinhardt2017resilience], optimal bounds for Gaussian data [diakonikolas2018robustly] and the study of computational/statistical tradeoffs [diakonikolas2017statistical, hopkins2019hard] and robust methodofmoments [kothari2017outlier]. There have also been works providing better sample complexity bounds in the Huber model if the mean vector is sparse [balakrishnan2017computationally, klivans2018efficient]. These works also study estimation in the spiked covariance model under corruptions. More recent developments include a linear time estimator for robust mean estimation [cheng2019high] and fast algorithms for robust covariance estimation [cheng2019faster]. There are also recent works studying computationally efficient robust optimization of more general objectives [diakonikolas2018sever, prasad2018robust]. We would like to point out that in these works (and several other recent works), the model of corruption is different than ours. In particular, rather than assuming that the data contains a few outliers (Huber’s model), in our model an adversary can potentially corrupt every data point up to magnitude $\delta $ (measured in ${\mathrm{\ell}}_{q}$ norm for $q\ge 2$). Hence the guarantees from these works do not translate into our setting.
Clustering. From the computational point of view, the work of Dasgupta [dasgupta1999learning] formulated the goal of clustering data generated from a mixture of wellseparated Gaussians. There is a large body of work on designing efficient algorithms for clustering in this setting, both for Gaussians and more general distributions [arora2005learning, vempala2004spectral, achlioptas2005spectral]. See recent works [regev2017learning, hopkinsli2018, diakonikolas2018list, kothari2017better] for a detailed discussion. The work of Kumar and Kannan [kumar2010clustering] abstracted out a common property of datasets (spectral stability as defined in (66)) that captures mixtures of well separated Gaussians, the planted partitioning model, and other well clustered instances. They showed that a single algorithm, namely the popular Lloyd’s algorithm, with the right initialization, provably computes optimal solution for such stable instances. The separation factor needed for Lloyd’s to work in [kumar2010clustering] was later improved by [awasthi2012improved]. The recent work of [cohen2017local] shows that local search obtains a polynomial time approximation scheme (PTAS) on such spectrally stable instances. However, this in general does not guarantee closeness of the clustering obtained to the optimal one. The works of [kalai2010efficiently, moitra2010settling, belkin2010polynomial] provided algorithms for clustering Gaussian mixtures with no separation requirement. These algorithms, inherently have an exponential dependence on the number of clusters, $k$, in the running time. Building on robust algorithms for mean estimation, there have also been works to perform robust clustering of well separated instances under Huber’s contamination model and its variants [brubaker2009robust, diakonikolas2018list, kothari2017better, kothari2017outlier, hopkinsli2018]. There have also been works in analyzing the EM algorithm for Gaussian mixtures. See [balakrishnan2017statistical] for a detailed discussion. While motivated by the study of the phenomenon of robustness, the above results do not provide guarantees in our model of corruption. As in the case of mean estimation, these results are designed to be robust to a small number of outliers (e.g., a small constant fraction) in the training set. In our corruption model on the other hand, every data point could be potentially corrupted up to magnitude $\delta $ (measured in ${\mathrm{\ell}}_{q}$ norm for $q\ge 2$).
4 Notation and Preliminaries
Norms.
For every $q\ge 1$ and $x\in {\mathbb{R}}^{n}$, we will use ${\parallel x\parallel}_{q}$ to denote the ${\mathrm{\ell}}_{q}$ norm of the vector $x$ i.e., ${\parallel x\parallel}_{q}^{q}={\sum}_{i\in [n]}{{x}_{i}}^{q}$. The dual norm of ${\mathrm{\ell}}_{q}$ is ${\mathrm{\ell}}_{{q}^{*}}$ where $1/{q}^{*}+1/q=1$. We will heavily use Holder’s inequality which states that
$$\text{(H\xf6lder\u2019s inequality)}\mathit{\hspace{1em}}\u27e8u,v\u27e9\le {\parallel u\parallel}_{q}\cdot {\parallel v\parallel}_{{q}^{*}}\mathit{\hspace{1em}}\forall u,v\in {\mathbb{R}}^{n}.$$  (13) 
When not specified, $\parallel x\parallel $ will denote the Euclidean norm of $x$. Further ${\mathbb{S}}^{n1}$ will represent the unit sphere for the Euclidean norm. For convenience, we will use ${\parallel x\parallel}_{0}$ to denote the sparsity i.e., the size of the support of $x$ (note that ${\mathrm{\ell}}_{0}$ is not a valid norm on vectors).
Operator Norms of Matrices.
We will use the following matrix norms. For any $q,p\ge 1$ and any matrix $M\in {\mathbb{R}}^{n\times m}$, we will denote by ${\parallel M\parallel}_{q\to p}={\mathrm{max}}_{y\in {\mathbb{R}}^{m},{\parallel y\parallel}_{q}\le 1}{\parallel My\parallel}_{p}$. By duality of vector norms, we have
$${\parallel M\parallel}_{q\to p}=\underset{y\in {\mathbb{R}}^{m},{\parallel y\parallel}_{q}\le 1}{\mathrm{max}}\underset{z\in {\mathbb{R}}^{n},{\parallel z\parallel}_{{p}^{*}}\le 1}{\mathrm{max}}{z}^{\top}My=\underset{z\in {\mathbb{R}}^{n},{\parallel z\parallel}_{{p}^{*}}\le 1}{\mathrm{max}}\underset{y\in {\mathbb{R}}^{m},{\parallel y\parallel}_{q}\le 1}{\mathrm{max}}{y}^{\top}{M}^{\top}z={\parallel {M}^{\top}\parallel}_{{p}^{*}\to {q}^{*}}.$$ 
When $p=q=2$, this corresponds to the spectral norm of the matrix $M$ i.e., the maximum singular value of $M$. When unspecified, we will use $\parallel M\parallel $ to denote the spectral norm of $M$. (Note that the above equalities from duality also show that the $\parallel A\parallel =\parallel {A}^{\top}\parallel $ i.e., the maximum right singular value is the same as the maximum left singular value).
Entrywise Norms of Matrices.
We will also consider various matrix norms obtained by considering a matrix $M\in {\mathbb{R}}^{m\times n}$ as a vector of size $mn$. In particular, for any $q\ge 1$ we will use ${\parallel M\parallel}_{q}$ to denote the ${\mathrm{\ell}}_{q}$ norm of the “flattened” vector corresponding to $M$ i.e., ${\parallel M\parallel}_{q}^{q}={\sum}_{i=1,j=1}^{m,n}{M(i,j)}^{q}$. The Frobenius norm ${\parallel M\parallel}_{F}={\parallel M\parallel}_{2}$. Moreover for matrices $A,B$, we use $\u27e8A,B\u27e9:=\text{tr}({A}^{\top}B)$ to represent the trace inner product.
High probability bounds.
We will say that an event holds with high probability (w.h.p.) if the probability of failure on a given instance is less than any polynomial of the input parameters e.g., the dimension $n$, and the number of data points $m$. We remark that for our randomized algorithms, we can amplify the success probability to $1\eta $ for any $\eta >0$ by repeating the algorithm $\mathrm{log}(1/\eta )$ times (hence these guarantees will in fact hold with exponentially small failure probability).
4.1 Approximation Algorithms for Operator Norms.
Here we briefly describe some known positive and negative results for approximating the $q\to p$ operator norm of a matrix (sometimes referred to as the $({\mathrm{\ell}}_{q},{\mathrm{\ell}}_{p})$Grothendieck problem. We will say that a randomized algorithm gives an $\alpha $factor approximation for the $q\to p$ operator norm (for some $\alpha \ge 1$) iff for any input matrix $M$ the algorithm outputs with probability at least $(1{n}^{\omega (1)})$ a vector $\widehat{x}\ne 0$ such that ${\parallel M\widehat{x}\parallel}_{p}/{\parallel \widehat{x}\parallel}_{q}\ge \frac{1}{\alpha}\cdot {\parallel M\parallel}_{q\to p}$. There are three simple cases ($q=1,1\le p\le \mathrm{\infty}$, and $p=\mathrm{\infty},1\le q\le \mathrm{\infty}$, and $q=p=2)$, where the norm can be computed in polynomial time. Some notable cases that are known to be intractable are the $\mathrm{\infty}\to 1$ and $\mathrm{\infty}\to 2$ norm (hence also $2\to 1$ by duality). The $\mathrm{\infty}\to 1$ norm is the wellknown Grothendieck’s problem [Gro56] (that is related to the cutnorm of a matrix [alon2004approximating] and has a rich history [krivine1978constantes, braverman2013grothendieck, alon2006quadratic, khot2011grothendieck, pisier2012grothendieck, reeds1991, khot2006sdp]).
More generally, if $$ (nonhypercontractive norms), computing the $q\to p$ norm is NPhard [steinberg2005computation] and in fact the problem exhibits a dichotomy in terms of approximation. Specifically, whenever $2\in [p,q]$, the problem admits constant factor approximation algorithms, whereas whenever $2\notin [p,q]$, the problem is hard to approximate within almost polynomial factors [bhaskara2011approximating]. Regarding approximation algorithms, the work of [steinberg2005computation] builds upon Nesterov’s theorems and extensions [nesterov1998semidefinite, wolkowicz2012handbook] and provides a $1/(\frac{2\sqrt{3}}{\pi}\frac{2}{3})\approx 2.29$ approximation for when $1\le p\le 2\le q\le \mathrm{\infty}$, and for the special case $p=2$ or $q=2$ the factor becomes $\sqrt{\pi /2}\approx 1.25$. Recently, improved upper and (almost matching) lower bounds were proved for many settings of $q,p$ in [bhattiprolu2018approximating, bhattiprolu2018inapproximability].
In what follows let ${\gamma}_{r}:={\mathbb{E}}_{g}[{\parallel g\parallel}_{r}]$ be the ${\mathrm{\ell}}_{r}$ norm of a standard normal random variable $g\sim N(0,1)$ i.e., the $r$th root of the $r$th Gaussian moment, and let ${\mathrm{\ell}}_{{q}^{*}}$ be the dual norm for ${\mathrm{\ell}}_{q}$. Specifically, for $q\ge p$ with $2\in [p,q]$, it is NPhard to approximate the $q\to p$ norm within a factor smaller than $\frac{1}{{\gamma}_{{q}^{*}}{\gamma}_{p}}$. The hardness factor matches the polynomial time algorithms from [steinberg2005computation] for cases when $q$ or $p$ equals 2, and this is encapsulated in Theorem 4.1. These algorithms are based on semidefinite programming (SDP) relaxations. The algorithm for the $q\to {q}^{*}$ norm for example first solves in polynomial time an SDP relaxation and produces a solution $X\u2ab00$ of value $\text{SDPvalue}\ge {\parallel M\parallel}_{q\to {q}^{*}}$; then the algorithm uses $X$ to produce with high probability a vector $y$ with ${\parallel My\parallel}_{{q}^{*}}/{\parallel y\parallel}_{q}\ge \frac{1}{\alpha}\cdot \text{SDPvalue}\ge \frac{1}{\alpha}{\parallel M\parallel}_{q\to {q}^{*}}$, where $\alpha \ge 1$ is the approximation factor.
Theorem 4.1 (Combining [nesterov1998semidefinite, steinberg2005computation]).
For computing the $\mathrm{\infty}\mathrm{\to}\mathrm{2}$ norm, there is a randomized polynomial time algorithm that gives a $\sqrt{\pi \mathrm{/}\mathrm{2}}\mathrm{\approx}\mathrm{1.25}$approximation algorithm, and for the $q\mathrm{\to}\mathrm{2}$ norm there is a randomized polynomial time algorithm that gives a $\mathrm{1}\mathrm{/}{\gamma}_{{q}^{\mathrm{*}}}$factor approximation. Furthermore, when the input matrices are positive semidefinite matrices, there exists randomized polynomial time algorithms that give a $\pi \mathrm{/}\mathrm{2}$ approximation for the $\mathrm{\infty}\mathrm{\to}\mathrm{1}$ norm, and a $\mathrm{1}\mathrm{/}{\gamma}_{{q}^{\mathrm{*}}}^{\mathrm{2}}$ factor approximation for the $q\mathrm{\to}{q}^{\mathrm{*}}$ operator norm respectively. These algorithms are SDPbased randomized algorithms that succeed with high probability for any given instance.
The approximation algorithms from Theorem 4.1 will serve as separation oracles for solving the convex relaxations that we will use for our algorithms. As noted above these approximability results are tight assuming $P\ne NP$ [bhattiprolu2018inapproximability].
4.2 Properties of Robust Projections.
Throughout the paper we will use the term projections and projection matrices to always refer to orthogonal projection matrices on to linear subspaces of ${\mathbb{R}}^{n}$. Next we list and prove some simple properties of subspaces with robust projection matrices i.e., subspaces with ${\parallel \mathrm{\Pi}\parallel}_{\mathrm{\infty}\to 2}$ (or more generally $q\to 2$ norm for some $q\ge 2$) that is upper bounded.
For any ${q}^{*}\in [1,2]$, the ratio of the ${\mathrm{\ell}}_{{q}^{*}}$ vs ${\mathrm{\ell}}_{2}$ corresponds to an analytic notion of sparsity. The following claim gives an upper bound on the ${\mathrm{\ell}}_{{q}^{*}}$ norm in terms of the sparsity.
Lemma 4.2 (Analytic Sparsity).
Consider any vector $v\mathrm{\in}{\mathrm{R}}^{n}$ of support size $k$. For any ${q}^{\mathrm{*}}\mathrm{\in}\mathrm{[}\mathrm{1}\mathrm{,}\mathrm{2}\mathrm{]}$, we have
$${\parallel v\parallel}_{{q}^{*}}\le {k}^{{\scriptscriptstyle \frac{1}{{q}^{*}}}{\scriptscriptstyle \frac{1}{2}}}{\parallel v\parallel}_{2}.$$ 
In particular, ${\mathrm{\parallel}v\mathrm{\parallel}}_{\mathrm{1}}\mathrm{\le}\sqrt{k}\mathit{}{\mathrm{\parallel}v\mathrm{\parallel}}_{\mathrm{2}}$ for vectors with support size at most $k$.
On the other hand, it is easy to see that the bound given here is tight for any vector that is equally spread out among its support of size $k$.
Proof.
Without loss of generality suppose ${\parallel v\parallel}_{2}=1$ (if $v=0$ it holds trivially). Let $v$ have support $S$ of size $k$. Set $p:=2/{q}^{*}$, and let $u$ be the vector such that ${u}_{i}={{v}_{i}}^{q*}$ for each $i\in [n]$. By Holder’s inequality
$${\parallel v\parallel}_{{q}^{*}}^{{q}^{*}}=\sum _{i\in S}1\cdot {u}_{i}\le {\parallel {\mathrm{\U0001d7cf}}_{S}\parallel}_{{p}^{*}}{\parallel u\parallel}_{p}\le {k}^{1/{p}^{*}}{(\sum _{i}{{v}_{i}}^{p{q}^{*}})}^{1/p}\le {k}^{1{q}^{*}/2}{\parallel v\parallel}_{2}^{2/p}={k}^{1{q}^{*}/2},$$ 
hence establishing the lemma. ∎
Recall that ${\mathrm{\ell}}_{{q}^{*}}$ corresponds to the dual norm for ${\mathrm{\ell}}_{q}$, and ${q}^{*}\in [1,2]$ when $q\ge 2$. The following simple lemma proves two useful properties of robust subspaces i.e., subspaces having projection matrices with bounded $\mathrm{\infty}\to 2$ norm (or more generally $q\to 2$ norm for $q>2$). The first property shows that any two vectors that are close in ${\mathrm{\ell}}_{\mathrm{\infty}}$ norm will have nearby projections onto any subspace that is robust. The second property shows that a subspace is robust (i.e., has a robust projection matrix) exactly when every vector in the subspace is analytically sparse.
Lemma 4.3.
[Properties of Robust Subspaces and Projections] Consider any subspace of $\mathrm{V}\mathrm{\subseteq}{\mathrm{R}}^{n}$ with projection matrix $\mathrm{\Pi}\mathrm{\in}{\mathrm{R}}^{n\mathrm{\times}n}$ satisfying ${\mathrm{\parallel}\mathrm{\Pi}\mathrm{\parallel}}_{q\mathrm{\to}\mathrm{2}}\mathrm{\le}\kappa $. We have the following two properties:

I.
Closeness of projections of pertubations: For any vector $v$ and its perturbation $\stackrel{~}{v}$
$${\parallel v\stackrel{~}{v}\parallel}_{q}\le \delta \mathit{\hspace{1em}}\u27f9\mathit{\hspace{1em}}{\parallel \mathrm{\Pi}\stackrel{~}{v}\mathrm{\Pi}v\parallel}_{2}\le \kappa \delta .$$ 
II.
Analytic sparsity: For any $v\in \mathcal{V}$, we have ${\parallel v\parallel}_{{q}^{*}}\le \kappa {\parallel v\parallel}_{2}$, where ${q}^{*}=q/(q1)$. Moreover, if every vector in $\mathcal{V}$ has ${\parallel v\parallel}_{{q}^{*}}\le \kappa {\parallel v\parallel}_{2}$, then ${\parallel \mathrm{\Pi}\parallel}_{q\to 2}\le \kappa $. In particular ${\parallel \mathrm{\Pi}\parallel}_{\mathrm{\infty}\to 2}\le \kappa $ if and only if ${\parallel v\parallel}_{1}\le \kappa $ for all unit vectors $v\in {\mathbb{S}}^{n1}\cap \mathcal{V}$.
Proof.
We first show property (I). Let $u:=v\stackrel{~}{v}$. Then
$${\parallel \mathrm{\Pi}\stackrel{~}{v}\mathrm{\Pi}v\parallel}_{2}=\parallel \mathrm{\Pi}u\parallel \le {\parallel \mathrm{\Pi}\parallel}_{q\to 2}{\parallel u\parallel}_{q}\le \kappa \delta .$$ 
To show property (II), note that by duality of matrix operator norms we have ${\parallel \mathrm{\Pi}\parallel}_{q\to 2}={\parallel {\mathrm{\Pi}}^{\top}\parallel}_{2\to {q}^{*}}={\parallel \mathrm{\Pi}\parallel}_{2\to {q}^{*}}$.
$$\text{Hence}\mathit{\hspace{1em}\hspace{1em}\u2006}\forall v\in {\mathbb{S}}^{n1}\cap \mathcal{V},{\parallel v\parallel}_{{q}^{*}}={\parallel \mathrm{\Pi}v\parallel}_{{q}^{*}}\le {\parallel \mathrm{\Pi}\parallel}_{2\to {q}^{*}}{\parallel v\parallel}_{2}\le \kappa .$$ 
For the converse, if there exists $v\in {\mathbb{S}}^{n1}\cap \mathcal{V}$ s.t. ${\parallel v\parallel}_{{q}^{*}}>\kappa $, then by duality ${\parallel \mathrm{\Pi}\parallel}_{2\to {q}^{*}}={\parallel \mathrm{\Pi}\parallel}_{q\to 2}>\kappa $.
∎
Observe that the robustness condition on the subspace as captured by the $q\to 2$ operator norm bound of its projection matrix $\mathrm{\Pi}$ is basis independent.The following gives a simple sufficient condition on the basis of the subspace that implies robustness of the subspace spanned by them. This relates our robustness of the subspace to alternate notions of sparsity of subspaces that have been studied in the literature on sparse PCA [VuLei12, VuLei13].
Lemma 4.4.
Given any orthonormal basis ${v}_{\mathrm{1}}\mathrm{,}{v}_{\mathrm{2}}\mathrm{,}\mathrm{\dots}\mathrm{,}{v}_{r}$ for a subspace $\mathrm{V}$ such that ${\mathrm{\parallel}{v}_{i}\mathrm{\parallel}}_{{q}^{\mathrm{*}}}\mathrm{\le}\kappa $ for each $i\mathrm{\in}\mathrm{[}r\mathrm{]}$, we have ${\mathrm{\parallel}\mathrm{\Pi}\mathrm{\parallel}}_{q\mathrm{\to}\mathrm{2}}\mathrm{\le}\sqrt{r}\mathit{}\kappa $.
Proof.
Firstly, $\mathrm{\Pi}={\sum}_{i=1}^{r}{v}_{i}{v}_{i}^{\top}$, and ${\parallel \mathrm{\Pi}\parallel}_{q\to 2}={\parallel \mathrm{\Pi}\parallel}_{2\to {q}^{*}}$. We have
$${\parallel \mathrm{\Pi}\parallel}_{q\to 2}=\underset{u:{\parallel u\parallel}_{q}\le 1}{\mathrm{max}}{\parallel \sum _{i=1}^{r}\u27e8u,{v}_{i}\u27e9{v}_{i}\parallel}_{2}\le \sqrt{\sum _{i=1}^{r}{\u27e8u,{v}_{i}\u27e9}^{2}}\le \sqrt{r}\cdot \underset{u:{\parallel u\parallel}_{q}\le 1}{\mathrm{max}}\underset{v:{\parallel v\parallel}_{{q}^{*}}\le \kappa}{\mathrm{max}}\u27e8u,v\u27e9\le \sqrt{r}\kappa .$$ 
∎
For a given matrix $B\in {\mathbb{R}}^{n\times m}$, let us denote by $\mathrm{\Pi}(B)$ to the projection matrix onto the column space of $B$. The following lemma shows that the best lowrank $(\kappa ,q)$robust projection objective (3) also finds the lowrank approximation that has smallest error among ones with a $(\kappa ,q)$ robust column space.
Lemma 4.5.
Let ${\mathrm{P}}_{r}$ be the set of all rank$r$ projection matrices. Given a data matrix $A\mathrm{\in}{\mathrm{R}}^{n\mathrm{\times}m}$ and a given parameter $\kappa \mathrm{\ge}\mathrm{1}$, $q\mathrm{>}\mathrm{0}$, we have
$$\underset{\begin{array}{c}\mathrm{\Pi}\in {\mathcal{P}}_{r}\\ {\parallel \mathrm{\Pi}\parallel}_{q\to 2}\le \kappa \end{array}}{\mathrm{min}}\parallel A\mathrm{\Pi}A\parallel =\underset{\begin{array}{c}B:\text{\mathit{r}\mathit{a}\mathit{n}\mathit{k}}(B)\le r,\\ {\parallel \mathrm{\Pi}(B)\parallel}_{q\to 2}\le \kappa \end{array}}{\mathrm{min}}\parallel AB\parallel ,$$ 
where $\mathrm{\parallel}M\mathrm{\parallel}$ here stands for the spectral norm. The above statement is also true for the Frobenius norm.
Proof.
Let ${B}^{*}$ be the minimizer for the right minimization problem and let ${\mathrm{\Pi}}_{2}=\mathrm{\Pi}({B}^{*})$ be its projection matrix, and let ${\mathrm{\Pi}}_{1}$ be the minimizer for the left optimization problem. It is easy to see that $\parallel A{\mathrm{\Pi}}_{1}A\parallel \ge \parallel A{B}^{*}\parallel $, since ${\mathrm{\Pi}}_{1}A$ is also a feasible choice for $B$ in the right minimization problem. To prove the other direction, if ${A}_{i},{B}_{i}^{*}$ are the $i$th columns of $A,{B}^{*}$ respectively then,
$\parallel A{B}^{*}\parallel $  $=\parallel A{\mathrm{\Pi}}_{2}{B}^{*}\parallel =\parallel {\mathrm{\Pi}}_{2}(A{B}^{*})+{\mathrm{\Pi}}_{2}^{\u27c2}A\parallel $  
$=\underset{u\in {\mathbb{S}}^{m1}}{\mathrm{max}}{\parallel {\displaystyle \sum _{i\in [m]}}{u}_{i}{\mathrm{\Pi}}_{2}({A}_{i}{B}_{i}^{*})+{\displaystyle \sum _{i\in [m]}}{u}_{i}{\mathrm{\Pi}}_{2}^{\u27c2}{A}_{i}\parallel}_{2}$  
$\ge \underset{u\in {\mathbb{S}}^{m1}}{\mathrm{max}}{\parallel {\displaystyle \sum _{i\in [m]}}{u}_{i}{\mathrm{\Pi}}_{2}^{\u27c2}{A}_{i}\parallel}_{2}=\parallel {\mathrm{\Pi}}_{2}^{\u27c2}A\parallel \ge \parallel {\mathrm{\Pi}}_{1}^{\u27c2}A\parallel $ 
as required. The first inequality above follows since the column space of ${\mathrm{\Pi}}_{2}^{\u27c2}A$ and the column space of ${\mathrm{\Pi}}_{2}(A{B}^{*})$ are orthogonal. An identical proof also follows for the Frobenius norm.
∎
Monotonicity of Matrix Norms.
The following property of certain matrix norms will be crucial in designing constant factor approximation algorithms for the lowrank approximations.
Definition 4.6 (Monotone matrix norm).
A matrix norm $\u2af4\cdot \u2af4$ is said to be monotone if and only if
$$\forall A,B\u2ab00,\u2af4A+B\u2af4\ge \u2af4A\u2af4.$$  (14) 
Observe that it suffices to check the above condition for all rank$1$ PSD matrices $B$ i.e., $B=v{v}^{\top}$ for $v\in {\mathbb{R}}^{n}$. It is well known that all unitarily invariant matrix norms^{4}^{4} 4 A matrix norm $\u2af4\cdot \u2af4$ is unitarily invariant iff $\u2af4A\u2af4=\u2af4UAV\u2af4$ for all matrices $A$ and all unitary matrices $U,V$. are monotone (this is because unitarily invariant norms are just norms on the singular values). On the other hand, many other matrix norms including other entrywise norms ${\parallel X\parallel}_{q}$ or general operator norms ${\parallel X\parallel}_{q\to p}$ are not necessarily monotone (see Claim A.1 and Claim A.2 for some counterexamples). Perhaps surprisingly, the $q\to {q}^{*}$ matrix operator norms are monotone (see Lemma 5.6 for a simple proof of this fact)!
5 Worstcase Approximation Guarantees
5.1 Approximations in Frobenius norm error
We will aim to obtain a bicriteria approximation for the robust lowrank approximation problem given in (3) for the Frobenius norm objective. In what follows $q\in [2,\mathrm{\infty}]$.
$\underset{\mathrm{\Pi}}{\mathrm{min}}$  ${\parallel {\mathrm{\Pi}}^{\u27c2}A\parallel}_{F}^{2}=\underset{\mathrm{\Pi}}{\mathrm{min}}{\parallel A\parallel}_{F}^{2}\u27e8A{A}^{\top},\mathrm{\Pi}\u27e9$  (15)  
s.t.  $\mathrm{\Pi}\text{is a projection matrix of rank}\le r,\text{and}{\parallel \mathrm{\Pi}\parallel}_{q\to 2}\le \kappa .$  (16) 
We prove the following theorem.
Theorem 5.1.
Suppose the data matrix $A\mathrm{\in}{\mathrm{R}}^{n\mathrm{\times}m}$ has an (orthogonal) projection ${\mathrm{\Pi}}^{\mathrm{*}}$ of rank at most $r$ such that ${\mathrm{\parallel}{\mathrm{\Pi}}^{\mathrm{*}}\mathrm{\parallel}}_{q\mathrm{\to}\mathrm{2}}\mathrm{\le}\kappa $ and the approximation error $O\mathit{}P\mathit{}T\mathrm{:=}{\mathrm{\parallel}\mathrm{(}I\mathrm{}{\mathrm{\Pi}}^{\mathrm{*}}\mathrm{)}\mathit{}A\mathrm{\parallel}}_{F}^{\mathrm{2}}$. There exists an algorithm that w.h.p. runs in polynomial time and finds a projection matrix $\widehat{\mathrm{\Pi}}$ of rank at most $r$ such that for every $\gamma \mathrm{>}\mathrm{0}$,
${\parallel \widehat{\mathrm{\Pi}}\parallel}_{q\to 2}$  $\le \sqrt{{C}_{G}(q)(1+1/\gamma )}\cdot \kappa ,\mathit{\text{and}}{\parallel (I\widehat{\mathrm{\Pi}})A\parallel}_{F}^{2}\le (2+\gamma )OPT,$  (17) 
where ${C}_{G}\mathit{}\mathrm{(}q\mathrm{)}\mathrm{>}\mathrm{0}$ is a constant that only depends on $q$ as given in Theorem 4.1 (for $q\mathrm{=}\mathrm{\infty}$ this value is at most $\pi \mathrm{/}\mathrm{2}$). Moreover, for any $\gamma \mathrm{>}\mathrm{0}$, there exists an algorithm that w.h.p. runs in polynomial time and finds an ${r}^{\mathrm{\prime}}\mathrm{\le}r\mathit{}\mathrm{(}\mathrm{1}\mathrm{+}\mathrm{1}\mathrm{/}\gamma \mathrm{)}$dimensional orthogonal projection $\widehat{\mathrm{\Pi}}$ such that
${\parallel \widehat{\mathrm{\Pi}}\parallel}_{q\to 2}$  $\le \sqrt{{C}_{G}(q)(1+1/\gamma )}\cdot \kappa ,\mathit{\text{and}}{\parallel (I\widehat{\mathrm{\Pi}})A\parallel}_{F}^{2}\le (1+\gamma )OPT.$  (18) 
The theorem above will be established by proving a statement about the more general problem of finding a lowrank projection that satisfies any monotone norm constraint that can be approximately separated. While the $q\to 2$ norm is not monotone, as discussed in Section 4, we will show that applying the more general guarantee on an appropriate monotone norm helps prove Theorem 5.1 above. Let $\u2af4\cdot \u2af4$ be a monotone matrix norm. Consider the following generalization of problem (19) that given a data matrix $A\in {\mathbb{R}}^{n\times m}$, and a parameter $\kappa \ge 1$, finds a projection
$$\underset{\mathrm{\Pi}}{\mathrm{min}}{\parallel A\parallel}_{F}^{2}\u27e8A{A}^{\top},\mathrm{\Pi}\u27e9\text{s.t.}\mathrm{\Pi}\text{is a projection matrix of rank}\le r,\text{and}\u2af4\mathrm{\Pi}\u2af4\le \kappa .$$  (19) 
Definition 5.2.
[$\alpha $approximately separable matrix norm] A matrix norm $\u2af4\cdot \u2af4$ over ${\mathbb{R}}^{n\times m}$ matrices is $\alpha $approximately separable for some $\alpha \ge 1$ iff there exists a (potentially) randomized algorithm that w.h.p. runs in time $\mathrm{poly}(n,m)$, and when given a PSD matrix $B\in {\mathbb{R}}^{n\times n}$ and a parameter $\kappa $ as input will either certify that $\u2af4B\u2af4\le \alpha \kappa $, or finds a $Z\in {\mathbb{R}}^{n\times n}$ such that (1) $\u27e8B,Z\u27e9>\kappa $, and (2) $\u27e8M,Z\u27e9\le \kappa $ for all $M$ s.t. $\u2af4M\u2af4\le \kappa $.
Note that in the above definition, the only potential randomness is from the potential random choices of the algorithm. As we will see later the operator norms that we will consider (e.g., $q\to 2$ norm and the $q\to {q}^{*}$ norm for $q\ge 2$) will be $O(1)$approximately separable. This will allow us to construct an appropriate separation oracle for using the Ellipsoid algorithm.
The following general theorem gives an $O(1)$ bicriteria approximation for the problem assuming the monotone matrix norm $\u2af4\cdot \u2af4$ can be optimized approximately in polynomial time.
Theorem 5.3.
Let $\mathrm{\u2af4}\mathit{}\mathrm{\cdot}\mathit{}\mathrm{\u2af4}$ be any matrix norm that is monotone and $\alpha $approximately separable for some $\alpha \mathrm{\ge}\mathrm{1}$. Suppose we are given as input a data matrix $A\mathrm{\in}{\mathrm{R}}^{n\mathrm{\times}m}$ that has an (orthogonal) projection ${\mathrm{\Pi}}^{\mathrm{*}}$ of rank at most $r$ such that $\mathrm{\u2af4}{\mathrm{\Pi}}^{\mathrm{*}}\mathrm{\u2af4}\mathrm{\le}\kappa $ and the approximation error $O\mathit{}P\mathit{}T\mathrm{:=}{\mathrm{\parallel}\mathrm{(}I\mathrm{}{\mathrm{\Pi}}^{\mathrm{*}}\mathrm{)}\mathit{}A\mathrm{\parallel}}_{F}^{\mathrm{2}}$. There exists an algorithm that w.h.p. runs in polynomial time, and finds an orthogonal projection matrix $\widehat{\mathrm{\Pi}}$ of rank at most $r$ such that for every $\gamma \mathrm{\in}\mathrm{(}\mathrm{0}\mathrm{,}\mathrm{1}\mathrm{)}$,
$\u2af4\widehat{\mathrm{\Pi}}\u2af4$  $\le \alpha \left(1+\frac{1}{\gamma}\right)\cdot \kappa ,\mathit{\text{and}}{\parallel (I\widehat{\mathrm{\Pi}})A\parallel}_{F}^{2}\le \left(2+\gamma \right)OPT.$  (20) 
Moreover, for any $\gamma \mathrm{\in}\mathrm{(}\mathrm{0}\mathrm{,}\mathrm{1}\mathrm{)}$, there exists an algorithm that w.h.p. runs in polynomial time and finds an ${r}^{\mathrm{\prime}}\mathrm{\le}\mathrm{(}\mathrm{1}\mathrm{+}\mathrm{1}\mathrm{/}\gamma \mathrm{)}\mathit{}r$dimensional orthogonal projection $\widehat{\mathrm{\Pi}}$ such that
$\u2af4\widehat{\mathrm{\Pi}}\u2af4$  $\le \alpha \left(1+\frac{1}{\gamma}\right)\cdot \kappa ,\mathit{\text{and}}{\parallel (I\widehat{\mathrm{\Pi}})A\parallel}_{F}^{2}\le (1+\gamma )\cdot OPT.$  (21) 
We remark that the only randomization in the algorithm is in the construction of the separation oracle for the matrix norm constraint $\u2af4\mathrm{\Pi}\u2af4\le \kappa $. In particular the algorithm from Theorem 5.3 has a Las Vegas guarantee i.e., the algorithm is always correct, and the running time is polynomial with high probability (in fact with exponentially small failure probability), and hence in expectation.
We consider the following mathematical programming relaxation for the problem.
$\underset{X}{\mathrm{min}}$  ${\parallel A\parallel}_{F}^{2}\u27e8A{A}^{\top},X\u27e9$  (22)  
s.t.  $\text{tr}(X)\le r$  (23)  
$0\u2aafX\u2aafI$  (24)  
$\u2af4X\u2af4\le \kappa $  (25) 
First we observe that this is a valid relaxation to the problem. In fact any feasible projection matrix $\mathrm{\Pi}$ of rank at most $r$ for (19) is a feasible solution to the above program (22)(25) with the same value. The intended solution here is just $X=\mathrm{\Pi}$. All the eigenvalues of $\mathrm{\Pi}$ are $0$ or $1$, since $\mathrm{\Pi}$ is a projection matrix; hence (23), (24) are satisfied. Moreover (25) is satisfied just because of the same constraint as in (19). Finally, the objective value is preserved since
$${\parallel A\parallel}_{F}^{2}\u27e8A{A}^{\top},\mathrm{\Pi}\u27e9={\parallel A\parallel}_{F}^{2}\text{tr}(A{A}^{\top}\mathrm{\Pi})={\parallel A\parallel}_{F}^{2}\text{tr}(\mathrm{\Pi}A{(\mathrm{\Pi}A)}^{\top})={\parallel A\parallel}_{F}^{2}{\parallel \mathrm{\Pi}A\parallel}_{F}^{2}.$$ 
In the above program, the objective (22) and constraints (23)(24) define a semidefinite program (SDP). Moreover, for (25), we see that for any $\lambda \in [0,1]$, by triangle inequality $\u2af4\lambda {X}_{1}+(1\lambda ){X}_{2}\u2af4\le \lambda \u2af4{X}_{1}\u2af4+(1\lambda )\u2af4{X}_{2}\u2af4$. Hence the set of all $X$ that satisfies constraints (23)  (25) is convex. In general, constraint (25) may be NPhard to verify for a given PSD matrix $X$. However, we can use the fact that $\u2af4\cdot \u2af4$ is approximately separable to get a approximately feasible solution to the program in polynomial time, using the Ellipsoid method.
The following lemma shows that by truncating a solution of the program (22)(25) to just the terms corresponding to the large eigenvalues, we retain much of the objective.
Lemma 5.4.
Let $\epsilon \mathrm{,}\delta \mathrm{>}\mathrm{0}$, and let $M\mathrm{\u2ab0}\mathrm{0}$ have $\text{\mathit{t}\mathit{r}}\mathit{}\mathrm{(}M\mathrm{)}\mathrm{=}\mathrm{1}$. Suppose $X$ satisfies the SDP constraints (23) and (24) and $\mathrm{\u27e8}M\mathrm{,}X\mathrm{\u27e9}\mathrm{\ge}\mathrm{1}\mathrm{}\epsilon $. Suppose ${P}_{X}^{\mathrm{1}\mathrm{}\delta}$ is the projection operator onto the subspace spanned by eigenvectors of $X$ with eigenvalues at least $\tau \mathrm{:=}\mathrm{1}\mathrm{}\delta $. Then we have
$$\u27e8I{P}_{X}^{1\delta},M\u27e9\le \frac{\epsilon}{1\tau}.$$  (26) 
Proof.
Let $X={\sum}_{i=1}^{n}{\lambda}_{i}{v}_{i}{v}_{i}^{\top}$ be the eigendecomposition of $X$ (note that ${\lambda}_{i}\ge 0$ since $X$ is p.s.d.), and let $S=\{i:{\lambda}_{i}\ge \tau \}$. We have
$(1\epsilon )\text{tr}(M)$  $\le \u27e8M,X\u27e9={\displaystyle \sum _{i}}{\lambda}_{i}{v}_{i}^{\top}M{v}_{i}$  
$\text{tr}(M)$  $={\displaystyle \sum _{i}}{v}_{i}^{\top}M{v}_{i},\text{since}\{{v}_{i}:i\in [n]\}\text{is an orthonormal basis}$  
$\text{Hence}{\displaystyle \sum _{i}}({\lambda}_{i}1+\epsilon ){v}_{i}^{\top}M{v}_{i}$  $\ge 0\mathit{\hspace{1em}}\u27f9{\displaystyle \sum _{i\in S}}\epsilon {v}_{i}^{\top}M{v}_{i}+{\displaystyle \sum _{i\notin S}}(\tau +\epsilon 1){v}_{i}^{\top}M{v}_{i}\ge 0$  
$\sum _{i\in S}}\epsilon {v}_{i}^{\top}M{v}_{i$  $\ge {\displaystyle \sum _{i\notin S}}(\delta \epsilon ){v}_{i}^{\top}M{v}_{i}$  
$\text{Hence,}\u27e8I{P}_{X}^{1\delta},M\u27e9$  $\le {\displaystyle \frac{\epsilon}{\delta \epsilon}}\cdot \u27e8{P}_{X}^{1\delta},M\u27e9.$ 
Substituting the above inequality in $\u27e8{P}_{X}^{1\delta},M\u27e9+\u27e8I{P}_{X}^{1\delta},M\u27e9=\text{tr}(M)=1$, we get
$$\u27e8I{P}_{X}^{1\delta},M\u27e9\le \frac{1}{1+\frac{\delta \epsilon}{\epsilon}}=\frac{\epsilon}{\delta}.$$ 
∎
Proof of Theorem 5.3.
Let $OPT:=\epsilon {\parallel A\parallel}_{F}^{2}$ for some $\epsilon \in [0,1]$. Without loss of generality, in the rest of the proof we will assume that ${\parallel A\parallel}_{F}=1$. We will use the Ellipsoid algorithm to approximately solve the relaxation in (22)(25). As we have explained before, the feasible set is convex. We now show how to design an approximate hyperplane separation oracle for (25); the rest of the constraints just correspond to a simple SDP. Since $\u2af4\cdot \u2af4$ is $\alpha $approximately separable, we have a randomized polynomial time algorithm that given a matrix $\widehat{X}\u2ab00$ , either certifies that $\u2af4\widehat{X}\u2af4\le \alpha \kappa $ (e.g., when the SDP value is at most $\alpha \kappa $), and otherwise produces a separating hyperplane of the form $\u27e8Z,X\u27e9\le \kappa $ that is not satisfied by $\widehat{X}$. Let $T=\mathrm{poly}(n,m)$ be the number of iterations taken by the Ellipsoid algorithm to produce a solution of value at most $OPT$ (up to exponentially small additive error), assuming access to a separation oracle. Hence from a union bound over all the $T$ iterations, we can use the above randomized polynomial time algorithm for separating (25), and run the Ellipsoid algorithm to find a solution $\widehat{X}$ that satisfies $\u2af4\widehat{X}\u2af4\le \alpha \kappa $, and has objective value that is arbitrarily close to $OPT$. This algorithm runs in polynomial time with high probability (with an exponentially small probability it may not terminate when the separation oracle does not terminate in polynomial time). For the rest of the analysis, we condition on the event that the algorithm terminates in polynomial time.
Set $\delta :=1/(1+\gamma )$. Let $X={\sum}_{i}{\lambda}_{i}{v}_{i}{v}_{i}^{\top}$ and let $S=\{i:{\lambda}_{i}\ge 1\delta \}$. Define for each $i\in [S]$, ${\alpha}_{i}:=\u27e8{v}_{i}{v}_{i}^{\top},M\u27e9$, where $M=A{A}^{\top}$. We sort the elements of $S$ based on $\{{\alpha}_{i}\}$, and pick greedily the first $\mathrm{min}\{r,S\}$ of them to form $T\subseteq S$. Our projection matrix will be ${\mathrm{\Pi}}_{T}={\sum}_{i\in T}{v}_{i}{v}_{i}^{\top}$.
We first argue that the operator norm constraint is approximately satisfied. By the monotonicity of the $\u2af4\cdot \u2af4$ we have
$\u2af4{\mathrm{\Pi}}_{T}\u2af4$  $=\u2af4{\displaystyle \sum _{i\in T}}{v}_{i}{v}_{i}^{\top}\u2af4\le {\displaystyle \frac{1}{\tau}}\u2af4{\displaystyle \sum _{\begin{array}{c}i:{\lambda}_{i}>\tau \end{array}}}{\lambda}_{i}{v}_{i}{v}_{i}^{\top}\u2af4\le {\displaystyle \frac{\u2af4X\u2af4}{\tau}}\le {\displaystyle \frac{\alpha \kappa}{1\delta}}={\displaystyle \frac{\alpha (1+\gamma )}{\gamma}}\cdot \kappa $  (27) 
Also, from Lemma 5.4, we have
$\sum _{i\notin S}}\u27e8{v}_{i}{v}_{i}^{\top},M\u27e9$  $=\u27e8I{P}_{X}^{1\delta},M\u27e9\le {\displaystyle \frac{\epsilon}{\delta}},\text{and}$  
$\sum _{i\in S}}\u27e8(1{\lambda}_{i}){v}_{i}{v}_{i}^{\top},M\u27e9$  $\le {\displaystyle \sum _{i}}\u27e8(1{\lambda}_{i}){v}_{i}{v}_{i}^{\top},M\u27e9\le 1\u27e8X,M\u27e9\le \epsilon .$  
$\text{Hence,}{\displaystyle \sum _{i\in S}}{\lambda}_{i}{\alpha}_{i}$  $={\displaystyle \sum _{i\in S}}{\lambda}_{i}\u27e8{v}_{i}{v}_{i}^{\top},M\u27e9\ge 1\epsilon \left(1+{\displaystyle \frac{1}{\delta}}\right)=1(2+\gamma )\epsilon ,$ 
for our choice of $\delta =1/(1+\gamma )$. By our greedy choice of $T$, we have ${\sum}_{i\in T}{\alpha}_{i}\ge {\sum}_{i\in S}{\lambda}_{i}{\alpha}_{i}$, as ${\sum}_{i\in S}{\lambda}_{i}\le \text{tr}(X)\le \mathrm{min}\{r,S\}$, with each ${\lambda}_{i}\in [0,1]$. Thus ${\parallel {\mathrm{\Pi}}_{T}^{\u27c2}A\parallel}_{F}^{2}\le (2+\gamma )\epsilon $.
The guarantee in (18) is obtained by returning the projection ${\mathrm{\Pi}}_{S}={\sum}_{i\in S}{v}_{i}{v}_{i}^{\top}$. Observe that $S\le r/(1\delta )$ from (29). The operator norm bounds follows using the same argument as (18) with $T=S$. Moreover, the objective value follows directly from Lemma 5.4. ∎
Proof of Theorem 5.1.
Our goal will be to apply Theorem 5.3 to obtain our required guarantee. However the $q\to 2$ operator norm is not monotone when $q>2$; see Claim A.2 for a counterexample. Our crucial insight in this result is we can write down the following mathematical programming relaxation for the problem, where we instead use the ${\parallel \cdot \parallel}_{q\to {q}^{*}}$ norm which we show is indeed monotone!
$\underset{X}{\mathrm{min}}$  ${\parallel A\parallel}_{F}^{2}\u27e8A{A}^{\top},X\u27e9$  (28)  
s.t.  $\text{tr}(X)\le r$  (29)  
$0\u2aafX\u2aafI$  (30)  
${\parallel X\parallel}_{q\to {q}^{*}}\le {\kappa}^{2}$  (31) 
In the above program, the objective (28) and constraints (29)(30) define a semidefinite program (SDP). We will show later that we can get $O(1)$ approximate separation oracle for the problem. The following lemma shows that the above SDP is a valid relaxation to the problem.
Lemma 5.5.
Any feasible projection matrix $\mathrm{\Pi}$ of rank $r$ satisfying (19) is a feasible SDP solution with the same value.
Proof.
The intended SDP solution here is just $X=\mathrm{\Pi}$. All the eigenvalues of $\mathrm{\Pi}$ are $0$ or $1$, since $\mathrm{\Pi}$ is a projection matrix; hence (29), (30) are satisfied. To verify (31), note that by duality and since ${\mathrm{\Pi}}^{2}=\mathrm{\Pi}$,
${\parallel \mathrm{\Pi}\parallel}_{q\to {q}^{*}}$  $=\underset{{\parallel y\parallel}_{q}\le 1}{\mathrm{max}}{\parallel \mathrm{\Pi}y\parallel}_{{q}^{*}}=\underset{{\parallel y\parallel}_{q}\le 1,{\parallel z\parallel}_{q}\le 1}{\mathrm{max}}{z}^{\top}\mathrm{\Pi}y=\underset{{\parallel y\parallel}_{q}\le 1,{\parallel z\parallel}_{q}\le 1}{\mathrm{max}}\u27e8\mathrm{\Pi}z,\mathrm{\Pi}y\u27e9=\underset{{\parallel y\parallel}_{q}\le 1}{\mathrm{max}}\u27e8\mathrm{\Pi}y,\mathrm{\Pi}y\u27e9$  
$=\underset{{\parallel y\parallel}_{q}\le 1}{\mathrm{max}}{\parallel \mathrm{\Pi}y\parallel}_{2}^{2}={\parallel \mathrm{\Pi}\parallel}_{q\to 2}^{2},$ 
as required. Finally, the objective value is preserved since
$${\parallel A\parallel}_{F}^{2}\u27e8A{A}^{\top},\mathrm{\Pi}\u27e9={\parallel A\parallel}_{F}^{2}\text{tr}(A{A}^{\top}\mathrm{\Pi})={\parallel A\parallel}_{F}^{2}\text{tr}(\mathrm{\Pi}A{(A\mathrm{\Pi})}^{\top})={\parallel A\parallel}_{F}^{2}{\parallel \mathrm{\Pi}A\parallel}_{F}^{2}.$$ 
∎
The following lemma shows that the norm in constraint (31) satisfies the monotonicity property (Definition 4.6), so that we can apply Propositon 5.3.
Lemma 5.6 (Monotonicity of $q\to {q}^{*}$ operator norm).
For any $q\mathrm{\ge}\mathrm{1}$, the operator norm ${\mathrm{\parallel}\mathrm{\cdot}\mathrm{\parallel}}_{q\mathrm{\to}{q}^{\mathrm{*}}}$ is monotone.
Proof.
Let $B\in {\mathbb{R}}^{n\times n}$. It suffices to show that for any $B\u2ab00,v\in {\mathbb{R}}^{n}$, ${\parallel B+v{v}^{\top}\parallel}_{q\to {q}^{*}}\ge {\parallel B\parallel}_{q\to {q}^{*}}$.
${\parallel B\parallel}_{q\to {q}^{*}}$  $=\underset{{\parallel y\parallel}_{q}\le 1,{\parallel z\parallel}_{q}\le 1}{\mathrm{max}}{z}^{\top}By=\underset{{\parallel y\parallel}_{q}\le 1,{\parallel z\parallel}_{q}\le 1}{\mathrm{max}}\u27e8{B}^{1/2}z,{B}^{1/2}y\u27e9\le \underset{{\parallel y\parallel}_{q}\le 1}{\mathrm{max}}\u27e8{B}^{1/2}y,{B}^{1/2}y\u27e9=\underset{{\parallel y\parallel}_{q}\le 1}{\mathrm{max}}{y}^{\top}By.$ 
In other words, the quadratic form is maximized by $y=z$. Moreover for every $y$, ${y}^{T}(B+v{v}^{\top})y={y}^{T}By+{\u27e8y,v\u27e9}^{2}\ge {y}^{\top}By$. Hence, ${\parallel B+v{v}^{\top}\parallel}_{q\to {q}^{*}}\ge {\parallel B\parallel}_{q\to {q}^{*}}$.
∎
Proof of Theorem 5.1.
We will apply Theorem 5.3 with $\u2af4\cdot \u2af4:={\parallel \cdot \parallel}_{q\to {q}^{*}}$. Lemma 5.5 shows the feasibility of the convex program . From Lemma 5.6, we have that the ${\parallel \cdot \parallel}_{q\to {q}^{*}}$ is monotone. We now show that ${\parallel \cdot \parallel}_{q\to {q}^{*}}$ is $O(1)$approximately separable. Recall that the
$${\parallel X\parallel}_{q\to {q}^{*}}=\underset{\begin{array}{c}y,z\in {\mathbb{R}}^{n}\text{s.t.}\\ {\parallel y\parallel}_{q},{\parallel z\parallel}_{q}\le 1\end{array}}{\mathrm{max}}\u27e8y{z}^{\top},X\u27e9.$$ 
Crucially, there is an efficient approximate separation oracle for (31) using an approximation algorithm for the $q\to {q}^{*}$ operator norm (see Theorem 4.1). The ${C}_{G}(q)$factor SDPbased approximation algorithm immediately gives a ${C}_{G}(q)$factor approximate separation oracle. This SDPbased algorithm either certifies that the given matrix $B\u2ab00$ has ${\parallel B\parallel}_{q\to {q}^{*}}\le {C}_{G}{\kappa}^{2}$ (when the SDP value is at most ${C}_{G}{\kappa}^{2}$); otherwise (when the SDP value is larger than ${C}_{G}{\kappa}^{2}$) the algorithm, w.h.p. runs in polynomial time and produces a solution ${y}^{\prime},{z}^{\prime}\in {\mathbb{R}}^{n}$ with ${\parallel {y}^{\prime}\parallel}_{q},{\parallel {z}^{\prime}\parallel}_{q}\le 1$ that gives a separating hyperplane of the form $\u27e8{y}^{\prime}{({z}^{\prime})}^{\top},X\u27e9\le {\kappa}^{2}$ (that $B$ does not satisfy)^{5}^{5} 5 The ${C}_{G}(q)$factor SDPbased algorithm has the property that when the SDP value is larger than ${C}_{G}{\kappa}^{2}$, there is a rounding algorithm that runs in time $\mathrm{poly}(n,m)$ and with high probability produces the desired solution ${y}^{\prime},{z}^{\prime}$. To get the Las Vegas guarantee, we simply run the rounding algorithm repeatedly until it outputs the desired solution.. Hence ${\parallel \cdot \parallel}_{q\to {q}^{*}}$ is $\alpha ={O}_{q}(1)$ approximately separable (in particular when $q=\mathrm{\infty}$, we have $\alpha \le \pi /2$). Hence Theorem 5.3 can be applied to finish the proof. ∎
5.2 Approximations in the Spectral norm
We now show how techniques similar to those in Section 5.2 can also be extended to robust lowrank approximations, when the error is measured in spectral norm as opposed to the Frobenius norm. In this section we will use $\parallel A\parallel $ to denote the spectral norm of matrix $A$. For convenience of exposition, we will measure the projection error in (3) using the spectral norm $\parallel A\mathrm{\Pi}A\parallel $ as opposed to the squared spectral norm.
Theorem 5.7.
Suppose the data matrix $A\mathrm{\in}{\mathrm{R}}^{n\mathrm{\times}m}$ has an (orthogonal) projection ${\mathrm{\Pi}}^{\mathrm{*}}$ of rank at most $r$ such that ${\mathrm{\parallel}{\mathrm{\Pi}}^{\mathrm{*}}\mathrm{\parallel}}_{q\mathrm{\to}\mathrm{2}}\mathrm{\le}\kappa $ and the approximation error $O\mathit{}P\mathit{}T\mathrm{:=}\mathrm{\parallel}\mathrm{(}I\mathrm{}{\mathrm{\Pi}}^{\mathrm{*}}\mathrm{)}\mathit{}A\mathrm{\parallel}$. There exists an algorithm that w.h.p. runs in polynomial time, and finds an (orthogonal) projection matrix $\widehat{\mathrm{\Pi}}$ of rank at most $r$ such that for every $\gamma \mathrm{\in}\mathrm{(}\mathrm{0}\mathrm{,}\mathrm{1}\mathrm{)}$,
${\parallel \widehat{\mathrm{\Pi}}\parallel}_{q\to 2}$  $\le \sqrt{{C}_{G}(q)(1+2/\gamma )}\cdot \kappa ,\mathit{\text{and}}\parallel (I\widehat{\mathrm{\Pi}})A\parallel \le \sqrt{(3+\gamma )}\cdot OPT,$  (32) 
where ${C}_{G}\mathit{}\mathrm{(}q\mathrm{)}\mathrm{>}\mathrm{0}$ is a constant that only depends on $q$ as given in Theorem 4.1. For $q\mathrm{=}\mathrm{\infty}$ this value is known to be at most $\pi \mathrm{/}\mathrm{2}$.
Moreover, for any $\gamma \mathrm{\in}\mathrm{(}\mathrm{0}\mathrm{,}\mathrm{1}\mathrm{)}$, there exists an algorithm that w.h.p. runs in polynomial time and finds an ${r}^{\mathrm{\prime}}\mathrm{\le}\mathrm{(}\mathrm{1}\mathrm{+}\frac{\mathrm{2}}{\gamma}\mathrm{)}\mathit{}r$ dimensional orthogonal projection $\widehat{\mathrm{\Pi}}$ such that
${\parallel \widehat{\mathrm{\Pi}}\parallel}_{q\to 2}$  $\le \sqrt{{C}_{G}(q)(1+2/\gamma )}\cdot \kappa ,\mathit{\text{and}}\parallel (I\widehat{\mathrm{\Pi}})A\parallel \le \sqrt{(1+\gamma )}\cdot OPT.$  (33) 
We remark that as in Theorem 5.1, the only randomization in the algorithm from Theorem 5.7 is in the construction of the separation oracle for the matrix norm constraint. In particular the algorithm from Theorem 5.7 has a Las Vegas guarantee i.e., the algorithm is always correct, and the running time is polynomial with high probability (and hence in expectation).
Proof of Theorem 5.7.
We will use the following mathematical relaxation for the problem.
$\mathrm{min}$  $\lambda $  (34)  
$\text{s.t.}{A}^{\top}(IX)A$  $\u2aaf\lambda I$  (35)  
${\parallel X\parallel}_{q\to {q}^{*}}$  $\le {\kappa}^{2}$  (36)  
$\text{tr}(X)\le r,$  $\text{and}0\u2aafX\u2aafI$  (37) 
To see that the above program is a valid relaxation for the problem, note that any rank$r$ projection matrix $\mathrm{\Pi}$ satisfies (37). Moreover as in Lemma 5.5 we see that
${\parallel \mathrm{\Pi}\parallel}_{q\to {q}^{*}}$  $=\underset{{\parallel y\parallel}_{q}\le 1,{\parallel z\parallel}_{q}\le 1}{\mathrm{max}}{z}^{\top}\mathrm{\Pi}y=\underset{{\parallel y\parallel}_{q}\le 1,{\parallel z\parallel}_{q}\le 1}{\mathrm{max}}\u27e8\mathrm{\Pi}z,\mathrm{\Pi}y\u27e9=\underset{{\parallel y\parallel}_{q}\le 1}{\mathrm{max}}{\parallel \mathrm{\Pi}y\parallel}_{2}^{2}={\parallel \mathrm{\Pi}\parallel}_{q\to 2}^{2}.$ 
Finally to see that the objective value is preserved, note that for any projection matrix $\mathrm{\Pi}$,
$$\parallel {A}^{\top}(I\mathrm{\Pi})A\parallel =\parallel {A}^{\top}{\mathrm{\Pi}}^{\u27c2}{\mathrm{\Pi}}^{\u27c2}A\parallel ={\parallel {\mathrm{\Pi}}^{\u27c2}A\parallel}^{2}={\parallel A\mathrm{\Pi}A\parallel}^{2},$$ 
as required. Lemma 5.8 shows that the above program can be solved approximately in polynomial time.
Lemma 5.8.
There exists a universal constant $c\mathrm{=}c\mathit{}\mathrm{(}q\mathrm{)}\mathrm{\ge}\mathrm{1}$ and a randomized algorithm that given an instance $A\mathrm{\in}{\mathrm{R}}^{m\mathrm{\times}n}$ that has a feasible solution ${X}^{\mathrm{*}}$ to the relaxation (34)(37) with objective value ${\lambda}^{\mathrm{*}}$, runs in polynomial time w.h.p., and finds a solution $\widehat{X}$ satisfying (37) with objective value arbitrarily close to ${\lambda}^{\mathrm{*}}$ such that ${\mathrm{\parallel}\widehat{X}\mathrm{\parallel}}_{q\mathrm{\to}{q}^{\mathrm{*}}}\mathrm{\le}c\mathit{}{\kappa}^{\mathrm{2}}$.
Proof.
We will use the Ellipsoid algorithm to approximately solve the relaxation in (34)(37). Note that the feasible set is convex (including (36), which is just a upper bound constraint on a matrix norm). We now show how to design an approximate hyperplane separation oracle for (36) and a separation oracle for the rest of the constraints. The constraint (36) can be rewritten as $\u27e8y{z}^{\top},X\u27e9\le {\kappa}^{2}$ for all $y,z\in {\mathbb{R}}^{n}$ such that ${\parallel y\parallel}_{q},{\parallel z\parallel}_{q}\le 1$. Crucially, there is an efficient approximate separation oracle for (36) using an approximation algorithm for the $q\to {q}^{*}$ operator norm problem (see Theorem 4.1). As in Theorem 5.1, the $c:={C}_{G}(q)$factor SDPbased approximation algorithm immediately gives a $c$factor approximate separation oracle. This SDPbased algorithm either certifies that the given matrix $B\u2ab00$ has ${\parallel B\parallel}_{q\to {q}^{*}}\le {C}_{G}{\kappa}^{2}$ (when the SDP value is at most ${C}_{G}{\kappa}^{2}$); otherwise (when the SDP value is larger than ${C}_{G}{\kappa}^{2}$) the algorithm, w.h.p. runs in polynomial time, and produces a solution ${y}^{\prime},{z}^{\prime}\in {\mathbb{R}}^{n}$ with ${\parallel {y}^{\prime}\parallel}_{q},{\parallel {z}^{\prime}\parallel}_{q}\le 1$ that gives a separating hyperplane of the form $\u27e8{y}^{\prime}{({z}^{\prime})}^{\top},X\u27e9\le {\kappa}^{2}$ (that $B$ does not satisfy). Hence by a union bound, we can assume that (36) can be separated up to a $O(1)$ factor over all the $T=\mathrm{poly}(n,m)$ iterations of the Ellipsoid algorithm by an algorithm that w.h.p. runs in polynomial time.
The constraints (37) can be efficiently separated using an algorithm for eigenvalue computations. Finally, given $\lambda ,X$, (35) can also be efficiently separated by computing the maximum eigenvalue of ${A}^{\top}(IX)A$. Let $v\in {\mathbb{S}}^{n1}$ be the corresponding eigenvector. If the constraint is violated, the hyperplane separator is of the form
$$\u27e8v{v}^{\top},{A}^{\top}A\u27e9\u27e8v{v}^{\top},{A}^{\top}XA\u27e9\lambda \le 0,\text{i.e.,}\u27e8v{v}^{\top},{A}^{\top}A\u27e9\u27e8Av{v}^{\top}{A}^{\top},X\u27e9\lambda \le 0,$$ 
since $\text{tr}(v{v}^{\top}{A}^{\top}XA)=\text{tr}(Av{v}^{\top}{A}^{\top}X)$. This completes the proof. ∎
The proof of Theorem 5.7 also crucially uses the monotonicity of $q\to {q}^{*}$ matrix operator norm.
Proof of Theorem 5.7.
Let $OP{T}^{2}:={\epsilon}^{2}{\parallel A\parallel}^{2}$ for some $\epsilon \in (0,1]$. Set $\delta :=2/(2+\gamma )$. Lemma 5.8 shows that in polynomial time we obtain a solution $X\u2ab00$ satisfying (37), (35) with $\lambda \le OPT$, and ${\parallel X\parallel}_{q\to {q}^{*}}\le {C}_{G}(q){\kappa}^{2}$ (with exponentially small probability the algorithm does not terminate in polynomial time). The rest of the algorithm (and analysis) conditions on the success of this event. Let $X={\sum}_{i}{\lambda}_{i}{v}_{i}{v}_{i}^{\top}$ and let $S=\{i:{\lambda}_{i}\ge 1\delta \}$.
For the rest of the analysis we will assume without loss of generality that $\parallel A\parallel =1$. We first show the guarantee in (33). The projection output is just ${\mathrm{\Pi}}_{S}={\sum}_{i\in S}{v}_{i}{v}_{i}^{\top}$. Observe that $S\le r/(1\delta )$ from (37). Since the projector we output is just ${\mathrm{\Pi}}_{S}$, each of its associated eigenvalues as at least $1\delta $. Hence, the operator norm bounds follows using the monotonicity of the norm since ${\mathrm{\Pi}}_{S}\u2aaf\frac{1}{1\delta}X$. To verify the objective value we see that
$\parallel {\displaystyle \sum _{i\in [n]}}(1{\lambda}_{i}){A}^{\top}{v}_{i}{v}_{i}^{\top}A\parallel $  $=\parallel {A}^{\top}(IX)A\parallel \le {\epsilon}^{2}{\parallel A\parallel}^{2}={\epsilon}^{2}$  
$\parallel {\displaystyle \sum _{i\notin S}}\delta {A}^{\top}{v}_{i}{v}_{i}^{\top}A\parallel $  $\le \parallel {\displaystyle \sum _{i\notin S}}(1{\lambda}_{i}){A}^{\top}{v}_{i}{v}_{i}^{\top}A\parallel \le \parallel {\displaystyle \sum _{i\in [n]}}(1{\lambda}_{i}){A}^{\top}{v}_{i}{v}_{i}^{\top}A\parallel \le {\epsilon}^{2}$  
$\text{Hence}\parallel {A}^{\top}{\mathrm{\Pi}}_{S}^{\u27c2}A\parallel $  $\le {\displaystyle \frac{{\epsilon}^{2}}{\delta}}=\left(1+{\displaystyle \frac{\gamma}{2}}\right){\epsilon}^{2},$ 
as required. We now show the guarantee in (32) where we output a projection of rank at most $r$ (with no slack). Let ${M}^{\prime}:={\sum}_{i\in S}{A}^{\top}{v}_{i}{v}_{i}^{\top}A$. Let ${\mathrm{\Pi}}^{\prime}$ be the projection matrix for the subspace corresponding to the best rank $r$ projection of ${M}^{\prime}$. The algorithm outputs ${\mathrm{\Pi}}^{\prime}$.
Note that ${\mathrm{\Pi}}^{\prime}\u2aaf{\mathrm{\Pi}}_{S}$, hence by monotonicity, the $q\to {q}^{*}$ operator norm constraint is satisfied up to a $\alpha :=c$ factor. Also note that if ${\mathrm{\Pi}}^{*}$ is the projection that gives the optimal solution to the problem,
$$\parallel {M}^{\prime}{A}^{\top}{\mathrm{\Pi}}^{*}A\parallel \le \parallel {M}^{\prime}{A}^{\top}A\parallel +\parallel {A}^{\top}A{A}^{\top}{\mathrm{\Pi}}^{*}A\parallel \le {\epsilon}^{2}+\frac{{\epsilon}^{2}}{\delta}={\epsilon}^{2}(1+\frac{1}{\delta})=(2+\frac{\gamma}{2}){\epsilon}^{2}.$$ 
But ${A}^{\top}{\mathrm{\Pi}}^{*}A$ is a valid approximation of ${M}^{\prime}$ of rank at most $r$. Hence, we have that
$\parallel {M}^{\prime}{\mathrm{\Pi}}^{\prime}{M}^{\prime}{\mathrm{\Pi}}^{\prime}\parallel $  $\le {\epsilon}^{2}(1+\frac{1}{\delta})$  
$\parallel {A}^{\top}A{\mathrm{\Pi}}^{\prime}{A}^{\top}A{\mathrm{\Pi}}^{\prime}\parallel $  $\le \parallel {A}^{\top}A{\mathrm{\Pi}}^{\prime}{\mathrm{\Pi}}_{S}{A}^{\top}A{\mathrm{\Pi}}_{S}{\mathrm{\Pi}}^{\prime}\parallel \le \parallel {A}^{\top}A{M}^{\prime}\parallel +\parallel {M}^{\prime}{\mathrm{\Pi}}^{\prime}{M}^{\prime}{\mathrm{\Pi}}^{\prime}\parallel $  
$\le {\displaystyle \frac{2{\epsilon}^{2}}{\delta}}+{\epsilon}^{2}=(1+\frac{2}{\delta}){\epsilon}^{2}=(3+\gamma ){\epsilon}^{2},$ 
as required.
∎
As before the same ideas also give the following more general theorem for any monotone matrix norm $\u2af4\cdot \u2af4$ that is approximately separable.
Theorem 5.9.
Let $\mathrm{\u2af4}\mathit{}\mathrm{\cdot}\mathit{}\mathrm{\u2af4}$ be any matrix norm that is monotone and $\alpha $approximately separable for some $\alpha \mathrm{\ge}\mathrm{1}$. Suppose the data matrix $A\mathrm{\in}{\mathrm{R}}^{n\mathrm{\times}m}$ has a projection ${\mathrm{\Pi}}^{\mathrm{*}}$ of rank at most $r$ such that $\mathrm{\u2af4}{\mathrm{\Pi}}^{\mathrm{*}}\mathrm{\u2af4}\mathrm{\le}\kappa $ and the approximation error $O\mathit{}P\mathit{}T\mathrm{:=}\mathrm{\parallel}\mathrm{(}I\mathrm{}{\mathrm{\Pi}}^{\mathrm{*}}\mathrm{)}\mathit{}A\mathrm{\parallel}$. There exists an algorithm that w.h.p. runs in polynomial time, and finds an orthogonal projection $\widehat{\mathrm{\Pi}}$ of dimension at most $r$ such that for every $\gamma \mathrm{\in}\mathrm{(}\mathrm{0}\mathrm{,}\mathrm{1}\mathrm{)}$,
$\u2af4\widehat{\mathrm{\Pi}}\u2af4$  $\le \sqrt{\alpha (1+2/\gamma )}\cdot \kappa ,\mathit{\text{and}}\parallel (I\widehat{\mathrm{\Pi}})A\parallel \le \sqrt{(3+\gamma )}\cdot OPT.$  (38) 
Moreover, for any $\gamma \mathrm{\in}\mathrm{(}\mathrm{0}\mathrm{,}\mathrm{1}\mathrm{)}$, there exists an algorithm that w.h.p. runs in polynomial time and finds an orthogonal projection $\widehat{\mathrm{\Pi}}$ of rank ${r}^{\mathrm{\prime}}\mathrm{\le}\mathrm{(}\mathrm{1}\mathrm{+}\frac{\mathrm{2}}{\gamma}\mathrm{)}\mathit{}r$ such that
$\u2af4\widehat{\mathrm{\Pi}}\u2af4$  $\le \sqrt{\alpha (1+2/\gamma )}\cdot \kappa ,\mathit{\text{and}}\parallel (I\widehat{\mathrm{\Pi}})A\parallel \le \sqrt{(1+\gamma )}\cdot OPT.$  (39) 
We omit the proof, since the ideas are identical to Theorem 5.7.
5.3 Recovering the Optimal Projection Matrix
We now show that if the optimal robust lowrank projection has very small error compared to the $r$th smallest singular value of $A$, then we can in fact approximately recover the subspace itself up to small error measured in terms of the principal angles. For two subspaces with projection matrices ${\mathrm{\Pi}}_{1},{\mathrm{\Pi}}_{2}$, the Sin of the canonical angles matrix is given by ${\mathrm{\Pi}}_{1}^{\u27c2}{\mathrm{\Pi}}_{2}$. These techniques will also be helpful for recovery in the Spiked Covariance model. The following simple corollary will work for both Frobenius norm error and spectral norm error. For this purpose, we will just use $\u2af4A\u2af4$ to denote the norm of $A$, where the unspecified matrix norm $\u2af4\cdot \u2af4$ is norm in which we are measuring the error – either Frobenius norm or spectral norm.
Corollary 5.10.
Suppose the data matrix $A\mathrm{\in}{\mathrm{R}}^{n\mathrm{\times}m}$ has an $r$dimensional projection ${\mathrm{\Pi}}^{\mathrm{*}}$ such that ${\mathrm{\parallel}{\mathrm{\Pi}}^{\mathrm{*}}\mathrm{\parallel}}_{q\mathrm{\to}\mathrm{2}}\mathrm{\le}\kappa $, the approximation error $$ and ${\sigma}_{r}\mathit{}\mathrm{(}{\mathrm{\Pi}}^{\mathrm{*}}\mathit{}A\mathrm{)}\mathrm{\ge}\theta $. There exists an algorithm that w.h.p. runs in polynomial time, and finds a projection $\widehat{\mathrm{\Pi}}$ of rank at most $r$ such that
$$\u2af4{\mathrm{\Pi}}^{\u27c2}\widehat{\mathrm{\Pi}}\u2af4\le O(1+\alpha )\cdot \frac{\epsilon \u2af4A\u2af4}{\theta}.$$  (40) 
where the subspace corresponding to $\mathrm{\Pi}$ is a subset of the subspace given by ${\mathrm{\Pi}}^{\mathrm{*}}$ and $\alpha $ is the approximation factor attained by the algorithm in Theorem 5.1 (or Theorem 5.7).
Note that the above bound holds for the spectral norm error and the Frobenius norm error.
Proof.
The algorithm is exactly the same algorithm used in Theorem 5.1. Let $\mathrm{\Pi}$ denote the best robust lowrank subspace for $A$. We will then use the DavisKahan $\mathrm{sin}\mathrm{\Theta}$ theorem about perturbations of singular vectors to show that the subspaces given by ${\mathrm{\Pi}}_{1}$ and ${\mathrm{\Pi}}_{2}$ are close. Note that the DavisKahan theorem states that if ${\mathrm{\Pi}}_{i}$ is the projection matrix onto eigenspaces of ${A}_{i}{A}_{i}^{\top}$ respectively ($i\in \{1,2\}$) with the least singular values of ${\mathrm{\Pi}}_{1}{A}_{1}$ being at least $\delta >0$ more than the singular values of ${\mathrm{\Pi}}_{2}^{\u27c2}{A}_{2}$, then for any unitarily invariant norm $\u2af4\cdot \u2af4$,
$$\u2af4{\mathrm{\Pi}}_{2}^{\u27c2}{\mathrm{\Pi}}_{1}\u2af4\le \frac{\u2af4{A}_{1}{A}_{2}\u2af4}{\delta}.$$ 
We would like to apply it with ${A}_{2}=\widehat{\mathrm{\Pi}}A,{A}_{1}={\mathrm{\Pi}}^{*}A$ and ${\mathrm{\Pi}}_{2}=\widehat{\mathrm{\Pi}},{\mathrm{\Pi}}_{1}={\mathrm{\Pi}}^{*}$. We know that by the triangle inequality, for some constant $\alpha $ given by the approximation ratio in Theorem 5.1 (or Theorem 5.7),
$\u2af4{\mathrm{\Pi}}^{*}A\widehat{\mathrm{\Pi}}A\u2af4$  $\le \u2af4\mathrm{\Pi}{A}^{*}A\u2af4+\u2af4A\widehat{A}\u2af4\le \epsilon \u2af4A\u2af4+\alpha \epsilon \u2af4A\u2af4\le (\alpha +1)\epsilon \u2af4A\u2af4,$  (41) 
where we used the fact that $\widehat{\mathrm{\Pi}}$ gives an $\alpha $factor approximation to the objective. Moreover, in our case ${A}_{1}={\mathrm{\Pi}}^{*}A$ is itself of rank$r$ and ${\mathrm{\Pi}}_{2}^{\u27c2}{A}_{2}=0$. Under the stronger assumption in (40), we have ${\sigma}_{r}({\mathrm{\Pi}}^{*}A)\ge \theta $. Hence we see that (40) holds since
$\u2af4{\widehat{\mathrm{\Pi}}}^{\u27c2}{\mathrm{\Pi}}^{*}\u2af4$  $\le {\displaystyle \frac{\u2af4{\mathrm{\Pi}}^{*}A\widehat{\mathrm{\Pi}}A\u2af4}{\theta}}\le {\displaystyle \frac{(1+\alpha )\epsilon}{\theta}}.$ 
∎
6 Data Poisoning and Robustness to Adversarial Perturbations at Training time
6.1 Trainingtime robustness: Approximations in Frobenius norm error
Theorem 6.1.
Suppose $q\mathrm{\ge}\mathrm{2}$ and $A\mathrm{\in}{\mathrm{R}}^{n\mathrm{\times}m}$ is the (unknown) uncorrupted data matrix, with an (orthogonal) projection matrix ${\mathrm{\Pi}}^{\mathrm{*}}$ of rank at most $r$ that is robust i.e., ${\mathrm{\parallel}{\mathrm{\Pi}}^{\mathrm{*}}\mathrm{\parallel}}_{q\mathrm{\to}\mathrm{2}}\mathrm{\le}\kappa $ satisfying ${\mathrm{\parallel}A\mathrm{}{\mathrm{\Pi}}^{\mathrm{*}}\mathit{}A\mathrm{\parallel}}_{F}^{\mathrm{2}}\mathrm{\le}\epsilon \mathit{}{\mathrm{\parallel}A\mathrm{\parallel}}_{F}^{\mathrm{2}}$ for some $\epsilon \mathrm{\in}\mathrm{[}\mathrm{0}\mathrm{,}\mathrm{1}\mathrm{]}$. There exists an algorithm that w.h.p. runs in polynomial time, and given as input any (adversarially perturbed) data matrix $\stackrel{\mathrm{~}}{A}$ s.t. for each column $j\mathrm{\in}\mathrm{[}m\mathrm{]}$, ${\mathrm{\parallel}{\stackrel{\mathrm{~}}{A}}_{j}\mathrm{}{A}_{j}\mathrm{\parallel}}_{q}\mathrm{\le}\delta $, outputs an orthogonal projection $\widehat{\mathrm{\Pi}}$ of rank at most $r$ such that for any $\eta \mathrm{>}\mathrm{0}$
${\parallel \widehat{\mathrm{\Pi}}\parallel}_{q\to 2}$  $\le O(\kappa ),\mathit{\text{and}}{\parallel A\widehat{\mathrm{\Pi}}A\parallel}_{F}^{2}\le O(\epsilon +\eta )\cdot {\parallel A\parallel}_{F}^{2}+O(\frac{1}{\eta})\cdot {\delta}^{2}{\kappa}^{2}m.$  (42) 
To get a multiplicative approximation we will set $\eta =O(\epsilon )$, and get an extra additive term of ${\delta}^{2}{\kappa}^{2}m/\epsilon $. Here think of ${\delta}^{2}{\kappa}^{2}\ll \frac{1}{m}\cdot \epsilon {\parallel A\parallel}_{F}^{2}$. Further we remark that the above guarantees are optimal up to constant factors; in particular, the additive factor of $O(m{\delta}^{2}{\kappa}^{2})$ is unavoidable (see Proposition 6.7).
The main challenge here is that while $A$ has a good lowrank projection (in fact a robust one), $\stackrel{~}{A}$ may be very far from a rank$r$ matrix (let alone having a robust rank$r$ approximation). Further, the best robust lowrank approximation of $\stackrel{~}{A}$ could be very different from the best robust lowrank projection of $A$. This is because the entrywise perturbations of $\delta $ could be too large in aggregate; for instance, it could be the case that ${\parallel \stackrel{~}{A}\parallel}_{F}^{2}\gg {\parallel A\parallel}_{F}^{2}$. Suppose ${\mathrm{\Pi}}^{*}$ is the best robust lowrank projection of $A$. We will run the algorithm in the previous section not on the given matrix $\stackrel{~}{A}$, but on a suitably modified matrix ${A}^{\prime}$.
Lemma 6.2.
There is a polynomial time algorithm that given any matrix $M\mathrm{\in}{\mathrm{R}}^{m\mathrm{\times}n}$, can find
$${\mathrm{\Gamma}}_{q}(M)=\underset{\begin{array}{c}B\in {\mathbb{R}}^{n\times m}\mathit{\text{s.t.}}\\ {\parallel {B}_{j}{M}_{j}\parallel}_{q}\le \delta ,\forall j\in [m]\end{array}}{\mathrm{min}}{\parallel B\parallel}_{F}^{2},$$ 
up to arbitrary accuracy.
Proof.
First we note that since ${\parallel B\parallel}_{F}^{2}={\sum}_{j}{\parallel {B}_{j}\parallel}_{2}^{2}$, the optimization problem is separable across each of the $m$ samples i.e.,
$$\underset{\begin{array}{c}B\in {\mathbb{R}}^{n\times m}\text{s.t.}\\ {\parallel {B}_{j}{M}_{j}\parallel}_{q}\le \delta ,\forall j\in [m]\end{array}}{\mathrm{min}}{\parallel B\parallel}_{F}^{2}=\sum _{j\in [m]}\underset{\begin{array}{c}{B}_{j}\in {\mathbb{R}}^{n}\text{s.t.}\\ {\parallel {B}_{j}{M}_{j}\parallel}_{q}\le \delta \end{array}}{\mathrm{min}}{\parallel {B}_{j}\parallel}_{2}^{2}$$ 
We now describe how to solve each of the $m$ subinstances corresponding to the column $j\in [m]$, which for a given $b\in {\mathbb{R}}^{n}$ is of the form
$$\underset{\begin{array}{c}z\in {\mathbb{R}}^{n}\\ {\parallel z\parallel}_{q}\le \delta \end{array}}{\mathrm{min}}{\parallel bz\parallel}_{2}^{2}.$$ 
Note that the leastsquares objective ${\parallel bz\parallel}_{2}^{2}$ is convex. Moreover the constraint ${\parallel z\parallel}_{q}\le \delta $ is also convex; moreover there is a simple separation oracle for this constraint since by duality
$${\parallel z\parallel}_{q}=\underset{y\in {\mathbb{R}}^{n}:{\parallel y\parallel}_{{q}^{*}}\le 1}{\mathrm{max}}\u27e8y,z\u27e9=\u27e8\frac{{z}^{*}}{{\parallel {z}^{*}\parallel}_{{q}^{*}}},z\u27e9,\text{where}{z}_{i}^{*}=\text{sign}({z}_{i}){z(i)}^{q1}\mathit{\hspace{1em}}\forall i\in [n].$$ 
Hence by using the Ellipsoid algorithm, this problem can be solved in polynomial time. ^{6}^{6} 6 In fact Projected Gradient Descent Algorithm can also be used here; see [sra2012fast]. ∎
Note that when $q=\mathrm{\infty}$, it is easy to find the matrix ${A}^{\prime}$, by just setting
$${A}_{ij}^{\prime}=\text{sign}({M}_{ij})\cdot \mathrm{max}\{0,{M}_{ij}\delta \},\forall i,j\in [n].$$ 
We will argue that ${\mathrm{\Pi}}^{*}$ also gives a good lowrank approximation to ${A}^{\prime}$. This crucially uses the fact that ${\mathrm{\Pi}}^{*}$ has bounded ${\mathrm{\ell}}_{q}\to 2$ norm, which implies the following useful lemma.
Lemma 6.3.
Suppose $A\mathrm{,}B\mathrm{\in}{\mathrm{R}}^{n\mathrm{\times}m}$ are two matrices such that for each column $j\mathrm{\in}\mathrm{[}m\mathrm{]}$, ${\mathrm{\parallel}{A}_{j}\mathrm{}{B}_{j}\mathrm{\parallel}}_{q}\mathrm{\le}\delta $, and let $\mathrm{\Pi}$ be any rank$r$ projection matrix such that ${\mathrm{\parallel}\mathrm{\Pi}\mathrm{\parallel}}_{q\mathrm{\to}\mathrm{2}}\mathrm{\le}\kappa $. Then for any $\eta \mathrm{\in}\mathrm{(}\mathrm{0}\mathrm{,}\mathrm{1}\mathrm{)}$,
$$(1\eta ){\parallel \mathrm{\Pi}A\parallel}_{F}^{2}(\frac{1}{\eta}1){\delta}^{2}{\kappa}^{2}m\le {\parallel \mathrm{\Pi}B\parallel}_{F}^{2}\le (1+\eta ){\parallel \mathrm{\Pi}A\parallel}_{F}^{2}+(\frac{1}{\eta}+1){\delta}^{2}{\kappa}^{2}m.$$ 
Proof.
For each $j\in [m]$, let ${A}_{j},{B}_{j}$ be the $j$th columns of $A$ and $B$ respectively. Then ${\parallel \mathrm{\Pi}({A}_{j}{B}_{j})\parallel}_{2}\le {\parallel \mathrm{\Pi}\parallel}_{q\to 2}{\parallel {A}_{j}{B}_{j}\parallel}_{q}\le \delta \kappa $. Using this along with the triangle inequality we get,
${\parallel \mathrm{\Pi}B\parallel}_{F}^{2}$  $={\displaystyle \sum _{j=1}^{m}}{\parallel \mathrm{\Pi}({B}_{j}{A}_{j})+\mathrm{\Pi}{A}_{j}\parallel}_{2}^{2}\ge {\displaystyle \sum _{j=1}^{m}}{({\parallel \mathrm{\Pi}{A}_{j}\parallel}_{2}\delta \kappa )}^{2}\ge {\displaystyle \sum _{j=1}^{m}}{\parallel \mathrm{\Pi}{A}_{j}\parallel}_{2}){}^{2}2({\displaystyle \frac{\delta \kappa}{\sqrt{\eta}}})(\sqrt{\eta}{\parallel \mathrm{\Pi}{A}_{j}\parallel}_{2})+(\delta \kappa ){}^{2}$  
$\ge (1\eta ){\parallel \mathrm{\Pi}A\parallel}_{F}^{2}(\frac{1}{\eta}1){\delta}^{2}{\kappa}^{2}m,\text{for any}\eta \in (0,1).$ 
This proves the first inequality. A similar argument also shows the other inequality. ∎
We now prove that Algorithm 1 finds an approximately optimal robust lowrank projection for unknown, uncorrupted data matrix $A$.
Proof of Theorem 6.1.
The first step of the algorithm finds the matrix ${A}^{\prime}$ given by
$${A}^{\prime}=\underset{\begin{array}{c}B\in {\mathbb{R}}^{n\times m}\text{s.t.}\\ {\parallel {B}_{j}{\stackrel{~}{A}}_{j}\parallel}_{q}\le \delta ,\forall j\in [m]\end{array}}{\text{argmin}}{\parallel B\parallel}_{F}^{2}.$$ 
Note that ${\parallel {A}^{\prime}\parallel}_{F}\le {\parallel A\parallel}_{F}$ since $A$ is also a feasible solution for the above minimization. Moreover since ${\parallel {A}_{j}{A}_{j}^{\prime}\parallel}_{q}\le 2\delta $ for each $j\in [m]$, we get from Lemma 6.3,
${\parallel {\mathrm{\Pi}}^{*}{A}^{\prime}\parallel}_{F}^{2}$  $\ge (1\eta ){\parallel {\mathrm{\Pi}}^{*}A\parallel}_{F}^{2}4(\frac{1}{\eta}1){\delta}^{2}{\kappa}^{2}m,\text{for any}\eta \in (0,1).$  (43) 
Now we run the algorithm from the previous section (Theorem 5.1) on ${A}^{\prime}$. From Theorem 5.1 (with $\delta =1/2$ say), we find a rank$r$ projection matrix $\mathrm{\Pi}$ with ${\parallel \mathrm{\Pi}\parallel}_{\mathrm{\infty}\to 2}\le O(\kappa )$ such that
${\parallel {A}^{\prime}\mathrm{\Pi}{A}^{\prime}\parallel}_{F}^{2}$  $\le 3\left({\parallel {A}^{\prime}\parallel}_{F}^{2}(1\eta ){\parallel {\mathrm{\Pi}}^{*}A\parallel}_{F}^{2}+4(\frac{1}{\eta}1){\delta}^{2}{\kappa}^{2}m\right)$  
$\le 3\left({\parallel A{\mathrm{\Pi}}^{*}A\parallel}_{F}^{2}\right)+3\eta {\parallel {\mathrm{\Pi}}^{*}A\parallel}_{F}^{2}+12(\frac{1}{\eta}1){\delta}^{2}{\kappa}^{2}m$  
$\le 3(\epsilon +\eta ){\parallel A\parallel}_{F}^{2}+12(\frac{1}{\eta}1){\delta}^{2}{\kappa}^{2}m.$ 
However we know that ${\parallel {A}^{\prime}\parallel}_{F}^{2}\ge {\parallel {\mathrm{\Pi}}^{*}{A}^{\prime}\parallel}_{F}^{2}$. Hence
${\parallel \mathrm{\Pi}{A}^{\prime}\parallel}_{F}^{2}$  $\ge {\parallel {A}^{\prime}\parallel}_{F}^{2}{\parallel {A}^{\prime}\mathrm{\Pi}{A}^{\prime}\parallel}_{F}^{2}\ge {\parallel {\mathrm{\Pi}}^{*}{A}^{\prime}\parallel}_{F}^{2}{\parallel {A}^{\prime}\mathrm{\Pi}{A}^{\prime}\parallel}_{F}^{2}$  
$\ge (1\eta ){\parallel {\mathrm{\Pi}}^{*}A\parallel}_{F}^{2}3(\epsilon +\eta ){\parallel A\parallel}_{F}^{2}16(\frac{1}{\eta}1){\delta}^{2}{\kappa}^{2}m$  
$\text{Hence,}{\parallel A\mathrm{\Pi}A\parallel}_{F}^{2}$  $={\parallel A\parallel}_{F}^{2}{\parallel \mathrm{\Pi}A\parallel}_{F}^{2}{\le}_{\begin{array}{c}\text{Lemma}\mathit{\text{6.3}}\end{array}}{\parallel A\parallel}_{F}^{2}(1\eta ){\parallel \mathrm{\Pi}{A}^{\prime}\parallel}_{F}^{2}+(\frac{1}{\eta}+1){\delta}^{2}{\kappa}^{2}m$  
$\le {\parallel A\parallel}_{F}^{2}{(1\eta )}^{2}{\parallel {\mathrm{\Pi}}^{*}A\parallel}_{F}^{2}+3(\epsilon +\eta )(1\eta ){\parallel A\parallel}_{F}^{2}+m{\delta}^{2}{\kappa}^{2}\left(1+\frac{1}{\eta}+16(1\eta )(\frac{1}{\eta}1)\right)$  
$\le {\parallel A{\mathrm{\Pi}}^{*}A\parallel}_{F}^{2}+(3\epsilon +5\eta ){\parallel A\parallel}_{F}^{2}+\left(1+\frac{17}{\eta}\right){\delta}^{2}{\kappa}^{2}m\le O(\eta ){\parallel A\parallel}_{F}^{2}+O(\frac{1}{\eta}){\delta}^{2}{\kappa}^{2}m,$ 
for any $\eta \ge 4\epsilon $. ∎
6.2 Trainingtime robustness: Approximations in Spectral norm error
We now show guarantees for lowrank approximations in spectral norm error that are similar to Theorem 6.1. However, there is a qualitative difference: we will either find a robust lowdimensional projection of the unknown dataset $A$, or we will certify that the dataset has been poisoned substantially. In particular, the algorithm will never output a lowdimensional representation that is bad for the unknown data matrix $A$. We will later see how these guarantees also imply trainingtime robustness for downstream unsupervised learning applications like spectral clustering, robust mean estimation and learning mixture models. In what follows $\parallel \cdot \parallel $ will refer to the spectral norm.
Theorem 6.4.
Suppose $q\mathrm{\ge}\mathrm{2}$ and $A\mathrm{\in}{\mathrm{R}}^{n\mathrm{\times}m}$ is the (unknown) uncorrupted data matrix, and let ${\mathrm{\Pi}}^{\mathrm{*}}$ have the smallest spectral norm error $\mathrm{\parallel}A\mathrm{}\mathrm{\Pi}\mathit{}A\mathrm{\parallel}$ among (orthogonal) projections of rank at most $r$ that are robust i.e., ${\mathrm{\parallel}\mathrm{\Pi}\mathrm{\parallel}}_{q\mathrm{\to}\mathrm{2}}\mathrm{\le}\kappa $. There exists an algorithm (Alg. 2) that given as input any (adversarially perturbed) data matrix $\stackrel{\mathrm{~}}{A}$ s.t. for each column $j\mathrm{\in}\mathrm{[}m\mathrm{]}$, ${\mathrm{\parallel}{\stackrel{\mathrm{~}}{A}}_{j}\mathrm{}{A}_{j}\mathrm{\parallel}}_{q}\mathrm{\le}\delta $ and a parameter $\tau \mathrm{>}\mathrm{0}$, w.h.p. runs in polynomial time and outputs either a projection matrix $\widehat{\mathrm{\Pi}}$ of rank at most $r$ or outputs Bad Input s.t.

(I)
if the algorithm outputs a projection $\widehat{\mathrm{\Pi}}$ of rank at most $r$, then it is a nearoptimal robust lowrank approximation for the unknown matrix $A$ i.e., for some small universal constant $c\ge 1$,
$$\forall \eta >0,{\parallel \widehat{\mathrm{\Pi}}\parallel}_{q\to 2}\le {c}_{q}\kappa ,\mathit{\text{and}}\parallel (I\widehat{\mathrm{\Pi}})A\parallel \le O\left(1+\frac{1}{\eta}\right)\left(\tau +\parallel A{\mathrm{\Pi}}^{*}A\parallel +\sqrt{m}\delta \kappa \right)+\sqrt{2\eta}\parallel A\parallel .$$ (44) 
(II)
if the algorithm outputs Bad Input , then either the data was poisoned i.e., $\parallel A\stackrel{~}{A}\parallel >\tau $, or there is no good robust spectral norm approximation for $A$ i.e., $\parallel A\mathrm{\Pi}A\parallel >\tau $ for all rank$r$ projection matrices $\mathrm{\Pi}$ s.t. ${\parallel \mathrm{\Pi}\parallel}_{q\to 2}\le \kappa $.
In particular, if we are promised that $A$ has a good robust projection ${\mathrm{\Pi}}^{\mathrm{*}}$ of value $\mathrm{\parallel}A\mathrm{}{\mathrm{\Pi}}^{\mathrm{*}}\mathit{}A\mathrm{\parallel}\mathrm{\le}\epsilon \mathit{}\mathrm{\parallel}A\mathrm{\parallel}$, then the algorithm either finds an approximately optimal robust projection $\widehat{\mathrm{\Pi}}$ of rank at most $r$ for $A$ with
$${\parallel \widehat{\mathrm{\Pi}}\parallel}_{q\to 2}\le {c}_{q}\kappa ,\mathit{\text{and}}\forall \eta 0,\parallel (I\widehat{\mathrm{\Pi}})A\parallel \le O\left(1+\frac{1}{\eta}\right)(\parallel A{\mathrm{\Pi}}^{*}A\parallel +\sqrt{m}\delta \kappa )+\sqrt{2\eta}\parallel A\parallel ,$$  (45) 
or certifies that the data has been poisoned i.e., $\mathrm{\parallel}\stackrel{\mathrm{~}}{A}\mathrm{}A\mathrm{\parallel}\mathrm{>}\epsilon \mathit{}\mathrm{\parallel}A\mathrm{\parallel}$.
Our algorithm just runs the worstcase approximation algorithm from Theorem 5.7 on $\stackrel{~}{A}$ to find a projection $\widehat{\mathrm{\Pi}}$. If the error is less than $\tau $, it outputs $\widehat{\mathrm{\Pi}}$; else it certifies that the data is corrupt.
The main feature of the above algorithm is that it is always correct. The algorithm certifies that the input is Bad only when the data has been poisoned i.e., $\stackrel{~}{A}$ is substantially far from $A$, or $A$ did not have a good robust lowrank approximation to begin with. More crucially, when it does output a projection matrix $\widehat{\mathrm{\Pi}}$, it is guaranteed to be a valid robust projection^{7}^{7} 7 In particular it rules out the scenario where the algorithm finds a solution that it thinks is good (on $\stackrel{~}{A}$), but is in fact bad for the unknown, uncorrupted matrix $A$. for the unknown matrix $A$ (with an exponentially small probability, the algorithm may not terminate in polynomial time; this happens exactly when the algorithm from Theorem 5.7 fails to terminate in polynomial time). We remark that the additive error term of $\mathrm{\Omega}(\delta \kappa \sqrt{m})$ is unavoidable here informationtheoretically; see Proposition 6.7 for an example.
The following is the key lemma that argues that if the projection $\widehat{\mathrm{\Pi}}$ gives a small error on $\stackrel{~}{A}$, it necessarily gives a lowerror on $A$.
Lemma 6.5.
Let $\delta \mathrm{\in}{\mathrm{R}}_{\mathrm{+}}$ and $A\mathrm{,}B\mathrm{\in}{\mathrm{R}}^{n\mathrm{\times}m}$ such that ${\mathrm{\parallel}A\mathrm{}B\mathrm{\parallel}}_{q}\mathrm{\le}\delta $. Let ${\mathrm{\Pi}}_{\mathrm{1}}\mathrm{,}{\mathrm{\Pi}}_{\mathrm{2}}$ be projection matrices such that ${\mathrm{\parallel}{\mathrm{\Pi}}_{\mathrm{1}}\mathrm{\parallel}}_{q\mathrm{\to}\mathrm{2}}\mathrm{,}{\mathrm{\parallel}{\mathrm{\Pi}}_{\mathrm{2}}\mathrm{\parallel}}_{q\mathrm{\to}\mathrm{2}}\mathrm{\le}\kappa $, and = $\mathrm{\parallel}A\mathrm{}{\mathrm{\Pi}}_{\mathrm{1}}\mathit{}A\mathrm{\parallel}\mathrm{\le}{\epsilon}_{\mathrm{1}}$ and $\mathrm{\parallel}B\mathrm{}{\mathrm{\Pi}}_{\mathrm{2}}\mathit{}B\mathrm{\parallel}\mathrm{\le}{\epsilon}_{\mathrm{2}}$. Then we have that for any $\eta \mathrm{\in}\mathrm{(}\mathrm{0}\mathrm{,}\mathrm{1}\mathrm{)}$,
$$\parallel A{\mathrm{\Pi}}_{2}A\parallel \le O\left(1+\frac{1}{\eta}\right)\left({\epsilon}_{1}+{\epsilon}_{2}+\sqrt{m}\delta \kappa \right)+\sqrt{2\eta}\parallel A\parallel ,\mathit{\text{and}}\parallel B{\mathrm{\Pi}}_{1}B\parallel \le O\left(1+\frac{1}{\eta}\right)\left({\epsilon}_{1}+{\epsilon}_{2}+\sqrt{m}\delta \kappa \right)+\sqrt{2\eta}\parallel B\parallel .$$  (46) 
Proof.
The projection matrices ${\mathrm{\Pi}}_{1},{\mathrm{\Pi}}_{2}$ are both robust. For $\mathrm{\ell}\in \{1,2\}$
${\parallel {\mathrm{\Pi}}_{\mathrm{\ell}}A{\mathrm{\Pi}}_{\mathrm{\ell}}B\parallel}^{2}$  $\le {\parallel {\mathrm{\Pi}}_{\mathrm{\ell}}(AB)\parallel}_{F}^{2}={\displaystyle \sum _{j\in [m]}}{\parallel {\mathrm{\Pi}}_{\mathrm{\ell}}({A}_{j}{B}_{j})\parallel}_{2}^{2}\le m{\kappa}^{2}{\delta}^{2}$  
$\text{Hence}\left\parallel {\mathrm{\Pi}}_{\mathrm{\ell}}A\parallel \parallel {\mathrm{\Pi}}_{\mathrm{\ell}}B\parallel \right$  $\le \sqrt{m}\kappa \delta .$  (47) 
Let $\gamma :=\sqrt{m}\delta \kappa $. We also know that $\parallel A{\mathrm{\Pi}}_{1}A\parallel \le {\epsilon}_{1}$.
$\parallel A{\mathrm{\Pi}}_{1}B\parallel \le \parallel A{\mathrm{\Pi}}_{1}A\parallel +\parallel {\mathrm{\Pi}}_{1}A{\mathrm{\Pi}}_{1}B\parallel \le {\epsilon}_{1}+\gamma $  
$\text{Hence}\forall v\in {\mathbb{S}}^{n1},$  ${\parallel Av{\mathrm{\Pi}}_{1}Bv\parallel}_{2}\le {\epsilon}_{1}+\gamma ,\text{and similarly}{\parallel Bv{\mathrm{\Pi}}_{2}Av\parallel}_{2}\le {\epsilon}_{2}+\gamma .$  (48) 
But $Bv={\mathrm{\Pi}}_{1}Bv+{\mathrm{\Pi}}_{1}^{\u27c2}Bv$. We have for any $\eta \in (0,1)$
${\parallel Bv\parallel}_{2}^{2}$  $={\parallel {\mathrm{\Pi}}_{1}Bv\parallel}_{2}^{2}+{\parallel {\mathrm{\Pi}}_{1}^{\u27c2}Bv\parallel}_{2}^{2}\ge {({\parallel Av\parallel}_{2}{\epsilon}_{1}\gamma )}^{2}+{\parallel {\mathrm{\Pi}}_{1}^{\u27c2}Bv\parallel}_{2}^{2}$  
$\ge (1\eta ){\parallel Av\parallel}_{2}^{2}{({\epsilon}_{1}+\gamma )}^{2}(1+\frac{1}{\eta})+{\parallel {\mathrm{\Pi}}_{1}^{\u27c2}Bv\parallel}_{2}^{2}$  
$\text{Similarly,}{\parallel Av\parallel}_{2}^{2}$  $\ge (1\eta ){\parallel Bv\parallel}_{2}^{2}{({\epsilon}_{2}+\gamma )}^{2}(1+\frac{1}{\eta})+{\parallel {\mathrm{\Pi}}_{2}^{\u27c2}Av\parallel}_{2}^{2}$ 
Combining the two, we get that
${\parallel Bv\parallel}_{2}^{2}$  $\ge {(1\eta )}^{2}{\parallel Bv\parallel}_{2}^{2}(1+\frac{1}{\eta})\left({({\epsilon}_{1}+\gamma )}^{2}+{({\epsilon}_{1}+\gamma )}^{2}\right)+(1\eta ){\parallel {\mathrm{\Pi}}_{2}^{\u27c2}Av\parallel}_{2}^{2}+{\parallel {\mathrm{\Pi}}_{1}^{\u27c2}Bv\parallel}_{2}^{2}$  
$(1\eta ){\parallel {\mathrm{\Pi}}_{2}^{\u27c2}Av\parallel}_{2}^{2}+{\parallel {\mathrm{\Pi}}_{1}^{\u27c2}Bv\parallel}_{2}^{2}$  $\le (2\eta {\eta}^{2}){\parallel Bv\parallel}_{2}^{2}+(1+\frac{1}{\eta})\left({({\epsilon}_{1}+\gamma )}^{2}+{({\epsilon}_{1}+\gamma )}^{2}\right)$  
$\forall v\in {\mathbb{S}}^{n1},{\parallel {\mathrm{\Pi}}_{1}^{\u27c2}Bv\parallel}_{2}^{2}$  $\le 2\eta {\parallel Bv\parallel}_{2}^{2}+(1+\frac{1}{\eta})\left({({\epsilon}_{1}+\gamma )}^{2}+{({\epsilon}_{1}+\gamma )}^{2}\right).$  
$\text{Hence,}{\parallel B{\mathrm{\Pi}}_{1}B\parallel}^{2}$  $\le 2\eta {\parallel B\parallel}^{2}+(1+\frac{1}{\eta}){({\epsilon}_{1}+2\gamma +{\epsilon}_{2})}^{2},$ 
as required. A similar statement also follows for $A$ using a symmetric proof. ∎
Proof of Theorem 6.4.
Firstly the algorithm from Theorem 5.7 runs on $\stackrel{~}{A}$ and with high probability produced a robust projection matrix $\widehat{\mathrm{\Pi}}$ (with some exponentially small probability, the algorithm fails to terminate in polynomial time). We now condition on the event that algorithm from Theorem 5.7 terminates in polynomial time. The proof consists of two parts. We first argue that if the algorithm outputs any robust rank$r$ projection matrix, then it has to be robust for $A$. Any such $\widehat{\mathrm{\Pi}}$ satisfies $\parallel \stackrel{~}{A}\widehat{\mathrm{\Pi}}\stackrel{~}{A}\parallel \le \tau $. Applying Lemma 6.5 with ${\epsilon}_{2}=\tau $ ($B=\stackrel{~}{A}$) and ${\epsilon}_{1}=\parallel A{\mathrm{\Pi}}^{*}A\parallel $, we have
$$\parallel A\widehat{\mathrm{\Pi}}A\parallel \le O\left(1+\frac{1}{\eta}\right)\left(\tau +\parallel A{\mathrm{\Pi}}^{*}A\parallel +\sqrt{m}\delta \kappa \right)+\sqrt{2\eta}\parallel A\parallel .$$ 
On the other hand, if the input $\stackrel{~}{A}$ is not “Bad” i.e., (a) for the unknown matrix $A$, $\parallel A{\mathrm{\Pi}}^{*}A\parallel \le \tau $, and (b) $\parallel A\stackrel{~}{A}\parallel \le \tau $, we now show that the algorithm outputs a good solution for $A$. In this case we have that $\parallel \stackrel{~}{A}{\mathrm{\Pi}}^{*}A\parallel \le 2\tau $; hence,
$$\parallel \stackrel{~}{A}{\mathrm{\Pi}}^{*}\stackrel{~}{A}\parallel \le \parallel \stackrel{~}{A}{\mathrm{\Pi}}^{*}A\parallel +\parallel {\mathrm{\Pi}}^{*}\stackrel{~}{A}{\mathrm{\Pi}}^{*}A\parallel \le 2\tau +\sqrt{\sum _{j\in [m]}{\parallel {\mathrm{\Pi}}^{*}{A}_{j}{\mathrm{\Pi}}^{*}{\stackrel{~}{A}}_{j}\parallel}_{2}^{2}}\le 2\tau +\sqrt{m}\kappa \delta .$$ 
Hence, by Lemma 6.5 applied with ${\epsilon}_{1}=\tau $ and ${\epsilon}_{2}=$ ($B=\stackrel{~}{A}$), we have that
$$\parallel A\widehat{\mathrm{\Pi}}A\parallel \le O\left(1+\frac{1}{\eta}\right)\left(\tau +\sqrt{m}\delta \kappa \right)+\sqrt{2\eta}\parallel A\parallel .$$ 
This proves the theorem. The moreover part follows by setting $\tau :=\epsilon \parallel A\parallel $.
∎
In fact, Lemma 6.5 implies a stronger informationtheoretic statement about finding a robust lowrank approximation of the unknown, uncorrupted matrix $A$ with low spectral norm (just like Theorem 6.1 for Frobenius norm error). In fact we get a polynomial time algorithm assuming access to a polynomial time algorithm approximation algorithm for solving the following problem: given a matrix $\stackrel{~}{A}\in {\mathbb{R}}^{n\times m}$, find^{8}^{8} 8 This problem is reminiscent of the concept of $\epsilon $rank [AlonVempala], that corresponds to the smallest rank attainable by changing every entry of the given matrix by at most $\delta $.
$$\underset{\begin{array}{c}B:{\parallel {B}_{j}\stackrel{~}{{A}_{j}}\parallel}_{q}\le \delta \end{array}\forall j\in [m]}{\mathrm{min}}\mathit{\hspace{1em}\hspace{1em}\u2006}\underset{\begin{array}{c}\mathrm{\Pi}:\text{rank}(\mathrm{\Pi})=r,{\parallel \mathrm{\Pi}\parallel}_{q\to 2}\le \kappa \end{array}}{\mathrm{min}}{\parallel B\mathrm{\Pi}B\parallel}^{2},$$  (49) 
where $\parallel \cdot \parallel $ stands for the spectral norm.
Proposition 6.6.
Suppose $q\mathrm{\ge}\mathrm{2}$ and $A\mathrm{\in}{\mathrm{R}}^{n\mathrm{\times}m}$ is the (unknown) uncorrupted data matrix, and let ${\mathrm{\Pi}}^{\mathrm{*}}$ have the smallest spectral norm error $\mathrm{\parallel}A\mathrm{}\mathrm{\Pi}\mathit{}A\mathrm{\parallel}$ among rank$r$ projections that are robust i.e., ${\mathrm{\parallel}\mathrm{\Pi}\mathrm{\parallel}}_{q\mathrm{\to}\mathrm{2}}\mathrm{\le}\kappa $. Suppose further that there is an efficient algorithm for finding an $\alpha $factor approximation algorithm for (49). Then there exists an algorithm that w.h.p. runs in polynomial time, and given as input any (adversarially perturbed) data matrix $\stackrel{\mathrm{~}}{A}$ s.t. for each column $j\mathrm{\in}\mathrm{[}m\mathrm{]}$, ${\mathrm{\parallel}{\stackrel{\mathrm{~}}{A}}_{j}\mathrm{}{A}_{j}\mathrm{\parallel}}_{q}\mathrm{\le}\delta $ and a parameter $\tau \mathrm{>}\mathrm{0}$, outputs a robust projection matrix $\widehat{\mathrm{\Pi}}$ of rank at most $r$ that is near optimal in approximation error for the unknown matrix $A$ i.e., for some small universal constant $c\mathrm{\ge}\mathrm{1}$,
$$\forall \eta >0,\parallel (I\widehat{\mathrm{\Pi}})A\parallel \le O\left(1+\frac{1}{\eta}\right)\left(\alpha \parallel A{\mathrm{\Pi}}^{*}A\parallel +\sqrt{m}\delta \kappa \right)+\sqrt{2\eta}\parallel A\parallel .$$  (50) 
Moreover, the above bound is achieved informationtheoretically by an algorithm (that potentially does not have polynomial running time), by using an inefficient algorithm for problem (49).
We remark that the main difference between the above proposition and Theorem 6.4 is that Proposition 6.6 will always output a good robust projection for $A$ (just like Theorem 6.1 for Frobenius norm error), but the algorithm is not computationally efficient unless (49) can be solved efficiently.
Proof.
Given $\stackrel{~}{A}$, the algorithm first runs the $\alpha $factor approximation algorithm for solving (49) on $\stackrel{~}{A}$. The uncorrupted matrix $A$ is itself a feasible solution; hence the solution output by the algorithm ${A}^{\prime}$ has a robust lowrank approximation of error $O(\alpha )\parallel A{\mathrm{\Pi}}^{*}A\parallel $. Such a robust lowrank projection $\widehat{\mathrm{\Pi}}$ for ${A}^{\prime}$ i.e., a projection for rank at most $r$ with ${\parallel \widehat{\mathrm{\Pi}}\parallel}_{q\to 2}\le O(\kappa )$ and $\parallel {A}^{\prime}\widehat{\mathrm{\Pi}}{A}^{\prime}\parallel \le O(\alpha )\parallel A\mathrm{\Pi}A\parallel $ can be found by running Theorem 5.7 on ${A}^{\prime}$. Moreover ${A}^{\prime}$ and $A$ are valid $2\delta $ adversarial perturbations of each other. Now applying Lemma 6.5 with $A,{\mathrm{\Pi}}^{*}$ and ${A}^{\prime},\widehat{\mathrm{\Pi}}$ completes the proof. ∎
6.3 Lower Bound for the Additive error in Training with Adversarial Perturbations
We now show that the additive terms of $\mathrm{\Omega}(m{\delta}^{2}{\kappa}^{2})$ in Theorem 6.1 is unavoidable.
Proposition 6.7.
For any data matrix $A$ with the following two properties:

1.
Each column ${\parallel {A}_{j}\parallel}_{2}\in [1/10,10]$,

2.
There exists ${\mathrm{\Pi}}^{*}$ of rank $1$ and ${\parallel {\mathrm{\Pi}}^{*}\parallel}_{\mathrm{\infty}\to 2}\ge \kappa $ (which is at least 2) satisfying ${\mathrm{\Pi}}^{*}A=A$,
for any $\delta $ small enough, there exist ${A}^{\mathrm{\prime}}$ as a $\delta $perturbation of $A$ (i.e., ${\mathrm{\parallel}A\mathrm{}{A}^{\mathrm{\prime}}\mathrm{\parallel}}_{\mathrm{\infty}}\mathrm{\le}\delta $) and a projection matrix ${\mathrm{\Pi}}^{\mathrm{\prime}}$ of rank 1 satisfying

1.
${\mathrm{\Pi}}^{\prime}$ is more robust than ${\mathrm{\Pi}}^{*}$, i.e., $$.

2.
We still have ${\mathrm{\Pi}}^{\prime}\cdot {A}^{\prime}={A}^{\prime}$ but ${\parallel A{\mathrm{\Pi}}^{\prime}\cdot A\parallel}_{F}=\mathrm{\Omega}(\delta \kappa \sqrt{m})$. Since $A{A}^{\prime}$ is of rank $2$, this also implies a similar lower bound for the spectral norm.
Proof.
Let $v$ be the unit eigenvector of ${\mathrm{\Pi}}^{*}$ such that ${\parallel v\parallel}_{1}\ge \kappa $. Without loss of generality, we assume ${v}_{1}\ge {v}_{2}\mathrm{\cdots}\ge {v}_{n}$ and $\mathrm{\ell}=\lfloor \text{supp}(v)/2\rfloor $. Notice that $\text{supp}(v)\ge 2\mathrm{\ell}$ by this definition such that ${v}_{2\mathrm{\ell}}\ne 0$. At the same time, because of the CauchySchwartz inequality, we have ${\parallel v\parallel}_{1}^{2}\le \text{supp}(v)\cdot {\parallel v\parallel}_{2}^{2}$ such that $\mathrm{\ell}\ge {\kappa}^{2}/21$.
Then we consider any $\delta \le {v}_{2\mathrm{\ell}}$ and perturb $v$ to another “sparser” vector $u$ whose pointwise absolute value is
$$({v}_{1}+\delta ,\mathrm{\dots},{v}_{\mathrm{\ell}}+\delta ,{v}_{\mathrm{\ell}+1}\delta ,\mathrm{\dots},{v}_{2\mathrm{\ell}}\delta ,{v}_{2\mathrm{\ell}+1},\mathrm{\dots},{v}_{n}).$$ 
However, since ${v}_{i}$ may be negative or positive, we define $u$ according to the sign function:
$$u=({v}_{1}+\text{sign}({v}_{1})\cdot \delta ,\mathrm{\dots},{v}_{\mathrm{\ell}}+\text{sign}({v}_{\mathrm{\ell}})\cdot \delta ,{v}_{\mathrm{\ell}+1}\text{sign}({v}_{\mathrm{\ell}+1})\cdot \delta ,\mathrm{\dots},{v}_{2\mathrm{\ell}}\text{sign}({v}_{2\mathrm{\ell}})\cdot \delta ,{v}_{2\mathrm{\ell}+1},\mathrm{\dots},{v}_{n}).$$ 
We have ${\parallel u\parallel}_{1}={\sum}_{i=1}^{\mathrm{\ell}}{v}_{i}+\delta +{\sum}_{i=\mathrm{\ell}+1}^{2\mathrm{\ell}}{v}_{i}\delta +{\sum}_{i>2\mathrm{\ell}}{v}_{i}={\parallel v\parallel}_{1}$ and
${\parallel u\parallel}_{2}^{2}$  $={({v}_{1}+\delta )}^{2}+\mathrm{\cdots}+{({v}_{\mathrm{\ell}}+\delta )}^{2}+{({v}_{\mathrm{\ell}+1}\delta )}^{2}+\mathrm{\cdots}+{({v}_{2\mathrm{\ell}}\delta )}^{2}+{v}_{2\mathrm{\ell}+1}^{2}+\mathrm{\cdots}+{v}_{n}^{2}$  
$={\displaystyle \sum _{i}}{v}_{i}^{2}+2({\displaystyle \sum _{i=1}^{\mathrm{\ell}}}{v}_{i}{\displaystyle \sum _{i=\mathrm{\ell}+1}^{2\mathrm{\ell}}}{v}_{i})\delta +2\mathrm{\ell}{\delta}^{2}.$ 
Since ${v}_{1}\ge {v}_{2}\ge \mathrm{\cdots}\ge {v}_{n}$, this is at least ${\sum}_{i}{v}_{i}^{2}+2\mathrm{\ell}\cdot {\delta}^{2}>{\parallel v\parallel}_{2}^{2}$. So let $\overline{u}=u/{\parallel u\parallel}_{2}$ with unit ${\mathrm{\ell}}_{2}$ norm and ${\mathrm{\Pi}}^{\prime}=\overline{u}\cdot {\overline{u}}^{\top}$. So $$.
Next we consider $A$. Since ${\mathrm{\Pi}}^{*}A=A$, we assume $A=[{c}_{1}\cdot v,{c}_{2}\cdot v,\mathrm{\dots},{c}_{m}\cdot v]$ with coefficient ${c}_{i}\le [1/10,10]$. We set ${A}^{\prime}=[{c}_{1}\cdot u,\mathrm{\dots},{c}_{m}\cdot u]$ such that ${\parallel A{A}^{\prime}\parallel}_{\mathrm{\infty}}\le 10\delta $ and ${\mathrm{\Pi}}^{\prime}{A}^{\prime}={A}^{\prime}$.
Finally we lower bound ${\parallel A{\mathrm{\Pi}}^{\prime}A\parallel}_{F}^{2}$. Notice that
$$\u27e8u,v\u27e9=\sum _{i=1}^{\mathrm{\ell}}({v}_{i}^{2}+{v}_{i}\delta )+\sum _{i=\mathrm{\ell}+1}^{2\mathrm{\ell}}({v}_{i}^{2}{v}_{i}\delta )+\sum _{i>2\mathrm{\ell}}{v}_{i}^{2}=1+(\sum _{i=1}^{\mathrm{\ell}}{v}_{i}\sum _{i=\mathrm{\ell}+1}^{2\mathrm{\ell}}{v}_{i})\delta .$$ 
Thus ${\mathrm{\Pi}}^{\prime}v=\u27e8u,v\u27e9u/{\parallel u\parallel}_{2}^{2}=\alpha u$ for $$. So we lower bound the distance between $v{\mathrm{\Pi}}^{\prime}v$ by counting the $\mathrm{\ell}$ entries from ${v}_{\mathrm{\ell}+1}$ to ${v}_{2\mathrm{\ell}}$:
$$\sum _{i=\mathrm{\ell}+1}^{2\mathrm{\ell}}{[{v}_{i}\alpha ({v}_{i}\text{sign}({v}_{i})\delta )]}^{2}=\sum _{i=\mathrm{\ell}+1}^{2\mathrm{\ell}}{[(1\alpha ){v}_{i}+\alpha \delta ]}^{2}.$$ 
Since $\delta \le {v}_{2\mathrm{\ell}}$, each term in the summation is at least ${\delta}^{2}$. So ${\parallel A{\mathrm{\Pi}}^{\prime}A\parallel}_{F}=\sqrt{{c}_{1}^{2}+\mathrm{\cdots}+{c}_{m}^{2}}\cdot \delta \cdot \sqrt{\mathrm{\ell}}=\mathrm{\Omega}(\delta \kappa \sqrt{m})$. ∎
7 Robustness to Training Time Corruptions in Unsupervised Learning
7.1 Mean Estimation
Recall that the uncorrupted data matrix $A\in {\mathbb{R}}^{n\times m}$ has mean vector denoted by $\mu :=\mathrm{\text{Mean}}(A)$. In our setting, every point ${A}_{i}\in {\mathbb{R}}^{n}$ can potentially be corrupted adversarially up to a $\delta $ amount, as measured in ${\mathrm{\ell}}_{\mathrm{\infty}}$ norm (more generally ${\mathrm{\ell}}_{q}$ norm for $q\ge 2$). Our input instance is $\stackrel{~}{A}\in {\mathbb{R}}^{n\times m}$ where every column $j\in [m]$ satisfies ${\parallel {\stackrel{~}{A}}_{j}{A}_{j}\parallel}_{q}\le \delta $. In this section we will show how to use the guarantee from Theorem 6.4 to either compute an accurate estimate of the mean of the unperturbed dataset from the given set of perturbed points, or certify that the data has been poisoned. In particular, our estimate of the mean of the unperturbed dataset will simply be the mean of the perturbed data points when projected onto a robust subspace. We will use the algorithm from Figure 2 to compute such a projection, or certify that the dataset is poisoned.
How much $\delta $ can one handle?
Consider, for instance the case when $q=\mathrm{\infty}$. Let $\mu $ be the mean of the data points in the matrix $A$ with $C$ representing the $n\times m$ matrix with each column being $\mu $. Furthermore denote ${\sigma}^{2}$ to be the data variance, i.e., ${\parallel AC\parallel}^{2}={\sigma}^{2}m$. Notice that $\delta $ is the bound on the ${\mathrm{\ell}}_{\mathrm{\infty}}$ norm of the perturbation. If $\stackrel{~}{A}$ is the corrupted data points, in the worst case the Mean($A$) and Mean($\stackrel{~}{A}$) could differ by $\delta \sqrt{n}$, thereby needing $\delta $ to be inversely proportional to $\sqrt{n}$. However, if the one dimensional subspace spanned by $\mathrm{\text{Mean}}($A$)$ is $\kappa $robust we can do better. As an extreme case, assume that both $A$ and $\stackrel{~}{A}$ exactly lie in a $\kappa $robust subspace. Then Mean($A$) and Mean($\stackrel{~}{A}$) can only differ by $\kappa \delta $. Notice that in this best case scenario we can handle much larger perturbations. In particular, if the mean is a $k$sparse unit length vector, then $\kappa \sim \sqrt{k}$ and we can potentially handle $\delta $ as large as $\sim 1/\sqrt{k}$. Our main result of this section is that we can indeed achieve this best case guarantee or certify that the data has been poisoned, i.e., $\parallel A\stackrel{~}{A}\parallel $ is large.
Why natural certification approaches fail?
A natural approach for robust mean estimation in our model is to simply compute $\widehat{\mu}=\mathrm{\text{Mean}}(\stackrel{~}{A})$ as an estimate, and if the variance of $\stackrel{~}{A}$ around $\widehat{\mu}$ is not much larger than ${\sigma}^{2}m$ then output $\widehat{\mu}$, otherwise say that the data has been poisoned. This would work for perturbations of the order of $1/\sqrt{n}$. However, in the regime of interest to us, i.e., perturbations of the order of $1/\kappa $, this could fail miserably. Consider a data set of $m$ points generated as $\mu +g$ where $\mu =(1/\sqrt{k},\mathrm{\dots},1/\sqrt{k},0,\mathrm{\dots},0)$ and $g$ is a standard normal vector orthogonal to $\mu $ and with variance ${\sigma}^{2}$ in every direction orthogonal to $\mu $. Hence, the data lies close to a robust ($\sqrt{k}$ sparse in ${\mathrm{\ell}}_{1}/{\mathrm{\ell}}_{2}$ sense) one dimensional subspace with variance ${\sigma}^{2}m$. If ${\mathrm{\ell}}_{\mathrm{\infty}}$perturbations of magnitude $1/\kappa \sim 1/\sqrt{k}$ are allowed in each dimension we could construct $\stackrel{~}{A}$ by taking every point in $A$ and zeroing out the first $k$ coordinates and adding $+1/\sqrt{k}$ to the rest of the coordinates. This new data set has also has variance ${\sigma}^{2}m$ around the mean. However, Mean($\stackrel{~}{A}$) and Mean(A) are quite far. In this case we would like to certify that the data set has been poisoned. In particular, this would follow from certifying that the new data set has no small projection (in spectral norm) onto any robust one dimensional subspace. On the other hand, if the adversary hides the perturbations such that $\stackrel{~}{A}$ has a good projection on to a robust one dimensional subspace (not necessarily the one spanned by $\mu $), we require to output a good approximation to $\mu $. The guarantee of Theorem 6.4 will be used to achieve this.
The robust mean estimation guarantee from this section will also serve as a subroutine in the application to clustering a mixture of well separated Gaussians and more general data sets, even when every data point is poisoned. We discuss the clustering application in Section 7.2. In what follows, for a vector $u$, $\parallel u\parallel $ will denote ${\parallel u\parallel}_{2}$, and for a matrix $M$ we will denote by $\parallel M\parallel $ the spectral norm of matrix $M$. The goal of this section is to prove the following theorem.
Theorem 7.1.
Let $A$ be an $n\mathrm{\times}m$ matrix representing $m$ data points in $n$ dimensions and let $\mu $ be the mean of the data points in the matrix $A$ with $C$ representing the $n\mathrm{\times}m$ matrix with each column being $\mu $. Let ${\mathrm{\Pi}}^{\mathrm{*}}\mathrm{=}\mu \mathit{}{\mu}^{\mathrm{\top}}\mathrm{/}{\mathrm{\parallel}\mu \mathrm{\parallel}}^{\mathrm{2}}$ be the one dimensional subspace denoting the projection onto $\mu $ and assume that ${\mathrm{\parallel}{\mathrm{\Pi}}^{\mathrm{*}}\mathrm{\parallel}}_{q\mathrm{\to}\mathrm{2}}\mathrm{\le}\kappa $, for some $q\mathrm{\ge}\mathrm{2}$. Let $\stackrel{\mathrm{~}}{A}$ be the given input such that for every column $j\mathrm{\in}\mathrm{[}m\mathrm{]}$ we have ${\mathrm{\parallel}{A}_{j}\mathrm{}{\stackrel{\mathrm{~}}{A}}_{j}\mathrm{\parallel}}_{q}\mathrm{\le}\delta $. Furthermore, let ${\sigma}^{\mathrm{2}}\mathrm{>}\mathrm{0}$ be a given upper bound on the variance of the data around the mean, i.e., $\mathrm{\parallel}A\mathrm{}C\mathrm{\parallel}\mathrm{\le}\sigma \mathit{}\sqrt{m}$. Then the algorithm from Figure 3 when run on $\stackrel{\mathrm{~}}{A}$, w.h.p. runs in polynomial time, and either certifies that the data has been poisoned, i.e., $\mathrm{\parallel}\stackrel{\mathrm{~}}{A}\mathrm{}A\mathrm{\parallel}\mathrm{=}\mathrm{\Omega}\mathit{}\mathrm{(}\sigma \mathit{}\sqrt{m}\mathrm{)}$, or outputs an estimate $\widehat{\mu}$ of the true mean $\mu $ such that
${\parallel \widehat{\mu}\mu \parallel}_{2}$  $\le O({c}_{q})(1+{\displaystyle \frac{\kappa \delta}{\sigma}})\mathrm{max}(\sigma ,\sqrt{\sigma \parallel \mu \parallel}),$ 
where ${c}_{q}$ is a constant that depends on $q$. In particular, the above implies a relative error guarantee of
$\frac{{\parallel \widehat{\mu}\mu \parallel}_{2}}{\parallel \mu \parallel}$  $\le O({c}_{q})(1+{\displaystyle \frac{\kappa \delta}{\sigma}})\mathrm{max}({\displaystyle \frac{\sigma}{\parallel \mu \parallel}},\sqrt{{\displaystyle \frac{\sigma}{\parallel \mu \parallel}}}).$ 
Proof.
By assumption we know that $\parallel AC\parallel \le \sigma \sqrt{m}$. This implies that
$\parallel A{\mathrm{\Pi}}^{*}A\parallel $  $\le \parallel AC\parallel +\parallel C{\mathrm{\Pi}}^{*}A\parallel $  
$=\parallel AC\parallel +\parallel {\mathrm{\Pi}}^{*}(CA)\parallel $  
$\le 2\parallel AC\parallel \le 2\sigma \sqrt{m}.$ 
Since we set $\tau =2\sigma \sqrt{m}$, from the guarantee of Theorem 6.4, we know that if the algorithm outputs Bad Input , the data must be poisoned, i.e., $\parallel A\stackrel{~}{A}\parallel >2\sigma \sqrt{m}$. Next suppose that the algorithm outputs a projection matrix $\mathrm{\Pi}$. Setting $\widehat{\mu}:=\mathrm{\text{Mean}}(\mathrm{\Pi}\stackrel{~}{A})$, and $\mathrm{\U0001d7cf}$ to be the allones vector ${(1,1,\mathrm{\dots},1)}^{\top}$ we have that
${\parallel \widehat{\mu}\mu \parallel}_{2}$  $={\displaystyle \frac{1}{m}}{\parallel {\displaystyle \sum _{j=1}^{m}}({A}_{j}\mathrm{\Pi}{\stackrel{~}{A}}_{j})\parallel}_{2}$  
$\le {\displaystyle \frac{1}{m}}{\parallel {\displaystyle \sum _{j=1}^{m}}({A}_{j}\mathrm{\Pi}{A}_{j})\parallel}_{2}+{\displaystyle \frac{1}{m}}{\parallel {\displaystyle \sum _{j=1}^{m}}(\mathrm{\Pi}{A}_{j}\mathrm{\Pi}{\stackrel{~}{A}}_{j})\parallel}_{2}$  
$\le {\displaystyle \frac{1}{m}}{\parallel {\mathrm{\U0001d7cf}}^{\top}(A\mathrm{\Pi}A)\parallel}_{2}+{\displaystyle \frac{1}{m}}{\displaystyle \sum _{i=1}^{m}}{\parallel \mathrm{\Pi}({A}_{j}{\stackrel{~}{A}}_{j})\parallel}_{2}$  
$\le {\displaystyle \frac{1}{\sqrt{m}}}\parallel A\mathrm{\Pi}A\parallel +{c}_{q}\kappa \delta .$  (51) 
Next we make a crucial observation that if $\mathrm{\Pi}$ is good for $\stackrel{~}{A}$ then it is also good for $A$ and hence $\parallel A\mathrm{\Pi}A\parallel $ is small. This is formally established in Lemma 7.2. Applying the lemma on $A$ and $\stackrel{~}{A}$ with ${\mathrm{\Pi}}_{1}={\mathrm{\Pi}}^{*}$, ${\mathrm{\Pi}}_{2}=\mathrm{\Pi}$, ${\kappa}_{1}=\kappa $, ${\kappa}_{2}={c}_{q}\kappa $, and $\epsilon =4\sigma \sqrt{m}/\parallel A\parallel $ we get that
$\parallel A\mathrm{\Pi}A\parallel $  $\le (\epsilon +\sqrt{\epsilon})\parallel A\parallel +8{\displaystyle \frac{\kappa \delta \sqrt{m}}{\sqrt{\epsilon}}}.$ 
Substituting into (51) we get that
${\parallel \widehat{\mu}\mu \parallel}_{2}$  $\le (\epsilon +\sqrt{\epsilon}){\displaystyle \frac{\parallel A\parallel}{\sqrt{m}}}+8{\displaystyle \frac{\kappa \delta}{\sqrt{\epsilon}}}+{c}_{q}\kappa \delta $  
$\le (\epsilon +\sqrt{\epsilon}){\displaystyle \frac{4\sigma}{\epsilon}}+8{\displaystyle \frac{\kappa \delta}{\sqrt{\epsilon}}}+{c}_{q}\kappa \delta (\text{by writing}\parallel A\parallel \text{in terms of}\epsilon )$  
$\le O({c}_{q})(\sigma +\kappa \delta )\left(1+\sqrt{{\displaystyle \frac{1}{\epsilon}}}\right).$  (52) 
Next, notice that by triangle inequality,
$\parallel A\parallel $  $\le \parallel AC\parallel +\parallel C\parallel $  
$\le (\sigma +\parallel \mu \parallel )\sqrt{m}.$ 
Hence we get that
$\sqrt{{\displaystyle \frac{1}{\epsilon}}}=\sqrt{{\displaystyle \frac{\parallel A\parallel}{4\sigma \sqrt{m}}}}\le {\displaystyle \frac{1}{2}}{\left(1+{\displaystyle \frac{\parallel \mu \parallel}{\sigma}}\right)}^{\frac{1}{2}}.$ 
Substituting into (52) we get that
${\parallel \widehat{\mu}\mu \parallel}_{2}$  $\le O({c}_{q})(\sigma +\kappa \delta )\left(1+{\left(1+{\displaystyle \frac{\parallel \mu \parallel}{\sigma}}\right)}^{\frac{1}{2}}\right)$  (53)  
$\le O({c}_{q})(1+{\displaystyle \frac{\kappa \delta}{\sigma}})\mathrm{max}(\sigma ,\sqrt{\sigma \parallel \mu \parallel}).$  (54) 
From the above we get the relative error guarantee of
$\frac{{\parallel \widehat{\mu}\mu \parallel}_{2}}{\parallel \mu \parallel}$  $\le O({c}_{q})(1+{\displaystyle \frac{\kappa \delta}{\sigma}})\mathrm{max}({\displaystyle \frac{\sigma}{\parallel \mu \parallel}},\sqrt{{\displaystyle \frac{\sigma}{\parallel \mu \parallel}}}).$  (55) 
∎
Note: We would like to point out that for robust mean estimation, our analysis also shows that in step $2$ of the algorithm above, we can replace $\mathrm{\text{Mean}}(\mathrm{\Pi}\stackrel{~}{A})$ with $\mathrm{\text{Mean}}(\stackrel{~}{A})$. This is because if the algorithm did not output Bad Input then ${\parallel \stackrel{~}{A}\mathrm{\Pi}\stackrel{~}{A}\parallel}_{2}/\sqrt{m}\le 2\sigma $ and hence mean of $\stackrel{~}{A}$ and that of $\mathrm{\Pi}\stackrel{~}{A}$ will be close. However, in this case, the subspace spanned by the output vector, i.e., Mean($A$) might not be robust and hence susceptible to test time perturbations.
Lemma 7.2.
Fix $q\mathrm{\ge}\mathrm{2}\mathrm{,}\delta \mathrm{>}\mathrm{0}\mathrm{,}\kappa \mathrm{\ge}\mathrm{1}$. Let $A$ and $\stackrel{\mathrm{~}}{A}$ be two $n\mathrm{\times}m$ matrices, each representing $m$ data points in $n$ dimensions such that for every $j\mathrm{\in}\mathrm{[}m\mathrm{]}$, columns ${A}_{j}$ and ${\stackrel{\mathrm{~}}{A}}_{j}$ are close, i.e., ${\mathrm{\parallel}{A}_{j}\mathrm{}{\stackrel{\mathrm{~}}{A}}_{j}\mathrm{\parallel}}_{q}\mathrm{\le}\delta $. Furthermore, assume that there exist projection matrices, ${\mathrm{\Pi}}_{\mathrm{1}}\mathrm{=}v\mathit{}{v}^{\mathrm{\top}}$ and ${\mathrm{\Pi}}_{\mathrm{2}}\mathrm{=}u\mathit{}{u}^{\mathrm{\top}}$ such that ${\mathrm{\parallel}{\mathrm{\Pi}}_{\mathrm{1}}\mathrm{\parallel}}_{q\mathrm{\to}\mathrm{2}}\mathrm{\le}{\kappa}_{\mathrm{1}}$ and ${\mathrm{\parallel}{\mathrm{\Pi}}_{\mathrm{2}}\mathrm{\parallel}}_{q\mathrm{\to}\mathrm{2}}\mathrm{\le}{\kappa}_{\mathrm{2}}$ and that $\mathrm{\parallel}A\mathrm{}{\mathrm{\Pi}}_{\mathrm{1}}\mathit{}A\mathrm{\parallel}\mathrm{\le}{\epsilon}_{\mathrm{1}}\mathit{}\mathrm{\parallel}A\mathrm{\parallel}$ and $\mathrm{\parallel}\stackrel{\mathrm{~}}{A}\mathrm{}{\mathrm{\Pi}}_{\mathrm{2}}\mathit{}\stackrel{\mathrm{~}}{A}\mathrm{\parallel}\mathrm{\le}{\epsilon}_{\mathrm{2}}\mathit{}\mathrm{\parallel}A\mathrm{\parallel}$. Then, letting $\epsilon \mathrm{=}{\epsilon}_{\mathrm{1}}\mathrm{+}{\epsilon}_{\mathrm{2}}$ and $\kappa \mathrm{=}{\kappa}_{\mathrm{1}}\mathrm{+}{\kappa}_{\mathrm{2}}$, it also holds that
$\parallel A{\mathrm{\Pi}}_{2}A\parallel $  $\le O(\epsilon +\sqrt{\epsilon})\parallel A\parallel +{\displaystyle \frac{8\kappa \delta \sqrt{m}}{\sqrt{\epsilon}}}$  (56)  
$\parallel \stackrel{~}{A}{\mathrm{\Pi}}_{1}\stackrel{~}{A}\parallel $  $\le O(\epsilon +\sqrt{\epsilon})\parallel A\parallel +{\displaystyle \frac{8\kappa \delta \sqrt{m}}{\sqrt{\epsilon}}}$  (57) 
Proof.
We will show the desired bound on $\parallel A{\mathrm{\Pi}}_{2}A\parallel $ and by symmetry the same bound will also apply to $\parallel \stackrel{~}{A}{\mathrm{\Pi}}_{1}\stackrel{~}{A}\parallel $. Notice that both ${\mathrm{\Pi}}_{1}$ and ${\mathrm{\Pi}}_{2}$ are projections onto one dimensional subspaces and a bound on $\parallel {\parallel}_{q\to 2}$ norm of the projection matrices implies that ${\parallel v\parallel}_{{q}^{*}}\le {\kappa}_{1}$ and ${\parallel u\parallel}_{{q}^{*}}\le {\kappa}_{2}$, where ${q}^{*}$ is such that $1/q+1/{q}^{*}=1$. Next, let $\mathrm{\Pi}$ be the projection matrix onto the subspace spanned by $v$ and $u$. By triangle inequality we have that
$\parallel A{\mathrm{\Pi}}_{2}A\parallel $  $\le \parallel A\mathrm{\Pi}A\parallel +\parallel \mathrm{\Pi}A{\mathrm{\Pi}}_{2}A\parallel $  
$\le \parallel A\mathrm{\Pi}A\parallel +\parallel \mathrm{\Pi}A{\mathrm{\Pi}}_{2}\stackrel{~}{A}\parallel +\parallel {\mathrm{\Pi}}_{2}\stackrel{~}{A}{\mathrm{\Pi}}_{2}A\parallel $  
$\le \parallel A\mathrm{\Pi}A\parallel +\parallel \mathrm{\Pi}A\mathrm{\Pi}\stackrel{~}{A}\parallel +\parallel \mathrm{\Pi}\stackrel{~}{A}{\mathrm{\Pi}}_{2}\stackrel{~}{A}\parallel +\parallel {\mathrm{\Pi}}_{2}\stackrel{~}{A}{\mathrm{\Pi}}_{2}A\parallel .$  (58) 
Recall the standard fact that if ${P}_{1}$ and ${P}_{2}$ are projection matrices on to subspaces ${S}_{1}$ and ${S}_{2}$ such that ${S}_{1}\subseteq {S}_{2}$, then for any matrix $B$, $\parallel {P}_{1}B\parallel \le \parallel {P}_{2}B\parallel $. Using this we to get that
$\parallel A\mathrm{\Pi}A\parallel $  $\le \parallel A{\mathrm{\Pi}}_{1}A\parallel \le {\epsilon}_{1}\parallel A\parallel $  (59) 
and,
$\parallel {\mathrm{\Pi}}_{2}\stackrel{~}{A}\mathrm{\Pi}\stackrel{~}{A}\parallel $  $\le \parallel {\mathrm{\Pi}}_{2}\stackrel{~}{A}\stackrel{~}{A}\parallel +\parallel \stackrel{~}{A}\mathrm{\Pi}\stackrel{~}{A}\parallel $  
$\le 2\parallel {\mathrm{\Pi}}_{2}\stackrel{~}{A}\stackrel{~}{A}\parallel \le 2{\epsilon}_{2}\parallel A\parallel .$  (60) 
From the closeness of $A$ and $\stackrel{~}{A}$ and the robustness of ${\mathrm{\Pi}}_{2}$ we also know that
$\parallel {\mathrm{\Pi}}_{2}\stackrel{~}{A}{\mathrm{\Pi}}_{2}A\parallel $  $\le {\parallel {\mathrm{\Pi}}_{2}\parallel}_{q\to 2}\delta \sqrt{m}\le {\kappa}_{2}\delta \sqrt{m}.$  (61) 
Substituting (59), (60), and (61) into (58) we get that
$\parallel A{\mathrm{\Pi}}_{2}A\parallel \le {\epsilon}_{1}\parallel A\parallel +2{\epsilon}_{2}\parallel A\parallel +{\kappa}_{2}\delta \sqrt{m}+\parallel \mathrm{\Pi}A\mathrm{\Pi}\stackrel{~}{A}\parallel .$  (62) 
Note that if $\parallel \mathrm{\Pi}A\mathrm{\Pi}\stackrel{~}{A}\parallel \le \kappa \delta \sqrt{m}/\sqrt{\epsilon}$, then we have the desired bound on $\parallel A{\mathrm{\Pi}}_{2}A\parallel $. We now look at the case when $\parallel \mathrm{\Pi}A\mathrm{\Pi}\stackrel{~}{A}\parallel >\kappa \delta \sqrt{m}/\sqrt{\epsilon}$. Notice that $\mathrm{\Pi}$ is the union of robust subspaces and $A\stackrel{~}{A}$ has columns bounded in $q$ norm. Hence, the only way $\parallel \mathrm{\Pi}A\mathrm{\Pi}\stackrel{~}{A}\parallel $ can be very large is if the $\parallel {\parallel}_{q\to 2}$ norm of the projection matrix of a union of two subspaces ($\mathrm{\Pi}$) is much larger than the $\parallel {\parallel}_{q\to 2}$ norm of the projection matrices of individual subspaces (${\mathrm{\Pi}}_{1}$ and ${\mathrm{\Pi}}_{2}$). For this to happen the two subspaces must be very close to each other and then we can bound $\parallel A{\mathrm{\Pi}}_{2}A\parallel $ in a different way. Formally, we have that
$\parallel \mathrm{\Pi}A\mathrm{\Pi}\stackrel{~}{A}\parallel $  $=\underset{z:\parallel z\parallel =1}{\mathrm{max}}\parallel \mathrm{\Pi}(A\stackrel{~}{A})\cdot z\parallel $  
$=\underset{z:\parallel z\parallel =1}{\mathrm{max}}\parallel {\displaystyle \sum _{j=1}^{m}}{z}_{j}\mathrm{\Pi}({A}_{j}{\stackrel{~}{A}}_{j})\parallel $  
$\le \underset{z:\parallel z\parallel =1}{\mathrm{max}}{\displaystyle \sum _{j=1}^{m}}{z}_{j}\parallel \mathrm{\Pi}({A}_{j}{\stackrel{~}{A}}_{j})\parallel $  
$\le {\parallel \mathrm{\Pi}\parallel}_{q\to 2}\delta \sqrt{m}.$  (63) 
Next we establish an upper bound on ${\parallel \mathrm{\Pi}\parallel}_{q\to 2}$ in terms of the distance between subspaces ${\mathrm{\Pi}}_{1}=v{v}^{\top}$ and ${\mathrm{\Pi}}_{2}=u{u}^{\top}$. Suppose $\parallel uv\parallel =\gamma $ and that $u\cdot v\ge 0$ (otherwise we work with $u$). We also know that ${\parallel v\parallel}_{{q}^{*}}\le {\kappa}_{1}$ and ${\parallel u\parallel}_{{q}^{*}}\le {\kappa}_{2}$. Now, ${\parallel \mathrm{\Pi}\parallel}_{q\to 2}$ is the maximum ${q}^{*}$ norm of any unit vector in the span of $v$ and $u$. We can write any such vector $z$ as
$z={\alpha}_{1}v+{\alpha}_{2}{v}^{\u27c2}$ 
where ${\alpha}_{1}^{2}+{\alpha}_{2}^{2}=1$ and ${v}^{\u27c2}=\frac{u(u\cdot v)v}{\parallel u(u\cdot v)v\parallel}$. Next we have that
${\parallel u(u\cdot v)v\parallel}^{2}=1{(u\cdot v)}^{2}$  
$\ge 1u\cdot v={\displaystyle \frac{{\gamma}^{2}}{2}}.$ 
Hence we get that for any $z$ in the span of $v$ and $u$
${\parallel z\parallel}_{{q}^{*}}$  $\le {\parallel v\parallel}^{{q}^{*}}+{\parallel {v}^{\u27c2}\parallel}_{{q}^{*}}$  
$\le {\kappa}_{1}+{\displaystyle \frac{\sqrt{2}}{\gamma}}\left({\parallel u\parallel}_{{q}^{*}}+{\parallel v\parallel}_{{q}^{*}}\right)$  
$\le {\displaystyle \frac{2\sqrt{2}}{\gamma}}\kappa .$ 
The above also establishes that
$${\parallel \mathrm{\Pi}\parallel}_{q\to 2}\le 2\frac{\sqrt{2}}{\gamma}\kappa .$$ 
Substituting into (63) we get that
$\parallel \mathrm{\Pi}A\mathrm{\Pi}\stackrel{~}{A}\parallel \le {\displaystyle \frac{2\sqrt{2}}{\gamma}}\kappa \delta \sqrt{m}.$  (64) 
Hence, if $\parallel \mathrm{\Pi}A\mathrm{\Pi}\stackrel{~}{A}\parallel >\kappa \delta \sqrt{m}/\sqrt{\epsilon}$ we must have that
$\parallel vu\parallel =\gamma \le 2\sqrt{2}\sqrt{\epsilon}.$  (65) 
In this case we can bound $\parallel A{\mathrm{\Pi}}_{2}A\parallel $ as
$\parallel A{\mathrm{\Pi}}_{2}A\parallel $  $\le \parallel A{\mathrm{\Pi}}_{1}A\parallel +\parallel ({\mathrm{\Pi}}_{1}{\mathrm{\Pi}}_{2})A\parallel $  
$\le {\epsilon}_{1}\parallel A\parallel +\parallel {\mathrm{\Pi}}_{1}{\mathrm{\Pi}}_{2}\parallel \parallel A\parallel $  
$\le \epsilon \parallel A\parallel +\parallel v{v}^{\top}u{u}^{\top}\parallel \parallel A\parallel $  
$=\epsilon \parallel A\parallel +\parallel {\displaystyle \frac{1}{2}}(v+u){(vu)}^{\top}+{\displaystyle \frac{1}{2}}(vu){(v+u)}^{\top}\parallel \parallel A\parallel $  
$\le \epsilon \parallel A\parallel +2\parallel vu\parallel \parallel A\parallel (\text{by triangle inequality and the fact that}\parallel v+u\parallel \le 2)$  
$\le \epsilon \parallel A\parallel +2\gamma \parallel A\parallel $  
$\le \epsilon \parallel A\parallel +4\sqrt{2}\sqrt{\epsilon}\parallel A\parallel .$ 
∎
Tightness of the Guarantee in Theorem 7.1. We close out this section by showing that the dependence on $\sqrt{\sigma \parallel \mu \parallel}$ in our bound on mean estimation is necessary even information theoretically. In what follows $A$ will be an $n\times m$ matrix with $\mu =\mathrm{\text{Mean}}(A)$ such that ${\mathrm{\Pi}}^{*}=\mu {\mu}^{\top}/{\parallel \mu \parallel}^{2}$ has small norm, i.e., ${\parallel {\mathrm{\Pi}}^{*}\parallel}_{\mathrm{\infty}\to 2}=\kappa $. Furthermore, let $C=\mu {\mathrm{\U0001d7cf}}^{\top}$ and define $\sigma =\parallel AC\parallel /\sqrt{m}$. We will prove the following.
Theorem 7.3.
Fix $q\mathrm{=}\mathrm{\infty}$. Let $\mathrm{M}$ be the set of $n\mathrm{\times}m$ matrices $A$ with mean $\mu $ that satisfies $\mathrm{\parallel}\mu \mathrm{\parallel}\mathrm{\in}\mathrm{[}\mathrm{1}\mathrm{,}\mathrm{2}\mathrm{]}$, variance ${\sigma}^{\mathrm{2}}$ around the mean that satisfies $\sigma \mathrm{\in}\mathrm{(}\mathrm{0}\mathrm{,}\mathrm{1}\mathrm{/}\mathrm{6}\mathrm{]}$, and the subspace spanned by $\mu $ being $\kappa $robust. Call a perturbation $\stackrel{\mathrm{~}}{A}$ of $A\mathrm{\in}\mathrm{M}$ of be valid if ${\mathrm{\parallel}A\mathrm{}\stackrel{\mathrm{~}}{A}\mathrm{\parallel}}_{\mathrm{\infty}}\mathrm{\le}\delta $. Then, any algorithm that takes as input a valid perturbation $\stackrel{\mathrm{~}}{A}$ of a matrix $A\mathrm{\in}\mathrm{M}$ and either certifies that the data is poisoned, i.e., $\mathrm{\parallel}A\mathrm{}\stackrel{\mathrm{~}}{A}\mathrm{\parallel}\mathrm{>}\mathrm{8}\mathit{}\sigma \mathit{}\sqrt{m}$ or outputs an estimate $\widehat{\mu}$ of $\mu $ must incur an error of
$\parallel \mu \widehat{\mu}\parallel \ge \mathrm{\Omega}\left((1+{\displaystyle \frac{\kappa \delta}{\sigma}})\mathrm{max}(\sigma ,\sqrt{\sigma \parallel \mu \parallel})\right),$ 
where $\mu \mathrm{=}\mathrm{\text{Mean}}\mathit{}\mathrm{(}A\mathrm{)}$.
Proof.
We will establish the lower bound by constructing two matrices $A$ and $\stackrel{~}{A}$, both of which lie in the set $\mathcal{M}$, satisfy $\parallel A\stackrel{~}{A}\parallel =\delta $ for $\kappa \delta =O(\sigma )$, but have means separated by $\mathrm{\Omega}(\mathrm{max}(\sigma ,\sqrt{\sigma {\mu}_{\mathrm{max}}}))$, where ${\mu}_{\mathrm{max}}$ is the maximum ${\mathrm{\ell}}_{2}$ norm among Mean($A$) and $\mathrm{\text{Mean}}(\stackrel{~}{A})$. In this case, given either $A$ or $\stackrel{~}{A}$ as input, any provably robust certification procedure cannot output Bad Input and must output an estimate $\widehat{\mu}$, thereby making $\mathrm{\Omega}(\mathrm{max}(\sigma ,\sqrt{\sigma {\mu}_{\mathrm{max}}}))$ error on either $A$ or $\stackrel{~}{A}$. We next describe our construction.
For a $k$ to be determined later, let ${\mu}_{1}=(1/\sqrt{k},1/\sqrt{k},\mathrm{\dots},1/\sqrt{k},0,0,\mathrm{\dots},0)$. Hence, ${\mu}_{1}$ is a unit length sparse vector with ${\parallel {\mu}_{1}\parallel}_{1}=\sqrt{k}$. We define the set of $m$ points in $A$ by generating i.i.d. points of the form ${\mu}_{1}+g$, where $g$ is a mean zero Gaussian with variance $0$ in the first $k$ coordinates and variance ${\sigma}^{2}$ in the other coordinates. Then it is a standard fact that with high probability $\parallel A{\mu}_{1}{\mathrm{\U0001d7cf}}^{\top}\parallel \le 2\sigma \sqrt{m}$ and that Mean(A) will be $\sigma \sqrt{d/m}=o(\sigma )$close to ${\mu}_{1}$. Next we define the set of points in $\stackrel{~}{A}$ to be ${\stackrel{~}{A}}_{j}={A}_{j}+\delta sgn({\mu}_{1})$, where $sgn({\mu}_{1})$ is a $\pm 1$ vector representing the sign of the corresponding coordinate of ${\mu}_{1}$. Here we can arbitrarily set $sgn(0)$ to be $+1$. It is easy to see that Mean($\stackrel{~}{A}$) will be $o(\sigma )$close to ${\mu}_{2}={\mu}_{1}+\delta sgn({\mu}_{1})$, and that $\parallel \stackrel{~}{A}{\mu}_{2}{\mathrm{\U0001d7cf}}^{\top}\parallel \le 2\sigma \sqrt{m}$. Next notice that
${\parallel {\mu}_{2}\parallel}^{2}$  $=1+{\delta}^{2}n+\delta \sqrt{k}$  
${\parallel {\mu}_{2}\parallel}_{1}$  $=\sqrt{k}+\delta \sqrt{k}.$ 
By setting $\delta \sqrt{k}=3\sigma $ and $\delta n=\sqrt{k}$ we ensure that $\parallel {\mu}_{2}\parallel \in [1,2]$ and ${\parallel {\mu}_{2}\parallel}_{1}\le 2\sqrt{k}$. Hence we get that for the matrix $\mathrm{\Pi}={\mu}_{2}{\mu}_{2}^{\top}/{\parallel {\mu}_{2}\parallel}^{2}$, ${\parallel \mathrm{\Pi}\parallel}_{\mathrm{\infty}\to 2}\le 2\sqrt{k}$. Hence, both $A$ and $\stackrel{~}{A}$ lie in the set $\mathcal{M}$ with sparsity bound $\kappa =2\sqrt{k}$. Furthermore, the fact that $\delta \sqrt{k}=3\sigma $, ensures that $\kappa \delta \le 6\sigma $. Finally, notice that the difference between two means is
$\parallel {\mu}_{1}{\mu}_{2}\parallel $  $=\delta \sqrt{n}$  
$=\sqrt{\delta}{k}^{1/4}$  
$=\sqrt{3\sigma}$  
$=\mathrm{\Omega}\left((1+{\displaystyle \frac{\kappa \delta}{\sigma}})\mathrm{max}(\sigma ,\sqrt{\sigma {\mu}_{\mathrm{max}}})\right).$ 
∎
We end this section by showing that via an (inefficient) algorithm one can get the same guarantee for mean estimation as in Theorem 7.1 without the need for certification.
Theorem 7.4 (Information Theoretic Upper Bound).
Let $A$ be an $n\mathrm{\times}m$ matrix representing $m$ data points in $n$ dimensions and let $\mu $ be the mean of the data points in the matrix $A$ with $C$ representing the $n\mathrm{\times}m$ matrix with each column being $\mu $. Let ${\mathrm{\Pi}}^{\mathrm{*}}\mathrm{=}\mu \mathit{}{\mu}^{\mathrm{\top}}\mathrm{/}{\mathrm{\parallel}\mu \mathrm{\parallel}}^{\mathrm{2}}$ be the one dimensional subspace denoting the projection onto $\mu $ and assume that ${\mathrm{\parallel}{\mathrm{\Pi}}^{\mathrm{*}}\mathrm{\parallel}}_{q\mathrm{\to}\mathrm{2}}\mathrm{\le}\kappa $, for some $q\mathrm{\ge}\mathrm{2}$. Let $\stackrel{\mathrm{~}}{A}$ be the given input such that for every column $j\mathrm{\in}\mathrm{[}m\mathrm{]}$ we have ${\mathrm{\parallel}{A}_{j}\mathrm{}{\stackrel{\mathrm{~}}{A}}_{j}\mathrm{\parallel}}_{q}\mathrm{\le}\delta $. Furthermore, let ${\sigma}^{\mathrm{2}}\mathrm{>}\mathrm{0}$ be a given upper bound on the variance of the data around the mean, i.e., $\mathrm{\parallel}A\mathrm{}C\mathrm{\parallel}\mathrm{\le}\sigma \mathit{}\sqrt{m}$. Then there is an (inefficient) exponential time algorithm that takes $\stackrel{\mathrm{~}}{A}$ as input and outputs an estimate $\widehat{\mu}$ of the true mean $\mu $ such that
${\parallel \widehat{\mu}\mu \parallel}_{2}$  $\le O({c}_{q})(1+{\displaystyle \frac{\kappa \delta}{\sigma}})\mathrm{max}(\sigma ,\sqrt{\sigma \parallel \mu \parallel}),$ 
where ${c}_{q}$ is a constant that depends on $q$. In particular, the above implies a relative error guarantee of
$\frac{{\parallel \widehat{\mu}\mu \parallel}_{2}}{\parallel \mu \parallel}$  $\le O({c}_{q})(1+{\displaystyle \frac{\kappa \delta}{\sigma}})\mathrm{max}({\displaystyle \frac{\sigma}{\parallel \mu \parallel}},\sqrt{{\displaystyle \frac{\sigma}{\parallel \mu \parallel}}}).$ 
Proof.
In order to establish the theorem above we first optimize (49) exactly. Let ${A}^{\prime}$ be the matrix and $\mathrm{\Pi}$ be the projection that achieve the minimum of (49). Then we have that
$\parallel {A}^{\prime}\mathrm{\Pi}{A}^{\prime}\parallel $  $\le \parallel A{\mathrm{\Pi}}^{*}A\parallel $ 
Furthermore, we also have that
$\parallel A{\mathrm{\Pi}}^{*}A\parallel $  $\le \parallel AC\parallel +\parallel C{\mathrm{\Pi}}^{*}A\parallel $  
$\le 2\parallel AC\parallel $  
$\le 2\sigma \sqrt{m}.$ 
Hence both $A$ and ${A}^{\prime}$ have good projections onto rank$1$ subspaces. Plugging into Lemma 7.2 we get that
$\parallel A\mathrm{\Pi}A\parallel \le O(\epsilon +\sqrt{\epsilon})\parallel A\parallel +{\displaystyle \frac{8\kappa \delta \sqrt{m}}{\sqrt{\epsilon}}},$ 
where $\epsilon =2\sigma \sqrt{m}/\parallel A\parallel $. Hence, letting $\widehat{\mu}=\mathrm{\text{Mean}}(\mathrm{\Pi}{A}^{\prime})$ we get from (51) that
$\parallel \mu \widehat{\mu}\parallel $  $\le {\displaystyle \frac{1}{\sqrt{m}}}\parallel A\mathrm{\Pi}A\parallel +{c}_{q}\kappa \delta $  
$\le O(\epsilon +\sqrt{\epsilon}){\displaystyle \frac{\parallel A\parallel}{\sqrt{m}}}+{\displaystyle \frac{8\kappa \delta \sqrt{m}}{\sqrt{\epsilon}}}+{c}_{q}\kappa \delta .$ 
The rest of the argument proceeds exactly as in the Proof of Theorem 7.1 by writing $\parallel A\parallel $ in terms of $\epsilon $, as done in (52). ∎
7.2 Clustering
In this section we will use the guarantee from Theorem 6.4 to show how to perform clustering of a wellclustered instance when every data point in the instance could be corrupted. Our guarantees will apply to clustering a mixture of well separated Gaussians and more general data distributions. In particular, we will show that a robust modification of the popular Lloyd’s algorithm [lloyd1982least] (also known as the $k$means algorithm) can be used to perform clustering in our model, thereby providing further evidence towards the widespread applicability of the algorithm. Existing guarantees for using Lloyd’s algorithm [kumar2010clustering, awasthi2012improved] for clustering a mixture of Gaussians and general datasets assume that every pair of means ${\mu}_{i},{\mu}_{j}$ are separated by $\sim \sigma \sqrt{k}$, where $\sigma $ is the maximum variance of the dataset around the mean and $k$ is the number of clusters (see (66) for the formal condition). However, our lower bound example from the previous section shows that such a separation condition is too weak for robust clustering in our noise model. If we view the datasets $A$ and $\stackrel{~}{A}$ in our lower bound instance as two target clusters, then they have the property that the means are separated by $\sim \sqrt{\sigma}>\mathrm{\Omega}(\sigma \sqrt{k})$, where $k=2$, and yet via ${\mathrm{\ell}}_{\mathrm{\infty}}$ perturbations we can reach one cluster from the other, making the clustering task meaningless. Hence, we need a separation condition that rules out this scenario. Furthermore, even if one is given the original optimal clustering of the given perturbed dataset, we know from previous section that one must incur a loss of $\mathrm{\Omega}(\sigma \cdot \mathrm{max}(1,\sqrt{\parallel \mu \parallel /\sigma}))$ in simply estimating the true mean $\mu $ of a cluster. This suggests a separation condition of the type $\sim \alpha \sigma \sqrt{k}$, where $\alpha $ depends on $(1+\frac{{\mu}_{\mathrm{max}}}{\sigma})$ and the guarantee to aim for is to estimate means upto error $O(\alpha \sigma )$ error. Here ${\mu}_{\mathrm{max}}$ is the maximum ${\mathrm{\ell}}_{2}$ norm of the any of the $k$ mean vectors. In this section we will show that a modified Lloyd’s combined with our certification procedure can indeed achieve this guarantee or certify that the dataset has been poisoned.
More formally, we will assume that there is a set of $m$ points in ${\mathbb{R}}^{n}$ with ground truth clustering ${C}_{1}^{*},{C}_{2}^{*},\mathrm{\dots},{C}_{k}^{*}$, and means ${\mu}_{r}=\mathrm{\text{Mean}}({C}_{r}^{*})$ for $r\in [k]$ and ${\mu}_{\mathrm{max}}={\mathrm{max}}_{r}\parallel {\mu}_{r}\parallel $. Let $A$ be the $n\times m$ data matrix and $C$ be the matrix of corresponding centers. We will assume that we have an upper bound ${\sigma}^{2}$ on the maximum variance of the data points around their mean, i.e. ${\parallel AC\parallel}^{2}\le {\sigma}^{2}m$ and define $\alpha =(1+\frac{\kappa \delta}{\sigma}){(1+\frac{{\mu}_{\mathrm{max}}}{\sigma})}^{2/3}$. We will enforce the spectral stability condition studied in [kumar2010clustering] on our clustering instance. This condition implies that for each pair of clusters ${C}_{r}^{*},{C}_{s}^{*}$ with means ${\mu}_{r},{\mu}_{s}$ and each point $x\in {C}_{r}^{*}$, $\overline{x}$ is closer to ${\mu}_{r}$ than to ${\mu}_{s}$ by a margin ${\mathrm{\Delta}}_{r,s}$. Here $\overline{x}$ is the projection of $x$ onto the line joining ${\mu}_{r}$ and ${\mu}_{s}$. For a constant $c>0$, the $c$spectral stability condition requires that for each $r\ne s$,
${\mathrm{\Delta}}_{r,s}\ge c\alpha \sigma \sqrt{k}\left({\displaystyle \frac{\sqrt{m}}{\sqrt{{C}_{r}^{*}}}}+{\displaystyle \frac{\sqrt{m}}{\sqrt{{C}_{s}^{*}}}}\right)$  (66) 
Notice that the above also implies that every pair of means are separated i.e.,
$$\parallel {\mu}_{r}{\mu}_{s}\parallel \ge c\alpha \sigma \sqrt{k}\left(\frac{\sqrt{m}}{\sqrt{{C}_{r}^{*}}}+\frac{\sqrt{m}}{\sqrt{{C}_{s}^{*}}}\right).$$ 
It is worth mentioning that in the typical analysis of Lloyd’s algorithm [kumar2010clustering, awasthi2012improved] the dependence on $\alpha $ in the separation condition is not needed. However, as discussed before, in our noise model, some dependence on $\alpha $ is unavoidable to get a meaningful clustering guarantee.
Assumptions I: Fix $q\ge 2$. We will assume that we are given access to $\stackrel{~}{A}$ such that for every $j\in [m]$, ${\parallel {A}_{j}{\stackrel{~}{A}}_{j}\parallel}_{q}\le \delta $. Furthermore, define $\kappa $ to be the robustness, of the subspace spanned by the means ${\mu}_{1},\mathrm{\dots},{\mu}_{k}$. Formally, let ${\mathrm{\Pi}}_{C}$ be the projection matrix for the orthogonal projection onto the space of the means. Then $\kappa $ is such that ${\parallel {\mathrm{\Pi}}_{C}\parallel}_{q\to 2}\le \kappa $. Under Assumptions I, we prove the following theorem that applies to any stable dataset as defined in (66).
Theorem 7.5.
Fix $q\mathrm{\ge}\mathrm{2}$, and let ${c}_{q}$ be a constant that depends on $q$. Let $A$ be a dataset that satisfies $c$spectral stability for $c\mathrm{>}\mathrm{200}\mathit{}{c}_{q}$. Under Assumptions I, there is a Lloyd’s style algorithm that takes $\stackrel{\mathrm{~}}{A}$ as input, w.h.p. runs in polynomial time, and either certifies that the dataset is poisoned, i.e, $\mathrm{\parallel}A\mathrm{}\stackrel{\mathrm{~}}{A}\mathrm{\parallel}\mathrm{=}\mathrm{\Omega}\mathit{}\mathrm{(}\sigma \mathit{}\sqrt{m}\mathrm{)}$, or outputs a clustering ${\widehat{C}}_{\mathrm{1}}\mathrm{,}\widehat{{C}_{\mathrm{2}}}\mathrm{,}\mathrm{\dots}\mathrm{,}\widehat{{C}_{k}}$ and means $\widehat{{\mu}_{\mathrm{1}}}\mathrm{,}\widehat{{\mu}_{\mathrm{2}}}\mathrm{,}\mathrm{\dots}\mathrm{,}\widehat{{\mu}_{k}}$ such that
$\sum _{r=1}^{k}}{C}_{r}^{*}\mathrm{\u25b3}{\widehat{C}}_{\pi (r)}\le O\left({\displaystyle \frac{{c}_{q}^{2}m}{k{\alpha}^{2}{c}^{2}}}\right)$  
$\parallel {\mu}_{r}{\widehat{\mu}}_{\pi (r)}\parallel \le {c}_{q}\alpha {\displaystyle \frac{\sigma \sqrt{m}}{\sqrt{{C}_{r}^{*}}}}.$ 
for an appropriately chosen bijection $\pi \mathrm{:}\mathrm{[}k\mathrm{]}\mathrm{\to}\mathrm{[}k\mathrm{]}$.
While the above theorem works for any data set that satisfies spectral stability, notice that it leads to a sub optimal mean estimation error of $O(\alpha \sigma \sqrt{m}/\sqrt{{C}_{r}^{*}})$ for each cluster $r$. For example, when the clusters are balanced, this will lead to a guarantee of $O(\alpha \sigma )\sqrt{k}$. Next, we show that for data sets that additionally satisfy Gaussian type concentration, we can indeed get $O(\alpha \sigma )$ estimation error even when each data point is corrupted.
Assumptions II: Let $A$ be a given dataset with optimal clustering ${C}_{1}^{*},{C}_{2}^{*},\mathrm{\dots},{C}_{k}^{*}$. We will assume that we are given $\stackrel{~}{A}$ that satisfies Assumptions I. Furthermore, we will assume that $\parallel {C}_{r}^{*}\parallel \ge {n}^{3}$ for each $r\in [k]$ and that for any subset $S\subset {C}_{r}^{*}$ of points such that $S>n\mathrm{log}n$, we have that $\parallel {A}_{S}{C}_{S}\parallel \le \sigma \sqrt{S}\cdot \mathrm{poly}\mathrm{log}(m,n)$. Here ${A}_{S},{C}_{S}$ are the matrices $A$ and $C$ restricted to the columns of the points in $S$. Additionally, we require a pointwise guarantee that for each $r\in [k]$, and ${A}_{i}\in {C}_{r}^{*}$, ${\parallel {A}_{i}{\mu}_{r}\parallel}^{2}\le 2{\sigma}^{2}n\cdot \mathrm{poly}\mathrm{log}(m,n)$. It is easy to see that $m\ge \mathrm{poly}(m,1/{w}_{\mathrm{min}})$ samples generated from a mixture of Gaussians with maximum variance ${\sigma}^{2}$ and minimum mixture weight ${w}_{\mathrm{min}}$ will, with high probability, satisfy the above assumptions. Under Assumptions II, we prove the following theorem that applies to any stable dataset as defined in (66).
Theorem 7.6.
Fix $q\mathrm{\ge}\mathrm{2}$, and let ${c}_{q}$ be a constant that depends on $q$. Let $A$ be a dataset that satisfies $c$spectral stability for $c\mathrm{>}\mathrm{200}\mathit{}{c}_{q}$. Under Assumptions II, there is a Lloyd’s style algorithm that takes $\stackrel{\mathrm{~}}{A}$ as input, w.h.p. runs in polynomial time, and either certifies that the dataset is poisoned, i.e, $\mathrm{\parallel}A\mathrm{}\stackrel{\mathrm{~}}{A}\mathrm{\parallel}\mathrm{=}\mathrm{\Omega}\mathit{}\mathrm{(}\sigma \mathit{}\sqrt{m}\mathrm{)}$, or outputs a clustering ${\widehat{C}}_{\mathrm{1}}\mathrm{,}\widehat{{C}_{\mathrm{2}}}\mathrm{,}\mathrm{\dots}\mathrm{,}\widehat{{C}_{k}}$ and means $\widehat{{\mu}_{\mathrm{1}}}\mathrm{,}\widehat{{\mu}_{\mathrm{2}}}\mathrm{,}\mathrm{\dots}\mathrm{,}\widehat{{\mu}_{k}}$ such that
$\sum _{r=1}^{k}}{C}_{r}^{*}\mathrm{\u25b3}{\widehat{C}}_{\pi (r)}\le O\left({\displaystyle \frac{{c}_{q}^{2}m}{{k}^{2}{\alpha}^{2}{c}^{2}}}\right)$  
$\parallel {\mu}_{r}{\widehat{\mu}}_{\pi (r)}\parallel \le \stackrel{~}{O}(\alpha \sigma ).$ 
for an appropriately chosen bijection $\pi \mathrm{:}\mathrm{[}k\mathrm{]}\mathrm{\to}\mathrm{[}k\mathrm{]}$, where we hide a polylogarithmic (in $n\mathrm{,}m$) factor in the $\stackrel{\mathrm{~}}{O}$ notation.
As a corollary we get the following statement about robustly clustering a mixture of Gaussians.
Theorem 7.7.
Fix $q\mathrm{\ge}\mathrm{2}$, and let ${c}_{q}$ be a constant that depends on $q$. Define $\mathrm{M}$ to be a distribution that is a mixture of $k$ Gaussians, i.e., $\mathrm{M}\mathrm{:=}{\mathrm{\sum}}_{r\mathrm{=}\mathrm{1}}^{k}{w}_{r}\mathit{}\mathrm{N}\mathit{}\mathrm{(}{\mu}_{r}\mathrm{,}{\mathrm{\Sigma}}_{r}\mathrm{)}$. Furthermore, let ${\mathrm{\Sigma}}_{r}\mathrm{\u2aaf}{\sigma}^{\mathrm{2}}\mathit{}I$ and define ${w}_{\mathrm{min}}\mathrm{=}{\mathrm{min}}_{r}\mathit{}{w}_{r}$ and ${\mu}_{\mathrm{max}}\mathrm{=}{\mathrm{max}}_{r}\mathit{}\mathrm{\parallel}{\mu}_{r}\mathrm{\parallel}$, $\alpha \mathrm{=}\mathrm{(}\mathrm{1}\mathrm{+}\frac{\kappa \mathit{}\delta}{\sigma}\mathrm{)}\mathit{}{\mathrm{(}\mathrm{1}\mathrm{+}\frac{{\mu}_{\mathrm{max}}}{\sigma}\mathrm{)}}^{\mathrm{2}\mathrm{/}\mathrm{3}}$. Let $A$ be a set $\mathrm{poly}\mathit{}\mathrm{(}n\mathrm{,}\mathrm{1}\mathrm{/}{w}_{\mathrm{min}}\mathrm{)}$ samples generated i.i.d. from the mixture. If the mixture if well separated, i.e, $\mathrm{\parallel}{\mu}_{r}\mathrm{}{\mu}_{s}\mathrm{\parallel}\mathrm{\ge}c\mathit{}\alpha \mathit{}\sigma \mathit{}\sqrt{k}\mathrm{\cdot}\mathrm{poly}\mathit{}\mathrm{log}\mathit{}\mathrm{(}n\mathrm{/}{w}_{m\mathit{}i\mathit{}n}\mathrm{)}\mathrm{/}\sqrt{{w}_{\mathrm{min}}}$ for $c\mathrm{>}\mathrm{200}\mathit{}{c}_{q}$, and the means span a $\kappa $ robust subspace, then given access to $\stackrel{\mathrm{~}}{A}$ such that ${\mathrm{\parallel}{\stackrel{\mathrm{~}}{A}}_{j}\mathrm{}{A}_{j}\mathrm{\parallel}}_{q}\mathrm{\le}\delta $, there is a Lloyd’s style algorithm that, w.h.p. runs in polynomial time, and either certifies that the data is poisoned, i.e., $\mathrm{\parallel}A\mathrm{}\stackrel{\mathrm{~}}{A}\mathrm{\parallel}\mathrm{\ge}\mathrm{\Omega}\mathit{}\mathrm{(}\sigma \mathit{}\sqrt{m}\mathrm{)}$, or outputs a clustering ${\widehat{C}}_{\mathrm{1}}\mathrm{,}\widehat{{C}_{\mathrm{2}}}\mathrm{,}\mathrm{\dots}\mathrm{,}\widehat{{C}_{k}}$ and means $\widehat{{\mu}_{\mathrm{1}}}\mathrm{,}\widehat{{\mu}_{\mathrm{2}}}\mathrm{,}\mathrm{\dots}\mathrm{,}\widehat{{\mu}_{k}}$ such that
$\sum _{r=1}^{k}}{C}_{r}^{*}\mathrm{\u25b3}{\widehat{C}}_{\pi (r)}\le O\left({\displaystyle \frac{{c}_{q}^{2}m}{{k}^{2}{\alpha}^{2}{c}^{2}}}\right)$  
$\parallel {\mu}_{r}{\widehat{\mu}}_{\pi (r)}\parallel \le \stackrel{~}{O}(\alpha \sigma ).$ 
for an appropriately chosen bijection $\pi \mathrm{:}\mathrm{[}k\mathrm{]}\mathrm{\to}\mathrm{[}k\mathrm{]}$.
Computing Good Initial Centers. The first step in establishing the above theorems is to compute centers/means that are close to the true ones. A common approach for this step is to use PCA to project the data onto the top$k$ subspace of the input data matrix, and run any constant factor approximation algorithm for $k$means clustering [kumar2010clustering]. However this can be arbitrarily bad if the data is corrupted as in our model. We instead show that by projecting the data onto a robust subspace as output by our guarantee from Theorem 6.4 and then using a $k$means approximation algorithm, we do indeed recover good centers. Our initialization algorithm is shown in Figure 4. We next provide proofs for our claims.
Theorem 7.8.
Assume that the clustering instance $A$ is $c$stable for $c\mathrm{>}\mathrm{200}\mathit{}{c}_{q}$. If Assumptions I hold, then the algorithm in Figure 4 w.h.p. runs in polynomial time, and either certifies that the data has been poisoned, i.e., $\mathrm{\parallel}A\mathrm{}\stackrel{\mathrm{~}}{A}\mathrm{\parallel}\mathrm{>}\mathrm{2}\mathit{}\sigma \mathit{}\sqrt{m}$, or the algorithm outputs centers ${\nu}_{\mathrm{1}}\mathrm{,}{\nu}_{\mathrm{2}}\mathrm{,}\mathrm{\dots}\mathrm{,}{\nu}_{k}$ such that for all $r\mathrm{\in}\mathrm{[}k\mathrm{]}$,
$\parallel {\mu}_{r}{\nu}_{\pi (r)}\parallel \le 30{c}_{q}\alpha \sqrt{k}{\displaystyle \frac{\sigma \sqrt{m}}{\sqrt{{C}_{r}^{*}}}}.$ 
for an appropriately chosen bijection $\pi $.
Proof.
The proof will follow the general outline of Lemma 5.1 of [kumar2010clustering], except that we need to argue following two stronger conditions. Firstly, we need to establish that $\stackrel{~}{A}$ projected on to $\mathrm{\Pi}$ has cost comparable to that of $k{\sigma}^{2}m$. This will ensure that the approximation algorithm will output a clustering of low cost. Secondly, we also simultaneously need to establish that $\stackrel{~}{A}$ when projected on to $\mathrm{\Pi}$ has low cost clustering when true means $C$ are used to cluster it. Together with the fact that $\stackrel{~}{A}$ and $A$ are pointwise close in the projected space, we can then claim that missing out on a good approximation for even a single cluster center of ${C}^{*}$ will incur a cost of $\mathrm{\Omega}(k{\sigma}^{2}m)$, thereby contradicting the approximation guarantee of the $k$means algorithm used in step 2.
Establishing that $\mathrm{\Pi}\stackrel{~}{A}$ has low cost with respect to $C$ boils down to showing that $\mathrm{\Pi}$ is good for $A$ given that it is good for $\stackrel{~}{A}$, a perturbation of $A$. This statement, established in Theorem 6.4 is the key in analyzing the initialization phase, and is a generalization of Lemma 7.2 to higher dimensional subspaces. Let’s first establish that $\stackrel{~}{A}$ projected on to $\mathrm{\Pi}$ has a low cost clustering. We have
${\parallel \mathrm{\Pi}\stackrel{~}{A}\mathrm{\Pi}C\parallel}_{F}$  $\le \sqrt{3k}\parallel \mathrm{\Pi}\stackrel{~}{A}\mathrm{\Pi}C\parallel (\text{since both}\mathrm{\Pi}\stackrel{~}{A}\text{and}\mathrm{\Pi}C\text{have rank at most}k)$  
$\le \sqrt{3k}\left(\parallel \mathrm{\Pi}(\stackrel{~}{A}A)\parallel +\parallel \mathrm{\Pi}(AC)\parallel \right)$  
$\le \sqrt{3k}({c}_{q}\delta \kappa \sqrt{m}+\parallel AC\parallel )$  
$\le 2\sqrt{3}{c}_{q}\sqrt{k}\sigma \sqrt{m}\le {c}_{q}(1+{\displaystyle \frac{\kappa \delta}{\sigma}})\sqrt{12k}\sigma \sqrt{m}.$  (67) 
Here the third inequality follows from the fact that for any $n\times m$ matrix $M$, $\parallel M\parallel \le {\mathrm{\ell}}_{\mathrm{max}}\sqrt{m}$, where ${\mathrm{\ell}}_{\mathrm{max}}$ is the maximum ${\mathrm{\ell}}_{2}$ norm of a column of $M$. Furthermore, from the robustness of $\mathrm{\Pi}$ we know that for any $j\in [m]$,
$$\parallel \mathrm{\Pi}({\stackrel{~}{A}}_{j}{A}_{j})\parallel \le {c}_{q}\kappa \delta .$$ 
Next, let’s establish that $\parallel A\mathrm{\Pi}A\parallel $ is small. By triangle inequality we know that
$\parallel A{\mathrm{\Pi}}_{C}A\parallel $  $\le \parallel AC\parallel +\parallel C{\mathrm{\Pi}}_{C}A\parallel $  
$=\parallel AC\parallel +\parallel {\mathrm{\Pi}}_{C}(CA)\parallel $  
$\le 2\parallel AC\parallel \le 2\sigma \sqrt{m}.$ 
Furthermore, from the guarantee of Theorem 6.4 we have that for any $\eta \in (0,1)$,
$\parallel A\mathrm{\Pi}A\parallel $  $\le O(1+{\displaystyle \frac{1}{\eta}})\left(2\sigma \sqrt{m}+\parallel A{\mathrm{\Pi}}_{C}A\parallel +\kappa \delta \sqrt{m}\right)+\sqrt{2\eta}\parallel A\parallel $  
$\le O(1+{\displaystyle \frac{1}{\eta}})\left(4\sigma +\kappa \delta \right)\sqrt{m}+\sqrt{2\eta}\parallel A\parallel .$ 
Setting $\eta ={(5\sigma \sqrt{m}/\parallel A\parallel )}^{2/3}$ we get that
$\parallel A\mathrm{\Pi}A\parallel $  $\le 4{c}_{q}(1+{\displaystyle \frac{\kappa \delta}{\sigma}})\sigma \left(1+{\left({\displaystyle \frac{\parallel A\parallel}{\sigma \sqrt{m}}}\right)}^{2/3}\right)\sqrt{m}$  
$\le 4{c}_{q}\alpha \sigma \sqrt{m}.$  (68) 
The last inequality above follows from the fact that
$\parallel A\parallel $  $\le \parallel AC\parallel +\parallel C\parallel $  
$\le \sigma \sqrt{m}+{\mu}_{\mathrm{max}}\sqrt{m}.$ 
Next notice that
${\parallel \mathrm{\Pi}\stackrel{~}{A}C\parallel}_{F}$  $\le \sqrt{3k}\parallel \mathrm{\Pi}\stackrel{~}{A}C\parallel (\text{since both}\mathrm{\Pi}\stackrel{~}{A}\text{and}C\text{have rank at most}k)$  
$\le \sqrt{3k}(\parallel \mathrm{\Pi}(\stackrel{~}{A}A)\parallel +\parallel \mathrm{\Pi}AA\parallel +\parallel AC\parallel )$  
$\le \sqrt{3k}({c}_{q}\delta \kappa \sqrt{m}+5{c}_{q}\alpha \sigma \sqrt{m}+\sigma \sqrt{m})$  
$\le 6\sqrt{3k}{c}_{q}\alpha \sigma \sqrt{m}.$  (69) 
Now we are ready to establish the claim of the theorem. From (67) we get that the centers ${\nu}_{1},{\nu}_{2},\mathrm{\dots},{\nu}_{k}$ will have $k$means cost at most $120k{(1+\frac{\kappa \delta}{\sigma})}^{2}{c}_{q}^{2}{\sigma}^{2}m$ on $\stackrel{~}{A}$. Furthermore, suppose that there exists a center ${\mu}_{r}$ such that every ${\nu}_{s}$ is far from it. For any point ${A}_{i}$, let ${\nu}_{c(i)}$ be the center in the set $\{{\nu}_{1},{\nu}_{2},\mathrm{\dots},{\nu}_{k}\}$ that is closest to the projection of ${\stackrel{~}{A}}_{i}$ on to $\mathrm{\Pi}$. Then we have that
$120k{c}_{q}^{2}{(1+{\displaystyle \frac{\kappa \delta}{\sigma}})}^{2}{\sigma}^{2}m\ge {\displaystyle \sum _{{A}_{i}\in {C}_{r}}}{\parallel \mathrm{\Pi}{\stackrel{~}{A}}_{i}{\nu}_{c(i)}\parallel}^{2}$  $={\displaystyle \sum _{{A}_{i}\in {C}_{r}}}{\parallel \mathrm{\Pi}{\stackrel{~}{A}}_{i}{\mu}_{r}+{\mu}_{r}{\nu}_{c(i)}\parallel}^{2}$  
$\ge {\displaystyle \frac{1}{2}}{C}_{r}{\parallel {\mu}_{r}{\nu}_{c(i)}\parallel}^{2}{\displaystyle \sum _{{A}_{i}\in {C}_{r}}}{\parallel \mathrm{\Pi}{\stackrel{~}{A}}_{i}{\mu}_{r}\parallel}^{2}$  
$\ge {\displaystyle \frac{1}{2}}{C}_{r}{\parallel {\mu}_{r}{\nu}_{c(i)}\parallel}^{2}{\parallel \mathrm{\Pi}\stackrel{~}{A}C\parallel}_{F}^{2}$  
$\ge 450k{\alpha}^{2}{c}_{q}^{2}{\sigma}^{2}m{\parallel \mathrm{\Pi}\stackrel{~}{A}C\parallel}_{F}^{2}$  
$>120k{c}_{q}^{2}{\alpha}^{2}{\sigma}^{2}m.$  (70) 
Noticing that $\alpha \ge (1+\frac{\kappa \delta}{\sigma})$, we get a contradiction to the fact that ${\mu}_{r}$ is far from every ${\nu}_{s}$. This combined with the fact that the clustering instance is $c$stable for $c>200{c}_{q}$ implies that one can find a bijection $\pi :[k]\mapsto [k]$ between $\{{\mu}_{1},\mathrm{\dots},{\mu}_{k}\}$ and $\{{\nu}_{1},\mathrm{\dots},{\nu}_{k}\}$ such that each ${\mu}_{i}$ is close to a unique ${\nu}_{\pi (i)}$. ∎
7.3 Analyzing Lloyd’s Updates
Next we will use the obtained initial centers and run the robust Lloyd’s algorithm starting with these centers as shown in Figure 5. Our goal in this section is to analyze the updates and establish Theorem 7.5 and Theorem 7.6.
Overview of Analysis and Challenges. Our analysis of the modified Lloyd’s updates proceeds in two stages: a) a center improvement step, and b) analyzing robust Lloyd’s updates. In (a), we first improve the initial center estimates obtained form the initialization phase to get estimates ${\nu}_{1}^{(1)},\mathrm{\dots},{\nu}_{k}^{(1)}$ such that each ${\nu}_{r}^{(1)}$ is $\sim {\mathrm{\Delta}}_{r}/2$close to the corresponding ${\mu}_{r}$, where ${\mathrm{\Delta}}_{r}=40{c}_{q}\alpha \sigma \sqrt{m}/\sqrt{{C}_{r}^{*}}$. In other words, we get a factor $\sqrt{k}$ improvement over the initial estimates. This is sketched in step $3$ of the algorithm in Figure 5. First, we motivate the need for this intermediate step, since it is not necessary in the analysis of Lloyd’s algorithm for uncorrupted data.
Just as in standard analysis of Lloyd’s updates, we would like to argue that if we have nontrivial estimates of the centers, as obtained from the initialization stage, forming clusters using these points and moving to the means of these clusters will improve the center estimates. To argue this we will crucially rely on the fact that when projected onto $\mathrm{\Pi}$, $A$ and $\stackrel{~}{A}$ are close pointwise. Hence, we can come up with a charging argument to assign mistakes made by the current centers on the uncorrupted points to the mistakes made by the centers on the corrupted points. We can then bound the number of such mistakes by observing that on $\mathrm{\Pi}\stackrel{~}{A}$, the true means have a small $k$means cost. This forces us to work in the projected space $\mathrm{\Pi}$, but as a result inherently limits the accuracy to which we can obtain center estimates. Notice that if the initialization algorithm outputs $\mathrm{\Pi}$, then $\mathrm{\Pi}$ is guaranteed to be good overall for $\stackrel{~}{A}$, in the sense that $\parallel \stackrel{~}{A}\mathrm{\Pi}\stackrel{~}{A}\parallel =O(\sigma \sqrt{m})$. However, $\mathrm{\Pi}$ has no per cluster guarantee, and in general $\parallel {\stackrel{~}{A}}_{r}\mathrm{\Pi}{\stackrel{~}{A}}_{r}\parallel $ when restricted to a cluster ${C}_{r}^{*}$ could be as large as $\sigma \sqrt{m}/\sqrt{{C}_{r}^{*}}$. Hence, to achieve our goal of estimating the centers upto $\stackrel{~}{O}(\alpha \sigma )$ accuracy, we also need to work outside of the projection $\mathrm{\Pi}$ at the same time. Due to these conflicting demands, notice that the Lloyd’s updates we analyze in step $4$ of the algorithm in Figure 5 perform clustering using current centers in the projected space, but perform robust mean estimation on the original input data.
Furthermore, from our guarantee on robust mean estimation in Theorem 7.1, we know that in the RobustMean step of the algorithm the centers will be accurate upto $\sim \alpha {\sigma}_{S}$, where ${\sigma}_{S}$ is the standard deviation of the uncorrupted data points in ${S}_{r}$ around the uncorrupted mean of ${S}_{r}$. As a result we need a stronger argument that not only shows that we have low clustering error given the current estimates, but also helps us argue about the variance of the formed clusters ${S}_{r}$ at each step. Such an argument (Lemma 7.9) is a main technical contribution in the analysis.
Unfortunately, the argument (Lemma 7.9) only kicks in when we have much better center estimates than the one provided by the initialization stage, thereby requiring an additional center improvement stage. To argue about the center improvement stage, we use a trick from [awasthi2012improved] and form sets ${S}_{r}$ that correspond to points in $\mathrm{\Pi}\stackrel{~}{A}$ that are significantly close to one of the centers ${\nu}_{r}^{(0)}$ than any other center ${\nu}_{s}^{(0)}$. Notice that these sets do not form a partitioning of the data. We then argue that any mistake made by this assignment must have also been made if one had used the true centers ${\mu}_{1},\mathrm{\dots},{\mu}_{k}$, to cluster $\mathrm{\Pi}\stackrel{~}{A}$. Using the fact that the true means have small $k$means cost on $\mathrm{\Pi}\stackrel{~}{A}$ we can bound the number of such mistakes and hence get sets ${S}_{r}$ that have low error, thereby helping us show that the means of these sets will be much closer to the true centers. This is established in Theorem 7.10. The above arguments help us establish Theorem 7.5. We next state the key technical lemma for our analysis.
Lemma 7.9.
Let $\mathrm{\Pi}$ be the robust subspace computed in step (1) of the algorithm in Figure 5. For each cluster ${C}_{r}^{\mathrm{*}}$ in the optimal clustering of $A$, define ${\mathrm{\Delta}}_{r}\mathrm{=}\mathrm{40}\mathit{}{c}_{q}\mathit{}\alpha \mathit{}\sigma \mathit{}\sqrt{m}\mathrm{/}\sqrt{\mathrm{}{C}_{r}^{\mathrm{*}}\mathrm{}}$. Suppose we have center estimates $\mathrm{\{}{\nu}_{\mathrm{1}}\mathrm{,}{\nu}_{\mathrm{2}}\mathrm{,}\mathrm{\dots}\mathrm{,}{\nu}_{k}\mathrm{\}}$ such that $\mathrm{\parallel}{\nu}_{r}\mathrm{}{\mu}_{r}\mathrm{\parallel}\mathrm{\le}\beta \mathit{}{\mathrm{\Delta}}_{r}$, $\mathrm{\forall}r\mathrm{\in}\mathrm{[}k\mathrm{]}$, and some $$. When using ${\nu}_{i}$s to cluster $\mathrm{\Pi}\mathit{}\stackrel{\mathrm{~}}{A}$, define ${T}_{r\mathrm{,}s}$ to be the set of points that are misclassified, w.r.t. the induced clustering on $A$, i.e., ${T}_{s\mathrm{\to}r}\mathrm{=}\mathrm{\{}i\mathrm{:}{A}_{i}\mathrm{\in}{C}_{r}^{\mathrm{*}}\mathit{}\mathit{\text{and}}\mathit{}\mathrm{\parallel}\mathrm{\Pi}\mathit{}{\stackrel{\mathrm{~}}{A}}_{i}\mathrm{}{\nu}_{s}\mathrm{\parallel}\mathrm{\le}\mathrm{\parallel}\mathrm{\Pi}\mathit{}{\stackrel{\mathrm{~}}{A}}_{i}\mathrm{}{\nu}_{r}\mathrm{\parallel}\mathrm{\}}$. There exists a constant ${c}_{\mathrm{1}}\mathrm{>}\mathrm{0}$ depending on $q$ such that if the clustering instance is $c$stable for $c\mathrm{>}\mathrm{200}\mathit{}{c}_{q}$ then we have that $\mathrm{}{T}_{s\mathrm{\to}r}\mathrm{}\mathrm{\le}\frac{{c}_{\mathrm{1}}\mathit{}{\beta}^{\mathrm{2}}\mathit{}{\sigma}^{\mathrm{2}}\mathit{}m}{k\mathit{}{c}^{\mathrm{2}}\mathit{}{\mathrm{\parallel}{\mu}_{r}\mathrm{}{\mu}_{s}\mathrm{\parallel}}^{\mathrm{2}}}$.
Proof.
Fix $s\ne r$ and let $W$ be the subspace spanned by $\{{\mu}_{r},{\mu}_{s},{\nu}_{r},{\nu}_{s}\}$ with ${\mathrm{\Pi}}_{W}$ being the projection matrix for the orthogonal projection on to the subspace. Define ${\overline{A}}_{i}$ to be the projection of ${A}_{i}$ onto the line joining ${\mu}_{r}$ and ${\mu}_{s}$. Since $W$ contains ${\mu}_{r},{\mu}_{s}$, this is also the same as the projection of ${\mathrm{\Pi}}_{W}{A}_{i}$ on to the line joining ${\mu}_{r}$ and ${\mu}_{s}$. Similarly, define ${\overline{\stackrel{~}{A}}}_{i}$ to be the projection of ${\stackrel{~}{A}}_{i}$ on to the line joining ${\mu}_{r}$ and ${\mu}_{s}$, and again this is the same as the projection of ${\mathrm{\Pi}}_{W}{\stackrel{~}{A}}_{i}$ on to the line joining ${\mu}_{r}$ and ${\mu}_{s}$. We will crucially make use of the fact that
$\parallel {\overline{\stackrel{~}{A}}}_{i}{\mu}_{s}\parallel \parallel {\overline{\stackrel{~}{A}}}_{i}{\mu}_{r}\parallel $  $\ge {\mathrm{\Delta}}_{r,s}O(\kappa \delta )\ge {\mathrm{\Delta}}_{r,s}/2.$  (71) 
The above holds since from $c$stability we know that $\parallel {\overline{A}}_{i}{\mu}_{s}\parallel \parallel {\overline{A}}_{i}{\mu}_{r}$