Computationally efficient sparse clustering

Abstract

We study statistical and computational limits of clustering when the means ofthe centres are sparse and their dimension is possibly much larger than thesample size. Our theoretical analysis focuses on the simple model $X_i = z_i\theta + \varepsilon_i$, $z_i \in \{-1,1\}$, $\varepsilon_i \thicksim\mathcal{N}(0, I)$, which has two clusters with centres $\theta$ and $-\theta$. We provide a finite sample analysis of a new sparse clustering algorithmbased on sparse PCA and show that it achieves the minimax optimal misclusteringrate in the regime $\|\theta\| \rightarrow \infty$, matching asymptotically theBayes error. Our results require the sparsity to grow slower than the square root of thesample size. Using a recent framework for computational lower bounds---thelow-degree likelihood ratio---we give evidence that this condition is necessaryfor any polynomial-time clustering algorithm to succeed below the BBPthreshold. This complements existing evidence based on reductions andstatistical query lower bounds. Compared to these existing results, we cover awider set of parameter regimes and give a more precise understanding of theruntime required and the misclustering error achievable. We also discuss extensions of our results to more than two clusters.

Quick Read (beta)

loading the full paper ...