Guaranteed Recovery of Unambiguous Clusters

Abstract

Clustering is often a challenging problem because of the inherent ambiguityin what the "correct" clustering should be. Even when the number of clusters$K$ is known, this ambiguity often still exists, particularly when there isvariation in density among different clusters, and clusters have multiplerelatively separated regions of high density. In this paper we propose aninformation-theoretic characterization of when a $K$-clustering is ambiguous,and design an algorithm that recovers the clustering whenever it isunambiguous. This characterization formalizes the situation when two highdensity regions within a cluster are separable enough that they look more liketwo distinct clusters than two truly distinct clusters in the clustering. Thealgorithm first identifies $K$ partial clusters (or "seeds") using adensity-based approach, and then adds unclustered points to the initial $K$partial clusters in a greedy manner to form a complete clustering. We implementand test a version of the algorithm that is modified to effectively handleoverlapping clusters, and observe that it requires little parameter selectionand displays improved performance on many datasets compared to widely usedalgorithms for non-convex cluster recovery.

Quick Read (beta)

loading the full paper ...