Segment Anything without Supervision

Abstract

The Segmentation Anything Model (SAM) requires labor-intensive data labeling.We present Unsupervised SAM (UnSAM) for promptable and automatic whole-imagesegmentation that does not require human annotations. UnSAM utilizes adivide-and-conquer strategy to "discover" the hierarchical structure of visualscenes. We first leverage top-down clustering methods to partition an unlabeledimage into instance/semantic level segments. For all pixels within a segment, abottom-up clustering method is employed to iteratively merge them into largergroups, thereby forming a hierarchical structure. These unsupervisedmulti-granular masks are then utilized to supervise model training. Evaluatedacross seven popular datasets, UnSAM achieves competitive results with thesupervised counterpart SAM, and surpasses the previous state-of-the-art inunsupervised segmentation by 11% in terms of AR. Moreover, we show thatsupervised SAM can also benefit from our self-supervised labels. By integratingour unsupervised pseudo masks into SA-1B's ground-truth masks and trainingUnSAM with only 1% of SA-1B, a lightly semi-supervised UnSAM can often segmententities overlooked by supervised SAM, exceeding SAM's AR by over 6.7% and APby 3.9% on SA-1B.

Quick Read (beta)

loading the full paper ...