Diffuse, Attend, and Segment: Unsupervised Zero-Shot Segmentation using Stable Diffusion

Abstract

Producing quality segmentation masks for images is a fundamental problem incomputer vision. Recent research has explored large-scale supervised trainingto enable zero-shot segmentation on virtually any image style and unsupervisedtraining to enable segmentation without dense annotations. However,constructing a model capable of segmenting anything in a zero-shot mannerwithout any annotations is still challenging. In this paper, we propose toutilize the self-attention layers in stable diffusion models to achieve thisgoal because the pre-trained stable diffusion model has learned inherentconcepts of objects within its attention layers. Specifically, we introduce asimple yet effective iterative merging process based on measuring KL divergenceamong attention maps to merge them into valid segmentation masks. The proposedmethod does not require any training or language dependency to extract qualitysegmentation for any images. On COCO-Stuff-27, our method surpasses the priorunsupervised zero-shot SOTA method by an absolute 26% in pixel accuracy and 17%in mean IoU. The project page is at\url{https://sites.google.com/view/diffseg/home}.

Quick Read (beta)

loading the full paper ...