Peekaboo: Text to Image Diffusion Models are Zero-Shot Segmentors

Abstract

Recent diffusion-based generative models combined with vision-language modelsare capable of creating realistic images from natural language prompts. Whilethese models are trained on large internet-scale datasets, such pre-trainedmodels are not directly introduced to any semantic localization or grounding.Most current approaches for localization or grounding rely on human-annotatedlocalization information in the form of bounding boxes or segmentation masks.The exceptions are a few unsupervised methods that utilize architectures orloss functions geared towards localization, but they need to be trainedseparately. In this work, we explore how off-the-shelf diffusion models,trained with no exposure to such localization information, are capable ofgrounding various semantic phrases with no segmentation-specific re-training.An inference time optimization process is introduced, that is capable ofgenerating segmentation masks conditioned on natural language. We evaluate ourproposal Peekaboo for unsupervised semantic segmentation on the Pascal VOCdataset. In addition, we evaluate for referring segmentation on the RefCOCOdataset. In summary, we present a first zero-shot, open-vocabulary,unsupervised (no localization information), semantic grounding techniqueleveraging diffusion-based generative models with no re-training. Our code willbe released publicly.

Quick Read (beta)

loading the full paper ...