ZegCLIP: Towards Adapting CLIP for Zero-shot Semantic Segmentation

Abstract

Recently, CLIP has been applied to pixel-level zero-shot learning tasks via atwo-stage scheme. The general idea is to first generate class-agnostic regionproposals and then feed the cropped proposal regions to CLIP to utilize itsimage-level zero-shot classification capability. While effective, such a schemerequires two image encoders, one for proposal generation and one for CLIP,leading to a complicated pipeline and high computational cost. In this work, wepursue a simpler-and-efficient one-stage solution that directly extends CLIP'szero-shot prediction capability from image to pixel level. Our investigationstarts with a straightforward extension as our baseline that generates semanticmasks by comparing the similarity between text and patch embeddings extractedfrom CLIP. However, such a paradigm could heavily overfit the seen classes andfail to generalize to unseen classes. To handle this issue, we propose threesimple-but-effective designs and figure out that they can significantly retainthe inherent zero-shot capacity of CLIP and improve pixel-level generalizationability. Incorporating those modifications leads to an efficient zero-shotsemantic segmentation system called ZegCLIP. Through extensive experiments onthree public benchmarks, ZegCLIP demonstrates superior performance,outperforming the state-of-the-art methods by a large margin under both"inductive" and "transductive" zero-shot settings. In addition, compared withthe two-stage method, our one-stage ZegCLIP achieves a speedup of about 5 timesfaster during inference. We release the code athttps://github.com/ZiqinZhou66/ZegCLIP.git.

Quick Read (beta)

loading the full paper ...