Abstract
Open-vocabulary dense prediction tasks including object detection and imagesegmentation have been advanced by the success of Contrastive Language-ImagePre-training (CLIP). CLIP models, particularly those incorporating visiontransformers (ViTs), have exhibited remarkable generalization ability inzero-shot image classification. However, when transferring the vision-languagealignment of CLIP from global image representation to local regionrepresentation for the open-vocabulary dense prediction tasks, CLIP ViTs sufferfrom the domain shift from full images to local image regions. In this paper,we embark on an in-depth analysis of the region-language alignment in CLIPmodels, which is essential for downstream open-vocabulary dense predictiontasks. Subsequently, we propose an approach named CLIPSelf, which adapts theimage-level recognition ability of CLIP ViT to local image regions withoutneeding any region-text pairs. CLIPSelf empowers ViTs to distill itself byaligning a region representation extracted from its dense feature map with theimage-level representation of the corresponding image crop. With the enhancedCLIP ViTs, we achieve new state-of-the-art performance on open-vocabularyobject detection, semantic segmentation, and panoptic segmentation acrossvarious benchmarks. Models and code are released athttps://github.com/wusize/CLIPSelf.