DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting

Abstract

Recent progress has shown that large-scale pre-training using contrastiveimage-text pairs can be a promising alternative for high-quality visualrepresentation learning from natural language supervision. Benefiting from abroader source of supervision, this new paradigm exhibits impressivetransferability to downstream classification tasks and datasets. However, theproblem of transferring the knowledge learned from image-text pairs to morecomplex dense prediction tasks has barely been visited. In this work, wepresent a new framework for dense prediction by implicitly and explicitlyleveraging the pre-trained knowledge from CLIP. Specifically, we convert theoriginal image-text matching problem in CLIP to a pixel-text matching problemand use the pixel-text score maps to guide the learning of dense predictionmodels. By further using the contextual information from the image to promptthe language model, we are able to facilitate our model to better exploit thepre-trained knowledge. Our method is model-agnostic, which can be applied toarbitrary dense prediction systems and various pre-trained visual backbonesincluding both CLIP models and ImageNet pre-trained models. Extensiveexperiments demonstrate the superior performance of our methods on semanticsegmentation, object detection, and instance segmentation tasks. Code isavailable at https://github.com/raoyongming/DenseCLIP

Quick Read (beta)

loading the full paper ...