MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining

Abstract

This paper presents a simple yet effective framework MaskCLIP, whichincorporates a newly proposed masked self-distillation into contrastivelanguage-image pretraining. The core idea of masked self-distillation is todistill representation from a full image to the representation predicted from amasked image. Such incorporation enjoys two vital benefits. First, maskedself-distillation targets local patch representation learning, which iscomplementary to vision-language contrastive focusing on text-relatedrepresentation.Second, masked self-distillation is also consistent withvision-language contrastive from the perspective of training objective as bothutilize the visual encoder for feature aligning, and thus is able to learnlocal semantics getting indirect supervision from the language. We providespecially designed experiments with a comprehensive analysis to validate thetwo benefits. Empirically, we show that MaskCLIP, when applied to variouschallenging downstream tasks, achieves superior results in linear probing,finetuning as well as the zero-shot performance with the guidance of thelanguage encoder.

Quick Read (beta)

loading the full paper ...