Zero-Shot Detection via Vision and Language Knowledge Distillation

Abstract

Zero-shot image classification has made promising progress by training thealigned image and text encoders. The goal of this work is to advance zero-shotobject detection, which aims to detect novel objects without bounding box normask annotations. We propose ViLD, a training method via Vision and Languageknowledge Distillation. We distill the knowledge from a pre-trained zero-shotimage classification model (e.g., CLIP) into a two-stage detector (e.g., MaskR-CNN). Our method aligns the region embeddings in the detector to the text andimage embeddings inferred by the pre-trained model. We use the text embeddingsas the detection classifier, obtained by feeding category names into thepre-trained text encoder. We then minimize the distance between the regionembeddings and image embeddings, obtained by feeding region proposals into thepre-trained image encoder. During inference, we include text embeddings ofnovel categories into the detection classifier for zero-shot detection. Webenchmark the performance on LVIS dataset by holding out all rare categories asnovel categories. ViLD obtains 16.1 mask AP$_r$ with a Mask R-CNN (ResNet-50FPN) for zero-shot detection, outperforming the supervised counterpart by 3.8.The model can directly transfer to other datasets, achieving 72.2 AP$_{50}$,36.6 AP and 11.8 AP on PASCAL VOC, COCO and Objects365, respectively.

Quick Read (beta)

loading the full paper ...