Open-Vocabulary Instance Segmentation via Robust Cross-Modal Pseudo-Labeling

Abstract

Open-vocabulary instance segmentation aims at segmenting novel classeswithout mask annotations. It is an important step toward reducing laborioushuman supervision. Most existing works first pretrain a model on captionedimages covering many novel classes and then finetune it on limited base classeswith mask annotations. However, the high-level textual information learned fromcaption pretraining alone cannot effectively encode the details required forpixel-wise segmentation. To address this, we propose a cross-modalpseudo-labeling framework, which generates training pseudo masks by aligningword semantics in captions with visual features of object masks in images.Thus, our framework is capable of labeling novel classes in captions via theirword semantics to self-train a student model. To account for noises in pseudomasks, we design a robust student model that selectively distills maskknowledge by estimating the mask noise levels, hence mitigating the adverseimpact of noisy pseudo masks. By extensive experiments, we show theeffectiveness of our framework, where we significantly improve mAP score by4.5% on MS-COCO and 5.1% on the large-scale Open Images & Conceptual Captionsdatasets compared to the state-of-the-art.

Quick Read (beta)

loading the full paper ...