Abstract
Despite weakly supervised object detection (WSOD) being a promising steptoward evading strong instance-level annotations, its capability is confined toclosed-set categories within a single training dataset. In this paper, wepropose a novel weakly supervised open-vocabulary object detection framework,namely WSOVOD, to extend traditional WSOD to detect novel concepts and utilizediverse datasets with only image-level annotations. To achieve this, we explorethree vital strategies, including dataset-level feature adaptation, image-levelsalient object localization, and region-level vision-language alignment. First,we perform data-aware feature extraction to produce an input-conditionalcoefficient, which is leveraged into dataset attribute prototypes to identifydataset bias and help achieve cross-dataset generalization. Second, acustomized location-oriented weakly supervised region proposal network isproposed to utilize high-level semantic layouts from the category-agnosticsegment anything model to distinguish object boundaries. Lastly, we introduce aproposal-concept synchronized multiple-instance network, i.e., object miningand refinement with visual-semantic alignment, to discover objects matched tothe text embeddings of concepts. Extensive experiments on Pascal VOC and MSCOCO demonstrate that the proposed WSOVOD achieves new state-of-the-artcompared with previous WSOD methods in both close-set object localization anddetection tasks. Meanwhile, WSOVOD enables cross-dataset and open-vocabularylearning to achieve on-par or even better performance than well-establishedfully-supervised open-vocabulary object detection (FSOVOD).