Detecting Twenty-thousand Classes using Image-level Supervision

Abstract

Current object detectors are limited in vocabulary size due to the smallscale of detection datasets. Image classifiers, on the other hand, reason aboutmuch larger vocabularies, as their datasets are larger and easier to collect.We propose Detic, which simply trains the classifiers of a detector on imageclassification data and thus expands the vocabulary of detectors to tens ofthousands of concepts. Unlike prior work, Detic does not assign image labels toboxes based on model predictions, making it much easier to implement andcompatible with a range of detection architectures and backbones. Our resultsshow that Detic yields excellent detectors even for classes without boxannotations. It outperforms prior work on both open-vocabulary and long-taildetection benchmarks. Detic provides a gain of 2.4 mAP for all classes and 8.3mAP for novel classes on the open-vocabulary LVIS benchmark. On the standardLVIS benchmark, Detic reaches 41.7 mAP for all classes and 41.7 mAP for rareclasses. For the first time, we train a detector with all thetwenty-one-thousand classes of the ImageNet dataset and show that itgeneralizes to new datasets without fine-tuning. Code is available athttps://github.com/facebookresearch/Detic.

Quick Read (beta)

loading the full paper ...