Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Abstract

In this paper, we present an open-set object detector, called Grounding DINO,by marrying Transformer-based detector DINO with grounded pre-training, whichcan detect arbitrary objects with human inputs such as category names orreferring expressions. The key solution of open-set object detection isintroducing language to a closed-set detector for open-set conceptgeneralization. To effectively fuse language and vision modalities, weconceptually divide a closed-set detector into three phases and propose a tightfusion solution, which includes a feature enhancer, a language-guided queryselection, and a cross-modality decoder for cross-modality fusion. Whileprevious works mainly evaluate open-set object detection on novel categories,we propose to also perform evaluations on referring expression comprehensionfor objects specified with attributes. Grounding DINO performs remarkably wellon all three settings, including benchmarks on COCO, LVIS, ODinW, andRefCOCO/+/g. Grounding DINO achieves a $52.5$ AP on the COCO detectionzero-shot transfer benchmark, i.e., without any training data from COCO. Itsets a new record on the ODinW zero-shot benchmark with a mean $26.1$ AP. Codewill be available at \url{https://github.com/IDEA-Research/GroundingDINO}.

Quick Read (beta)

loading the full paper ...