DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

Abstract

We present DINO (\textbf{D}ETR with \textbf{I}mproved de\textbf{N}oisinganch\textbf{O}r boxes), a state-of-the-art end-to-end object detector. % inthis paper. DINO improves over previous DETR-like models in performance andefficiency by using a contrastive way for denoising training, a mixed queryselection method for anchor initialization, and a look forward twice scheme forbox prediction. DINO achieves $48.3$AP in $12$ epochs and $51.0$AP in $36$epochs on COCO with a ResNet-50 backbone and multi-scale features, yielding asignificant improvement of $\textbf{+4.9}$\textbf{AP} and$\textbf{+2.4}$\textbf{AP}, respectively, compared to DN-DETR, the previousbest DETR-like model. DINO scales well in both model size and data size.Without bells and whistles, after pre-training on the Objects365 dataset with aSwinL backbone, DINO obtains the best results on both COCO \texttt{val2017}($\textbf{63.2}$\textbf{AP}) and \texttt{test-dev}(\textbf{$\textbf{63.3}$AP}). Compared to other models on the leaderboard, DINOsignificantly reduces its model size and pre-training data size while achievingbetter results. Our code will be available at\url{https://github.com/IDEACVR/DINO}.

Quick Read (beta)

loading the full paper ...