UP-DETR: Unsupervised Pre-training for Object Detection with Transformers

Abstract

Object detection with transformers (DETR) reaches competitive performancewith Faster R-CNN via a transformer encoder-decoder architecture. Inspired bythe great success of pre-training transformers in natural language processing,we propose a pretext task named random query patch detection to unsupervisedlypre-train DETR (UP-DETR) for object detection. Specifically, we randomly croppatches from the given image and then feed them as queries to the decoder. Themodel is pre-trained to detect these query patches from the original image.During the pre-training, we address two critical issues: multi-task learningand multi-query localization. (1) To trade-off multi-task learning ofclassification and localization in the pretext task, we freeze the CNN backboneand propose a patch feature reconstruction branch which is jointly optimizedwith patch detection. (2) To perform multi-query localization, we introduceUP-DETR from single-query patch and extend it to multi-query patches withobject query shuffle and attention mask. In our experiments, UP-DETRsignificantly boosts the performance of DETR with faster convergence and higherprecision on PASCAL VOC and COCO datasets. The code will be available soon.

Quick Read (beta)

loading the full paper ...