Abstract
Object detection with transformers (DETR) reaches competitive performancewith Faster R-CNN via a transformer encoder-decoder architecture. Inspired bythe great success of pre-training transformers in natural language processing,we propose a pretext task named random query patch detection to unsupervisedlypre-train DETR (UP-DETR) for object detection. Specifically, we randomly croppatches from the given image and then feed them as queries to the decoder. Themodel is pre-trained to detect these query patches from the original image.During the pre-training, we address two critical issues: multi-task learningand multi-query localization. (1) To trade-off multi-task learning ofclassification and localization in the pretext task, we freeze the CNN backboneand propose a patch feature reconstruction branch which is jointly optimizedwith patch detection. (2) To perform multi-query localization, we introduceUP-DETR from single-query patch and extend it to multi-query patches withobject query shuffle and attention mask. In our experiments, UP-DETRsignificantly boosts the performance of DETR with faster convergence and higherprecision on PASCAL VOC and COCO datasets. The code will be available soon.