Efficient Decoder-free Object Detection with Transformers

Abstract

Vision transformers (ViTs) are changing the landscape of object detectionapproaches. A natural usage of ViTs in detection is to replace the CNN-basedbackbone with a transformer-based backbone, which is straightforward andeffective, with the price of bringing considerable computation burden forinference. More subtle usage is the DETR family, which eliminates the need formany hand-designed components in object detection but introduces a decoderdemanding an extra-long time to converge. As a result, transformer-based objectdetection can not prevail in large-scale applications. To overcome theseissues, we propose a novel decoder-free fully transformer-based (DFFT) objectdetector, achieving high efficiency in both training and inference stages, forthe first time. We simplify objection detection into an encoder-onlysingle-level anchor-based dense prediction problem by centering around twoentry points: 1) Eliminate the training-inefficient decoder and leverage twostrong encoders to preserve the accuracy of single-level feature mapprediction; 2) Explore low-level semantic features for the detection task withlimited computational resources. In particular, we design a novel lightweightdetection-oriented transformer backbone that efficiently captures low-levelfeatures with rich semantics based on a well-conceived ablation study.Extensive experiments on the MS COCO benchmark demonstrate that DFFT_SMALLoutperforms DETR by 2.5% AP with 28% computation cost reduction and more than$10\times$ fewer training epochs. Compared with the cutting-edge anchor-baseddetector RetinaNet, DFFT_SMALL obtains over 5.5% AP gain while cutting down 70%computation cost.

Quick Read (beta)

loading the full paper ...