Abstract
Transformers are transforming the landscape of computer vision, especiallyfor recognition tasks. Detection transformers are the first fully end-to-endlearning systems for object detection, while vision transformers are the firstfully transformer-based architecture for image classification. In this paper,we integrate Vision and Detection Transformers (ViDT) to build an effective andefficient object detector. ViDT introduces a reconfigured attention module toextend the recent Swin Transformer to be a standalone object detector, followedby a computationally efficient transformer decoder that exploits multi-scalefeatures and auxiliary techniques essential to boost the detection performancewithout much increase in computational load. Extensive evaluation results onthe Microsoft COCO benchmark dataset demonstrate that ViDT obtains the best APand latency trade-off among existing fully transformer-based object detectors,and achieves 49.2AP owing to its high scalability for large models. We willrelease the code and trained models athttps://github.com/naver-ai/vidt