Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions

Abstract

Although using convolutional neural networks (CNNs) as backbones achievesgreat successes in computer vision, this work investigates a simple backbonenetwork useful for many dense prediction tasks without convolutions. Unlike therecently-proposed Transformer model (e.g., ViT) that is specially designed forimage classification, we propose Pyramid Vision Transformer~(PVT), whichovercomes the difficulties of porting Transformer to various dense predictiontasks. PVT has several merits compared to prior arts. (1) Different from ViTthat typically has low-resolution outputs and high computational and memorycost, PVT can be not only trained on dense partitions of the image to achievehigh output resolution, which is important for dense predictions but also usinga progressive shrinking pyramid to reduce computations of large feature maps.(2) PVT inherits the advantages from both CNN and Transformer, making it aunified backbone in various vision tasks without convolutions by simplyreplacing CNN backbones. (3) We validate PVT by conducting extensiveexperiments, showing that it boosts the performance of many downstream tasks,e.g., object detection, semantic, and instance segmentation. For example, witha comparable number of parameters, RetinaNet+PVT achieves 40.4 AP on the COCOdataset, surpassing RetinNet+ResNet50 (36.3 AP) by 4.1 absolute AP. We hope PVTcould serve as an alternative and useful backbone for pixel-level predictionsand facilitate future researches. Code is available athttps://github.com/whai362/PVT.

Quick Read (beta)

loading the full paper ...