Abstract
Transformers are quickly becoming one of the most heavily applied deeplearning architectures across modalities, domains, and tasks. In vision, on topof ongoing efforts into plain transformers, hierarchical transformers have alsogained significant attention, thanks to their performance and easy integrationinto existing frameworks. These models typically employ localized attentionmechanisms, such as the sliding-window Neighborhood Attention (NA) or SwinTransformer's Shifted Window Self Attention. While effective at reducing selfattention's quadratic complexity, local attention weakens two of the mostdesirable properties of self attention: long range inter-dependency modeling,and global receptive field. In this paper, we introduce Dilated NeighborhoodAttention (DiNA), a natural, flexible and efficient extension to NA that cancapture more global context and expand receptive fields exponentially at noadditional cost. NA's local attention and DiNA's sparse global attentioncomplement each other, and therefore we introduce Dilated NeighborhoodAttention Transformer (DiNAT), a new hierarchical vision transformer built uponboth. DiNAT variants enjoy significant improvements over strong baselines suchas NAT, Swin, and ConvNeXt. Our large model is faster and ahead of its Swincounterpart by 1.5% box AP in COCO object detection, 1.3% mask AP in COCOinstance segmentation, and 1.1% mIoU in ADE20K semantic segmentation. Pairedwith new frameworks, our large variant is the new state of the art panopticsegmentation model on COCO (58.2 PQ) and ADE20K (48.5 PQ), and instancesegmentation model on Cityscapes (44.5 AP) and ADE20K (35.4 AP) (no extradata). It also matches the state of the art specialized semantic segmentationmodels on ADE20K (58.2 mIoU), and ranks second on Cityscapes (84.5 mIoU) (noextra data). We open-source our project.