Dilated Neighborhood Attention Transformer

Abstract

Transformers are quickly becoming one of the most heavily applied deeplearning architectures across modalities, domains, and tasks. In vision, on topof ongoing efforts into plain transformers, hierarchical transformers have alsogained significant attention, thanks to their performance and easy integrationinto existing frameworks. These models typically employ localized attentionmechanisms, such as the sliding-window Neighborhood Attention (NA) or SwinTransformer's Shifted Window Self Attention. While effective at reducing selfattention's quadratic complexity, local attention weakens two of the mostdesirable properties of self attention: long range inter-dependency modeling,and global receptive field. In this paper, we introduce Dilated NeighborhoodAttention (DiNA), a natural, flexible and efficient extension to NA that cancapture more global context and expand receptive fields exponentially at noadditional cost. NA's local attention and DiNA's sparse global attentioncomplement each other, and therefore we introduce Dilated NeighborhoodAttention Transformer (DiNAT), a new hierarchical vision transformer built uponboth. DiNAT variants enjoy significant improvements over attention-basedbaselines such as NAT and Swin, as well as modern convolutional baselineConvNeXt. Our Large model is ahead of its Swin counterpart by 1.5% box AP inCOCO object detection, 1.3% mask AP in COCO instance segmentation, and 1.1%mIoU in ADE20K semantic segmentation, and faster in throughput. We believecombinations of NA and DiNA have the potential to empower various tasks beyondthose presented in this paper. To support and encourage research in thisdirection, in vision and beyond, we open-source our project at:https://github.com/SHI-Labs/Neighborhood-Attention-Transformer.

Quick Read (beta)

loading the full paper ...