QuadTree Attention for Vision Transformers

Abstract

Transformers have been successful in many vision tasks, thanks to theircapability of capturing long-range dependency. However, their quadraticcomputational complexity poses a major obstacle for applying them to visiontasks requiring dense predictions, such as object detection, feature matching,stereo, etc. We introduce QuadTree Attention, which reduces the computationalcomplexity from quadratic to linear. Our quadtree transformer builds tokenpyramids and computes attention in a coarse-to-fine manner. At each level, thetop K patches with the highest attention scores are selected, such that at thenext level, attention is only evaluated within the relevant regionscorresponding to these top K patches. We demonstrate that quadtree attentionachieves state-of-the-art performance in various vision tasks, e.g. with 4.0%improvement in feature matching on ScanNet, about 50% flops reduction in stereomatching, 0.4-1.5% improvement in top-1 accuracy on ImageNet classification,1.2-1.8% improvement on COCO object detection, and 0.7-2.4% improvement onsemantic segmentation over previous state-of-the-art transformers. The codesare available athttps://github.com/Tangshitao/QuadtreeAttention}{https://github.com/Tangshitao/QuadtreeAttention.

Quick Read (beta)

loading the full paper ...