LookupViT: Compressing visual information to a limited number of tokens

Abstract

Vision Transformers (ViT) have emerged as the de-facto choice for numerousindustry grade vision solutions. But their inference cost can be prohibitivefor many settings, as they compute self-attention in each layer which suffersfrom quadratic computational complexity in the number of tokens. On the otherhand, spatial information in images and spatio-temporal information in videosis usually sparse and redundant. In this work, we introduce LookupViT, thataims to exploit this information sparsity to reduce ViT inference cost.LookupViT provides a novel general purpose vision transformer block thatoperates by compressing information from higher resolution tokens to a fixednumber of tokens. These few compressed tokens undergo meticulous processing,while the higher-resolution tokens are passed through computationally cheaperlayers. Information sharing between these two token sets is enabled through abidirectional cross-attention mechanism. The approach offers multipleadvantages - (a) easy to implement on standard ML accelerators (GPUs/TPUs) viastandard high-level operators, (b) applicable to standard ViT and its variants,thus generalizes to various tasks, (c) can handle different tokenization andattention approaches. LookupViT also offers flexibility for the compressedtokens, enabling performance-computation trade-offs in a single trained model.We show LookupViT's effectiveness on multiple domains - (a) forimage-classification (ImageNet-1K and ImageNet-21K), (b) video classification(Kinetics400 and Something-Something V2), (c) image captioning (COCO-Captions)with a frozen encoder. LookupViT provides $2\times$ reduction in FLOPs whileupholding or improving accuracy across these domains. In addition, LookupViTalso demonstrates out-of-the-box robustness and generalization on imageclassification (ImageNet-C,R,A,O), improving by up to $4\%$ over ViT.

Quick Read (beta)

loading the full paper ...