Hydra Attention: Efficient Attention with Many Heads

Abstract

While transformers have begun to dominate many tasks in vision, applying themto large images is still computationally difficult. A large reason for this isthat self-attention scales quadratically with the number of tokens, which inturn, scales quadratically with the image size. On larger images (e.g., 1080p),over 60% of the total computation in the network is spent solely on creatingand applying attention matrices. We take a step toward solving this issue byintroducing Hydra Attention, an extremely efficient attention operation forVision Transformers (ViTs). Paradoxically, this efficiency comes from takingmulti-head attention to its extreme: by using as many attention heads as thereare features, Hydra Attention is computationally linear in both tokens andfeatures with no hidden constants, making it significantly faster than standardself-attention in an off-the-shelf ViT-B/16 by a factor of the token count.Moreover, Hydra Attention retains high accuracy on ImageNet and, in some cases,actually improves it.

Quick Read (beta)

loading the full paper ...