On the Relationship between Self-Attention and Convolutional Layers

Abstract

Recent trends of incorporating attention mechanisms in vision have ledresearchers to reconsider the supremacy of convolutional layers as a primarybuilding block. Beyond helping CNNs to handle long-range dependencies,Ramachandran et al. (2019) showed that attention can completely replaceconvolution and achieve state-of-the-art performance on vision tasks. Thisraises the question: do learned attention layers operate similarly toconvolutional layers? This work provides evidence that attention layers canperform convolution and, indeed, they often learn to do so in practice.Specifically, we prove that a multi-head self-attention layer with sufficientnumber of heads is at least as powerful as any convolutional layer. Ournumerical experiments then show that the phenomenon also occurs in practice,corroborating our analysis. Our code is publicly available.

Quick Read (beta)

loading the full paper ...