Abstract
Recent trends of incorporating attention mechanisms in vision have ledresearchers to reconsider the supremacy of convolutional layers as a primarybuilding block. Beyond helping CNNs to handle longrange dependencies,Ramachandran et al. (2019) showed that attention can completely replaceconvolution and achieve stateoftheart performance on vision tasks. Thisraises the question: do learned attention layers operate similarly toconvolutional layers? This work provides evidence that attention layers canperform convolution and, indeed, they often learn to do so in practice.Specifically, we prove that a multihead selfattention layer with sufficientnumber of heads is at least as powerful as any convolutional layer. Ournumerical experiments then show that the phenomenon also occurs in practice,corroborating our analysis. Our code is publicly available.
Quick Read (beta)
On the Relationship between SelfAttention and Convolutional Layers
Abstract
Recent trends of incorporating attention mechanisms in vision have led researchers to reconsider the supremacy of convolutional layers as a primary building block. Beyond helping CNNs to handle longrange dependencies, Ramachandran et al. (2019) showed that attention can completely replace convolution and achieve stateoftheart performance on vision tasks. This raises the question: do learned attention layers operate similarly to convolutional layers? This work provides evidence that attention layers can perform convolution and, indeed, they often learn to do so in practice. Specifically, we prove that a multihead selfattention layer with sufficient number of heads is at least as expressive as any convolutional layer. Our numerical experiments then show that the phenomenon also occurs in practice, corroborating our analysis. Our code is publicly available^{1}^{1} 1 https://github.com/epfml/attentioncnn/tree/arxivv1 .
capbtabboxtable[][\FBwidth]
On the Relationship between SelfAttention and Convolutional Layers
JeanBaptiste Cordonnier, Andreas Loukas & Martin Jaggi 

École Polytechnique Fédérale de Lausanne (EPFL) 
{first.last}@epfl.ch 
1 Introduction
Recent advances in Natural Language Processing (NLP) are largely attributed to the rise of the transformer (Vaswani et al., 2017). Pretrained to solve an unsupervised task on large corpora of text, transformerbased architectures, such as GPT2 (Radford et al., 2018), BERT (Devlin et al., 2018) and TransformerXL (Dai et al., 2019), seem to possess the capacity to learn the underlying structure of text and, as a consequence, to learn representations that generalize across tasks. The key difference between transformers and previous methods, such as recurrent neural networks (Hochreiter & Schmidhuber, 1997) and convolutional neural networks (CNN), is that the former can simultaneously attend to every word of their input sequence. This is made possible thanks to the attention mechanism—originally introduced in Neural Machine Translation to better handle longrange dependencies (Bahdanau et al., 2015). With selfattention in particular, the similarity of two words in a sequence is captured by an attention score measuring the distance of their representations. The representation of each word is then updated based on those words whose attention score is highest.
Inspired by its capacity to learn meaningful interdependencies between words, researchers have recently considered utilizing selfattention in vision tasks. Selfattention was first added to CNN by either using channelbased attention (Hu et al., 2018) or nonlocal relationships across the image (Wang et al., 2018). More recently, Bello et al. (2019) augmented CNNs by replacing some convolutional layers with selfattention layers, leading to improvements on image classification and object detection tasks. Interestingly, Ramachandran et al. (2019) noticed that, even though stateofthe art results are reached when attention and convolutional features are combined, under same computation and model size constraints, selfattentiononly architectures also reach competitive image classification accuracy.
These findings raise the question, do selfattention layers process images in a similar manner to convolutional layers? From a theoretical perspective, one could argue that transfomers have the capacity to simulate any function—including a CNN. Indeed, Pérez et al. (2019) showed that a multilayer attentionbased architecture with additive positional encodings is Turing complete under some strong theoretical assumptions, such as unbounded precision arithmetic. Unfortunately, universality results do not reveal how a machine solves a task, only that it has the capacity to do so. Thus, the question of how selfattention layers actually process images remains open.
Contributions.
In this work, we put forth theoretical and empirical evidence that selfattention layers can (and do) learn to behave similar to convolutional layers:

I.
From a theoretical perspective, we provide a constructive proof showing that selfattention layers can express any convolutional layers.
Specifically, we show that a single multihead selfattention layer using relative positional encoding can be reparametrized to express any convolutional layer. Our insights lead to a relative positional encoding, that we refer to as quadratic encoding, that is very efficient in terms of size.

II.
Our experiments show that the first few layers of attentiononly architectures (Ramachandran et al., 2019) do learn to attend on gridlike pattern around each query pixel, similar to our theoretical construction.
Strikingly, this behavior is confirmed both for our quadratic encoding, but also for relative encoding that is learned during training. Our results seem to suggest that localized convolution is the right inductive bias for the first few layers of an image classifying network. For deeper layers, on the other hand, longrange as well as horizontallysymmetric interdependencies become more relevant.
For reproducibility purposes, our code is publicly available on GitHub^{2}^{2} 2 https://github.com/epfml/attentioncnn/tree/arxivv1 .
2 Background on Attention Mechanisms for Vision
We here recall the mathematical formulation of selfattention layers and emphasize the role of positional encodings.
2.1 The MultiHead SelfAttention Layer
Let $\bm{X}\in {\mathbb{R}}^{T\times {D}_{\text{\mathit{i}\mathit{n}}}}$ be an input matrix consisting of $T$ tokens in of ${D}_{\text{\mathit{i}\mathit{n}}}$ dimensions each. While in NLP each token corresponds to a word in a sentence, the same formalism can be applied to any sequence of $T$ discrete objects, e.g. pixels. A selfattention layer maps any query token $t\in [T]$ from ${D}_{\text{\mathit{i}\mathit{n}}}$ to ${D}_{\text{\mathit{o}\mathit{u}\mathit{t}}}$ dimensions as follows:
$\mathrm{Self}\mathrm{Attention}{(\bm{X})}_{t,:}$  $:=\mathrm{softmax}\left({\bm{A}}_{t,:}\right)\bm{X}{\bm{W}}_{\text{\mathit{v}\mathit{a}\mathit{l}}},$  (1) 
where we refer to the elements of the $T\times T$ matrix
$\bm{A}$  $:=\bm{X}{\bm{W}}_{\text{\mathit{q}\mathit{r}\mathit{y}}}{\bm{W}}_{\text{\mathit{k}\mathit{e}\mathit{y}}}^{\top}{\bm{X}}^{\top}$  (2) 
as attention scores and the softmax output^{3}^{3} 3 $\mathrm{softmax}{\left({\bm{A}}_{t,:}\right)}_{k}=\text{exp}({\bm{A}}_{t,k})/{\sum}_{p}\text{exp}({\bm{A}}_{t,p})$ as attention probabilities. The layer is parametrized by a query matrix ${\bm{W}}_{\text{\mathit{q}\mathit{r}\mathit{y}}}\in {\mathbb{R}}^{{D}_{\text{\mathit{i}\mathit{n}}}\times {D}_{k}}$, a key matrix ${\bm{W}}_{\text{\mathit{k}\mathit{e}\mathit{y}}}\in {\mathbb{R}}^{{D}_{\text{\mathit{i}\mathit{n}}}\times {D}_{k}}$ and a value matrix ${\bm{W}}_{\text{\mathit{v}\mathit{a}\mathit{l}}}\in {\mathbb{R}}^{{D}_{\text{\mathit{i}\mathit{n}}}\times {D}_{\text{\mathit{o}\mathit{u}\mathit{t}}}}$.For simplicity, we exclude any residual connections, batch normalization and constant factors.
A key property of the selfattention model described above is that it is equivariant to reordering, that is, it gives the same output independently of how the $T$ input tokens are shuffled. This is problematic for cases we expect the order of things to matter. To alleviate the limitation, a positional encoding is learned for each token in the sequence (or pixel in an image), and added to the representation of the token itself before applying selfattention
$\bm{A}$  $:=(\bm{X}+\bm{P}){\bm{W}}_{\text{\mathit{q}\mathit{r}\mathit{y}}}{\bm{W}}_{\text{\mathit{k}\mathit{e}\mathit{y}}}^{\top}{(\bm{X}+\bm{P})}^{\top},$  (3) 
where $\bm{P}\in {\mathbb{R}}^{T\times {D}_{\text{\mathit{i}\mathit{n}}}}$ contains the embedding vectors for each position. More generally, $\bm{P}$ may be substituted by any function that returns a vector representation of the position.
It has been found beneficial in practice to replicate this selfattention mechanism into multiple heads, each being able to focus on different parts of the input by using different query, key and value matrices. In multihead selfattention, the output of the ${N}_{h}$ heads of output dimension ${D}_{h}$ are concatenated and projected to dimension ${D}_{\text{\mathit{o}\mathit{u}\mathit{t}}}$ as follows:
$\mathrm{MHSA}(\bm{X}):=\underset{h\in [{N}_{h}]}{concat}\left[{\mathrm{Self}\mathrm{Attention}}_{h}(\bm{X})\right]{\bm{W}}_{\text{\mathit{o}\mathit{u}\mathit{t}}}+{\bm{b}}_{\text{\mathit{o}\mathit{u}\mathit{t}}}$  (4) 
and two new parameters are introduced: the projection matrix ${\bm{W}}_{\text{\mathit{o}\mathit{u}\mathit{t}}}\in {\mathbb{R}}^{{N}_{h}{D}_{h}\times {D}_{\text{\mathit{o}\mathit{u}\mathit{t}}}}$ and a bias term ${\bm{b}}_{\text{\mathit{o}\mathit{u}\mathit{t}}}\in {\mathbb{R}}^{{D}_{\text{\mathit{o}\mathit{u}\mathit{t}}}}$.
2.2 Attention for Images
Convolutional layers are the de facto choice for building neural networks that operate on images. We recall that, given an image tensor $\bm{X}\in {\mathbb{R}}^{W\times H\times {D}_{\text{\mathit{i}\mathit{n}}}}$ of width $W$, height $H$ and ${D}_{\text{\mathit{i}\mathit{n}}}$ channels, the output of a convolutional layer for pixel $(i,j)$ is given by
$\mathrm{Conv}{(\bm{X})}_{i,j,:}:={\displaystyle \sum _{({\delta}_{1},{\delta}_{2})\in \mathrm{\Delta}{\mathrm{\Delta}}_{K}}}{\bm{W}}_{{\delta}_{1},{\delta}_{2},:,:}{\bm{X}}_{i+{\delta}_{1},j+{\delta}_{2},:}+\bm{b},$  (5) 
where $\bm{W}$ is the $K\times K\times {D}_{\text{\mathit{o}\mathit{u}\mathit{t}}}\times {D}_{\text{\mathit{i}\mathit{n}}}$ weight tensor ^{4}^{4} 4 To simplify notation, we index the first two dimensions of the tensor from $\lfloor K/2\rfloor $ to $\lfloor K/2\rfloor $., $\bm{b}\in {\mathbb{R}}^{{D}_{\text{\mathit{o}\mathit{u}\mathit{t}}}}$ is the bias vector and the set
$$\mathrm{\Delta}{\mathrm{\Delta}}_{K}:=[\lfloor \frac{K}{2}\rfloor ,\mathrm{\cdots},\lfloor \frac{K}{2}\rfloor ]\times [\lfloor \frac{K}{2}\rfloor ,\mathrm{\cdots},\lfloor \frac{K}{2}\rfloor ]$$ 
contains all possible shifts appearing when convolving the image with a $K\times K$ kernel.
In the following, we review how selfattention can be adapted from 1D sequences to images.
With images, rather than tokens, we have query and key pixels $\bm{q},\bm{k}\in [W]\times [H]$. Accordingly, the input is a tensor $\bm{X}$ of dimension $W\times H\times {D}_{\text{\mathit{i}\mathit{n}}}$ and each attention score associates a query and a key pixel.
To keep the formulas consistent with the 1D case, we abuse notation and slice tensors by using a 2D index vector: if $\bm{p}=(i,j)$, we write ${\bm{X}}_{\bm{p},:}$ and ${\bm{A}}_{\bm{p},:}$ to mean ${\bm{X}}_{i,j,:}$ and ${\bm{A}}_{i,j,:,:}$, respectively. With this notation in place, the multihead self attention layer output at pixel $\bm{q}$ can be expressed as follows:
$\mathrm{Self}\mathrm{Attention}{(\bm{X})}_{\bm{q},:}$  $={\displaystyle \sum _{\bm{k}}}\mathrm{softmax}{\left({\bm{A}}_{\bm{q},:}\right)}_{\bm{k}}{\bm{X}}_{\bm{k},:}{\bm{W}}_{\text{\mathit{v}\mathit{a}\mathit{l}}}$  (6) 
and accordingly for the multihead case.
2.3 Positional Encoding for Images
There are two types of positional encoding that has been used in transformerbased architectures: the absolute and relative encoding (see also creftypecap 1 in the Appendix).
With absolute encodings, a (fixed or learned) vector ${\bm{P}}_{\bm{p},:}$ is assigned to each pixel $\bm{p}$. The computation of the attention scores we saw in creftype 2 can then be decomposed as follows:
${\bm{A}}_{\bm{q},\bm{k}}^{\mathrm{abs}}$  $=({\bm{X}}_{\bm{q},:}+{\bm{P}}_{\bm{q},:}){\bm{W}}_{\text{\mathit{q}\mathit{r}\mathit{y}}}{\bm{W}}_{\text{\mathit{k}\mathit{e}\mathit{y}}}^{\top}{({\bm{X}}_{\bm{k},:}+{\bm{P}}_{\bm{k},:})}^{\top}$  
$={\bm{X}}_{\bm{q},:}{\bm{W}}_{\text{\mathit{q}\mathit{r}\mathit{y}}}{\bm{W}}_{\text{\mathit{k}\mathit{e}\mathit{y}}}^{\top}{\bm{X}}_{\bm{k},:}^{\top}+{\bm{X}}_{\bm{q},:}{\bm{W}}_{\text{\mathit{q}\mathit{r}\mathit{y}}}{\bm{W}}_{\text{\mathit{k}\mathit{e}\mathit{y}}}^{\top}{\bm{P}}_{\bm{k},:}^{\top}+{\bm{P}}_{\bm{q},:}{\bm{W}}_{\text{\mathit{q}\mathit{r}\mathit{y}}}{\bm{W}}_{\text{\mathit{k}\mathit{e}\mathit{y}}}^{\top}{\bm{X}}_{\bm{k},:}+{\bm{P}}_{\bm{q},:}{\bm{W}}_{\text{\mathit{q}\mathit{r}\mathit{y}}}{\bm{W}}_{\text{\mathit{k}\mathit{e}\mathit{y}}}^{\top}{\bm{P}}_{\bm{k},:}$  (7) 
where $\bm{q}$ and $\bm{k}$ correspond to the query and key pixels, respectively.
The relative positional encoding was introduced by Dai et al. (2019). The main idea is to only consider the position difference between the query pixel (pixel we compute the representation of) and the key pixel (pixel we attend) instead of the absolute position of the key pixel:
${\bm{A}}_{\bm{q},\bm{k}}^{\mathrm{rel}}$  $:={\bm{X}}_{\bm{q},:}^{\top}{\bm{W}}_{\text{\mathit{q}\mathit{r}\mathit{y}}}^{\top}{\bm{W}}_{\text{\mathit{k}\mathit{e}\mathit{y}}}{\bm{X}}_{\bm{k},:}+{\bm{X}}_{\bm{q},:}^{\top}{\bm{W}}_{\text{\mathit{q}\mathit{r}\mathit{y}}}^{\top}{\widehat{\bm{W}}}_{\text{\mathit{k}\mathit{e}\mathit{y}}}{\bm{r}}_{\bm{\delta}}+{\bm{u}}^{\top}{\bm{W}}_{\text{\mathit{k}\mathit{e}\mathit{y}}}{\bm{X}}_{\bm{k},:}+{\bm{v}}^{\top}{\widehat{\bm{W}}}_{\text{\mathit{k}\mathit{e}\mathit{y}}}{\bm{r}}_{\bm{\delta}}$  (8) 
In this manner, the attention scores only depend on the shift $\bm{\delta}:=\bm{q}\bm{k}$. Above, the learnable vectors $\bm{u}$ and $\bm{v}$ are unique for each head, whereas for every shift $\bm{\delta}$ the relative positional encoding ${\bm{r}}_{\bm{\delta}}\in {\mathbb{R}}^{{D}_{p}}$ is shared by all layers and heads. Moreover, now the key weights are split into two types: ${\bm{W}}_{\text{\mathit{k}\mathit{e}\mathit{y}}}$ pertain to the input and ${\widehat{\bm{W}}}_{\text{\mathit{k}\mathit{e}\mathit{y}}}$ to the relative position of pixels.
3 SelfAttention as a Convolutional Layer
This section derives sufficient conditions such that a multihead selfattention layer can simulate a convolutional layer. Our main result is the following:
Theorem 1.
A multihead selfattention layer with ${N}_{h}$ heads of dimension ${D}_{h}$, output dimension ${D}_{\text{\mathit{o}\mathit{u}\mathit{t}}}$ and a relative positional encoding of dimension ${D}_{p}\mathrm{\ge}\mathrm{3}$ can express any convolutional layer of kernel size $\sqrt{{N}_{h}}\mathrm{\times}\sqrt{{N}_{h}}$ and $\mathrm{min}\mathit{}\mathrm{(}{D}_{h}\mathrm{,}{D}_{\text{\mathit{o}\mathit{u}\mathit{t}}}\mathrm{)}$ output channels.
The theorem is proven constructively by selecting the parameters of the multihead selfattention layer so that the latter acts like a convolutional layer. In the proposed construction, the attention scores of each selfattention head should attend to a different relative shift within the set $\mathrm{\Delta}{\mathrm{\Delta}}_{K}={\{\lfloor K/2\rfloor ,\mathrm{\dots},\lfloor K/2\rfloor \}}^{2}$ of all pixel shifts in a $K\times K$ kernel. The exact condition can be found in the statement of Lemma 1.
Then, Lemma 2 shows that the aforementioned condition is satisfied for the relative positional encoding that we refer to as the quadratic encoding:
${\bm{v}}^{(h)}:={\alpha}^{(h)}(1,2{\mathbf{\Delta}}_{1}^{(h)},2{\mathbf{\Delta}}_{2}^{(h)})\mathit{\hspace{1em}}{\bm{r}}_{\bm{\delta}}$  $:=({\parallel \bm{\delta}\parallel}^{2},{\bm{\delta}}_{1},{\bm{\delta}}_{2})\mathit{\hspace{1em}}{\bm{W}}_{\text{\mathit{q}\mathit{r}\mathit{y}}}={\bm{W}}_{\text{\mathit{k}\mathit{e}\mathit{y}}}:=\mathrm{\U0001d7ce}\mathit{\hspace{1em}}\widehat{{\bm{W}}_{\text{\mathit{k}\mathit{e}\mathit{y}}}}:=\bm{I}$  (9) 
The learned parameters ${\mathbf{\Delta}}^{(h)}=({\mathbf{\Delta}}_{1}^{(h)},{\mathbf{\Delta}}_{2}^{(h)})$ and ${\alpha}^{(h)}$ determine the center and width of attention of each head, respectively. On the other hand, $\bm{\delta}=({\bm{\delta}}_{1},{\bm{\delta}}_{2})$ is fixed and expresses the relative shift between query and key pixels.
It is important to stress that the above encoding is not the only one for which the conditions of Lemma 1 are satisfied. In fact, in our experiments, the relative encoding learned by the neural network also matched the conditions of the lemma (despite being different from the quadratic encoding). Nevertheless, the encoding defined above is very efficient in terms of size, as only ${D}_{p}=3$ dimensions suffice to encode the relative position of pixels, while also reaching similar or better empirical performance (than the learned one). Though we lack a formal proof, we conjecture that every encoding that satisfies Lemma 1 should have at least three dimensions.
Remark for the 1D case.
Convolutional layers acting on sequences are commonly used in the literature for text (Kim, 2014), as well as audio (van den Oord et al., 2016) and time series (Franceschi et al., 2019). Theorem 1 can be straightforwardly extended to show that multihead selfattention with ${N}_{h}$ heads can also simulate a 1D convolutional layer with a kernel of size $K={N}_{h}$ with $\mathrm{min}({D}_{h},{D}_{\text{\mathit{o}\mathit{u}\mathit{t}}})$ output channels using a positional encoding of dimension ${D}_{p}\ge 2$. Since we have not tested empirically if the preceding construction matches the behavior of 1D selfattention in practice, we cannot claim that it actually learns to convolve an input sequence—only that it has the capacity to do so.
3.1 Proof of Main Theorem
Lemma 1.
Consider a multihead selfattention layer consisting of ${N}_{h}\mathrm{=}{K}^{\mathrm{2}}$ heads, ${D}_{h}\mathrm{\ge}{D}_{\text{\mathit{o}\mathit{u}\mathit{t}}}$ and let $\mathbf{f}\mathrm{:}\mathrm{[}{N}_{h}\mathrm{]}\mathrm{\to}\mathrm{\Delta}\mathit{}{\mathrm{\Delta}}_{K}$ be a bijective mapping of heads onto shifts. Further, suppose that for every head the following holds:
$\mathrm{softmax}{({\bm{A}}_{\bm{q},:}^{(h)})}_{\bm{k}}=\{\begin{array}{cc}1\hfill & \mathit{\text{if}}\bm{f}(h)=\bm{q}\bm{k}\hfill \\ 0\hfill & \mathit{\text{otherwise}}.\hfill \end{array}$  (10) 
Then, for any convolutional layer with a $K\mathrm{\times}K$ kernel and ${D}_{\text{\mathit{o}\mathit{u}\mathit{t}}}$ output channels, there exists ${\mathrm{\{}{\mathbf{W}}_{\text{\mathit{v}\mathit{a}\mathit{l}}}^{\mathrm{(}h\mathrm{)}}\mathrm{\}}}_{h\mathrm{\in}\mathrm{[}{N}_{h}\mathrm{]}}$ such that $\mathrm{MHSA}\mathit{}\mathrm{(}\mathbf{X}\mathrm{)}\mathrm{=}\mathrm{Conv}\mathit{}\mathrm{(}\mathbf{X}\mathrm{)}$ for every $\mathbf{X}\mathrm{\in}{\mathrm{R}}^{W\mathrm{\times}H\mathrm{\times}{D}_{\text{\mathit{i}\mathit{n}}}}$.
Proof.
Our first step will be to rework the expression of the MultiHead SelfAttention operator from equation (1) and equation (4) such that the effect of the multiple heads becomes more transparent:
$\mathrm{MHSA}(\bm{X})$  $={\bm{b}}_{\text{\mathit{o}\mathit{u}\mathit{t}}}+{\displaystyle \sum _{h\in [{N}_{h}]}}\mathrm{softmax}({\bm{A}}^{(h)})\bm{X}\underset{{\bm{W}}^{(h)}}{\underset{\u23df}{{\bm{W}}_{\text{\mathit{v}\mathit{a}\mathit{l}}}^{(h)}{\bm{W}}_{\text{\mathit{o}\mathit{u}\mathit{t}}}[(h1){D}_{h}+1:h{D}_{h}+1]}}$  (11) 
Note that each head’s value matrix ${\bm{W}}_{\text{\mathit{v}\mathit{a}\mathit{l}}}^{(h)}\in {\mathbb{R}}^{{D}_{\text{\mathit{i}\mathit{n}}}\times {D}_{h}}$ and each block of the projection matrix ${\bm{W}}_{\text{\mathit{o}\mathit{u}\mathit{t}}}$ of dimension ${D}_{h}\times {D}_{\text{\mathit{o}\mathit{u}\mathit{t}}}$ are learned. Assuming that ${D}_{h}\ge {D}_{\text{\mathit{o}\mathit{u}\mathit{t}}}$, we can replace each pair of matrices by a learned matrix ${\bm{W}}^{(h)}$ for each head. We consider one output pixel of the multihead selfattention:
$\mathrm{MHSA}{(\bm{X})}_{\bm{q},:}={\displaystyle \sum _{h\in [{N}_{h}]}}\left({\displaystyle \sum _{\bm{k}}}\mathrm{softmax}{({\bm{A}}_{\bm{q},:}^{(h)})}_{\bm{k}}{\bm{X}}_{\bm{k},:}\right){\bm{W}}^{(h)}+{\bm{b}}_{\text{\mathit{o}\mathit{u}\mathit{t}}}$  (12) 
Due to the conditions of the Lemma, for the $h$th attention head the attention probability is one when $\bm{k}=\bm{q}\bm{f}(h)$ and zero otherwise. The layer’s output at pixel $\bm{q}$ is thus equal to
$\mathrm{MHSA}{(\bm{X})}_{\bm{q}}$  $={\displaystyle \sum _{h\in [{N}_{h}]}}{\bm{X}}_{\bm{q}\bm{f}(h),:}{\bm{W}}^{(h)}+{\bm{b}}_{\text{\mathit{o}\mathit{u}\mathit{t}}}$  (13) 
For $K=\sqrt{{N}_{h}}$, the above can be seen to be equivalent to a convolutional layer expressed in eq. 5: there is a one to one mapping (implied by map $\bm{f}$) between the matrices ${\bm{W}}^{(h)}$ for $h=[{N}_{h}]$ and the matrices ${\bm{W}}_{{k}_{1},{k}_{2},:,:}$ for all $({k}_{1},{k}_{2})\in {[K]}^{2}.$ ∎
Remark about ${D}_{h}$ and ${D}_{\text{\mathit{o}\mathit{u}\mathit{t}}}$.
It is frequent in transformerbased architectures to set ${D}_{h}={D}_{\text{\mathit{o}\mathit{u}\mathit{t}}}/{N}_{h}$, hence $$. In that case, ${\bm{W}}^{(h)}$ can be seen to be of rank ${D}_{\text{\mathit{o}\mathit{u}\mathit{t}}}{D}_{h}$, which does not suffice to express every convolutional layer with ${D}_{\text{\mathit{o}\mathit{u}\mathit{t}}}$ channels. Nevertheless, it can be seen that any ${D}_{h}$ out of ${D}_{\text{\mathit{o}\mathit{u}\mathit{t}}}$ outputs of $\mathrm{MHSA}(\bm{X})$ can express the output of any convolutional layer with ${D}_{h}$ output channels. To cover both cases, in the statement of the main theorem we assert that the output channels of the convolutional layer should be $\mathrm{min}({D}_{h},{D}_{\text{\mathit{o}\mathit{u}\mathit{t}}})$. In practice, we advise to concatenate heads of dimension ${D}_{h}={D}_{\text{\mathit{o}\mathit{u}\mathit{t}}}$ instead of splitting the ${D}_{\text{\mathit{o}\mathit{u}\mathit{t}}}$ dimensions among heads to have exact reparametrization and no “unused” channels.
Lemma 2.
There exists a relative encoding scheme ${\mathrm{\{}{\mathbf{r}}_{\mathbf{\delta}}\mathrm{\in}{\mathrm{R}}^{{D}_{p}}\mathrm{\}}}_{\mathbf{\delta}\mathrm{\in}{\mathrm{Z}}^{\mathrm{2}}}$ with ${D}_{p}\mathrm{\ge}\mathrm{3}$ and parameters ${\mathbf{W}}_{\text{\mathit{q}\mathit{r}\mathit{y}}}\mathrm{,}{\mathbf{W}}_{\text{\mathit{k}\mathit{e}\mathit{y}}}\mathrm{,}{\widehat{\mathbf{W}}}_{\text{\mathit{k}\mathit{e}\mathit{y}}}\mathrm{,}\mathbf{u}$ with ${D}_{p}\mathrm{\le}{D}_{k}$ such that, for every $\mathrm{\Delta}\mathrm{\in}\mathrm{\Delta}\mathit{}{\mathrm{\Delta}}_{K}$ there exists some vector $\mathbf{v}$ (conditioned on $\mathrm{\Delta}$) yielding $\mathrm{softmax}\mathit{}{\mathrm{(}{\bm{A}}_{\mathbf{q}\mathrm{,}\mathrm{:}}\mathrm{)}}_{\mathbf{k}}\mathrm{=}\mathrm{1}$ if $\mathbf{q}\mathrm{}\mathbf{k}\mathrm{=}\mathrm{\Delta}$ and zero, otherwise.
Proof.
We show by construction the existence of a ${D}_{p}=3$ dimensional relative encoding scheme yielding the required attention probabilities.
As the attention probabilities are independent of the input tensor $\bm{X}$, we set ${\bm{W}}_{\text{\mathit{k}\mathit{e}\mathit{y}}}={\bm{W}}_{\text{\mathit{q}\mathit{r}\mathit{y}}}=\mathrm{\U0001d7ce}$ which leaves only the last term of creftype 8. Setting ${\widehat{\bm{W}}}_{\text{\mathit{k}\mathit{e}\mathit{y}}}\in {\mathbb{R}}^{{D}_{k}\times {D}_{p}}$ to the identity matrix (with appropriate row padding), yields ${\bm{A}}_{\bm{q},\bm{k}}={\bm{v}}^{\top}{\bm{r}}_{\bm{\delta}}$ where $\bm{\delta}:=\bm{q}\bm{k}$. Above, we have assumed that ${D}_{p}\le {D}_{k}$ such that no information from ${\bm{r}}_{\bm{\delta}}$ is lost.
Now, suppose that we could write:
${\bm{A}}_{\bm{q},\bm{k}}=\alpha ({\parallel \bm{\delta}\mathbf{\Delta}\parallel}^{2}+c)$  (14) 
for some constant $c$. In the above expression, the maximum attention score over ${\bm{A}}_{\bm{q},:}$ is $\alpha c$ and it is reached for ${\bm{A}}_{\bm{q},\bm{k}}$ with $\bm{\delta}=\mathbf{\Delta}$. On the other hand, the $\alpha $ coefficient can be used to scale arbitrarily the difference between ${\bm{A}}_{\bm{q},\mathbf{\Delta}}$ and the other attention scores.
In this way, for $\bm{\delta}=\mathbf{\Delta}$, we have
$\underset{\alpha \to \mathrm{\infty}}{lim}\mathrm{softmax}{({\bm{A}}_{\bm{q},:})}_{\bm{k}}$  $=\underset{\alpha \to \mathrm{\infty}}{lim}{\displaystyle \frac{{e}^{\alpha ({\parallel \bm{\delta}\mathbf{\Delta}\parallel}^{2}+c)}}{{\sum}_{{\bm{k}}^{\prime}}{e}^{\alpha ({\parallel (\bm{q}{\bm{k}}^{\prime})\mathbf{\Delta}\parallel}^{2}+c)}}}$  
$=\underset{\alpha \to \mathrm{\infty}}{lim}{\displaystyle \frac{{e}^{\alpha {\parallel \bm{\delta}\mathbf{\Delta}\parallel}^{2}}{e}^{\alpha c}}{{\sum}_{{\bm{k}}^{\prime}}{e}^{\alpha {\parallel (\bm{q}{\bm{k}}^{\prime})\mathbf{\Delta}\parallel}^{2}}{e}^{\alpha c}}}$  
$=\underset{\alpha \to \mathrm{\infty}}{lim}{\displaystyle \frac{{e}^{\alpha {\parallel \bm{\delta}\mathbf{\Delta}\parallel}^{2}}}{{\sum}_{{\bm{k}}^{\prime}}{e}^{\alpha {\parallel (\bm{q}{\bm{k}}^{\prime})\mathbf{\Delta}\parallel}^{2}}}}={\displaystyle \frac{1}{1+{lim}_{\alpha \to \mathrm{\infty}}{\sum}_{{\bm{k}}^{\prime}\ne \bm{k}}{e}^{\alpha {\parallel (\bm{q}{\bm{k}}^{\prime})\mathbf{\Delta}\parallel}^{2}}}}=1$ 
and for $\bm{\delta}\ne \mathbf{\Delta}$, the equation becomes ${lim}_{\alpha \to \mathrm{\infty}}\mathrm{softmax}{({\bm{A}}_{\bm{q},:})}_{\bm{k}}=0,$ exactly as needed to satisfy the lemma statement.
What remains is to prove that there exist $\bm{v}$ and ${\{{\bm{r}}_{\bm{\delta}}\}}_{\bm{\delta}\in {\mathcal{Z}}^{2}}$ for which creftype 14 holds. Expanding the rhs of the equation, we have $\alpha ({\parallel \bm{\delta}\mathbf{\Delta}\parallel}^{2}+c)=\alpha ({\parallel \bm{\delta}\parallel}^{2}+{\parallel \mathbf{\Delta}\parallel}^{2}2\u27e8\bm{\delta},\mathbf{\Delta}\u27e9+c).$ Now if we set $\bm{v}=\alpha (1,2{\mathbf{\Delta}}_{1},2{\mathbf{\Delta}}_{2})$ and ${\bm{r}}_{\bm{\delta}}=({\parallel \bm{\delta}\parallel}^{2},{\bm{\delta}}_{1},{\bm{\delta}}_{2}),$ then
$${\bm{A}}_{\bm{q},\bm{k}}={\bm{v}}^{\top}{\bm{r}}_{\bm{\delta}}=\alpha ({\parallel \bm{\delta}\parallel}^{2}2{\mathbf{\Delta}}_{1}{\bm{\delta}}_{1}2{\mathbf{\Delta}}_{2}{\bm{\delta}}_{2})=\alpha ({\parallel \bm{\delta}\parallel}^{2}2\u27e8\bm{\delta},\mathbf{\Delta}\u27e9)=\alpha ({\parallel \bm{\delta}\mathbf{\Delta}\parallel}^{2}{\parallel \mathbf{\Delta}\parallel}^{2}),$$ 
which matches creftype 14 with $c={\parallel \mathbf{\Delta}\parallel}^{2}$ and the proof is concluded. ∎
4 Experiments
The aim of this section is to validate the applicability of our theoretical results—which state that selfattention can perform convolution—and to examine whether selfattention layers in practice do actually learn to operate like convolutional layers, when being trained on standard image classification tasks. In particular, we study the relationship between selfattention and convolution with quadratic and learned relative positional encodings. We find that for both cases, the attention probabilities learned tend to respect the conditions of Lemma 1, corroborating our hypothesis.
4.1 Implementation Details
We study a fully attentional model consisting of six multihead selfattention layers. As it has already been shown by Bello et al. (2019) that combining attention features with convolutional features improves performance on Cifar100 and ImageNet, we do not focus on attaining stateoftheart performance. Nevertheless, to validate that our model learns a meaningful classifier we compare it to the standard ResNet18 (He et al., 2015) on the CIFAR10 dataset (Krizhevsky et al., ). In all experiments, we use a $2\times 2$ invertible downsampling (Jacobsen et al., 2018) on the input to reduce the size of the image as storing the attention coefficient tensor requires a large amount of GPU memory. The fixed size representation of the input image is computed as the average pooling of the last layer representations and given to a linear classifier.
We used the PyTorch library (Paszke et al., 2017) and based our implementation on PyTorch Transformers^{5}^{5} 5 https://github.com/huggingface/pytorchtransformers. We release our code on Github^{6}^{6} 6 https://github.com/epfml/attentioncnn/tree/arxivv1 and all hyperparameters are in creftypecap 2 in the Appendix.
4.2 Quadratic Encoding
As a first step, we aim to verify that, with the relative position encoding introduced in equation (9), attention layers learn to behave like convolutional layers. We train nine attention heads at each layer to be on par with the $3\times 3$ kernels used predominantly by the ResNet architecture. The center of attention of each head $h$ is initialized to ${\mathrm{\Delta}}^{(h)}\sim \mathcal{N}(\mathrm{\U0001d7ce},2{\bm{I}}_{2})$.
creftypecap 2 shows how the initial positions of the heads (different colors) at layer 4 changed during training. We can see that after optimization, the heads attend on specific pixel of the image forming a grid around the query pixel. Our intuition that SelfAttention applied to images learn convolutional filter around the queried pixel is then confirmed.
creftypecap 3 displays all attention head at each layer of the model at the end of the training. It can be seen that in the first few layers the heads tend to focus on local patterns (layers 1 and 2), while deeper layers (layers 36) also attend to larger patterns by positioning the center of attention further from the queried pixel position. We also include in the Appendix a plot of the attention positions for a higher number of heads (${N}_{h}=16$), creftypecap 10 displays both local patterns similar to CNN and long range dependencies. Interestingly, attention heads do not overlap and seem to take an arrangement maximizing the coverage of the input space.
To verify that our selfattention model performs equally well as a small ResNet (creftypecap 5), in creftypecap 6 we display the evolution of the test accuracy on CIFAR10 over the 300 epochs of training. The ResNet is faster to converge, but we cannot ascertain whether this corresponds to an inherent property of the architecture or an artifact of the adopted optimization procedures. Our implementation could be optimized to exploit the locality of Gaussian attention probabilities and reduce significantly the number of FLOPS.
Models  accuracy  # of params  # of FLOPS 

ResNet18  0.938  11.2M  1.1B 
SA quadratic  0.938  12.1M  6.2B 
SA learned  0.918  12.3M  6.2B 
4.3 Learned Relative Positional Encoding
We move on to study the positional encoding used in practice by fullyattentional models on images.
We implemented the 2D relative positional encoding scheme used by (Ramachandran et al., 2019; Bello et al., 2019): we learn a $\lfloor {D}_{p}/2\rfloor $ position encoding vector for each row and each column pixel shift. Hence the relative positional encoding of a key pixel at position $\bm{k}$ with a query pixel at position $\bm{q}$ is the concatenation of the row shift embedding ${\bm{\delta}}_{1}$ and the column shift embedding ${\bm{\delta}}_{2}$ (where $\bm{\delta}=\bm{k}\bm{q}$). We chose ${D}_{p}={D}_{\text{\mathit{o}\mathit{u}\mathit{t}}}=400$ in the experiment. We differ from the (unpublished) implementation described by Ramachandran et al. (2019) in the following points: (i) we do not use convolution stem and ResNet bottlenecks for downsampling, but only a $2\times 2$ invertible downsampling layer (Jacobsen et al., 2018) at input, (ii) we use ${D}_{h}={D}_{\text{\mathit{o}\mathit{u}\mathit{t}}}$ instead of ${D}_{h}={D}_{\text{\mathit{o}\mathit{u}\mathit{t}}}/{N}_{h}$ backed from our theory that the effective number of learned filters is $\mathrm{min}({D}_{h},{D}_{\text{\mathit{o}\mathit{u}\mathit{t}}})$, (iii) the attention scores are computed using only the relative positions of the pixels and not the data. As seen in creftypecap 5, our implementation achieves accuracy close to that of ResNet18.
The attention probabilities of each head at each layer are displayed on creftypecap 6. The figure confirms our hypothesis for the first two layers and partially for the third: even when left to learn the encoding from the data, certain selfattention heads (depicted on the left) learn to attend to individual pixels, closely matching the condition of Lemma 1 and thus Theorem 1. At the same time, other heads pay attention to horizontallysymmetric but nonlocalized patterns, as well as to longrange pixel interdependencies. The phenomenon is particularly prominent for layers four to six, where the behavior of selfattention can be seen to deviate from that of convolution. We also notice that vertical symmetry is much more rare in the learned attention probabilities of high layers. This matches the intuition that, for image classification, distinguishing between what is above or below something is more crucial than what is left or right. Finally, the fact that some of the heads in the last two layers seem to be redundant, likely indicating that the computational and space complexity of the model could be amenable to further reduction, for example by pruning.
5 Conclusion
We showed that selfattention layers applied to images can express any convolutional layer (given sufficiently many heads) and that learned fullyattentional models do behave similar to CNN in practice. More generally, fullyattentional models seem to learn a generalization of CNNs where the kernel pattern is learned at the same time as the filters—similar to deformable convolutions (Dai et al., 2017; Zampieri, 2019). Interesting directions for future work include translating existing insights from the rich CNNs literature back to transformers on various data modalities, including images, text and time series. Also, though we currently lack the computational resources to do so, we would be interested to test whether our findings are replicated for datasets of largerscale, such as ImageNet and COCO.
References
 Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 79, 2015, Conference Track Proceedings, 2015.
 Bello et al. (2019) Irwan Bello, Barret Zoph, Ashish Vaswani, Jonathon Shlens, and Quoc V. Le. Attention Augmented Convolutional Networks. arXiv:1904.09925 [cs], April 2019.
 Dai et al. (2017) Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. CoRR, abs/1703.06211, 2017.
 Dai et al. (2019) Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G. Carbonell, Quoc V. Le, and Ruslan Salakhutdinov. TransformerXL: Attentive Language Models Beyond a FixedLength Context. CoRR, abs/1901.02860, 2019.
 Devlin et al. (2018) Jacob Devlin, MingWei Chang, Kenton Lee, and Kristina Toutanova. BERT: pretraining of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805, 2018.
 Franceschi et al. (2019) JeanYves Franceschi, Aymeric Dieuleveut, and Martin Jaggi. Unsupervised scalable representation learning for multivariate time series. In NeurIPS 2019, 2019.
 He et al. (2015) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015.
 Hochreiter & Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural Computation, 9(8):1735–1780, 1997.
 Hu et al. (2018) Jie Hu, Li Shen, and Gang Sun. Squeezeandexcitation networks. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 1822, 2018, pp. 7132–7141, 2018.
 Jacobsen et al. (2018) JörnHenrik Jacobsen, Arnold W.M. Smeulders, and Edouard Oyallon. irevnet: Deep invertible networks. In International Conference on Learning Representations, 2018.
 Kim (2014) Yoon Kim. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882, 2014.
 (12) Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. Cifar10 (canadian institute for advanced research).
 Paszke et al. (2017) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, 2017.
 Pérez et al. (2019) Jorge Pérez, Javier Marinkovic, and Pablo Barceló. On the turing completeness of modern neural network architectures. CoRR, abs/1901.03429, 2019.
 Radford et al. (2018) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2018.
 Ramachandran et al. (2019) Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan Bello, Anselm Levskaya, and Jonathon Shlens. Standalone selfattention in vision models. CoRR, abs/1906.05909, 2019.
 van den Oord et al. (2016) Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alexander Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
 Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. CoRR, abs/1706.03762, 2017.
 Wang et al. (2018) Xiaolong Wang, Ross B. Girshick, Abhinav Gupta, and Kaiming He. Nonlocal neural networks. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 1822, 2018, pp. 7794–7803, 2018.
 Yang et al. (2019) Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. Xlnet: Generalized autoregressive pretraining for language understanding. CoRR, abs/1906.08237, 2019.
 Zampieri (2019) Luca Zampieri. Geometric deep learning for volumetric computational fluid dynamics. pp. 67, 2019.
Appendix A Appendix
A.1 Generalized Quadratic Positional Encoding
We noticed the similarity of the attention probabilities in the quadratic positional encoding (creftypecap 3) to isotropic bivariate Gaussian distributions with bounded support:
$\mathrm{softmax}{({\bm{A}}_{\bm{q},:})}_{\bm{k}}={\displaystyle \frac{{e}^{\alpha {\parallel (\bm{k}\bm{q})\mathbf{\Delta}\parallel}^{2}}}{{\sum}_{{\bm{k}}^{\prime}\in [W]\times [H]}{e}^{\alpha {\parallel ({\bm{k}}^{\prime}\bm{q})\mathbf{\Delta}\parallel}^{2}}}}.$  (15) 
Building on this observation, we further extended our attention mechanism to nonisotropic Gaussian distribution over pixel positions. Each head is parametrized by a center of attention $\mathbf{\Delta}$ and a covariance matrix $\mathbf{\Sigma}$ to obtain the following attention scores,
${\bm{A}}_{\bm{q},\bm{k}}={\displaystyle \frac{1}{2}}{(\bm{\delta}\mathbf{\Delta})}^{\top}{\mathbf{\Sigma}}^{1}(\bm{\delta}\mathbf{\Delta})={\displaystyle \frac{1}{2}}{\bm{\delta}}^{\top}{\mathbf{\Sigma}}^{1}\bm{\delta}+{\bm{\delta}}^{\top}{\mathbf{\Sigma}}^{1}\mathbf{\Delta}{\displaystyle \frac{1}{2}}{\mathbf{\Delta}}^{\top}{\mathbf{\Sigma}}^{1}\mathbf{\Delta},$  (16) 
where, once more, $\bm{\delta}=\bm{k}\bm{q}$. The last term can be discarded because the softmax is shift invariant and we rewrite the attention coefficient as a dot product between the head target vector $\bm{v}$ and the relative position encoding ${\bm{r}}_{\bm{\delta}}$ (consisting of the first and second order combinations of the shift in pixels $\bm{\delta}$):
$$\bm{v}=\frac{1}{2}{(2{({\mathbf{\Sigma}}^{1}\mathbf{\Delta})}_{1},2{({\mathbf{\Sigma}}^{1}\mathbf{\Delta})}_{2},{\mathbf{\Sigma}}_{1,1}^{1},{\mathbf{\Sigma}}_{2,2}^{1},2\cdot {\mathbf{\Sigma}}_{1,2}^{1})}^{\top}\text{and}{\bm{r}}_{\bm{\delta}}={({\bm{\delta}}_{1},{\bm{\delta}}_{2},{\bm{\delta}}_{1}^{2},{\bm{\delta}}_{2}^{2},{\bm{\delta}}_{1}{\bm{\delta}}_{2})}^{\top}.$$ 
Evaluation.
We trained our model using this generalized quadratic relative position encoding. We were curious to see if, using the above encoding the selfattention model would learn to attend to nonisotropic groups of pixels—thus forming unseen patterns in CNNs. Each head was parametrized by $\mathbf{\Delta}\in {\mathbb{R}}^{2}$ and ${\mathbf{\Sigma}}^{1/2}\in {\mathbb{R}}^{2\times 2}$ to ensure that the covariance matrix remained positive semidefinite. We initialized the center of attention to ${\mathbf{\Delta}}^{(h)}\sim \mathcal{N}(\mathrm{\U0001d7ce},2{\bm{I}}_{2})$ and ${\mathbf{\Sigma}}^{1/2}={\bm{I}}_{2}+\mathcal{N}(\mathrm{\U0001d7ce},0.01{\bm{I}}_{2})$ so that initial attention probabilities were close to an isotropic Gaussian. creftypecap 7 shows that the network did learn nonisotropic attention probability patterns, especially in high layers. Nevertheless, the fact that we do not obtain any performance improvement seems to suggest that attention nonisotropy is not particularly helpful in practice—the quadratic positional encoding suffices.
Pruning degenerated heads.
We noticed that certain nonisotropic attention heads attended on “nonintuitive” patches of pixels: either attending a very thin stripe of pixels, when ${\mathbf{\Sigma}}^{1}$ was almost singular, or attending all pixels uniformly, when ${\mathbf{\Sigma}}^{1}$ was close to $\mathrm{\U0001d7ce}$ (i.e. constant attention scores). We asked ourselves, are such attention patterns indeed useful for the model or are these heads degenerated and unused? To find out, we pruned all heads having largest eigenvalues smaller than ${10}^{5}$ or condition number (ratio of the biggest and smallest eigenvalues) greater than ${10}^{5}$. Specifically in our model with 6layer and 9heads each, we pruned $[2,4,1,2,6,0]$ heads from the first to the last layer. This means that these layers cannot express a $3\times 3$ kernel anymore. As shown in yellow on creftype 5, this ablation initially hurts a bit the performance, probably due to off biases, but after a few epochs of continued training with a smaller learning rate (divided by 10) the accuracy recovers its unpruned value. Hence, without sacrificing performance, we reduce the size of the parameters and the number of FLOPS by a fourth.
Models  accuracy  # of params  # of FLOPS 

ResNet18  0.938  11.2M  1.1B 
SA quadratic  0.938  12.1M  6.2B 
SA quadratic generalized  0.934  12.1M  6.2B 
SA quadratic generalized pruned  0.934  9.7M  4.9B 
SA learned  0.918  12.3M  6.2B 
A.2 Increasing the Number of Heads
For completeness, we also tested increasing the number of heads of our architecture from 9 to 16.
Similar to Figure 3, we see that the network distinguishes two main types of attention patterns. Localized heads (i.e., those that attend to nearly individual pixels) appear more frequently in the first few layers. The selfattention layer uses these heads to act in a manner similar to how convolutional layers do. Heads with lesslocalized attention become more common at higher layers.
A.3 Positional Encoding References
Model  type of positional encoding  relative  

sinusoids  learned  quadratic  
Vaswani et al. (2017)  ✓  
Radford et al. (2018)  ✓  
Devlin et al. (2018)  ✓  
Dai et al. (2019)  ✓  ✓  
Yang et al. (2019)  ✓  ✓  
Bello et al. (2019)  ✓  ✓  
Ramachandran et al. (2019)  ✓  ✓  
Our work  ✓  ✓  ✓ 
A.4 Hyperparameters Used in Our Experiments
Hyperparameters  
number of layers  6 
number of heads  9 
hidden dimension  400 
intermediate dimension  512 
invertible pooling width  2 
dropout probability  0.1 
layer normalization epsilon  ${10}^{12}$ 
number of epochs  300 
batch size  100 
learning rate  0.1 
weight decay  0.0001 
momentum  0.9 
cosine decay  ✓ 
linear warm up ratio  0.05 