Abstract
We introduce a new local sparse attention layer that preservestwodimensional geometry and locality. We show that by just replacing the denseattention layer of SAGAN with our construction, we obtain very significant FID,Inception score and pure visual improvements. FID score is improved from$18.65$ to $15.94$ on ImageNet, keeping all other parameters the same. Thesparse attention patterns that we propose for our new layer are designed usinga novel information theoretic criterion that uses information flow graphs. Wealso present a novel way to invert Generative Adversarial Networks withattention. Our method extracts from the attention layer of the discriminator asaliency map, which we use to construct a new loss function for the inversion.This allows us to visualize the newly introduced attention heads and show thatthey indeed capture interesting aspects of twodimensional geometry of realimages.
Quick Read (beta)
Your Local GAN:
Designing Two Dimensional Local Attention Mechanisms for Generative Models
Abstract
We introduce a new local sparse attention layer that preserves twodimensional geometry and locality. We show that by just replacing the dense attention layer of SAGAN with our construction, we obtain very significant FID, Inception score and pure visual improvements. FID score is improved from 18.65 to 15.94 on ImageNet, keeping all other parameters the same. The sparse attention patterns that we propose for our new layer are designed using a novel information theoretic criterion that uses information flow graphs.
We also present a novel way to invert Generative Adversarial Networks with attention. Our method uses the attention layer of the discriminator to create an innovative loss function. This allows us to visualize the newly introduced attention heads and show that they indeed capture interesting aspects of twodimensional geometry of real images.
1 Introduction
Generative Adversarial Networks [10] are making significant progress on modeling and generating natural images [26, 4]. Transposed convolutional layers are a fundmamental architectural component since they capture spatial invariance, a key property of natural images [18, 15, 27]. The central limitation (e.g. as argued in [26]) is that convolutions fail to model complex geometries and longdistance dependencies– the canonical example is generating dogs with fewer or more than four legs.
To compensate for this limitation, attention layers [25] have been introduced in deep generative models [26, 4]. Attention enables the modeling of long range spatial dependencies in a single layer which automatically finds correlated parts of the image even if they are far apart. First introduced in SAGAN [26] and further improved in BigGAN [4], attention layers have led to some of the best known GANs currently available.
Attention layers have a few limitations. The first is that they are computationally inefficient: Standard dense attention requires memory and time complexity that scales quadratically in the size of the input. Second, dense attention layers are statistically inefficient: A significant number of training samples is required to train attention layers, a problem that becomes more pronounced when multiple attention heads or layers are introduced [6]. Statistical inefficiency also stems from the fact that dense attention does not benefit from locality, since most dependencies in images relate to nearby neighborhoods of pixels. Recent work indicates that most attention layer heads learn to attend mainly to local neighborhoods [24].
To mitigate these limitations, sparse attention layers were recently introduced in Sparse Transformers [6]. In that paper, different types of sparse attention kernels were introduced and used to obtain excellent results for images, text and audio data. They key observation we make is that the patterns that were introduced in Sparse Transformers are actually designed for onedimensional data, such as textsequences. Sparse Transformers [6] were applied to images by reshaping tensors in a way that significantly distorts distances of the twodimensional grid of image pixels. Therefore, local sparse attention kernels introduced in Sparse Transformers fail to capture image locality.
Our Contributions:

•
We introduce a new local sparse attention layer that preserves twodimensional image locality and can support good information flow through attention steps.

•
To design our attention patterns we use the information theoretic framework of Information Flow Graphs [8]. This quantifies how information can flow through multiple steps and preserve twodimensional locality. We visualize learned attention maps and show that different heads indeed learn different aspects of the geometry of generated images.

•
We modify SAGAN [26] using our new twodimensional sparse attention layers to introduce YLGSAGAN. We empirically show that this change yields significant benefits. We train on ImageNet128 and we achieve 14.53% improvement to the FID score of SAGAN and 8.95% improvement in Inception score, by only changing the attention layer while maintaining all other parameters of the architecture. Our ablation study shows that indeed the benefits come from two dimensional inductive bias and not from introducing multiple attention heads. Furthermore, YLGSAGAN achieves this performance in $800k$ training steps as opposed to $1300k$ for SAGAN and hence reduces the training time by approximately $40\%$.

•
To visualize our attention maps on natural images, we came across the problem of inverting a generator: given an image $x$, how to find a latent code $z$ so that $G(z)$ is as close as possible to $x$. The natural inversion process of performing gradient descent on this loss works in small GANs [3, 20, 19, 14] but has been notoriously failing in bigger models with attention like SAGAN^{1}^{1} 1 This fact is folklore, known at least among researchers who try to solve inverse problems using deep generative models. There are, of course numerous other ways to invert, like training an encoder, but also show poor performance on modern GANs with attention. . We present a solution to the GAN inversion problem: We use the attention layer of the discriminator to obtain a weighting on the loss function that subsequently we use to invert with gradient descent. We empirically show excellent inversion results for numerous cases where standard gradient descent inversion fails.
We opensource our code and our models to encourage further research in this area. The code is available under the repository: https://github.com/giannisdaras/ylg.
2 Background
Dense Attention. Given matrices $X\in {\mathbb{R}}^{{N}_{X}\times {E}_{X}},Y\in {\mathbb{R}}^{{N}_{Y}\times {E}_{Y}}$, attention of $X$ to $Y$, updates the vector representation of $X$ by integrating the vector representation of $Y$. In this paper, $X,Y$ are intermediate image representations. More specifically, attention of $X$ to $Y$ associates the following matrices with the inputs: The key matrix ${X}_{K}=X\cdot {W}_{K}$, the query matrix ${Y}_{Q}=Y\cdot {W}_{Q}$ and the value matrix ${Y}_{V}=X\cdot {W}_{V}$ where ${W}_{K}\in {\mathbb{R}}^{{E}_{X}\times E},{W}_{Q}\in {\mathbb{R}}^{{E}_{Y}\times E},{W}_{V}\in {\mathbb{R}}^{E\times {E}_{V}}$ are learnable weight matrices. Intuitively, queries are compared with keys and values translate the result of this comparison to a new vector representation of $X$ that integrates information from $Y$. Mathematically, the output of the attention is the matrix: ${X}^{\prime}=\sigma ({X}_{Q}\cdot {Y}_{K}^{T})\cdot {Y}_{V}$ where $\sigma (\cdot )$ denotes the softmax operation along the last axis.
Sparsified Attention. The quadratic complexity of attention to the size of the input is due to the calculation of the matrix ${A}_{X,Y}={X}_{Q}\cdot {Y}_{K}^{T},\in {\mathbb{R}}^{{N}_{X}\times {N}_{Y}}$. Instead of performing this calculation jointly, we can split attention in multiple steps. At each step $i$, we attend to a subset of input positions, specified by a binary mask ${M}_{i}\in {\{0,1\}}^{{N}_{X}\times {N}_{Y}}$. Mathematically, at step $i$ we calculate matrix ${A}_{X,Y}^{i}$, where: ${A}_{X,Y}^{i}[a,b]=\{\begin{array}{cc}{A}_{X,Y}[a,b],{M}^{i}[a,b]=1\hfill & \\ \mathrm{\infty},{M}^{i}[a,b]=0\hfill & \end{array}$. In this expression, $\mathrm{\infty}$ means that after the softmax, this position will be zeroed and thus not contribute to the calculation of the output matrix. The design of the masks $\{{M}^{i}\}$ is key in reducing the number of positions attended.
There are several ways that we can use the matrices ${A}_{X,Y}^{i}$ to perform multistep attention [6] in practice. The simplest is to have separate attention heads [25] calculating the different matrices $\{{A}_{X,Y}^{i}\}$ in parallel and then concatenate along the feature dimension. We use this approach in this paper.
3 Your Local GAN
3.1 Full Information Attention Sparsification
As explained, an attention sparsification in $p$ steps is described by binary masks $\{{M}^{1},\mathrm{\dots},{M}^{p}\}$. The question is how to design a good set of masks for these attention steps. We introduce a tool from information theory to guide this design.
Information Flow Graphs are directed acyclic graphs introduced in [8] to model distributed storage systems through network information flow [1]. For our problem, this graph models how information flows across attention steps. For a given set of masks $\{{M}^{1},\mathrm{\dots},{M}^{p}\}$, we create a multipartite graph $G(V=\{{V}^{0},{V}^{1},\mathrm{\dots},{V}^{p}\},E)$ where directed connections between ${V}^{i},{V}^{i+1}$ are determined by mask ${M}^{i}$. Each group of vertices in partition ${V}^{i}$ corresponds to attention tokens of step $i$.
We say that an attention sparsification has Full Information if its corresponding Information Flow Graph has a directed path from every node $a\in {V}^{0}$ to every node $b\in {V}^{p}$. Please note that the Fixed pattern [6] shown in subfigure 2 does not have Full Information: there is no path from node 1 of ${V}^{0}$ to node 2 of ${V}^{2}$.
Sparse attention is usually considered as a way to reduce the computational overhead of dense attention at a hopefully small performance loss. However, we show that attention masks chosen with a bias toward twodimensional locality, can surprisingly outperform dense attention layers (compare the second and the third row of Table 1). This is an example of what we call the statistical inefficiency of dense attention. Sparse attention layers with locality create better inductive bias and hence can perform better in the finite sample regime. In the limit of infinite data, dense attention can always simulate sparse attention or perform better, in the same way that a fully connected layer can simulate a convolutional layer for a possible selection of weights.
We design the sparse patterns of YLG as the natural extensions of the patterns of [6] while ensuring that the corresponding Information Flow Graph supports Full Information. The first pattern, which we call Left to Right (LTR), extends the pattern of [6] to a bidirectional context. The second pattern, which we call Right to Left (RTL), is a transposed version of LTR. The corresponding $9\times 9$ masks and associated Information Flow Graphs are presented in subfigures 2, 2 (LTR) and 2, 2 (RTL). These patterns allow attention only to $n\sqrt{n}$ positions, significantly reducing the quadratic complexity of dense attention. It is possible to create very sparse Full Information graphs using multiple attention steps, but designing them and training them remains open for future work; in this paper we focus on twostep factorizations. We include more details about information flow graphs and how we use them to design attention patterns in the Appendix.
3.2 TwoDimensional Locality
The factorization patterns of Sparse Transformers [6] and their Full Information extensions illustrated in Figure 2 are fundamentally matched to onedimensional data, such as textsequences.
The standard way to apply these layers on images is to reshape the three dimensional image tensors (having three color channels) to a twodimensional tensor $X\in {R}^{N\times C}$ that enters attention. This corresponds to $N$ tokens, each containing a $C$dimensional representation of a region of the input image. This reshape arranges these $N$ tokens linearly, significantly distorting which parts of the image are nearby in two dimensions. This behavior is illustrated in the subfigure at the left of Figure 3.
We argue that this is the reason that onedimensional sparsifications are not ideal for images. In fact, the authors of [6] mention that the Fixed Pattern (Figure 2) was designed for textsequences and not for images. Our central finding is that these patterns can work very well for images, if their two dimensional structure is correctly considered.
The question is therefore how to take twodimensional locality into account. We could create twodimensional attention patterns directly on a grid but this would have significant computational overhead and also prevent us from extending one dimensional sparsifications that are known to work well [12, 6]. Instead, we modify one dimensional sparsifications to become aware of twodimensional locality with the following trick: (i) we enumerate pixels of the image based on their Manhattan distance from the pixel at location (0, 0) (breaking ties using row priority), (ii) shift the indices of any given onedimensional sparsification to match the Manhattan distance enumeration instead of the reshape enumeration, and (iii) apply this new one dimensional sparsification pattern, that respects twodimensional locality, to the onedimensional reshaped version of the image. We call this procedure ESA (Enumerate, Shift, Apply) and illustrate it in Figure 3.
The ESA trick introduces some distortion compared to a true twodimensional distance. We found however that this was not too limiting, at least for $128\times 128$ resolution. On the other hand, ESA offers an important implementation advantage: it theoretically allows the use of onedimensional blocksparse kernels [11]. Currently these kernels exist only for GPUs, but making them work for TPUs is still under development.
4 Experimental Validation
We conduct experiments on the challenging ImageNet [21] dataset. We choose SAGAN [26] as the baseline for our models because, unlike BigGAN [4] it has official opensource Tensorflow code. BigGAN is not opensource and therefore training or modifying this architecture was not possible^{2}^{2} 2 Note that there is an ‘unofficial’ BigGAN that is open in PyTorch. However, that implementation uses gradient checkpointing and requires $8$ V100 GPUS for $15$ days to train. We simply did not have such computing resources. We believe, however, that YLG can be easily combined with BigGAN (by simply replacing its dense attention layer) and will yield an even better model..
In all our experiments, we change only the attention layer of SAGAN, keeping all the other hyperparameters unchanged (the number of parameters is not affected). We trained all models for up to 1,500,000 steps on a TPUv38 Pod, using a $1{e}^{4}$ learning rate for generator and $4{e}^{4}$ for the discriminator. For all the models we report the best performance obtained, even if it was obtained at an earlier point during training.
Attention Mechanism
We start with the Fixed Pattern (Figure 2) and modify it: First, we create Full Information extensions (Section 3.1), yielding the patterns LeftToRight (LTR) and RightToLeft (RTL) (Figures 2 and 2 respectively). We implement multistep attention in parallel using different heads. Since each pattern is a twostep sparsification, this yields 4 attention heads. To encourage diversity of learned patterns, we use each pattern twice, so the total number of heads in our new attention layer is 8. We use our ESA procedure (Section 3.2) to render these patterns aware of two dimensional geometry.
NonSquare Attention
In SAGAN, the query image and the key image in the attention layer have different dimensions. This complicates things, because the sparsification patterns we discuss are designed for selfattention, where the number of query and key nodes is the same. Specifically, for SAGAN the query image is $32\times 32$ and the key image is $16\times 16$. We deal with this in the simplest possible way: we create masks for the $16\times 16$ image and we shift these masks to cover the area of the $32\times 32$ image. Thus every $16\times 16$ block of the $32\times 32$ query image attends with full information to the $16\times 16$ key image.
# Heads  FID  Inception  

SAGAN  1  18.65  52.52 
SAGAN  8  20.09  46.01 
YLGSAGAN  8  15.94  57.22 
YLG  No ESA  8  17.47  51.09 
YLG  Strided  8  16.64  55.21 
Results: As shown in Table 1, YLGSAGAN (3rd row) outperforms SAGAN by a large margin measured by both FID and Inception score. Specifically, YLGSAGAN increases Inception score to 57.22 ($8.95\%$ improvement) and improves FID to 15.94 ($14.53\%$ improvement). Qualitatively, we observe really goodlooking samples for categories with simple geometries and homogeneity. Intuitively, a twodimensional locality can benefit importantly categories such as valleys or mountains, because usually the image transitions for these categories are smoother compared to others and thus the dependencies are mostly local.
Additionally to the significantly improved scores, one important benefit of using YLG sparse layer instead of a dense attention layer, is that we observe significant reduction of the training time needed for the model to reach it’s optimal performance. SAGAN reached it’s best FID score after more that 1.3 million training steps while YLGSAGAN reaches its’ optimal score after only 865,000 steps ($\approx 40\%$ reduction to the training time). Figure 4 illustrates SAGAN and YLGSAGAN FID and Inception score as a function of the training time.
We create two collages to display samples from our YLG version of SAGAN. At the Upper Panel of Figure 7, we show dogs of different breeds generated by our YLGSAN. At the Lower Panel, we use YLGSAGAN to generate samples from randomly chosen classes of the ImageNet dataset.
4.1 Ablation Study
Number of Attention Heads
The Original SAGAN implementation used a singleheaded attention mechanism. In YLG, we use multiple heads to perform parallel multistep sparse attention. Previous work has shown that multiple heads increased performance for Natural Language Processing tasks [25]. To understand how multiple heads affect SAGAN performance, we train an 8 head version of SAGAN. The results are reported in the second row of Table 1. Multiple heads actually worsen significantly the performance of the original SAGAN, reducing Inception score from 52.52 to 46.01. We provide a posthoc interpretation of this result. The image embedding of the query vector of SAGAN has only 32 vector positions. By using 8 heads, each head gets only 4 positions for its’ vector representation. Our intuition is that a 4positions vector representation is not sufficient for effective encoding of the image information for a dense head and that accounts for the decrease in performance. It is important to note that YLGSAGAN does not suffer from this problem. The reason is that each head is sparse, which means that only attends to a percentage of the positions that dense head attends to. Thus, a smaller vector representation does not worsen performance. Having multiple divergent sparse heads allows YLG layer to discover complex dependencies in the image space throughout the multistep attention.
TwoDimensional Locality
As described in Section 3.2 YLG uses the ESA procedure, to adapt 1D sparse patterns to data with 2D structure. Our motivation was that gridlocality could help our sparse attention layer to better model local regions. In order to validate this experimentally, we trained a version of YLG without the ESA procedure. We call this model YLG  No ESA. The results are shown in 4th row of Table 1: without the ESA procedure, the performance of YLG is about the same with the original SAGAN. This experiment indicates that ESA trick is essential for using onedimensional sparse patterns for gridstructured data. If ESA framework is used, FID improves from $17.47$ to $15.94$ and Inception score from $51.09$ to $57.22$, without any other difference in the model architecture. Thus, ESA is a plugandplay framework that achieves great performance boosts to both FID and Inception score metrics. ESA allows the utilization of fast sparse onedimensional patterns that were found to work well for textsequences to be adapted to images, with great performance benefits. In section 5.1, we visualize attention maps to showcase how our model utilizes ESA framework in practice.
Sparse Patterns
Our YLG layer uses the LTR and RTL patterns (Figures 2 and 2 respectively). Our intuition is that using multiple patterns at the same time increases performance because the model will be able to discover dependencies using multiple different paths. To test this intuition, we ran an experiment using the Full Information extension of the Strided [6] pattern. We choose this pattern because it was found to be effective for modeling images [6] due to its’ periodic structure. As with LTR and RTL patterns, we extend the Strided pattern so that it has Full Information^{3}^{3} 3 We include visualizations of the Full Information Strided Pattern in the Appendix.. We refer to the YLG model that instead of LTR and RTL patterns, has 8 heads implementing the Strided pattern as YLG  Strided. For our experiment, we use again the ESA trick. We report the results on the 5th row of Table 1. YLG  Strided importantly surpasses SAGAN both in FID and Inception score, however, it is still behind YLG. Although in the Sparse Transformers [6] it has been claimed that strided pattern is more suitable for images than the patterns we use in YLG, this experiment strongly suggests that it is the gridlocality which makes the difference, as both models are far better than SAGAN. Also, this experiment indicates that multiple sparse patterns can boost performance compared to using a single sparse pattern. To be noted, using multiple different patterns at the same attention layer requires scaling the number of heads as well. Although YLG variations of SAGAN were not impacted negatively by the increase of attention heads, more severe upscaling of the number of heads could potentially harm performance, similarly to how 8 heads harmed performance of SAGAN.
5 Inverting Generative Models with Attention
We are interested in visualizing our sparse attention on real images, not just generated ones. This leads naturally to the problem of projecting an image on the range of a generator, also called inversion. Given a real image $x\in {\mathbb{R}}^{n}$ and a generator $G(z)$, inversion corresponds to finding a latent variable ${z}^{*}\in {\mathbb{R}}^{k}$, so that $G({z}^{*})\in {\mathbb{R}}^{n}$ approximates the given image $x$ as well as possible. One approach for inversion is to try to solve the following nonconvex optimization problem:
$$\underset{{z}^{*}}{\text{argmin}}\{{\parallel G({z}^{*})x\parallel}^{2}\}.$$  (1) 
To solve this optimization problem, we can perform gradient descent from a random initalization ${z}_{0}$ to minimize this projection distance in the latent space. This approach was introduced independently in several papers [16, 3, 20] and further generalized to solve inverse problems beyond inversion [3, 20, 19, 14]. Very recent research [13, 23] demonstrated that for fully connected generators with random weights and sufficient layer expansion, gradient descent will provably converge to the correct optimal inversion.
Unfortunately, this theory does not apply for generators that have attention layers. Even empirically, inversion by gradient descent fails for bigger generative models like SAGAN and YLGSAGAN. As we show in our experiments the optimizer gets trapped in local minimima producing reconstructions that only vaguely resemble the target image. Other approaches for inversion have been tried in the literature, like training jointly an encoder [9] but none of these methods have been known to successfully invert complex generative models with attention layers.
We propose a novel inversion method that uses the discriminator to solve the minimization problem in an different representation space. Interestingly, the discriminator yields representations with a smoother loss landscape, especially if we use the attention layer in a special way. In more detail: We begin with a random latent variable $z$ and a given real image $x$. We denote with ${D}^{0}$ the Discriminator network up to, but not including, the attention layer and obtain the representations ${D}^{0}(G(z))$ and ${D}^{0}(x)$. We could perform gradient descent to minimize the distance of these discriminator representations:
$${\parallel {D}^{0}(G(z)){D}^{0}(x)\parallel}^{2}.$$ 
We found, however, that we can use the attention map of the real image to further enhance inversion. We will use the example of the SAGAN architecture to illustrate this. Inside the SAGAN Discriminator’s attention, an attention map $M\in {\mathbb{R}}^{32\times 32\times 16\times 16}$ is calculated. For each pixel of the $32\times 32$ image, this attention map is a distribution over the pixels of the $16\times 16$ image. We can use this attention map to extract a saliency map. For each pixel of the $16\times 16$ image, we can average the probabilities from all the pixels of the $32\times 32$ image and create a probability distribution of shape $16\times 16$. We denote this distribution with the letter $S$. Intuitively, this distribution represents how important each pixel of the image is to the discriminator.
Our proposed inversion algorithm is to perform gradient descent to minimize the discriminator embedding distance, weighted by these saliency maps:
$${\parallel \left({D}^{0}(G(z)){D}^{0}(x)\right)\cdot {S}^{\prime}\parallel}^{2},$$  (2) 
where ${S}^{\prime}$ is a projected version of saliency map $S$ to the dimensions of ${D}^{0}(x)$. We actually calculate one saliency map ${S}^{\prime}$ per head and use their sum as the final loss function that we optimize for inversion. More details are included in the Appendix.
5.1 Inversion as lens to attention
Given an arbitrary real image, we can now solve for a $z$ yielding a similar generated image from the generator, and visualize the attention maps.
We explain our approach using an example of a real image of a redshank (Figure 5). Figure 5 shows how the standard method for inverting generators [3] fails: the beak, legs, and rocks are missing. Figure 5 shows the result of our method. Using the $z$ that we found using inversion, we can project maps of the attention layer back to the original image to get valuable insight into how the YLG layers work.
First, we analyze the differences between the YLGSAGAN attention heads. For each attention head of the generator, we create a saliency map as described above and use these maps to analyze the attention mechanism. As shown in Figure 5, the head7 in the generator is mostly ignoring background focusing on the bird. Other heads function differently: The saliency map of head2 (Figure 5) shows that this head attends globally. We also find that there are heads that that attend quite sparsely, for example, head5 attends only to 56 background pixels.
We present a second inversion, this time an indigo bird (Figure 6). Figure 6 shows how the standard method [3] for inverting fails: the head of the bird and the branch are not reconstructed. We also illustrate where specific query points attend to. We first illustrate that the the model exploited the local bias of ESA: We plot the attention map for query point $(0,0)$ for generatorhead0. This point, indicated with a blue dot, is part of the background. We clearly see a local bias in the positions this point attends to. Another example of twodimensional local attention is shown in Figure 6. This figure illustrates the attention map of generatorhead4 for a query point on the body of the bird (blue dot). This point attends to the edges of the bird body and to the bird head.
Finally, Figure 6 shows that there are query points that attend to longdistance, demonstrating that the attention mechanism is capable of exploiting both locality and longdistance relationships when these appear in the image.
6 Related Work
There has been a flourishing of novel ideas on making attention mechanisms more efficient. Dai et al. [7] separate inputs into chunks and associate a state vector with previous chunks of the input. Attention is performed per chunk, but information exchange between chunks is possible via the state vector. Guo et al. [12] show that a starshaped topology can reduce attention cost from $O({n}^{2})$ to $O(n)$ in text sequences. Interestingly, this topology does have full information, under our framework. Sukhbaatar et al. [24] introduced the idea of a learnable adaptive span for each attention layer. Calian et al. [5] proposed a fast randomized algorithm that exploits spatial coherence and sparsity to design sparse approximations. We believe that all these methods can be possibly combined with YLG, but so far nothing has been demonstrated to improve generative models in a plugandplay way that this work shows.
7 Conclusions and Future Work
We introduced a new type of local sparse attention layer designed for twodimensional data. We believe that our layer will be widely applicable for any model with attention that works on twodimensional data. An interesting future direction is the design of attention layers, thought of as multistep networks. The two conflicting objectives are to make these networks as sparse as possible (for computational and statistical efficiency) but also support good information flow. We introduced information flow graphs as a mathematical abstraction and proposed full information as a desired criterion for such attention networks.
Finally, we presented a novel way to solve the inversion problem for GANs. Our technique uses the discriminator in two ways: First, using its attention to obtain pixel importance and second, as a smoothing representation of the inversion loss landscape. This new inversion method allowed us to visualize our network on approximations of real images and also to test how good a generative model is in this important coverage task. We believe that this is the first key step towards using generative models for inverse problems and we plan to explore this further in the future.
References
 [1] (2000) Network information flow. IEEE Transactions on information theory 46 (4), pp. 1204–1216. Cited by: §3.1.
 [2] (201910) Seeing What a GAN Cannot Generate. arXiv eprints, pp. arXiv:1910.11626. External Links: 1910.11626 Cited by: §8.4.1.
 [3] (2017) Compressed sensing using generative models. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 537–546. Cited by: 4th item, Figure 5, §5.1, §5.1, §5, Figure 11, Figure 11, §8.1.2, §8.4.1.
 [4] (201809) Large Scale GAN Training for High Fidelity Natural Image Synthesis. arXiv eprints, pp. arXiv:1809.11096. External Links: 1809.11096 Cited by: §1, §1, §4, §8.2.
 [5] (201905) SCRAM: Spatially Coherent Randomized Attention Maps. arXiv eprints, pp. arXiv:1905.10308. External Links: 1905.10308 Cited by: §6.
 [6] (2019) Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509. Cited by: §1, §1, §2, Figure 2, Figure 2, §3.1, §3.1, §3.2, §3.2, §3.2, §4.1, Figure 10, Figure 10, Figure 10, §8.3, §8.3.
 [7] (201901) TransformerXL: Attentive Language Models Beyond a FixedLength Context. arXiv eprints, pp. arXiv:1901.02860. External Links: 1901.02860 Cited by: §6.
 [8] (2010) Network coding for distributed storage systems. IEEE transactions on information theory 56 (9), pp. 4539–4551. Cited by: 2nd item, §3.1, §8.5.
 [9] (2016) Adversarial feature learning. arXiv preprint arXiv:1605.09782. Cited by: §5.
 [10] (201406) Generative Adversarial Networks. arXiv eprints, pp. arXiv:1406.2661. External Links: 1406.2661 Cited by: §1.
 [11] (2017) Gpu kernels for blocksparse weights. arXiv preprint arXiv:1711.09224. Cited by: §3.2.
 [12] (201902) StarTransformer. arXiv eprints, pp. arXiv:1902.09113. External Links: 1902.09113 Cited by: §3.2, §6, §8.5.
 [13] (2019) Global guarantees for enforcing deep generative priors by empirical risk. IEEE Transactions on Information Theory. Cited by: §5.
 [14] (2018) Taskaware compressed sensing with generative adversarial networks. In ThirtySecond AAAI Conference on Artificial Intelligence, Cited by: 4th item, §5.
 [15] (201812) A StyleBased Generator Architecture for Generative Adversarial Networks. arXiv eprints, pp. arXiv:1812.04948. External Links: 1812.04948 Cited by: §1.
 [16] (2017) Precise recovery of latent vectors from generative adversarial networks. arXiv preprint arXiv:1702.04782. Cited by: §5.
 [17] (2018) Image transformer. arXiv preprint arXiv:1802.05751. Cited by: §6.
 [18] (201511) Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. arXiv eprints, pp. arXiv:1511.06434. External Links: 1511.06434 Cited by: §1.
 [19] (2019) GANbased projector for faster recovery with convergence guarantees in linear inverse problems. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5602–5611. Cited by: 4th item, §5.
 [20] (2017) One network to solve them all–solving linear inverse problems using deep projection models. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5888–5897. Cited by: 4th item, §5.
 [21] (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. External Links: Document Cited by: §4.
 [22] (201409) Very Deep Convolutional Networks for LargeScale Image Recognition. arXiv eprints, pp. arXiv:1409.1556. External Links: 1409.1556 Cited by: §8.4.1.
 [23] (2019) Surfing: iterative optimization over incrementally trained deep networks. arXiv preprint arXiv:1907.08653. Cited by: §5.
 [24] (201905) Adaptive Attention Span in Transformers. arXiv eprints, pp. arXiv:1905.07799. External Links: 1905.07799 Cited by: §1, §6.
 [25] (201706) Attention Is All You Need. arXiv eprints, pp. arXiv:1706.03762. External Links: 1706.03762 Cited by: §1, §2, §4.1.
 [26] (201805) SelfAttention Generative Adversarial Networks. arXiv eprints, pp. arXiv:1805.08318. External Links: 1805.08318 Cited by: 3rd item, §1, §1, §4, §6, §8.3.
 [27] (201612) StackGAN: Text to Photorealistic Image Synthesis with Stacked Generative Adversarial Networks. arXiv eprints, pp. arXiv:1612.03242. External Links: 1612.03242 Cited by: §1.
 [28] (201907) Lookahead Optimizer: k steps forward, 1 step back. arXiv eprints, pp. arXiv:1907.08610. External Links: 1907.08610 Cited by: §8.1.3.
8 Appendix
8.1 A closer look to our inversion method
This subsection aims to explain technical details of our inversion technique and clarify the details of our approach.
We begin with a recap of our method. Given a real image we pass it to the discriminator and we extract the attention map from the attention layer. This attention map contains for every point of the query image, a probability distribution over the pixels of the key image. We can then convert this attention map to a saliency map: by averaging the attention each key point gets from all the query points, we can get a probability distribution over the ”importance” of the pixels of the key image. We denote this saliency map with $S$. Our proposed inversion algorithm is to perform gradient descent to minimize the discriminator embedding distance, weighted by this salience map:
$${\parallel \left({D}^{0}(G(z)){D}^{0}(x)\right)\cdot {S}^{\prime}\parallel}^{2},$$ 
where ${S}^{\prime}$ is a projected version of saliency map $S$, $x$ is the image, and ${D}^{0}$ is the Discriminator network up to, but not including, the attention layer.
8.1.1 Multiple heads and saliency map
There are some practical considerations that we need to address before illustrating that our inversion method indeed works: the most important of which is how the saliency map $S$ looks like.
In our analysis of the YLG attention layers, we explain that because of the Full Information property, our patterns are able, potentially, to discover a dependency between any two pixels of an image. If that is true, we should expect that in the general case our saliency map, generated by the average of all heads, allocates nonzero weights to all image pixels. The important question becomes whether this joint saliency map weights more the pixels that are important for a visually convincing inversion. For example, in case of a bird flying with a bluesky in the background, we should be ready to accept a small error in some point in the clouds of the sky but not a bird deformation that will make the inverted image look unrealistic. Therefore, our saliency map should allocate more weight in the bird than in it allocates in the background sky.
We already showed in Section 5.1 that different heads specialize in discovering important image parts (for example, some heads learn to focus on local neighbhoords, important shape edges, background, etc.) so extracting a saliency map $S$ by averaging all heads usually leads in a uniform distribution over pixels, which is not helping inversion. Figure 8 shows the saliency map jointly all heads of the attention layer of the discriminator produce. Although the bird receives a bit more attention than the background, it is not clear how this map would help weight our loss for inversion. However, as illustrated in 8, there are heads that produce far more meaningful saliency maps for a goodlooking inversion. There is a drawback here as well though; if we use that head only, we completely miss the background.
To address this problem, we find two solutions that work quite well.

•
Solution 1: calculate Equation 2 separately for each head and then add the losses. In that case, the new loss function is given by the following equation:
$$\sum _{i}{\parallel \left({D}^{0}(G(z)){D}^{0}(x)\right)\cdot {S}_{i}^{\prime}\parallel}^{2},$$ (3) where ${S}_{i}^{\prime}$ is the saliency map extracted from head $i$.

•
Solution 2: Examine manually the saliency maps for each head and remove the heads that are attending mainly to noncrucial for the inversion areas, such as homogeneous backgrounds.
8.1.2 More inversion visualizations
We present several inversions for different categories of real images at Figure 9. In all our Figures, we use Solution 1 as it has the advantage that it does not require human supervision.
With our method, we can effectively invert real world scenes. We tested the standard inversion method [3] for these images as well and the results were far less impressive for all images. Especially for the dogs, we noted complete failure of the previous approach, similar to what we illustrate in Figure 11.
8.1.3 Experiments setup
In this subsection, we will briefly describe the experimental setup for our inversion technique. We choose to use the recently introduced Lookahead [28] optimizer as we find that it reduces the number of different seeds we have to try for a successful inversion. For the vast majority of the examined real images, we are able to get a satisfying inversion by trying at most 4 different seeds. We set the learning rate to $0.05$ and we update for maximum $1500$ steps. On a single V100 GPU, a single image inversion takes less than a minute to complete. We choose to invert realworld images that were not present in the training set. We initialize our latent variables from a truncated normal distribution, as explained in 8.2.
8.2 Truncation and how it helps inversion
In the BigGAN [4] paper, the authors observed that latent variables sampled from a truncated normal distribution generated generally more photorealistic images compared to ones generated from the normal distribution which was used during the training. This socalled truncation trick (resampling the values with magnitude above a chosen threshold) leads to improvement in sample quality at the cost of reduction in sample variety. For the generated images of YLG presented in this paper, we also utilized this trick.
Interestingly, the truncation trick can help inversion as well under some conditions. If the original image has good quality, then according to the truncation trick, it is more probable to be generated by a latent variable sampled from a truncated normal (where values which fall outside a range are resampled to fall inside that range) than the standard normal distribution $N(0,I)$. For that reason, in our inversions we start our trainable latent variable from a sample of the truncated normal distribution. We found experimentally that setting the truncation threshold to two standard deviations from the median (in our case 0), is a good tradeoff between producing photorealistic images and having enough diversity to invert an arbitrary real world image.
8.3 Strided Pattern
In the ablation studies of our paper, we train a model we name YLG  Strided. For this model, we report better results than the baseline SAGAN [26] model and slightly worse results than the proposed YLG model. The purpose of this section is to give more information on how YLG and YLG  Strided differ.
First of all, the only difference between YLG and YLG Strided is the choosing of attention masks for the attention heads: both models implement 2step attention patterns with Full Information and twodimensional locality using the ESA framework.
YLG model uses the RTL and LTR patterns introduced in the paper (see Figures 2, 2). Each pattern corresponds to a twostep attention: in our implementation of multistep attention we compute steps in parallel using multiple heads, so in total we need 8 attention heads for YLG. In YLG  Strided instead of using different patterns (RTL and LTR), we stick with using a single attention pattern. Our motivation is to: (i) investigate whether using multiple attention patterns simultaneously affects performance, (ii) discover whether the performance differences between onedimensional sparse patterns reported in the literature remain when the patterns are rendered to be aware of twodimensional geometry. To explore (i), (ii) a natural choice was to work with the Strided pattern proposed in Sparse Transformers [6] as it was found to be (i) effective for modeling images and (ii) more suitable than the Fixed pattern (see Figure 2a), on which we built to invent LTR, RTL.
We illustrate the Strided pattern, as proposed in Sparse Transformers [6], in Figures 10, 10. For a fair comparison with LTR, RTL we need to expand Strided pattern in order for it to have Full Information. Figures 10, 10 illustrate this expansion. The pattern illustrated in this Figure is exactly the pattern that YLG  Strided uses. Note that this pattern attends to the same order of positions, $O(n\sqrt{n})$, as LTR and RTL. For one to one comparison with YLG, YLG  Strided has also 8 heads: the Full Information pattern is implemented 4 times, as we need 2 heads for a 2step pattern. As already mentioned, we also use ESA framework for YLG  Strided.
8.4 Things that did not work
In this subsection, we present several ideas, relevant to the paper, that we experimented on and found that their results were not satisfying. Our motivation is to inform the research community about the observed shortcomings of these approaches so that other researchers can reformulate them, reject them or compare their findings with ours.
8.4.1 Weighted inversion at the generator space
We already discussed that our key idea for the inversion: we pass a real image to the discriminator, extract the attention map, convert the attention map to a saliency distribution $S$ and we perform gradient descent to minimize the discriminator embedding distance, weighted by this saliency map:
$${\parallel \left({D}^{0}(G(z)){D}^{0}(x)\right)\cdot {S}^{\prime}\parallel}^{2},$$ 
where ${S}^{\prime}$ is a projected version of saliency map $S$, $x$ is the image, and ${D}^{0}$ is the Discriminator network up to, but not including, the attention layer. In practise, we use Equation 3 for the reasons we explained in Section 8.1 but for the rest of this Section we will overlook this detail as it is not important for our point.
Equation 2 implies that the inversion takes place in the embedding space of the Discriminator. However, naturally one might wonder if we could use the saliency map $S$ to weight the inversion of the Generator, in other words, if we could perform gradient descent on:
$${\parallel \left(G(z)x\right)\cdot {S}^{\prime \prime}\parallel}^{2},$$  (4) 
where ${S}^{\prime \prime}$ is a projected version of $S$ to match the dimensions of the Generator network.
In our experiments, we find that this approach generally leads to inversions of poor quality. To illustrate this, we present inversions of an image of a real husky from the the weighted generator inversion, the weighted discriminator inversion and standard inversion method [3] at Figure 11.
There are several reasons that could explain the quality gap when we change from inversion to the space of the Discriminator to that of the Generator. First of all, the saliency map we use to weight our loss is extracted from the Discriminator, which means that the weights reflect what the Discriminator network considers important at that stage. Therefore, it is reasonable to expect that this saliency map would be more accurate to describe what is important for the input of the attention of the discriminator than to the output of the Generator. Also note that due to the layers of the Discriminator before the attention, the images of the output of the generator and the input of the attention of the Discriminator can be quite different. Finally, the Discriminator may provide an ”easier” embedding space for inversion. The idea of using a different embedding space than the output of the Generator it is not new; activations from VGG16 [22] have also been used for inversion [2]. Our novelty is that we use the Discriminator instead of another pretrained model to work on a new embedding space.
8.4.2 Combination of dense and sparse heads
In our paper, we provide strong experimental evidence that multistep twodimensional sparse local heads can be more efficient than the conventional dense attention layer. We justify this evidence theoretically by modelling the multistep attention with Information Flow Graphs and indicating the implications of Full Information. Naturally, one might wonder what would happen if we combine YLG attention with dense attention. To answer this question, we split heads into two groups, the local  sparse heads and the dense ones. Specifically, we use 4 heads that implement the RTL, LTR patterns (see paper for more details) and 4 dense heads and we train this variation of SAGAN. We use the same setup as with our other experiments. We report FID 19.21 and Inception: 51.23. These scores are far behind than the scores of YLG and thus we did not see any benefit continuing the research in this direction.
8.4.3 Different resolution heads
One idea we believed it would be interesting was to train SAGAN with a multiheaded dense attention layer of different resolution heads. In simple words, that means that in this attention layer some heads have a wider vector representation than others. Our motivation was that the different resolutions could have helped enforcing locality in a different way; we expected the heads with the narrow hidden representations to learn to attend only locally and the wider heads to be able to recover longrange dependencies.
In SAGAN, the number of channels in the query vector is 32, so for an 8head attention layer normally each head would get 4 positions. We split the 8 heads into two equal groups: the narrow and the wide heads. In our experiment, narrow heads get only 2 positions for their vector representation while wide heads get 6. After training on the same setup with our other experiments, we obtain FID 19.57 and Inception score: 50.93. These scores are slightly worse than the original SAGAN, but are far better than SAGAN with dense 8head attention which achieved FID 20.09 and Inception 46.01, as mentioned in the ablation study.
At least in our preliminary experiments, different resolution heads were not found to help very much. Perhaps they can be combined with YLG attention but we more research would be needed in this direction.
8.5 Information Flow Graphs
We found that thinking about sparse attention as a network with multiple stages is helpful in visualizing how information of different tokens is attended and combined. We use Information Flow Graphs (IFGs) that were introduced in [8] for modeling how distributed storage codes preserve data. In full generality, IFGs are directed acyclic graphs with capacitated directed edges. Each storage node is represented with two copies of a vertex (${x}_{\text{in}}$ and ${x}_{\text{out}}$) connected by a directed edge with capacity equal to the amount of information that can be stored into that node. The key insight is that a multistage attention network can be considered a storage network since intermediate tokens are representing combinations of tokens at the previous stage. The IFGs we use in this paper are a special case: every token of every stage of an attention layer is represented by a storage node. Since all the tokens have the same size, we can eliminate vertex splitting and compactly represent each storage node by a single vertex, as shown in Figure 10.
Full information is a design requirement that we found to be helpful in designing attention networks. It simply means that any single input token is connected with a directed path to any output token and hence information (of entropy equal to one token representation) can flow from any one input into any one output. As we discussed in the paper, we found that previously used sparse attention patterns did not have this property and we augmented them to obtain the patterns we use. A stronger requirement would be that any pair of input nodes is connected to any pair of output nodes with two edgedisjoint paths. This would mean that flow of two tokens can be supported from any input to any output. Note that a fully connected network can support this for any pair or even for any set of $k$ inputoutput pairs for $\forall k\le n$.
An interesting example is the star transformer [12] where all $n$ input tokens are connected to a single intermediate node which is then connected to all output tokens. This information flow graph has $2n$ directed edges and can indeed support full information. However, it cannot support a flow of $2$ tokens for any pair, since there is a bottleneck at the intermediate node. We believe that enforcing good information flow for pairs or higher size sets improves the design of attention networks and we plan to investigate this further in the future.