We introduce a new local sparse attention layer that preservestwo-dimensional geometry and locality. We show that by just replacing the denseattention layer of SAGAN with our construction, we obtain very significant FID,Inception score and pure visual improvements. FID score is improved from$18.65$ to $15.94$ on ImageNet, keeping all other parameters the same. Thesparse attention patterns that we propose for our new layer are designed usinga novel information theoretic criterion that uses information flow graphs. Wealso present a novel way to invert Generative Adversarial Networks withattention. Our method extracts from the attention layer of the discriminator asaliency map, which we use to construct a new loss function for the inversion.This allows us to visualize the newly introduced attention heads and show thatthey indeed capture interesting aspects of two-dimensional geometry of realimages.
Quick Read (beta)
Your Local GAN:
Designing Two Dimensional Local Attention Mechanisms for Generative Models
We introduce a new local sparse attention layer that preserves two-dimensional geometry and locality. We show that by just replacing the dense attention layer of SAGAN with our construction, we obtain very significant FID, Inception score and pure visual improvements. FID score is improved from 18.65 to 15.94 on ImageNet, keeping all other parameters the same. The sparse attention patterns that we propose for our new layer are designed using a novel information theoretic criterion that uses information flow graphs.
We also present a novel way to invert Generative Adversarial Networks with attention. Our method uses the attention layer of the discriminator to create an innovative loss function. This allows us to visualize the newly introduced attention heads and show that they indeed capture interesting aspects of two-dimensional geometry of real images.
Generative Adversarial Networks  are making significant progress on modeling and generating natural images [26, 4]. Transposed convolutional layers are a fundmamental architectural component since they capture spatial invariance, a key property of natural images [18, 15, 27]. The central limitation (e.g. as argued in ) is that convolutions fail to model complex geometries and long-distance dependencies– the canonical example is generating dogs with fewer or more than four legs.
To compensate for this limitation, attention layers  have been introduced in deep generative models [26, 4]. Attention enables the modeling of long range spatial dependencies in a single layer which automatically finds correlated parts of the image even if they are far apart. First introduced in SAGAN  and further improved in BigGAN , attention layers have led to some of the best known GANs currently available.
Attention layers have a few limitations. The first is that they are computationally inefficient: Standard dense attention requires memory and time complexity that scales quadratically in the size of the input. Second, dense attention layers are statistically inefficient: A significant number of training samples is required to train attention layers, a problem that becomes more pronounced when multiple attention heads or layers are introduced . Statistical inefficiency also stems from the fact that dense attention does not benefit from locality, since most dependencies in images relate to nearby neighborhoods of pixels. Recent work indicates that most attention layer heads learn to attend mainly to local neighborhoods .
To mitigate these limitations, sparse attention layers were recently introduced in Sparse Transformers . In that paper, different types of sparse attention kernels were introduced and used to obtain excellent results for images, text and audio data. They key observation we make is that the patterns that were introduced in Sparse Transformers are actually designed for one-dimensional data, such as text-sequences. Sparse Transformers  were applied to images by reshaping tensors in a way that significantly distorts distances of the two-dimensional grid of image pixels. Therefore, local sparse attention kernels introduced in Sparse Transformers fail to capture image locality.
We introduce a new local sparse attention layer that preserves two-dimensional image locality and can support good information flow through attention steps.
To design our attention patterns we use the information theoretic framework of Information Flow Graphs . This quantifies how information can flow through multiple steps and preserve two-dimensional locality. We visualize learned attention maps and show that different heads indeed learn different aspects of the geometry of generated images.
We modify SAGAN  using our new two-dimensional sparse attention layers to introduce YLG-SAGAN. We empirically show that this change yields significant benefits. We train on ImageNet-128 and we achieve 14.53% improvement to the FID score of SAGAN and 8.95% improvement in Inception score, by only changing the attention layer while maintaining all other parameters of the architecture. Our ablation study shows that indeed the benefits come from two dimensional inductive bias and not from introducing multiple attention heads. Furthermore, YLG-SAGAN achieves this performance in training steps as opposed to for SAGAN and hence reduces the training time by approximately .
To visualize our attention maps on natural images, we came across the problem of inverting a generator: given an image , how to find a latent code so that is as close as possible to . The natural inversion process of performing gradient descent on this loss works in small GANs [3, 20, 19, 14] but has been notoriously failing in bigger models with attention like SAGAN11 1 This fact is folklore, known at least among researchers who try to solve inverse problems using deep generative models. There are, of course numerous other ways to invert, like training an encoder, but also show poor performance on modern GANs with attention. . We present a solution to the GAN inversion problem: We use the attention layer of the discriminator to obtain a weighting on the loss function that subsequently we use to invert with gradient descent. We empirically show excellent inversion results for numerous cases where standard gradient descent inversion fails.
We open-source our code and our models to encourage further research in this area. The code is available under the repository: https://github.com/giannisdaras/ylg.
Dense Attention. Given matrices , attention of to , updates the vector representation of by integrating the vector representation of . In this paper, are intermediate image representations. More specifically, attention of to associates the following matrices with the inputs: The key matrix , the query matrix and the value matrix where are learnable weight matrices. Intuitively, queries are compared with keys and values translate the result of this comparison to a new vector representation of that integrates information from . Mathematically, the output of the attention is the matrix: where denotes the softmax operation along the last axis.
Sparsified Attention. The quadratic complexity of attention to the size of the input is due to the calculation of the matrix . Instead of performing this calculation jointly, we can split attention in multiple steps. At each step , we attend to a subset of input positions, specified by a binary mask . Mathematically, at step we calculate matrix , where: . In this expression, means that after the softmax, this position will be zeroed and thus not contribute to the calculation of the output matrix. The design of the masks is key in reducing the number of positions attended.
3 Your Local GAN
3.1 Full Information Attention Sparsification
As explained, an attention sparsification in steps is described by binary masks . The question is how to design a good set of masks for these attention steps. We introduce a tool from information theory to guide this design.
Information Flow Graphs are directed acyclic graphs introduced in  to model distributed storage systems through network information flow . For our problem, this graph models how information flows across attention steps. For a given set of masks , we create a multi-partite graph where directed connections between are determined by mask . Each group of vertices in partition corresponds to attention tokens of step .
We say that an attention sparsification has Full Information if its corresponding Information Flow Graph has a directed path from every node to every node . Please note that the Fixed pattern  shown in sub-figure 2 does not have Full Information: there is no path from node 1 of to node 2 of .
Sparse attention is usually considered as a way to reduce the computational overhead of dense attention at a hopefully small performance loss. However, we show that attention masks chosen with a bias toward two-dimensional locality, can surprisingly outperform dense attention layers (compare the second and the third row of Table 1). This is an example of what we call the statistical inefficiency of dense attention. Sparse attention layers with locality create better inductive bias and hence can perform better in the finite sample regime. In the limit of infinite data, dense attention can always simulate sparse attention or perform better, in the same way that a fully connected layer can simulate a convolutional layer for a possible selection of weights.
We design the sparse patterns of YLG as the natural extensions of the patterns of  while ensuring that the corresponding Information Flow Graph supports Full Information. The first pattern, which we call Left to Right (LTR), extends the pattern of  to a bi-directional context. The second pattern, which we call Right to Left (RTL), is a transposed version of LTR. The corresponding masks and associated Information Flow Graphs are presented in sub-figures 2, 2 (LTR) and 2, 2 (RTL). These patterns allow attention only to positions, significantly reducing the quadratic complexity of dense attention. It is possible to create very sparse Full Information graphs using multiple attention steps, but designing them and training them remains open for future work; in this paper we focus on two-step factorizations. We include more details about information flow graphs and how we use them to design attention patterns in the Appendix.
3.2 Two-Dimensional Locality
The standard way to apply these layers on images is to reshape the three dimensional image tensors (having three color channels) to a two-dimensional tensor that enters attention. This corresponds to tokens, each containing a -dimensional representation of a region of the input image. This reshape arranges these tokens linearly, significantly distorting which parts of the image are nearby in two dimensions. This behavior is illustrated in the sub-figure at the left of Figure 3.
We argue that this is the reason that one-dimensional sparsifications are not ideal for images. In fact, the authors of  mention that the Fixed Pattern (Figure 2) was designed for text-sequences and not for images. Our central finding is that these patterns can work very well for images, if their two dimensional structure is correctly considered.
The question is therefore how to take two-dimensional locality into account. We could create two-dimensional attention patterns directly on a grid but this would have significant computational overhead and also prevent us from extending one dimensional sparsifications that are known to work well [12, 6]. Instead, we modify one dimensional sparsifications to become aware of two-dimensional locality with the following trick: (i) we enumerate pixels of the image based on their Manhattan distance from the pixel at location (0, 0) (breaking ties using row priority), (ii) shift the indices of any given one-dimensional sparsification to match the Manhattan distance enumeration instead of the reshape enumeration, and (iii) apply this new one dimensional sparsification pattern, that respects two-dimensional locality, to the one-dimensional reshaped version of the image. We call this procedure ESA (Enumerate, Shift, Apply) and illustrate it in Figure 3.
The ESA trick introduces some distortion compared to a true two-dimensional distance. We found however that this was not too limiting, at least for resolution. On the other hand, ESA offers an important implementation advantage: it theoretically allows the use of one-dimensional block-sparse kernels . Currently these kernels exist only for GPUs, but making them work for TPUs is still under development.
4 Experimental Validation
We conduct experiments on the challenging ImageNet  dataset. We choose SAGAN  as the baseline for our models because, unlike BigGAN  it has official open-source Tensorflow code. BigGAN is not open-source and therefore training or modifying this architecture was not possible22 2 Note that there is an ‘unofficial’ BigGAN that is open in PyTorch. However, that implementation uses gradient checkpointing and requires V100 GPUS for days to train. We simply did not have such computing resources. We believe, however, that YLG can be easily combined with BigGAN (by simply replacing its dense attention layer) and will yield an even better model..
In all our experiments, we change only the attention layer of SAGAN, keeping all the other hyper-parameters unchanged (the number of parameters is not affected). We trained all models for up to 1,500,000 steps on a TPUv3-8 Pod, using a learning rate for generator and for the discriminator. For all the models we report the best performance obtained, even if it was obtained at an earlier point during training.
We start with the Fixed Pattern (Figure 2) and modify it: First, we create Full Information extensions (Section 3.1), yielding the patterns Left-To-Right (LTR) and Right-To-Left (RTL) (Figures 2 and 2 respectively). We implement multi-step attention in parallel using different heads. Since each pattern is a two-step sparsification, this yields 4 attention heads. To encourage diversity of learned patterns, we use each pattern twice, so the total number of heads in our new attention layer is 8. We use our ESA procedure (Section 3.2) to render these patterns aware of two dimensional geometry.
In SAGAN, the query image and the key image in the attention layer have different dimensions. This complicates things, because the sparsification patterns we discuss are designed for self-attention, where the number of query and key nodes is the same. Specifically, for SAGAN the query image is and the key image is . We deal with this in the simplest possible way: we create masks for the image and we shift these masks to cover the area of the image. Thus every block of the query image attends with full information to the key image.
|YLG - No ESA||8||17.47||51.09|
|YLG - Strided||8||16.64||55.21|
Results: As shown in Table 1, YLG-SAGAN (3rd row) outperforms SAGAN by a large margin measured by both FID and Inception score. Specifically, YLG-SAGAN increases Inception score to 57.22 ( improvement) and improves FID to 15.94 ( improvement). Qualitatively, we observe really good-looking samples for categories with simple geometries and homogeneity. Intuitively, a two-dimensional locality can benefit importantly categories such as valleys or mountains, because usually the image transitions for these categories are smoother compared to others and thus the dependencies are mostly local.
Additionally to the significantly improved scores, one important benefit of using YLG sparse layer instead of a dense attention layer, is that we observe significant reduction of the training time needed for the model to reach it’s optimal performance. SAGAN reached it’s best FID score after more that 1.3 million training steps while YLG-SAGAN reaches its’ optimal score after only 865,000 steps ( reduction to the training time). Figure 4 illustrates SAGAN and YLG-SAGAN FID and Inception score as a function of the training time.
We create two collages to display samples from our YLG version of SAGAN. At the Upper Panel of Figure 7, we show dogs of different breeds generated by our YLG-SAN. At the Lower Panel, we use YLG-SAGAN to generate samples from randomly chosen classes of the ImageNet dataset.
4.1 Ablation Study
Number of Attention Heads
The Original SAGAN implementation used a single-headed attention mechanism. In YLG, we use multiple heads to perform parallel multi-step sparse attention. Previous work has shown that multiple heads increased performance for Natural Language Processing tasks . To understand how multiple heads affect SAGAN performance, we train an 8 head version of SAGAN. The results are reported in the second row of Table 1. Multiple heads actually worsen significantly the performance of the original SAGAN, reducing Inception score from 52.52 to 46.01. We provide a post-hoc interpretation of this result. The image embedding of the query vector of SAGAN has only 32 vector positions. By using 8 heads, each head gets only 4 positions for its’ vector representation. Our intuition is that a 4-positions vector representation is not sufficient for effective encoding of the image information for a dense head and that accounts for the decrease in performance. It is important to note that YLG-SAGAN does not suffer from this problem. The reason is that each head is sparse, which means that only attends to a percentage of the positions that dense head attends to. Thus, a smaller vector representation does not worsen performance. Having multiple divergent sparse heads allows YLG layer to discover complex dependencies in the image space throughout the multi-step attention.
As described in Section 3.2 YLG uses the ESA procedure, to adapt 1-D sparse patterns to data with 2-D structure. Our motivation was that grid-locality could help our sparse attention layer to better model local regions. In order to validate this experimentally, we trained a version of YLG without the ESA procedure. We call this model YLG - No ESA. The results are shown in 4th row of Table 1: without the ESA procedure, the performance of YLG is about the same with the original SAGAN. This experiment indicates that ESA trick is essential for using one-dimensional sparse patterns for grid-structured data. If ESA framework is used, FID improves from to and Inception score from to , without any other difference in the model architecture. Thus, ESA is a plug-and-play framework that achieves great performance boosts to both FID and Inception score metrics. ESA allows the utilization of fast sparse one-dimensional patterns that were found to work well for text-sequences to be adapted to images, with great performance benefits. In section 5.1, we visualize attention maps to showcase how our model utilizes ESA framework in practice.
Our YLG layer uses the LTR and RTL patterns (Figures 2 and 2 respectively). Our intuition is that using multiple patterns at the same time increases performance because the model will be able to discover dependencies using multiple different paths. To test this intuition, we ran an experiment using the Full Information extension of the Strided  pattern. We choose this pattern because it was found to be effective for modeling images  due to its’ periodic structure. As with LTR and RTL patterns, we extend the Strided pattern so that it has Full Information33 3 We include visualizations of the Full Information Strided Pattern in the Appendix.. We refer to the YLG model that instead of LTR and RTL patterns, has 8 heads implementing the Strided pattern as YLG - Strided. For our experiment, we use again the ESA trick. We report the results on the 5th row of Table 1. YLG - Strided importantly surpasses SAGAN both in FID and Inception score, however, it is still behind YLG. Although in the Sparse Transformers  it has been claimed that strided pattern is more suitable for images than the patterns we use in YLG, this experiment strongly suggests that it is the grid-locality which makes the difference, as both models are far better than SAGAN. Also, this experiment indicates that multiple sparse patterns can boost performance compared to using a single sparse pattern. To be noted, using multiple different patterns at the same attention layer requires scaling the number of heads as well. Although YLG variations of SAGAN were not impacted negatively by the increase of attention heads, more severe up-scaling of the number of heads could potentially harm performance, similarly to how 8 heads harmed performance of SAGAN.
5 Inverting Generative Models with Attention
We are interested in visualizing our sparse attention on real images, not just generated ones. This leads naturally to the problem of projecting an image on the range of a generator, also called inversion. Given a real image and a generator , inversion corresponds to finding a latent variable , so that approximates the given image as well as possible. One approach for inversion is to try to solve the following non-convex optimization problem:
To solve this optimization problem, we can perform gradient descent from a random initalization to minimize this projection distance in the latent space. This approach was introduced independently in several papers [16, 3, 20] and further generalized to solve inverse problems beyond inversion [3, 20, 19, 14]. Very recent research [13, 23] demonstrated that for fully connected generators with random weights and sufficient layer expansion, gradient descent will provably converge to the correct optimal inversion.
Unfortunately, this theory does not apply for generators that have attention layers. Even empirically, inversion by gradient descent fails for bigger generative models like SAGAN and YLG-SAGAN. As we show in our experiments the optimizer gets trapped in local minimima producing reconstructions that only vaguely resemble the target image. Other approaches for inversion have been tried in the literature, like training jointly an encoder  but none of these methods have been known to successfully invert complex generative models with attention layers.
We propose a novel inversion method that uses the discriminator to solve the minimization problem in an different representation space. Interestingly, the discriminator yields representations with a smoother loss landscape, especially if we use the attention layer in a special way. In more detail: We begin with a random latent variable and a given real image . We denote with the Discriminator network up to, but not including, the attention layer and obtain the representations and . We could perform gradient descent to minimize the distance of these discriminator representations:
We found, however, that we can use the attention map of the real image to further enhance inversion. We will use the example of the SAGAN architecture to illustrate this. Inside the SAGAN Discriminator’s attention, an attention map is calculated. For each pixel of the image, this attention map is a distribution over the pixels of the image. We can use this attention map to extract a saliency map. For each pixel of the image, we can average the probabilities from all the pixels of the image and create a probability distribution of shape . We denote this distribution with the letter . Intuitively, this distribution represents how important each pixel of the image is to the discriminator.
Our proposed inversion algorithm is to perform gradient descent to minimize the discriminator embedding distance, weighted by these saliency maps:
where is a projected version of saliency map to the dimensions of . We actually calculate one saliency map per head and use their sum as the final loss function that we optimize for inversion. More details are included in the Appendix.
5.1 Inversion as lens to attention
Given an arbitrary real image, we can now solve for a yielding a similar generated image from the generator, and visualize the attention maps.
We explain our approach using an example of a real image of a redshank (Figure 5). Figure 5 shows how the standard method for inverting generators  fails: the beak, legs, and rocks are missing. Figure 5 shows the result of our method. Using the that we found using inversion, we can project maps of the attention layer back to the original image to get valuable insight into how the YLG layers work.
First, we analyze the differences between the YLG-SAGAN attention heads. For each attention head of the generator, we create a saliency map as described above and use these maps to analyze the attention mechanism. As shown in Figure 5, the head-7 in the generator is mostly ignoring background focusing on the bird. Other heads function differently: The saliency map of head-2 (Figure 5) shows that this head attends globally. We also find that there are heads that that attend quite sparsely, for example, head-5 attends only to 5-6 background pixels.
We present a second inversion, this time an indigo bird (Figure 6). Figure 6 shows how the standard method  for inverting fails: the head of the bird and the branch are not reconstructed. We also illustrate where specific query points attend to. We first illustrate that the the model exploited the local bias of ESA: We plot the attention map for query point for generator-head-0. This point, indicated with a blue dot, is part of the background. We clearly see a local bias in the positions this point attends to. Another example of two-dimensional local attention is shown in Figure 6. This figure illustrates the attention map of generator-head-4 for a query point on the body of the bird (blue dot). This point attends to the edges of the bird body and to the bird head.
Finally, Figure 6 shows that there are query points that attend to long-distance, demonstrating that the attention mechanism is capable of exploiting both locality and long-distance relationships when these appear in the image.
6 Related Work
There has been a flourishing of novel ideas on making attention mechanisms more efficient. Dai et al.  separate inputs into chunks and associate a state vector with previous chunks of the input. Attention is performed per chunk, but information exchange between chunks is possible via the state vector. Guo et al.  show that a star-shaped topology can reduce attention cost from to in text sequences. Interestingly, this topology does have full information, under our framework. Sukhbaatar et al.  introduced the idea of a learnable adaptive span for each attention layer. Calian et al.  proposed a fast randomized algorithm that exploits spatial coherence and sparsity to design sparse approximations. We believe that all these methods can be possibly combined with YLG, but so far nothing has been demonstrated to improve generative models in a plug-and-play way that this work shows.
7 Conclusions and Future Work
We introduced a new type of local sparse attention layer designed for two-dimensional data. We believe that our layer will be widely applicable for any model with attention that works on two-dimensional data. An interesting future direction is the design of attention layers, thought of as multi-step networks. The two conflicting objectives are to make these networks as sparse as possible (for computational and statistical efficiency) but also support good information flow. We introduced information flow graphs as a mathematical abstraction and proposed full information as a desired criterion for such attention networks.
Finally, we presented a novel way to solve the inversion problem for GANs. Our technique uses the discriminator in two ways: First, using its attention to obtain pixel importance and second, as a smoothing representation of the inversion loss landscape. This new inversion method allowed us to visualize our network on approximations of real images and also to test how good a generative model is in this important coverage task. We believe that this is the first key step towards using generative models for inverse problems and we plan to explore this further in the future.
-  (2000) Network information flow. IEEE Transactions on information theory 46 (4), pp. 1204–1216. Cited by: §3.1.
-  (2019-10) Seeing What a GAN Cannot Generate. arXiv e-prints, pp. arXiv:1910.11626. External Links: Cited by: §8.4.1.
-  (2017) Compressed sensing using generative models. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 537–546. Cited by: 4th item, Figure 5, §5.1, §5.1, §5, Figure 11, Figure 11, §8.1.2, §8.4.1.
-  (2018-09) Large Scale GAN Training for High Fidelity Natural Image Synthesis. arXiv e-prints, pp. arXiv:1809.11096. External Links: Cited by: §1, §1, §4, §8.2.
-  (2019-05) SCRAM: Spatially Coherent Randomized Attention Maps. arXiv e-prints, pp. arXiv:1905.10308. External Links: Cited by: §6.
-  (2019) Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509. Cited by: §1, §1, §2, Figure 2, Figure 2, §3.1, §3.1, §3.2, §3.2, §3.2, §4.1, Figure 10, Figure 10, Figure 10, §8.3, §8.3.
-  (2019-01) Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context. arXiv e-prints, pp. arXiv:1901.02860. External Links: Cited by: §6.
-  (2010) Network coding for distributed storage systems. IEEE transactions on information theory 56 (9), pp. 4539–4551. Cited by: 2nd item, §3.1, §8.5.
-  (2016) Adversarial feature learning. arXiv preprint arXiv:1605.09782. Cited by: §5.
-  (2014-06) Generative Adversarial Networks. arXiv e-prints, pp. arXiv:1406.2661. External Links: Cited by: §1.
-  (2017) Gpu kernels for block-sparse weights. arXiv preprint arXiv:1711.09224. Cited by: §3.2.
-  (2019-02) Star-Transformer. arXiv e-prints, pp. arXiv:1902.09113. External Links: Cited by: §3.2, §6, §8.5.
-  (2019) Global guarantees for enforcing deep generative priors by empirical risk. IEEE Transactions on Information Theory. Cited by: §5.
-  (2018) Task-aware compressed sensing with generative adversarial networks. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: 4th item, §5.
-  (2018-12) A Style-Based Generator Architecture for Generative Adversarial Networks. arXiv e-prints, pp. arXiv:1812.04948. External Links: Cited by: §1.
-  (2017) Precise recovery of latent vectors from generative adversarial networks. arXiv preprint arXiv:1702.04782. Cited by: §5.
-  (2018) Image transformer. arXiv preprint arXiv:1802.05751. Cited by: §6.
-  (2015-11) Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. arXiv e-prints, pp. arXiv:1511.06434. External Links: Cited by: §1.
-  (2019) GAN-based projector for faster recovery with convergence guarantees in linear inverse problems. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5602–5611. Cited by: 4th item, §5.
-  (2017) One network to solve them all–solving linear inverse problems using deep projection models. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5888–5897. Cited by: 4th item, §5.
-  (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. External Links: Cited by: §4.
-  (2014-09) Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv e-prints, pp. arXiv:1409.1556. External Links: Cited by: §8.4.1.
-  (2019) Surfing: iterative optimization over incrementally trained deep networks. arXiv preprint arXiv:1907.08653. Cited by: §5.
-  (2019-05) Adaptive Attention Span in Transformers. arXiv e-prints, pp. arXiv:1905.07799. External Links: Cited by: §1, §6.
-  (2017-06) Attention Is All You Need. arXiv e-prints, pp. arXiv:1706.03762. External Links: Cited by: §1, §2, §4.1.
-  (2018-05) Self-Attention Generative Adversarial Networks. arXiv e-prints, pp. arXiv:1805.08318. External Links: Cited by: 3rd item, §1, §1, §4, §6, §8.3.
-  (2016-12) StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks. arXiv e-prints, pp. arXiv:1612.03242. External Links: Cited by: §1.
-  (2019-07) Lookahead Optimizer: k steps forward, 1 step back. arXiv e-prints, pp. arXiv:1907.08610. External Links: Cited by: §8.1.3.
8.1 A closer look to our inversion method
This subsection aims to explain technical details of our inversion technique and clarify the details of our approach.
We begin with a recap of our method. Given a real image we pass it to the discriminator and we extract the attention map from the attention layer. This attention map contains for every point of the query image, a probability distribution over the pixels of the key image. We can then convert this attention map to a saliency map: by averaging the attention each key point gets from all the query points, we can get a probability distribution over the ”importance” of the pixels of the key image. We denote this saliency map with . Our proposed inversion algorithm is to perform gradient descent to minimize the discriminator embedding distance, weighted by this salience map:
where is a projected version of saliency map , is the image, and is the Discriminator network up to, but not including, the attention layer.
8.1.1 Multiple heads and saliency map
There are some practical considerations that we need to address before illustrating that our inversion method indeed works: the most important of which is how the saliency map looks like.
In our analysis of the YLG attention layers, we explain that because of the Full Information property, our patterns are able, potentially, to discover a dependency between any two pixels of an image. If that is true, we should expect that in the general case our saliency map, generated by the average of all heads, allocates non-zero weights to all image pixels. The important question becomes whether this joint saliency map weights more the pixels that are important for a visually convincing inversion. For example, in case of a bird flying with a blue-sky in the background, we should be ready to accept a small error in some point in the clouds of the sky but not a bird deformation that will make the inverted image look unrealistic. Therefore, our saliency map should allocate more weight in the bird than in it allocates in the background sky.
We already showed in Section 5.1 that different heads specialize in discovering important image parts (for example, some heads learn to focus on local neighbhoords, important shape edges, background, etc.) so extracting a saliency map by averaging all heads usually leads in a uniform distribution over pixels, which is not helping inversion. Figure 8 shows the saliency map jointly all heads of the attention layer of the discriminator produce. Although the bird receives a bit more attention than the background, it is not clear how this map would help weight our loss for inversion. However, as illustrated in 8, there are heads that produce far more meaningful saliency maps for a good-looking inversion. There is a drawback here as well though; if we use that head only, we completely miss the background.
To address this problem, we find two solutions that work quite well.
Solution 1: calculate Equation 2 separately for each head and then add the losses. In that case, the new loss function is given by the following equation:
where is the saliency map extracted from head .
Solution 2: Examine manually the saliency maps for each head and remove the heads that are attending mainly to non-crucial for the inversion areas, such as homogeneous backgrounds.
8.1.2 More inversion visualizations
We present several inversions for different categories of real images at Figure 9. In all our Figures, we use Solution 1 as it has the advantage that it does not require human supervision.
With our method, we can effectively invert real world scenes. We tested the standard inversion method  for these images as well and the results were far less impressive for all images. Especially for the dogs, we noted complete failure of the previous approach, similar to what we illustrate in Figure 11.
8.1.3 Experiments setup
In this subsection, we will briefly describe the experimental setup for our inversion technique. We choose to use the recently introduced Lookahead  optimizer as we find that it reduces the number of different seeds we have to try for a successful inversion. For the vast majority of the examined real images, we are able to get a satisfying inversion by trying at most 4 different seeds. We set the learning rate to and we update for maximum steps. On a single V100 GPU, a single image inversion takes less than a minute to complete. We choose to invert real-world images that were not present in the training set. We initialize our latent variables from a truncated normal distribution, as explained in 8.2.
8.2 Truncation and how it helps inversion
In the BigGAN  paper, the authors observed that latent variables sampled from a truncated normal distribution generated generally more photo-realistic images compared to ones generated from the normal distribution which was used during the training. This so-called truncation trick (resampling the values with magnitude above a chosen threshold) leads to improvement in sample quality at the cost of reduction in sample variety. For the generated images of YLG presented in this paper, we also utilized this trick.
Interestingly, the truncation trick can help inversion as well under some conditions. If the original image has good quality, then according to the truncation trick, it is more probable to be generated by a latent variable sampled from a truncated normal (where values which fall outside a range are resampled to fall inside that range) than the standard normal distribution . For that reason, in our inversions we start our trainable latent variable from a sample of the truncated normal distribution. We found experimentally that setting the truncation threshold to two standard deviations from the median (in our case 0), is a good trade-off between producing photo-realistic images and having enough diversity to invert an arbitrary real world image.
8.3 Strided Pattern
In the ablation studies of our paper, we train a model we name YLG - Strided. For this model, we report better results than the baseline SAGAN  model and slightly worse results than the proposed YLG model. The purpose of this section is to give more information on how YLG and YLG - Strided differ.
First of all, the only difference between YLG and YLG Strided is the choosing of attention masks for the attention heads: both models implement 2-step attention patterns with Full Information and two-dimensional locality using the ESA framework.
YLG model uses the RTL and LTR patterns introduced in the paper (see Figures 2, 2). Each pattern corresponds to a two-step attention: in our implementation of multi-step attention we compute steps in parallel using multiple heads, so in total we need 8 attention heads for YLG. In YLG - Strided instead of using different patterns (RTL and LTR), we stick with using a single attention pattern. Our motivation is to: (i) investigate whether using multiple attention patterns simultaneously affects performance, (ii) discover whether the performance differences between one-dimensional sparse patterns reported in the literature remain when the patterns are rendered to be aware of two-dimensional geometry. To explore (i), (ii) a natural choice was to work with the Strided pattern proposed in Sparse Transformers  as it was found to be (i) effective for modeling images and (ii) more suitable than the Fixed pattern (see Figure 2a), on which we built to invent LTR, RTL.
We illustrate the Strided pattern, as proposed in Sparse Transformers , in Figures 10, 10. For a fair comparison with LTR, RTL we need to expand Strided pattern in order for it to have Full Information. Figures 10, 10 illustrate this expansion. The pattern illustrated in this Figure is exactly the pattern that YLG - Strided uses. Note that this pattern attends to the same order of positions, , as LTR and RTL. For one to one comparison with YLG, YLG - Strided has also 8 heads: the Full Information pattern is implemented 4 times, as we need 2 heads for a 2-step pattern. As already mentioned, we also use ESA framework for YLG - Strided.
8.4 Things that did not work
In this subsection, we present several ideas, relevant to the paper, that we experimented on and found that their results were not satisfying. Our motivation is to inform the research community about the observed shortcomings of these approaches so that other researchers can re-formulate them, reject them or compare their findings with ours.
8.4.1 Weighted inversion at the generator space
We already discussed that our key idea for the inversion: we pass a real image to the discriminator, extract the attention map, convert the attention map to a saliency distribution and we perform gradient descent to minimize the discriminator embedding distance, weighted by this saliency map:
where is a projected version of saliency map , is the image, and is the Discriminator network up to, but not including, the attention layer. In practise, we use Equation 3 for the reasons we explained in Section 8.1 but for the rest of this Section we will overlook this detail as it is not important for our point.
Equation 2 implies that the inversion takes place in the embedding space of the Discriminator. However, naturally one might wonder if we could use the saliency map to weight the inversion of the Generator, in other words, if we could perform gradient descent on:
where is a projected version of to match the dimensions of the Generator network.
In our experiments, we find that this approach generally leads to inversions of poor quality. To illustrate this, we present inversions of an image of a real husky from the the weighted generator inversion, the weighted discriminator inversion and standard inversion method  at Figure 11.
There are several reasons that could explain the quality gap when we change from inversion to the space of the Discriminator to that of the Generator. First of all, the saliency map we use to weight our loss is extracted from the Discriminator, which means that the weights reflect what the Discriminator network considers important at that stage. Therefore, it is reasonable to expect that this saliency map would be more accurate to describe what is important for the input of the attention of the discriminator than to the output of the Generator. Also note that due to the layers of the Discriminator before the attention, the images of the output of the generator and the input of the attention of the Discriminator can be quite different. Finally, the Discriminator may provide an ”easier” embedding space for inversion. The idea of using a different embedding space than the output of the Generator it is not new; activations from VGG16  have also been used for inversion . Our novelty is that we use the Discriminator instead of another pre-trained model to work on a new embedding space.
8.4.2 Combination of dense and sparse heads
In our paper, we provide strong experimental evidence that multi-step two-dimensional sparse local heads can be more efficient than the conventional dense attention layer. We justify this evidence theoretically by modelling the multi-step attention with Information Flow Graphs and indicating the implications of Full Information. Naturally, one might wonder what would happen if we combine YLG attention with dense attention. To answer this question, we split heads into two groups, the local - sparse heads and the dense ones. Specifically, we use 4 heads that implement the RTL, LTR patterns (see paper for more details) and 4 dense heads and we train this variation of SAGAN. We use the same setup as with our other experiments. We report FID 19.21 and Inception: 51.23. These scores are far behind than the scores of YLG and thus we did not see any benefit continuing the research in this direction.
8.4.3 Different resolution heads
One idea we believed it would be interesting was to train SAGAN with a multi-headed dense attention layer of different resolution heads. In simple words, that means that in this attention layer some heads have a wider vector representation than others. Our motivation was that the different resolutions could have helped enforcing locality in a different way; we expected the heads with the narrow hidden representations to learn to attend only locally and the wider heads to be able to recover long-range dependencies.
In SAGAN, the number of channels in the query vector is 32, so for an 8-head attention layer normally each head would get 4 positions. We split the 8 heads into two equal groups: the narrow and the wide heads. In our experiment, narrow heads get only 2 positions for their vector representation while wide heads get 6. After training on the same setup with our other experiments, we obtain FID 19.57 and Inception score: 50.93. These scores are slightly worse than the original SAGAN, but are far better than SAGAN with dense 8-head attention which achieved FID 20.09 and Inception 46.01, as mentioned in the ablation study.
At least in our preliminary experiments, different resolution heads were not found to help very much. Perhaps they can be combined with YLG attention but we more research would be needed in this direction.
8.5 Information Flow Graphs
We found that thinking about sparse attention as a network with multiple stages is helpful in visualizing how information of different tokens is attended and combined. We use Information Flow Graphs (IFGs) that were introduced in  for modeling how distributed storage codes preserve data. In full generality, IFGs are directed acyclic graphs with capacitated directed edges. Each storage node is represented with two copies of a vertex ( and ) connected by a directed edge with capacity equal to the amount of information that can be stored into that node. The key insight is that a multi-stage attention network can be considered a storage network since intermediate tokens are representing combinations of tokens at the previous stage. The IFGs we use in this paper are a special case: every token of every stage of an attention layer is represented by a storage node. Since all the tokens have the same size, we can eliminate vertex splitting and compactly represent each storage node by a single vertex, as shown in Figure 10.
Full information is a design requirement that we found to be helpful in designing attention networks. It simply means that any single input token is connected with a directed path to any output token and hence information (of entropy equal to one token representation) can flow from any one input into any one output. As we discussed in the paper, we found that previously used sparse attention patterns did not have this property and we augmented them to obtain the patterns we use. A stronger requirement would be that any pair of input nodes is connected to any pair of output nodes with two edge-disjoint paths. This would mean that flow of two tokens can be supported from any input to any output. Note that a fully connected network can support this for any pair or even for any set of input-output pairs for .
An interesting example is the star transformer  where all input tokens are connected to a single intermediate node which is then connected to all output tokens. This information flow graph has directed edges and can indeed support full information. However, it cannot support a flow of tokens for any pair, since there is a bottleneck at the intermediate node. We believe that enforcing good information flow for pairs or higher size sets improves the design of attention networks and we plan to investigate this further in the future.