TransGAN: Two Transformers Can Make One Strong GAN

Abstract

The recent explosive interest on transformers has suggested their potentialto become powerful "universal" models for computer vision tasks, such asclassification, detection, and segmentation. However, how further transformerscan go - are they ready to take some more notoriously difficult vision tasks,e.g., generative adversarial networks (GANs)? Driven by that curiosity, weconduct the first pilot study in building a GAN \textbf{completely free ofconvolutions}, using only pure transformer-based architectures. Our vanilla GANarchitecture, dubbed \textbf{TransGAN}, consists of a memory-friendlytransformer-based generator that progressively increases feature resolutionwhile decreasing embedding dimension, and a patch-level discriminator that isalso transformer-based. We then demonstrate TransGAN to notably benefit fromdata augmentations (more than standard GANs), a multi-task co-training strategyfor the generator, and a locally initialized self-attention that emphasizes theneighborhood smoothness of natural images. Equipped with those findings,TransGAN can effectively scale up with bigger models and high-resolution imagedatasets. Specifically, our best architecture achieves highly competitiveperformance compared to current state-of-the-art GANs based on convolutionalbackbones. Specifically, TransGAN sets \textbf{new state-of-the-art} IS scoreof 10.10 and FID score of 25.32 on STL-10. It also reaches competitive 8.64 ISscore and 11.89 FID score on Cifar-10, and 12.23 FID score on CelebA$64\times64$, respectively. We also conclude with a discussion of the currentlimitations and future potential of TransGAN. The code is available at\url{https://github.com/VITA-Group/TransGAN}.

Quick Read (beta)

loading the full paper ...