Recently, Vision Transformers (ViTs) have shown competitive performance onimage recognition while requiring less vision-specific inductive biases. Inthis paper, we investigate if such observation can be extended to imagegeneration. To this end, we integrate the ViT architecture into generativeadversarial networks (GANs). We observe that existing regularization methodsfor GANs interact poorly with self-attention, causing serious instabilityduring training. To resolve this issue, we introduce novel regularizationtechniques for training GANs with ViTs. Empirically, our approach, namedViTGAN, achieves comparable performance to state-of-the-art CNN-based StyleGAN2on CIFAR-10, CelebA, and LSUN bedroom datasets.