Text-to-image synthesis has recently seen significant progress thanks tolarge pretrained language models, large-scale training data, and theintroduction of scalable model families such as diffusion and autoregressivemodels. However, the best-performing models require iterative evaluation togenerate a single sample. In contrast, generative adversarial networks (GANs)only need a single forward pass. They are thus much faster, but they currentlyremain far behind the state-of-the-art in large-scale text-to-image synthesis.This paper aims to identify the necessary steps to regain competitiveness. Ourproposed model, StyleGAN-T, addresses the specific requirements of large-scaletext-to-image synthesis, such as large capacity, stable training on diversedatasets, strong text alignment, and controllable variation vs. text alignmenttradeoff. StyleGAN-T significantly improves over previous GANs and outperformsdistilled diffusion models - the previous state-of-the-art in fasttext-to-image synthesis - in terms of sample quality and speed.