Zero-Shot Text-to-Image Generation

Abstract

Text-to-image generation has traditionally focused on finding better modelingassumptions for training on a fixed dataset. These assumptions might involvecomplex architectures, auxiliary losses, or side information such as objectpart labels or segmentation masks supplied during training. We describe asimple approach for this task based on a transformer that autoregressivelymodels the text and image tokens as a single stream of data. With sufficientdata and scale, our approach is competitive with previous domain-specificmodels when evaluated in a zero-shot fashion.

Quick Read (beta)

loading the full paper ...