Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

  • 2024-03-05 18:45:39
  • Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, Robin Rombach
  • 0

Abstract

Diffusion models create data from noise by inverting the forward paths ofdata towards noise and have emerged as a powerful generative modeling techniquefor high-dimensional, perceptual data such as images and videos. Rectified flowis a recent generative model formulation that connects data and noise in astraight line. Despite its better theoretical properties and conceptualsimplicity, it is not yet decisively established as standard practice. In thiswork, we improve existing noise sampling techniques for training rectified flowmodels by biasing them towards perceptually relevant scales. Through alarge-scale study, we demonstrate the superior performance of this approachcompared to established diffusion formulations for high-resolutiontext-to-image synthesis. Additionally, we present a novel transformer-basedarchitecture for text-to-image generation that uses separate weights for thetwo modalities and enables a bidirectional flow of information between imageand text tokens, improving text comprehension, typography, and human preferenceratings. We demonstrate that this architecture follows predictable scalingtrends and correlates lower validation loss to improved text-to-image synthesisas measured by various metrics and human evaluations. Our largest modelsoutperform state-of-the-art models, and we will make our experimental data,code, and model weights publicly available.

 

Quick Read (beta)

loading the full paper ...