Abstract
Diffusion models create data from noise by inverting the forward paths ofdata towards noise and have emerged as a powerful generative modeling techniquefor high-dimensional, perceptual data such as images and videos. Rectified flowis a recent generative model formulation that connects data and noise in astraight line. Despite its better theoretical properties and conceptualsimplicity, it is not yet decisively established as standard practice. In thiswork, we improve existing noise sampling techniques for training rectified flowmodels by biasing them towards perceptually relevant scales. Through alarge-scale study, we demonstrate the superior performance of this approachcompared to established diffusion formulations for high-resolutiontext-to-image synthesis. Additionally, we present a novel transformer-basedarchitecture for text-to-image generation that uses separate weights for thetwo modalities and enables a bidirectional flow of information between imageand text tokens, improving text comprehension, typography, and human preferenceratings. We demonstrate that this architecture follows predictable scalingtrends and correlates lower validation loss to improved text-to-image synthesisas measured by various metrics and human evaluations. Our largest modelsoutperform state-of-the-art models, and we will make our experimental data,code, and model weights publicly available.