Abstract
This technical report introduces PIXART-{\delta}, a text-to-image synthesisframework that integrates the Latent Consistency Model (LCM) and ControlNetinto the advanced PIXART-{\alpha} model. PIXART-{\alpha} is recognized for itsability to generate high-quality images of 1024px resolution through aremarkably efficient training process. The integration of LCM inPIXART-{\delta} significantly accelerates the inference speed, enabling theproduction of high-quality images in just 2-4 steps. Notably, PIXART-{\delta}achieves a breakthrough 0.5 seconds for generating 1024x1024 pixel images,marking a 7x improvement over the PIXART-{\alpha}. Additionally,PIXART-{\delta} is designed to be efficiently trainable on 32GB V100 GPUswithin a single day. With its 8-bit inference capability (von Platen et al.,2023), PIXART-{\delta} can synthesize 1024px images within 8GB GPU memoryconstraints, greatly enhancing its usability and accessibility. Furthermore,incorporating a ControlNet-like module enables fine-grained control overtext-to-image diffusion models. We introduce a novel ControlNet-Transformerarchitecture, specifically tailored for Transformers, achieving explicitcontrollability alongside high-quality image generation. As a state-of-the-art,open-source image generation model, PIXART-{\delta} offers a promisingalternative to the Stable Diffusion family of models, contributingsignificantly to text-to-image synthesis.