Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Abstract

We introduce LlamaGen, a new family of image generation models that applyoriginal ``next-token prediction'' paradigm of large language models to visualgeneration domain. It is an affirmative answer to whether vanillaautoregressive models, e.g., Llama, without inductive biases on visual signalscan achieve state-of-the-art image generation performance if scaling properly.We reexamine design spaces of image tokenizers, scalability properties of imagegeneration models, and their training data quality. The outcome of thisexploration consists of: (1) An image tokenizer with downsample ratio of 16,reconstruction quality of 0.94 rFID and codebook usage of 97% on ImageNetbenchmark. (2) A series of class-conditional image generation models rangingfrom 111M to 3.1B parameters, achieving 2.18 FID on ImageNet 256x256benchmarks, outperforming the popular diffusion models such as LDM, DiT. (3) Atext-conditional image generation model with 775M parameters, from two-stagetraining on LAION-COCO and high aesthetics quality images, demonstratingcompetitive performance of visual quality and text alignment. (4) We verify theeffectiveness of LLM serving frameworks in optimizing the inference speed ofimage generation models and achieve 326% - 414% speedup. We release all modelsand codes to facilitate open-source community of visual generation andmultimodal foundation models.

Quick Read (beta)

loading the full paper ...