Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction

Abstract

We present Visual AutoRegressive modeling (VAR), a new generation paradigmthat redefines the autoregressive learning on images as coarse-to-fine"next-scale prediction" or "next-resolution prediction", diverging from thestandard raster-scan "next-token prediction". This simple, intuitivemethodology allows autoregressive (AR) transformers to learn visualdistributions fast and generalize well: VAR, for the first time, makes GPT-likeAR models surpass diffusion transformers in image generation. On ImageNet256x256 benchmark, VAR significantly improve AR baseline by improving Frechetinception distance (FID) from 18.65 to 1.73, inception score (IS) from 80.4 to350.2, with around 20x faster inference speed. It is also empirically verifiedthat VAR outperforms the Diffusion Transformer (DiT) in multiple dimensionsincluding image quality, inference speed, data efficiency, and scalability.Scaling up VAR models exhibits clear power-law scaling laws similar to thoseobserved in LLMs, with linear correlation coefficients near -0.998 as solidevidence. VAR further showcases zero-shot generalization ability in downstreamtasks including image in-painting, out-painting, and editing. These resultssuggest VAR has initially emulated the two important properties of LLMs:Scaling Laws and zero-shot task generalization. We have released all models andcodes to promote the exploration of AR/VAR models for visual generation andunified learning.

Quick Read (beta)

loading the full paper ...