LinFusion: 1 GPU, 1 Minute, 16K Image

Abstract

Modern diffusion models, particularly those utilizing a Transformer-basedUNet for denoising, rely heavily on self-attention operations to manage complexspatial relationships, thus achieving impressive generation performance.However, this existing paradigm faces significant challenges in generatinghigh-resolution visual content due to its quadratic time and memory complexitywith respect to the number of spatial tokens. To address this limitation, weaim at a novel linear attention mechanism as an alternative in this paper.Specifically, we begin our exploration from recently introduced models withlinear complexity, e.g., Mamba2, RWKV6, Gated Linear Attention, etc, andidentify two key features-attention normalization and non-causal inference-thatenhance high-resolution visual generation performance. Building on theseinsights, we introduce a generalized linear attention paradigm, which serves asa low-rank approximation of a wide spectrum of popular linear token mixers. Tosave the training cost and better leverage pre-trained models, we initializeour models and distill the knowledge from pre-trained StableDiffusion (SD). Wefind that the distilled model, termed LinFusion, achieves performance on parwith or superior to the original SD after only modest training, whilesignificantly reducing time and memory complexity. Extensive experiments onSD-v1.5, SD-v2.1, and SD-XL demonstrate that LinFusion delivers satisfactoryzero-shot cross-resolution generation performance, generating high-resolutionimages like 16K resolution. Moreover, it is highly compatible with pre-trainedSD components, such as ControlNet and IP-Adapter, requiring no adaptationefforts. Codes are available at https://github.com/Huage001/LinFusion.

Quick Read (beta)

loading the full paper ...