Exploring Diffusion Transformer Designs via Grafting

Abstract

Designing model architectures requires decisions such as selecting operators(e.g., attention, convolution) and configurations (e.g., depth, width).However, evaluating the impact of these decisions on model quality requirescostly pretraining, limiting architectural investigation. Inspired by how newsoftware is built on existing code, we ask: can new architecture designs bestudied using pretrained models? To this end, we present grafting, a simpleapproach for editing pretrained diffusion transformers (DiTs) to materializenew architectures under small compute budgets. Informed by our analysis ofactivation behavior and attention locality, we construct a testbed based on theDiT-XL/2 design to study the impact of grafting on model quality. Using thistestbed, we develop a family of hybrid designs via grafting: replacing softmaxattention with gated convolution, local attention, and linear attention, andreplacing MLPs with variable expansion ratio and convolutional variants.Notably, many hybrid designs achieve good quality (FID: 2.38-2.64 vs. 2.27 forDiT-XL/2) using <2% pretraining compute. We then graft a text-to-image model(PixArt-Sigma), achieving a 1.43x speedup with less than a 2% drop in GenEvalscore. Finally, we present a case study that restructures DiT-XL/2 byconverting every pair of sequential transformer blocks into parallel blocks viagrafting. This reduces model depth by 2x and yields better quality (FID: 2.77)than other models of comparable depth. Together, we show that new diffusionmodel designs can be explored by grafting pretrained DiTs, with edits rangingfrom operator replacement to architecture restructuring. Code and graftedmodels: https://grafting.stanford.edu

Quick Read (beta)

loading the full paper ...