Abstract
Large-scale AI model training divides work across thousands of GPUs, thensynchronizes gradients across them at each step. This incurs a significantnetwork burden that only centralized, monolithic clusters can support, drivingup infrastructure costs and straining power systems. We propose DecentralizedDiffusion Models, a scalable framework for distributing diffusion modeltraining across independent clusters or datacenters by eliminating thedependence on a centralized, high-bandwidth networking fabric. Our methodtrains a set of expert diffusion models over partitions of the dataset, each infull isolation from one another. At inference time, the experts ensemblethrough a lightweight router. We show that the ensemble collectively optimizesthe same objective as a single model trained over the whole dataset. This meanswe can divide the training burden among a number of "compute islands," loweringinfrastructure costs and improving resilience to localized GPU failures.Decentralized diffusion models empower researchers to take advantage ofsmaller, more cost-effective and more readily available compute like on-demandGPU nodes rather than central integrated systems. We conduct extensiveexperiments on ImageNet and LAION Aesthetics, showing that decentralizeddiffusion models FLOP-for-FLOP outperform standard diffusion models. We finallyscale our approach to 24 billion parameters, demonstrating that high-qualitydiffusion models can now be trained with just eight individual GPU nodes inless than a week.