Abstract
Recent work studying the generalization of diffusion models with UNet-baseddenoisers reveals inductive biases that can be expressed via geometry-adaptiveharmonic bases. However, in practice, more recent denoising networks are oftenbased on transformers, e.g., the diffusion transformer (DiT). This raises thequestion: do transformer-based denoising networks exhibit inductive biases thatcan also be expressed via geometry-adaptive harmonic bases? To our surprise, wefind that this is not the case. This discrepancy motivates our search for theinductive bias that can lead to good generalization in DiT models.Investigating the pivotal attention modules of a DiT, we find that locality ofattention maps are closely associated with generalization. To verify thisfinding, we modify the generalization of a DiT by restricting its attentionwindows. We inject local attention windows to a DiT and observe an improvementin generalization. Furthermore, we empirically find that both the placement andthe effective attention size of these local attention windows are crucialfactors. Experimental results on the CelebA, ImageNet, and LSUN datasets showthat strengthening the inductive bias of a DiT can improve both generalizationand generation quality when less training data is available. Source code willbe released publicly upon paper publication. Project page:dit-generalization.github.io/.