Abstract
Among generative models, diffusion models are uniquely intriguing due to theexistence of a closed-form optimal minimizer of their training objective, oftenreferred to as the optimal denoiser. However, diffusion using this optimaldenoiser merely reproduces images in the training set and hence fails tocapture the behavior of deep diffusion models. Recent work has attempted tocharacterize this gap between the optimal denoiser and deep diffusion models,proposing analytical, training-free models that can generate images thatresemble those generated by a trained UNet. The best-performing methodhypothesizes that shift equivariance and locality inductive biases ofconvolutional neural networks are the cause of the performance gap, henceincorporating these assumptions into its analytical model. In this work, wepresent evidence that the locality in deep diffusion models emerges as astatistical property of the image dataset, not due to the inductive bias ofconvolutional neural networks. Specifically, we demonstrate that an optimalparametric linear denoiser exhibits similar locality properties to the deepneural denoisers. We further show, both theoretically and experimentally, thatthis locality arises directly from the pixel correlations present in naturalimage datasets. Finally, we use these insights to craft an analytical denoiserthat better matches scores predicted by a deep diffusion model than the priorexpert-crafted alternative.