Abstract
Diffusion models have shown incredible capabilities as generative models;indeed, they power the current state-of-the-art models on text-conditionedimage generation such as Imagen and DALL-E 2. In this work we review,demystify, and unify the understanding of diffusion models across bothvariational and score-based perspectives. We first derive Variational DiffusionModels (VDM) as a special case of a Markovian Hierarchical VariationalAutoencoder, where three key assumptions enable tractable computation andscalable optimization of the ELBO. We then prove that optimizing a VDM boilsdown to learning a neural network to predict one of three potential objectives:the original source input from any arbitrary noisification of it, the originalsource noise from any arbitrarily noisified input, or the score function of anoisified input at any arbitrary noise level. We then dive deeper into what itmeans to learn the score function, and connect the variational perspective of adiffusion model explicitly with the Score-based Generative Modeling perspectivethrough Tweedie's Formula. Lastly, we cover how to learn a conditionaldistribution using diffusion models via guidance.