Abstract
In this paper, we argue that iterative computation with diffusion modelsoffers a powerful paradigm for not only generation but also visual perceptiontasks. We unify tasks such as depth estimation, optical flow, and amodalsegmentation under the framework of image-to-image translation, and show howdiffusion models benefit from scaling training and test-time compute for theseperceptual tasks. Through a careful analysis of these scaling properties, weformulate compute-optimal training and inference recipes to scale diffusionmodels for visual perception tasks. Our models achieve competitive performanceto state-of-the-art methods using significantly less data and compute. Toaccess our code and models, see https://scaling-diffusion-perception.github.io .