Abstract
Recent work showed that large diffusion models can be reused as highlyprecise monocular depth estimators by casting depth estimation as animage-conditional image generation task. While the proposed model achievedstate-of-the-art results, high computational demands due to multi-stepinference limited its use in many scenarios. In this paper, we show that theperceived inefficiency was caused by a flaw in the inference pipeline that hasso far gone unnoticed. The fixed model performs comparably to the bestpreviously reported configuration while being more than 200$\times$ faster. Tooptimize for downstream task performance, we perform end-to-end fine-tuning ontop of the single-step model with task-specific losses and get a deterministicmodel that outperforms all other diffusion-based depth and normal estimationmodels on common zero-shot benchmarks. We surprisingly find that thisfine-tuning protocol also works directly on Stable Diffusion and achievescomparable performance to current state-of-the-art diffusion-based depth andnormal estimation models, calling into question some of the conclusions drawnfrom prior works.