DiffusionSfM: Predicting Structure and Motion via Ray Origin and Endpoint Diffusion

Abstract

Current Structure-from-Motion (SfM) methods typically follow a two-stagepipeline, combining learned or geometric pairwise reasoning with a subsequentglobal optimization step. In contrast, we propose a data-driven multi-viewreasoning approach that directly infers 3D scene geometry and camera poses frommulti-view images. Our framework, DiffusionSfM, parameterizes scene geometryand cameras as pixel-wise ray origins and endpoints in a global frame andemploys a transformer-based denoising diffusion model to predict them frommulti-view inputs. To address practical challenges in training diffusion modelswith missing data and unbounded scene coordinates, we introduce specializedmechanisms that ensure robust learning. We empirically validate DiffusionSfM onboth synthetic and real datasets, demonstrating that it outperforms classicaland learning-based approaches while naturally modeling uncertainty.

Quick Read (beta)

loading the full paper ...