Abstract
Multi-view stereo reconstruction (MVS) in the wild requires to first estimatethe camera parameters e.g. intrinsic and extrinsic parameters. These areusually tedious and cumbersome to obtain, yet they are mandatory to triangulatecorresponding pixels in 3D space, which is the core of all best performing MVSalgorithms. In this work, we take an opposite stance and introduce DUSt3R, aradically novel paradigm for Dense and Unconstrained Stereo 3D Reconstructionof arbitrary image collections, i.e. operating without prior information aboutcamera calibration nor viewpoint poses. We cast the pairwise reconstructionproblem as a regression of pointmaps, relaxing the hard constraints of usualprojective camera models. We show that this formulation smoothly unifies themonocular and binocular reconstruction cases. In the case where more than twoimages are provided, we further propose a simple yet effective global alignmentstrategy that expresses all pairwise pointmaps in a common reference frame. Webase our network architecture on standard Transformer encoders and decoders,allowing us to leverage powerful pretrained models. Our formulation directlyprovides a 3D model of the scene as well as depth information, butinterestingly, we can seamlessly recover from it, pixel matches, relative andabsolute camera. Exhaustive experiments on all these tasks showcase that theproposed DUSt3R can unify various 3D vision tasks and set new SoTAs onmonocular/multi-view depth estimation as well as relative pose estimation. Insummary, DUSt3R makes many geometric 3D vision tasks easy.