No Pose, No Problem: Surprisingly Simple 3D Gaussian Splats from Sparse Unposed Images

Abstract

We introduce NoPoSplat, a feed-forward model capable of reconstructing 3Dscenes parameterized by 3D Gaussians from \textit{unposed} sparse multi-viewimages. Our model, trained exclusively with photometric loss, achievesreal-time 3D Gaussian reconstruction during inference. To eliminate the needfor accurate pose input during reconstruction, we anchor one input view's localcamera coordinates as the canonical space and train the network to predictGaussian primitives for all views within this space. This approach obviates theneed to transform Gaussian primitives from local coordinates into a globalcoordinate system, thus avoiding errors associated with per-frame Gaussians andpose estimation. To resolve scale ambiguity, we design and compare variousintrinsic embedding methods, ultimately opting to convert camera intrinsicsinto a token embedding and concatenate it with image tokens as input to themodel, enabling accurate scene scale prediction. We utilize the reconstructed3D Gaussians for novel view synthesis and pose estimation tasks and propose atwo-stage coarse-to-fine pipeline for accurate pose estimation. Experimentalresults demonstrate that our pose-free approach can achieve superior novel viewsynthesis quality compared to pose-required methods, particularly in scenarioswith limited input image overlap. For pose estimation, our method, trainedwithout ground truth depth or explicit matching loss, significantly outperformsthe state-of-the-art methods with substantial improvements. This work makessignificant advances in pose-free generalizable 3D reconstruction anddemonstrates its applicability to real-world scenarios. Code and trained modelsare available at https://noposplat.github.io/.

Quick Read (beta)

loading the full paper ...