Abstract
While neural 3D reconstruction has advanced substantially, it typicallyrequires densely captured multi-view data with carefully initialized poses(e.g., using COLMAP). However, this requirement limits its broaderapplicability, as Structure-from-Motion (SfM) is often unreliable insparse-view scenarios where feature matches are limited, resulting incumulative errors. In this paper, we introduce InstantSplat, a novel andlightning-fast neural reconstruction system that builds accurate 3Drepresentations from as few as 2-3 images. InstantSplat adopts aself-supervised framework that bridges the gap between 2D images and 3Drepresentations using Gaussian Bundle Adjustment (GauBA) and can be optimizedin an end-to-end manner. InstantSplat integrates dense stereo priors andco-visibility relationships between frames to initialize pixel-aligned geometryby progressively expanding the scene avoiding redundancy. Gaussian BundleAdjustment is used to adapt both the scene representation and camera parametersquickly by minimizing gradient-based photometric error. Overall, InstantSplatachieves large-scale 3D reconstruction in mere seconds by reducing the requirednumber of input views. It achieves an acceleration of over 20 times inreconstruction, improves visual quality (SSIM) from 0.3755 to 0.7624 thanCOLMAP with 3D-GS, and is compatible with multiple 3D representations (3D-GS,2D-GS, and Mip-Splatting).