Abstract
Emerging 3D geometric foundation models, such as DUSt3R, offer a promisingapproach for in-the-wild 3D vision tasks. However, due to the high-dimensionalnature of the problem space and scarcity of high-quality 3D data, thesepre-trained models still struggle to generalize to many challengingcircumstances, such as limited view overlap or low lighting. To address this,we propose LoRA3D, an efficient self-calibration pipeline to$\textit{specialize}$ the pre-trained models to target scenes using their ownmulti-view predictions. Taking sparse RGB images as input, we leverage robustoptimization techniques to refine multi-view predictions and align them into aglobal coordinate frame. In particular, we incorporate prediction confidenceinto the geometric optimization process, automatically re-weighting theconfidence to better reflect point estimation accuracy. We use the calibratedconfidence to generate high-quality pseudo labels for the calibrating views anduse low-rank adaptation (LoRA) to fine-tune the models on the pseudo-labeleddata. Our method does not require any external priors or manual labels. Itcompletes the self-calibration process on a $\textbf{single standard GPU withinjust 5 minutes}$. Each low-rank adapter requires only $\textbf{18MB}$ ofstorage. We evaluated our method on $\textbf{more than 160 scenes}$ from theReplica, TUM and Waymo Open datasets, achieving up to $\textbf{88% performanceimprovement}$ on 3D reconstruction, multi-view pose estimation and novel-viewrendering.