Abstract
This paper addresses metric 3D reconstruction of indoor scenes by exploitingtheir inherent geometric regularities with compact representations. Usingplanar 3D primitives - a well-suited representation for man-made environments -we introduce PLANA3R, a pose-free framework for metric Planar 3D Reconstructionfrom unposed two-view images. Our approach employs Vision Transformers toextract a set of sparse planar primitives, estimate relative camera poses, andsupervise geometry learning via planar splatting, where gradients arepropagated through high-resolution rendered depth and normal maps ofprimitives. Unlike prior feedforward methods that require 3D plane annotationsduring training, PLANA3R learns planar 3D structures without explicit planesupervision, enabling scalable training on large-scale stereo datasets usingonly depth and normal annotations. We validate PLANA3R on multiple indoor-scenedatasets with metric supervision and demonstrate strong generalization toout-of-domain indoor environments across diverse tasks under metric evaluationprotocols, including 3D surface reconstruction, depth estimation, and relativepose estimation. Furthermore, by formulating with planar 3D representation, ourmethod emerges with the ability for accurate plane segmentation. The projectpage is available at https://lck666666.github.io/plana3r