Abstract
Human perception reliably identifies movable and immovable parts of 3Dscenes, and completes the 3D structure of objects and background fromincomplete observations. We learn this skill not via labeled examples, butsimply by observing objects move. In this work, we propose an approach thatobserves unlabeled multi-view videos at training time and learns to map asingle image observation of a complex scene, such as a street with cars, to a3D neural scene representation that is disentangled into movable and immovableparts while plausibly completing its 3D structure. We separately parameterizemovable and immovable scene parts via 2D neural ground plans. These groundplans are 2D grids of features aligned with the ground plane that can belocally decoded into 3D neural radiance fields. Our model is trainedself-supervised via neural rendering. We demonstrate that the structureinherent to our disentangled 3D representation enables a variety of downstreamtasks in street-scale 3D scenes using simple heuristics, such as extraction ofobject-centric 3D representations, novel view synthesis, instance segmentation,and 3D bounding box prediction, highlighting its value as a backbone fordata-efficient 3D scene understanding models. This disentanglement furtherenables scene editing via object manipulation such as deletion, insertion, andrigid-body motion.