BUOL: A Bottom-Up Framework with Occupancy-aware Lifting for Panoptic 3D Scene Reconstruction From A Single Image

Abstract

Understanding and modeling the 3D scene from a single image is a practicalproblem. A recent advance proposes a panoptic 3D scene reconstruction task thatperforms both 3D reconstruction and 3D panoptic segmentation from a singleimage. Although having made substantial progress, recent works only focus ontop-down approaches that fill 2D instances into 3D voxels according toestimated depth, which hinders their performance by two ambiguities. (1)instance-channel ambiguity: The variable ids of instances in each scene lead toambiguity during filling voxel channels with 2D information, confusing thefollowing 3D refinement. (2) voxel-reconstruction ambiguity: 2D-to-3D liftingwith estimated single view depth only propagates 2D information onto thesurface of 3D regions, leading to ambiguity during the reconstruction ofregions behind the frontal view surface. In this paper, we propose BUOL, aBottom-Up framework with Occupancy-aware Lifting to address the two issues forpanoptic 3D scene reconstruction from a single image. For instance-channelambiguity, a bottom-up framework lifts 2D information to 3D voxels based ondeterministic semantic assignments rather than arbitrary instance idassignments. The 3D voxels are then refined and grouped into 3D instancesaccording to the predicted 2D instance centers. For voxel-reconstructionambiguity, the estimated multi-plane occupancy is leveraged together with depthto fill the whole regions of things and stuff. Our method shows a tremendousperformance advantage over state-of-the-art methods on synthetic dataset3D-Front and real-world dataset Matterport3D. Code and models are available inhttps://github.com/chtsy/buol.

Quick Read (beta)

loading the full paper ...