Abstract
Advances in deep learning techniques have allowed recent work to reconstructthe shape of a single object given only one RBG image as input. Building oncommon encoder-decoder architectures for this task, we propose threeextensions: (1) ray-traced skip connections that propagate local 2D informationto the output 3D volume in a physically correct manner; (2) a hybrid 3D volumerepresentation that enables building translation equivariant models, while atthe same time encoding fine object details without an excessive memoryfootprint; (3) a reconstruction loss tailored to capture overall objectgeometry. Furthermore, we adapt our model to address the harder task ofreconstructing multiple objects from a single image. We reconstruct all objectsjointly in one pass, producing a coherent reconstruction, where all objectslive in a single consistent 3D coordinate frame relative to the camera and theydo not intersect in 3D space. We also handle occlusions and resolve them byhallucinating the missing object parts in the 3D volume. We validate the impactof our contributions experimentally both on synthetic data from ShapeNet aswell as real images from Pix3D. Our method improves over the state-of-the-artsingle-object methods on both datasets. Finally, we evaluate performancequantitatively on multiple object reconstruction with synthetic scenesassembled from ShapeNet objects.