Learning to Predict the 3D Layout of a Scene

Abstract

While 2D object detection has improved significantly over the past, realworld applications of computer vision often require an understanding of the 3Dlayout of a scene. Many recent approaches to 3D detection use LiDAR pointclouds for prediction. We propose a method that only uses a single RGB image,thus enabling applications in devices or vehicles that do not have LiDARsensors. By using an RGB image, we can leverage the maturity and success ofrecent 2D object detectors, by extending a 2D detector with a 3D detectionhead. In this paper we discuss different approaches and experiments, includingboth regression and classification methods, for designing this 3D detectionhead. Furthermore, we evaluate how subproblems and implementation detailsimpact the overall prediction result. We use the KITTI dataset for training,which consists of street traffic scenes with class labels, 2D bounding boxesand 3D annotations with seven degrees of freedom. Our final architecture isbased on Faster R-CNN. The outputs of the convolutional backbone are fixedsized feature maps for every region of interest. Fully connected layers withinthe network head then propose an object class and perform 2D bounding boxregression. We extend the network head by a 3D detection head, which predictsevery degree of freedom of a 3D bounding box via classification. We achieve amean average precision of 47.3% for moderately difficult data, measured at a 3Dintersection over union threshold of 70%, as required by the official KITTIbenchmark; outperforming previous state-of-the-art single RGB only methods by alarge margin.

Quick Read (beta)

loading the full paper ...