Abstract
Autonomous driving has attracted remarkable attention from both industry andacademia. An important task is to estimate 3D properties(e.g.translation,rotation and shape) of a moving or parked vehicle on the road. This task, whilecritical, is still under-researched in the computer vision community -partially owing to the lack of large scale and fully-annotated 3D car databasesuitable for autonomous driving research. In this paper, we contribute thefirst large-scale database suitable for 3D car instance understanding -ApolloCar3D. The dataset contains 5,277 driving images and over 60K carinstances, where each car is fitted with an industry-grade 3D CAD model withabsolute model size and semantically labelled keypoints. This dataset is above20 times larger than PASCAL3D+ and KITTI, the current state-of-the-art. Toenable efficient labelling in 3D, we build a pipeline by considering 2D-3Dkeypoint correspondences for a single instance and 3D relationship amongmultiple instances. Equipped with such dataset, we build various baselinealgorithms with the state-of-the-art deep convolutional neural networks.Specifically, we first segment each car with a pre-trained Mask R-CNN, and thenregress towards its 3D pose and shape based on a deformable 3D car model withor without using semantic keypoints. We show that using keypoints significantlyimproves fitting performance. Finally, we develop a new 3D metric jointlyconsidering 3D pose and 3D shape, allowing for comprehensive evaluation andablation study. By comparing with human performance we suggest several futuredirections for further improvements.