Abstract
3D perception based on the representations learned from multi-camerabird's-eye-view (BEV) is trending as cameras are cost-effective for massproduction in autonomous driving industry. However, there exists a distinctperformance gap between multi-camera BEV and LiDAR based 3D object detection.One key reason is that LiDAR captures accurate depth and other geometrymeasurements, while it is notoriously challenging to infer such 3D informationfrom merely image input. In this work, we propose to boost the representationlearning of a multi-camera BEV based student detector by training it to imitatethe features of a well-trained LiDAR based teacher detector. We proposeeffective balancing strategy to enforce the student to focus on learning thecrucial features from the teacher, and generalize knowledge transfer tomulti-scale layers with temporal fusion. We conduct extensive evaluations onmultiple representative models of multi-camera BEV. Experiments reveal that ourapproach renders significant improvement over the student models, leading tothe state-of-the-art performance on the popular benchmark nuScenes.