Abstract
Recent camera-based 3D object detection methods have introduced sequentialframes to improve the detection performance hoping that multiple frames wouldmitigate the large depth estimation error. Despite improved detectionperformance, prior works rely on naive fusion methods (e.g., concatenation) orare limited to static scenes (e.g., temporal stereo), neglecting the importanceof the motion cue of objects. These approaches do not fully exploit thepotential of sequential images and show limited performance improvements. Toaddress this limitation, we propose a novel 3D object detection model, P2D(Predict to Detect), that integrates a prediction scheme into a detectionframework to explicitly extract and leverage motion features. P2D predictsobject information in the current frame using solely past frames to learntemporal motion features. We then introduce a novel temporal featureaggregation method that attentively exploits Bird's-Eye-View (BEV) featuresbased on predicted object information, resulting in accurate 3D objectdetection. Experimental results demonstrate that P2D improves mAP and NDS by3.0% and 3.7% compared to the sequential image-based baseline, illustratingthat incorporating a prediction scheme can significantly improve detectionaccuracy.