Self-Attention Based Context-Aware 3D Object Detection

Abstract

Most existing point-cloud based 3D object detectors use convolution-likeoperators to process information in a local neighbourhood with fixed-weightkernels and aggregate global context hierarchically. However, recent work onnon-local neural networks and self-attention for 2D vision has shown thatexplicitly modeling global context and long-range interactions betweenpositions can lead to more robust and competitive models. In this paper, weexplore two variants of self-attention for contextual modeling in 3D objectdetection by augmenting convolutional features with self-attention features. Wefirst incorporate the pairwise self-attention mechanism into the currentstate-of-the-art BEV, voxel and point-based detectors and show consistentimprovement over strong baseline models while simultaneously significantlyreducing their parameter footprint and computational cost. We also propose aself-attention variant that samples a subset of the most representativefeatures by learning deformations over randomly sampled locations. This notonly allows us to scale explicit global contextual modeling to largerpoint-clouds, but also leads to more discriminative and informative featuredescriptors. Our method can be flexibly applied to most state-of-the-artdetectors with increased accuracy and parameter and compute efficiency. Weachieve new state-of-the-art detection performance on KITTI and nuScenesdatasets. Code is available at\url{https://github.com/AutoVision-cloud/SA-Det3D}.

Quick Read (beta)

loading the full paper ...