Abstract
Vision-based bird's-eye-view (BEV) 3D object detection has advancedsignificantly in autonomous driving by offering cost-effectiveness and richcontextual information. However, existing methods often construct BEVrepresentations by collapsing extracted object features, neglecting intrinsicenvironmental contexts, such as roads and pavements. This hinders detectorsfrom comprehensively perceiving the characteristics of the physical world. Toalleviate this, we introduce a multi-task learning framework, CollaborativePerceiver (CoP), that leverages spatial occupancy as auxiliary information tomine consistent structural and conceptual similarities shared between 3D objectdetection and occupancy prediction tasks, bridging gaps in spatialrepresentations and feature refinement. To this end, we first propose apipeline to generate dense occupancy ground truths incorporating local densityinformation (LDO) for reconstructing detailed environmental information. Next,we employ a voxel-height-guided sampling (VHS) strategy to distill fine-grainedlocal features according to distinct object properties. Furthermore, we developa global-local collaborative feature fusion (CFF) module that seamlesslyintegrates complementary knowledge between both tasks, thus composing morerobust BEV representations. Extensive experiments on the nuScenes benchmarkdemonstrate that CoP outperforms existing vision-based frameworks, achieving49.5\% mAP and 59.2\% NDS on the test set. Code and supplementary materials areavailable at this link https://github.com/jichengyuan/Collaborative-Perceiver.