Abstract
Bird's eye view (BEV) representation has emerged as a dominant solution fordescribing 3D space in autonomous driving scenarios. However, objects in theBEV representation typically exhibit small sizes, and the associated pointcloud context is inherently sparse, which leads to great challenges forreliable 3D perception. In this paper, we propose IS-Fusion, an innovativemultimodal fusion framework that jointly captures the Instance- and Scene-levelcontextual information. IS-Fusion essentially differs from existing approachesthat only focus on the BEV scene-level fusion by explicitly incorporatinginstance-level multimodal information, thus facilitating the instance-centrictasks like 3D object detection. It comprises a Hierarchical Scene Fusion (HSF)module and an Instance-Guided Fusion (IGF) module. HSF applies Point-to-Gridand Grid-to-Region transformers to capture the multimodal scene context atdifferent granularities. IGF mines instance candidates, explores theirrelationships, and aggregates the local multimodal context for each instance.These instances then serve as guidance to enhance the scene feature and yieldan instance-aware BEV representation. On the challenging nuScenes benchmark,IS-Fusion outperforms all the published multimodal works to date. Code isavailable at: https://github.com/yinjunbo/IS-Fusion.