Condition-Aware Multimodal Fusion for Robust Semantic Perception of Driving Scenes

Abstract

Leveraging multiple sensors is crucial for robust semantic perception inautonomous driving, as each sensor type has complementary strengths andweaknesses. However, existing sensor fusion methods often treat sensorsuniformly across all conditions, leading to suboptimal performance. Bycontrast, we propose a novel, condition-aware multimodal fusion approach forrobust semantic perception of driving scenes. Our method, CAFuser uses an RGBcamera input to classify environmental conditions and generate a ConditionToken that guides the fusion of multiple sensor modalities. We further newlyintroduce modality-specific feature adapters to align diverse sensor inputsinto a shared latent space, enabling efficient integration with a single andshared pre-trained backbone. By dynamically adapting sensor fusion based on theactual condition, our model significantly improves robustness and accuracy,especially in adverse-condition scenarios. We set the new state of the art withCAFuser on the MUSES dataset with 59.7 PQ for multimodal panoptic segmentationand 78.2 mIoU for semantic segmentation, ranking first on the publicbenchmarks.

Quick Read (beta)

loading the full paper ...