Abstract
Infrastructure-based perception plays a crucial role in intelligenttransportation systems, offering global situational awareness and enablingcooperative autonomy. However, existing camera-based detection models oftenunderperform in such scenarios due to challenges such as multi-viewinfrastructure setup, diverse camera configurations, degraded visual inputs,and various road layouts. We introduce MIC-BEV, a Transformer-basedbird's-eye-view (BEV) perception framework for infrastructure-basedmulti-camera 3D object detection. MIC-BEV flexibly supports a variable numberof cameras with heterogeneous intrinsic and extrinsic parameters anddemonstrates strong robustness under sensor degradation. The proposedgraph-enhanced fusion module in MIC-BEV integrates multi-view image featuresinto the BEV space by exploiting geometric relationships between cameras andBEV cells alongside latent visual cues. To support training and evaluation, weintroduce M2I, a synthetic dataset for infrastructure-based object detection,featuring diverse camera configurations, road layouts, and environmentalconditions. Extensive experiments on both M2I and the real-world datasetRoScenes demonstrate that MIC-BEV achieves state-of-the-art performance in 3Dobject detection. It also remains robust under challenging conditions,including extreme weather and sensor degradation. These results highlight thepotential of MIC-BEV for real-world deployment. The dataset and source code areavailable at: https://github.com/HandsomeYun/MIC-BEV.