Abstract
Event cameras are gaining traction in traffic monitoring applications due totheir low latency, high temporal resolution, and energy efficiency, which makesthem well-suited for real-time object detection at traffic intersections.However, the development of robust event-based detection models is hindered bythe limited availability of annotated real-world datasets. To address this,several simulation tools have been developed to generate synthetic event data.Among these, the CARLA driving simulator includes a built-in dynamic visionsensor (DVS) module that emulates event camera output. Despite its potential,the sim-to-real gap for event-based object detection remains insufficientlystudied. In this work, we present a systematic evaluation of this gap bytraining a recurrent vision transformer model exclusively on synthetic datagenerated using CARLAs DVS and testing it on varying combinations of syntheticand real-world event streams. Our experiments show that models trained solelyon synthetic data perform well on synthetic-heavy test sets but suffersignificant performance degradation as the proportion of real-world dataincreases. In contrast, models trained on real-world data demonstrate strongergeneralization across domains. This study offers the first quantifiableanalysis of the sim-to-real gap in event-based object detection using CARLAsDVS. Our findings highlight limitations in current DVS simulation fidelity andunderscore the need for improved domain adaptation techniques in neuromorphicvision for traffic monitoring.