Talk2Event: Grounded Understanding of Dynamic Scenes from Event Cameras

Abstract

Event cameras offer microsecond-level latency and robustness to motion blur,making them ideal for understanding dynamic environments. Yet, connecting theseasynchronous streams to human language remains an open challenge. We introduceTalk2Event, the first large-scale benchmark for language-driven objectgrounding in event-based perception. Built from real-world driving data, weprovide over 30,000 validated referring expressions, each enriched with fourgrounding attributes -- appearance, status, relation to viewer, and relation toother objects -- bridging spatial, temporal, and relational reasoning. To fullyexploit these cues, we propose EventRefer, an attribute-aware groundingframework that dynamically fuses multi-attribute representations through aMixture of Event-Attribute Experts (MoEE). Our method adapts to differentmodalities and scene dynamics, achieving consistent gains over state-of-the-artbaselines in event-only, frame-only, and event-frame fusion settings. We hopeour dataset and approach will establish a foundation for advancing multimodal,temporally-aware, and language-driven perception in real-world robotics andautonomy.

Quick Read (beta)

loading the full paper ...