LaSe-E2V: Towards Language-guided Semantic-Aware Event-to-Video Reconstruction

Abstract

Event cameras harness advantages such as low latency, high temporalresolution, and high dynamic range (HDR), compared to standard cameras. Due tothe distinct imaging paradigm shift, a dominant line of research focuses onevent-to-video (E2V) reconstruction to bridge event-based and standard computervision. However, this task remains challenging due to its inherently ill-posednature: event cameras only detect the edge and motion information locally.Consequently, the reconstructed videos are often plagued by artifacts andregional blur, primarily caused by the ambiguous semantics of event data. Inthis paper, we find language naturally conveys abundant semantic information,rendering it stunningly superior in ensuring semantic consistency for E2Vreconstruction. Accordingly, we propose a novel framework, called LaSe-E2V,that can achieve semantic-aware high-quality E2V reconstruction from alanguage-guided perspective, buttressed by the text-conditional diffusionmodels. However, due to diffusion models' inherent diversity and randomness, itis hardly possible to directly apply them to achieve spatial and temporalconsistency for E2V reconstruction. Thus, we first propose an Event-guidedSpatiotemporal Attention (ESA) module to condition the event data to thedenoising pipeline effectively. We then introduce an event-aware mask loss toensure temporal coherence and a noise initialization strategy to enhancespatial consistency. Given the absence of event-text-video paired data, weaggregate existing E2V datasets and generate textual descriptions using thetagging models for training and evaluation. Extensive experiments on threedatasets covering diverse challenging scenarios (e.g., fast motion, low light)demonstrate the superiority of our method.

Quick Read (beta)

loading the full paper ...