Abstract
In this paper, we address the challenging problem of cross-modal(image-to-events) adaptation for event-based recognition without accessing anylabeled source image data. This task is arduous due to the substantial modalitygap between images and events. With only a pre-trained source model available,the key challenge lies in extracting knowledge from this model and effectivelytransferring knowledge to the event-based domain. Inspired by the naturalability of language to convey semantics across different modalities, we proposeEventDance++, a novel framework that tackles this unsupervised source-freecross-modal adaptation problem from a language-guided perspective. We introducea language-guided reconstruction-based modality bridging (L-RMB) module, whichreconstructs intensity frames from events in a self-supervised manner.Importantly, it leverages a vision-language model to provide furthersupervision, enriching the surrogate images and enhancing modality bridging.This enables the creation of surrogate images to extract knowledge (i.e.,labels) from the source model. On top, we propose a multi-representationknowledge adaptation (MKA) module to transfer knowledge to target models,utilizing multiple event representations to capture the spatiotemporalcharacteristics of events fully. The L-RMB and MKA modules are jointlyoptimized to achieve optimal performance in bridging the modality gap.Experiments on three benchmark datasets demonstrate that EventDance++ performson par with methods that utilize source data, validating the effectiveness ofour language-guided approach in event-based recognition.