Abstract
In this work, we tackle the problem of video classincremental learning(VCIL). Many existing VCIL methods mitigate catastrophic forgetting byrehearsal training with a few temporally dense samples stored in episodicmemory, which is memory-inefficient. Alternatively, some methods storetemporally sparse samples, sacrificing essential temporal information andthereby resulting in inferior performance. To address this trade-off betweenmemory-efficiency and performance, we propose EpiSodic and SEmaNTIc memoryintegrAtion for video class-incremental Learning (ESSENTIAL). ESSENTIALconsists of episodic memory for storing temporally sparse features and semanticmemory for storing general knowledge represented by learnable prompts. Weintroduce a novel memory retrieval (MR) module that integrates episodic memoryand semantic prompts through cross-attention, enabling the retrieval oftemporally dense features from temporally sparse features. We rigorouslyvalidate ESSENTIAL on diverse datasets: UCF-101, HMDB51, andSomething-Something-V2 from the TCD benchmark and UCF-101, ActivityNet, andKinetics-400 from the vCLIMB benchmark. Remarkably, with significantly reducedmemory, ESSENTIAL achieves favorable performance on the benchmarks.