CDFSL-V: Cross-Domain Few-Shot Learning for Videos

Abstract

Few-shot video action recognition is an effective approach to recognizing newcategories with only a few labeled examples, thereby reducing the challengesassociated with collecting and annotating large-scale video datasets. Existingmethods in video action recognition rely on large labeled datasets from thesame domain. However, this setup is not realistic as novel categories may comefrom different data domains that may have different spatial and temporalcharacteristics. This dissimilarity between the source and target domains canpose a significant challenge, rendering traditional few-shot action recognitiontechniques ineffective. To address this issue, in this work, we propose a novelcross-domain few-shot video action recognition method that leveragesself-supervised learning and curriculum learning to balance the informationfrom the source and target domains. To be particular, our method employs amasked autoencoder-based self-supervised training objective to learn from bothsource and target data in a self-supervised manner. Then a progressivecurriculum balances learning the discriminative information from the sourcedataset with the generic information learned from the target domain. Initially,our curriculum utilizes supervised learning to learn class discriminativefeatures from the source data. As the training progresses, we transition tolearning target-domain-specific features. We propose a progressive curriculumto encourage the emergence of rich features in the target domain based on classdiscriminative supervised features in the source domain. We evaluate our methodon several challenging benchmark datasets and demonstrate that our approachoutperforms existing cross-domain few-shot learning techniques. Our code isavailable at https://github.com/Sarinda251/CDFSL-V

Quick Read (beta)

loading the full paper ...