Abstract
We present MetaUVFS as the first Unsupervised Meta-learning algorithm forVideo Few-Shot action recognition. MetaUVFS leverages over 550K unlabeledvideos to train a two-stream 2D and 3D CNN architecture via contrastivelearning to capture the appearance-specific spatial and action-specificspatio-temporal video features respectively. MetaUVFS comprises a novelAction-Appearance Aligned Meta-adaptation (A3M) module that learns to focus onthe action-oriented video features in relation to the appearance features viaexplicit few-shot episodic meta-learning over unsupervised hard-mined episodes.Our action-appearance alignment and explicit few-shot learner conditions theunsupervised training to mimic the downstream few-shot task, enabling MetaUVFSto significantly outperform all unsupervised methods on few-shot benchmarks.Moreover, unlike previous few-shot action recognition methods that aresupervised, MetaUVFS needs neither base-class labels nor a supervisedpretrained backbone. Thus, we need to train MetaUVFS just once to performcompetitively or sometimes even outperform state-of-the-art supervised methodson popular HMDB51, UCF101, and Kinetics100 few-shot datasets.