REST: REtrieve & Self-Train for generative action recognition

Abstract

This work is on training a generative action/video recognition model whoseoutput is a free-form action-specific caption describing the video (rather thanan action class label). A generative approach has practical advantages likeproducing more fine-grained and human-readable output, and being naturallyopen-world. To this end, we propose to adapt a pre-trained generative Vision &Language (V&L) Foundation Model for video/action recognition. While recentlythere have been a few attempts to adapt V&L models trained with contrastivelearning (e.g. CLIP) for video/action, to the best of our knowledge, we proposethe very first method that sets outs to accomplish this goal for a generativemodel. We firstly show that direct fine-tuning of a generative model to produceaction classes suffers from severe overfitting. To alleviate this, we introduceREST, a training framework consisting of two key components: an unsupervisedmethod for adapting the generative model to action/video by means ofpseudo-caption generation and Self-training, i.e. without using anyaction-specific labels; (b) a Retrieval approach based on CLIP for discoveringa diverse set of pseudo-captions for each video to train the model.Importantly, we show that both components are necessary to obtain highaccuracy. We evaluate REST on the problem of zero-shot action recognition wherewe show that our approach is very competitive when compared to contrastivelearning-based methods. Code will be made available.

Quick Read (beta)

loading the full paper ...