SITAR: Semi-supervised Image Transformer for Action Recognition

Abstract

Recognizing actions from a limited set of labeled videos remains a challengeas annotating visual data is not only tedious but also can be expensive due toclassified nature. Moreover, handling spatio-temporal data using deep $3$Dtransformers for this can introduce significant computational complexity. Inthis paper, our objective is to address video action recognition in asemi-supervised setting by leveraging only a handful of labeled videos alongwith a collection of unlabeled videos in a compute efficient manner.Specifically, we rearrange multiple frames from the input videos in row-columnform to construct super images. Subsequently, we capitalize on the vast pool ofunlabeled samples and employ contrastive learning on the encoded super images.Our proposed approach employs two pathways to generate representations fortemporally augmented super images originating from the same video.Specifically, we utilize a 2D image-transformer to generate representations andapply a contrastive loss function to minimize the similarity betweenrepresentations from different videos while maximizing the representations ofidentical videos. Our method demonstrates superior performance compared toexisting state-of-the-art approaches for semi-supervised action recognitionacross various benchmark datasets, all while significantly reducingcomputational costs.

Quick Read (beta)

loading the full paper ...