Abstract
Semi-supervised action recognition is a challenging but critical task due tothe high cost of video annotations. Existing approaches mainly useconvolutional neural networks, yet current revolutionary vision transformermodels have been less explored. In this paper, we investigate the use oftransformer models under the SSL setting for action recognition. To this end,we introduce SVFormer, which adopts a steady pseudo-labeling framework (ie,EMA-Teacher) to cope with unlabeled video samples. While a wide range of dataaugmentations have been shown effective for semi-supervised imageclassification, they generally produce limited results for video recognition.We therefore introduce a novel augmentation strategy, Tube TokenMix, tailoredfor video data where video clips are mixed via a mask with consistent maskedtokens over the temporal axis. In addition, we propose a temporal warpingaugmentation to cover the complex temporal variation in videos, which stretchesselected frames to various temporal durations in the clip. Extensiveexperiments on three datasets Kinetics-400, UCF-101, and HMDB-51 verify theadvantage of SVFormer. In particular, SVFormer outperforms the state-of-the-artby 31.5% with fewer training epochs under the 1% labeling rate of Kinetics-400.Our method can hopefully serve as a strong benchmark and encourage futuresearch on semi-supervised action recognition with Transformer networks.