Self-supervised Video Transformer

Abstract

In this paper, we propose self-supervised training for video transformersusing unlabelled video data. From a given video, we create local and globalspatiotemporal views with varying spatial sizes and frame rates. Ourself-supervised objective seeks to match the features of these different viewsrepresenting the same video, to be invariant to spatiotemporal variations inactions. To the best of our knowledge, the proposed approach is the first toalleviate the dependency on negative samples or dedicated memory banks inSelf-supervised Video Transformer (SVT). Further, owing to the flexibility ofTransformer models, SVT supports slow-fast video processing within a singlearchitecture using dynamically adjusted positional encodings and supportslong-term relationship modeling along spatiotemporal dimensions. Our approachperforms well on four action recognition benchmarks (Kinetics-400, UCF-101,HMDB-51, and SSv2) and converges faster with small batch sizes. Code:https://git.io/J1juJ

Quick Read (beta)

loading the full paper ...