Efficient Video Transformers with Spatial-Temporal Token Selection

Abstract

Video transformers have achieved impressive results on major videorecognition benchmarks, however they suffer from high computational cost. Inthis paper, we present STTS, a token selection framework that dynamicallyselects a few informative tokens in both temporal and spatial dimensionsconditioned on input video samples. Specifically, we formulate token selectionas a ranking problem, which estimates the importance of each token through alightweight selection network and only those with top scores will be used fordownstream evaluation. In the temporal dimension, we keep the frames that aremost relevant for recognizing action categories, while in the spatialdimension, we identify the most discriminative region in feature maps withoutaffecting spatial context used in a hierarchical way in most videotransformers. Since the decision of token selection is non-differentiable, weemploy a perturbed-maximum based differentiable Top-K operator for end-to-endtraining. We conduct extensive experiments on Kinetics-400 with a recentlyintroduced video transformer backbone, MViT. Our framework achieves similarresults while requiring 20% less computation. We also demonstrate that ourapproach is compatible with other transformer architectures.

Quick Read (beta)

loading the full paper ...