We present a convolution-free approach to video classification builtexclusively on self-attention over space and time. Our method, named"TimeSformer," adapts the standard Transformer architecture to video byenabling spatiotemporal feature learning directly from a sequence offrame-level patches. Our experimental study compares different self-attentionschemes and suggests that "divided attention," where temporal attention andspatial attention are separately applied within each block, leads to the bestvideo classification accuracy among the design choices considered. Despite theradically different design compared to the prominent paradigm of 3Dconvolutional architectures for video, TimeSformer achieves state-of-the-artresults on several major action recognition benchmarks, including the bestreported accuracy on Kinetics-400 and Kinetics-600. Furthermore, our model isfaster to train and has higher test-time efficiency compared to competingarchitectures. Code and pretrained models will be made publicly available.