SlowFast Networks for Video Recognition

Abstract

We present SlowFast networks for video recognition. Our model involves (i) aSlow pathway, operating at low frame rate, to capture spatial semantics, and(ii) a Fast pathway, operating at high frame rate, to capture motion at finetemporal resolution. The Fast pathway can be made very lightweight by reducingits channel capacity, yet can learn useful temporal information for videorecognition. Our models achieve strong performance for both actionclassification and detection in video, and large improvements are pin-pointedas contributions by our SlowFast concept. We report 79.0% accuracy on theKinetics dataset without using any pre-training, largely surpassing theprevious best results of this kind. On AVA action detection we achieve a newstate-of-the-art of 28.3 mAP. Code will be made publicly available.

Quick Read (beta)

loading the full paper ...