Collaboratively Self-supervised Video Representation Learning for Action Recognition

Abstract

Considering the close connection between action recognition and human poseestimation, we design a Collaboratively Self-supervised Video Representation(CSVR) learning framework specific to action recognition by jointly factoringin generative pose prediction and discriminative context matching as pretexttasks. Specifically, our CSVR consists of three branches: a generative poseprediction branch, a discriminative context matching branch, and a videogenerating branch. Among them, the first one encodes dynamic motion feature byutilizing Conditional-GAN to predict the human poses of future frames, and thesecond branch extracts static context features by contrasting positive andnegative video feature and I-frame feature pairs. The third branch is designedto generate both current and future video frames, for the purpose ofcollaboratively improving dynamic motion features and static context features.Extensive experiments demonstrate that our method achieves state-of-the-artperformance on multiple popular video datasets.

Quick Read (beta)

loading the full paper ...