Language Supervised Training for Skeleton-based Action Recognition

Abstract

Skeleton-based action recognition has drawn a lot of attention for itscomputation efficiency and robustness to lighting conditions. Existingskeleton-based action recognition methods are typically formulated as a one-hotclassification task without fully utilizing the semantic relations betweenactions. For example, "make victory sign" and "thumb up" are two actions ofhand gestures, whose major difference lies in the movement of hands. Thisinformation is agnostic from the categorical one-hot encoding of action classesbut could be unveiled in the language description of actions. Therefore,utilizing action language descriptions in training could potentially benefitrepresentation learning. In this work, we propose a Language SupervisedTraining (LST) approach for skeleton-based action recognition. Morespecifically, we employ a large-scale language model as the knowledge engine toprovide text descriptions for body parts movements of actions, and propose amulti-modal training scheme by utilizing the text encoder to generate featurevectors for different body parts and supervise the skeleton encoder for actionrepresentation learning. Experiments show that our proposed LST method achievesnoticeable improvements over various baseline models without extra computationcost at inference. LST achieves new state-of-the-arts on popular skeleton-basedaction recognition benchmarks, including NTU RGB+D, NTU RGB+D 120 and NW-UCLA.The code can be found at https://github.com/MartinXM/LST.

Quick Read (beta)

loading the full paper ...