Vision-Language Meets the Skeleton: Progressively Distillation with Cross-Modal Knowledge for 3D Action Representation Learning

Abstract

Supervised and self-supervised learning are two main training paradigms forskeleton-based human action recognition. However, the former one-hotclassification requires labor-intensive predefined action categoriesannotations, while the latter involves skeleton transformations (e.g.,cropping) in the pretext tasks that may impair the skeleton structure. Toaddress these challenges, we introduce a novel skeleton-based trainingframework (C$^2$VL) based on Cross-modal Contrastive learning that uses theprogressive distillation to learn task-agnostic human skeleton actionrepresentation from the Vision-Language knowledge prompts. Specifically, weestablish the vision-language action concept space through vision-languageknowledge prompts generated by pre-trained large multimodal models (LMMs),which enrich the fine-grained details that the skeleton action space lacks.Moreover, we propose the intra-modal self-similarity and inter-modalcross-consistency softened targets in the cross-modal contrastive process toprogressively control and guide the degree of pulling vision-language knowledgeprompts and corresponding skeletons closer. These soft instance discriminationand self-knowledge distillation strategies contribute to the learning of betterskeleton-based action representations from the noisy skeleton-vision-languagepairs. During the inference phase, our method requires only the skeleton dataas the input for action recognition and no longer for vision-language prompts.Extensive experiments show that our method achieves state-of-the-art results onNTU RGB+D 60, NTU RGB+D 120, and PKU-MMD datasets. The code will be availablein the future.

Quick Read (beta)

loading the full paper ...