Vision-Language Meets the Skeleton: Progressively Distillation with Cross-Modal Knowledge for 3D Action Representation Learning

Abstract

Skeleton-based action representation learning aims to interpret andunderstand human behaviors by encoding the skeleton sequences, which can becategorized into two primary training paradigms: supervised learning andself-supervised learning. However, the former one-hot classification requireslabor-intensive predefined action categories annotations, while the latterinvolves skeleton transformations (e.g., cropping) in the pretext tasks thatmay impair the skeleton structure. To address these challenges, we introduce anovel skeleton-based training framework (C$^2$VL) based on Cross-modalContrastive learning that uses the progressive distillation to learntask-agnostic human skeleton action representation from the Vision-Languageknowledge prompts. Specifically, we establish the vision-language actionconcept space through vision-language knowledge prompts generated bypre-trained large multimodal models (LMMs), which enrich the fine-graineddetails that the skeleton action space lacks. Moreover, we propose theintra-modal self-similarity and inter-modal cross-consistency softened targetsin the cross-modal representation learning process to progressively control andguide the degree of pulling vision-language knowledge prompts and correspondingskeletons closer. These soft instance discrimination and self-knowledgedistillation strategies contribute to the learning of better skeleton-basedaction representations from the noisy skeleton-vision-language pairs. Duringthe inference phase, our method requires only the skeleton data as the inputfor action recognition and no longer for vision-language prompts. Extensiveexperiments on NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD datasets demonstratethat our method outperforms the previous methods and achieves state-of-the-artresults. Code is available at: https://github.com/cseeyangchen/C2VL.

Quick Read (beta)

loading the full paper ...