Abstract
The introduction of vision-language models like CLIP has enabled thedevelopment of foundational video models capable of generalizing to unseenvideos and human actions. However, these models are typically trained on webvideos, which often fail to capture the challenges present in Activities ofDaily Living (ADL) videos. Existing works address ADL-specific challenges, suchas similar appearances, subtle motion patterns, and multiple viewpoints, bycombining 3D skeletons and RGB videos. However, these approaches are notintegrated with language, limiting their ability to generalize to unseen actionclasses. In this paper, we introduce SKI models, which integrate 3D skeletonsinto the vision-language embedding space. SKI models leverage askeleton-language model, SkeletonCLIP, to infuse skeleton information intoVision Language Models (VLMs) and Large Vision Language Models (LVLMs) throughcollaborative training. Notably, SKI models do not require skeleton data duringinference, enhancing their robustness for real-world applications. Theeffectiveness of SKI models is validated on three popular ADL datasets forzero-shot action recognition and video caption generation tasks.