Enhancing Spatio-Temporal Zero-shot Action Recognition with Language-driven Description Attributes

Abstract

Vision-Language Models (VLMs) have demonstrated impressive capabilities inzero-shot action recognition by learning to associate video embeddings withclass embeddings. However, a significant challenge arises when relying solelyon action classes to provide semantic context, particularly due to the presenceof multi-semantic words, which can introduce ambiguity in understanding theintended concepts of actions. To address this issue, we propose an innovativeapproach that harnesses web-crawled descriptions, leveraging a large-languagemodel to extract relevant keywords. This method reduces the need for humanannotators and eliminates the laborious manual process of attribute datacreation. Additionally, we introduce a spatio-temporal interaction moduledesigned to focus on objects and action units, facilitating alignment betweendescription attributes and video content. In our zero-shot experiments, ourmodel achieves impressive results, attaining accuracies of 81.0%, 53.1%, and68.9% on UCF-101, HMDB-51, and Kinetics-600, respectively, underscoring themodel's adaptability and effectiveness across various downstream tasks.

Quick Read (beta)

loading the full paper ...