Abstract
We address the task of zero-shot video classification for extremelyfine-grained actions (e.g., Windmill Dunk in basketball), where no videoexamples or temporal annotations are available for unseen classes. Whileimage-language models (e.g., CLIP, SigLIP) show strong open-set recognition,they lack temporal modeling needed for video understanding. We proposeActAlign, a truly zero-shot, training-free method that formulates videoclassification as a sequence alignment problem, preserving the generalizationstrength of pretrained image-language models. For each class, a large languagemodel (LLM) generates an ordered sequence of sub-actions, which we align withvideo frames using Dynamic Time Warping (DTW) in a shared embedding space.Without any video-text supervision or fine-tuning, ActAlign achieves 30.5%accuracy on ActionAtlas--the most diverse benchmark of fine-grained actionsacross multiple sports--where human performance is only 61.6%. ActAlignoutperforms billion-parameter video-language models while using 8x fewerparameters. Our approach is model-agnostic and domain-general, demonstratingthat structured language priors combined with classical alignment methods canunlock the open-set recognition potential of image-language models forfine-grained video understanding.