From Captions to Keyframes: KeyScore for Multimodal Frame Scoring and Video-Language Understanding

Abstract

Selecting informative keyframes is critical for efficient videounderstanding, yet existing approaches often rely on heuristics, ignoresemantics, or produce redundant frames. We propose KeyScore, a caption-awareframe scoring method that combines three complementary signals: semanticsimilarity to captions, temporal representativeness, and contextual dropimpact. Applied to large-scale video-caption datasets, KeyScore generatesframe-level importance scores that enable training keyframe extractors orguiding video-language models. To support this, we also propose STACFP, aSpatio-Temporal Adaptive Clustering method that generates diverse and compactframe proposals across long videos. Together, KeyScore and STACFP reduceuninformative frames while preserving critical content, resulting in faster andmore accurate inference. Our experiments on three standard video-languagebenchmarks (MSRVTT, MSVD, DiDeMo) show that combining STACFP and KeyScoreenables up to 99% frame reduction compared to full-frame processing, whileoutperforming uniform 8-frame encoders in video-text retrieval, keyframeextraction, and action recognition tasks. By focusing on semantically relevantframes, our method enhances both efficiency and performance, enabling scalableand caption-grounded video understanding.

Quick Read (beta)

loading the full paper ...