Abstract
With video exploding across social media, surveillance, and education,compressing long footage into concise yet faithful surrogates is crucial.Supervised methods learn frame/shot importance from dense labels and excelin-domain, but are costly and brittle across datasets; unsupervised methodsavoid labels but often miss high-level semantics and narrative cues. Recentzero-shot pipelines use LLMs for training-free summarization, yet remainsensitive to handcrafted prompts and dataset-specific normalization.We proposea rubric-guided, pseudo-labeled prompting framework. A small subset of humanannotations is converted into high-confidence pseudo labels and aggregated intostructured, dataset-adaptive scoring rubrics for interpretable sceneevaluation. At inference, boundary scenes (first/last) are scored from theirown descriptions, while intermediate scenes include brief summaries of adjacentsegments to assess progression and redundancy, enabling the LLM to balancelocal salience with global coherence without parameter tuning.Across threebenchmarks, our method is consistently effective. On SumMe and TVSum itachieves F1 of 57.58 and 63.05, surpassing a zero-shot baseline (56.73, 62.21)by +0.85 and +0.84 and approaching supervised performance. On the query-focusedQFVS benchmark it attains 53.79 F1, beating 53.42 by +0.37 and remaining stableacross validation videos. These results show that rubric-guided pseudolabeling, coupled with contextual prompting, stabilizes LLM-based scoring andyields a general, interpretable zero-shot paradigm for both generic andquery-focused video summarization.