Beyond Coarse-Grained Matching in Video-Text Retrieval

Abstract

Video-text retrieval has seen significant advancements, yet the ability ofmodels to discern subtle differences in captions still requires verification.In this paper, we introduce a new approach for fine-grained evaluation. Ourapproach can be applied to existing datasets by automatically generating hardnegative test captions with subtle single-word variations across nouns, verbs,adjectives, adverbs, and prepositions. We perform comprehensive experimentsusing four state-of-the-art models across two standard benchmarks (MSR-VTT andVATEX) and two specially curated datasets enriched with detailed descriptions(VLN-UVO and VLN-OOPS), resulting in a number of novel insights: 1) ouranalyses show that the current evaluation benchmarks fall short in detecting amodel's ability to perceive subtle single-word differences, 2) our fine-grainedevaluation highlights the difficulty models face in distinguishing such subtlevariations. To enhance fine-grained understanding, we propose a new baselinethat can be easily combined with current methods. Experiments on ourfine-grained evaluations demonstrate that this approach enhances a model'sability to understand fine-grained differences.

Quick Read (beta)

loading the full paper ...