Repeating Words for Video-Language Retrieval with Coarse-to-Fine Objectives

  • 2025-08-20 16:03:56
  • Haoyu Zhao, Jiaxi Gu, Shicong Wang, Xing Zhang, Hang Xu, Zuxuan Wu, Yu-Gang Jiang
  • 0

Abstract

The explosive growth of video streaming presents challenges in achieving highaccuracy and low training costs for video-language retrieval. However, existingmethods rely on large-scale pre-training to improve video retrievalperformance, resulting in significant computational demands. Additionally, thefine-grained information in videos and texts remains underexplored. Toalleviate these problems, we propose a novel framework to learn fine-grainedfeatures for better alignment and introduce an inference pipeline to improveperformance without additional training. Specifically, we employ coarse-to-fineobjectives to understand the semantic information of video-text pairs,including contrastive and matching learning. The fine-grained data used fortraining is obtained through the Granularity-Aware Representation module, whichis designed based on similarity analysis between video frames and words incaptions. Furthermore, we observe that the repetition of keywords in theoriginal captions, referred to as "Repetition", can enhance retrievalperformance and improve alignment between video and text. Based on thisinsight, we propose a novel and effective inference pipeline that incorporatesa voting mechanism and a new Matching Entropy metric to achieve betterretrieval performance without requiring additional pre-training. Experimentalresults on four benchmarks demonstrate that the proposed method outperformsprevious approaches. Additionally, our inference pipeline achieves significantperformance improvements, with a 2.1% increase in Recall@1 on the MSR-VTTdataset and a 1.6% increase on the DiDeMo dataset.

 

Quick Read (beta)

loading the full paper ...