Abstract
We propose Wolf, a WOrLd summarization Framework for accurate videocaptioning. Wolf is an automated captioning framework that adopts amixture-of-experts approach, leveraging complementary strengths of VisionLanguage Models (VLMs). By utilizing both image and video models, our frameworkcaptures different levels of information and summarizes them efficiently. Ourapproach can be applied to enhance video understanding, auto-labeling, andcaptioning. To evaluate caption quality, we introduce CapScore, an LLM-basedmetric to assess the similarity and quality of generated captions compared tothe ground truth captions. We further build four human-annotated datasets inthree domains: autonomous driving, general scenes, and robotics, to facilitatecomprehensive comparisons. We show that Wolf achieves superior captioningperformance compared to state-of-the-art approaches from the research community(VILA1.5, CogAgent) and commercial solutions (Gemini-Pro-1.5, GPT-4V). Forinstance, in comparison with GPT-4V, Wolf improves CapScore both quality-wiseby 55.6% and similarity-wise by 77.4% on challenging driving videos. Finally,we establish a benchmark for video captioning and introduce a leaderboard,aiming to accelerate advancements in video understanding, captioning, and dataalignment. Webpage: https://wolfv0.github.io/.