AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs

Abstract

Despite progress in video understanding, current MLLMs struggle with countingtasks. Existing benchmarks are limited by short videos, close-set queries, lackof clue annotations, and weak multimodal coverage. In this paper, we introduceCG-AV-Counting, a manually-annotated clue-grounded counting benchmark with1,027 multimodal questions and 5,845 annotated clues over 497 long videos. Itsupports both black-box and white-box evaluation, serving as a comprehensivetestbed for both end-to-end and reasoning-based counting. To explore ways toimprove model's counting capability, we propose AV-Reasoner, a model trainedwith GRPO and curriculum learning to generalize counting ability from relatedtasks. AV-Reasoner achieves state-of-the-art results across multiplebenchmarks, demonstrating the effectiveness of reinforcement learning. However,experiments show that on out-of-domain benchmarks, reasoning in the languagespace fails to bring performance gains. The code and benchmark have beenrealeased on https://av-reasoner.github.io.

Quick Read (beta)

loading the full paper ...