Abstract
We present ExAct, a new video-language benchmark for expert-levelunderstanding of skilled physical human activities. Our new benchmark contains3521 expert-curated video question-answer pairs spanning 11 physical activitiesin 6 domains: Sports, Bike Repair, Cooking, Health, Music, and Dance. ExActrequires the correct answer to be selected from five carefully designedcandidate options, thus necessitating a nuanced, fine-grained, expert-levelunderstanding of physical human skills. Evaluating the recent state-of-the-artVLMs on ExAct reveals a substantial performance gap relative to human expertperformance. Specifically, the best-performing GPT-4o model achieves only44.70% accuracy, well below the 82.02% attained by trained humanspecialists/experts. We believe that ExAct will be beneficial for developingand evaluating VLMs capable of precise understanding of human skills in variousphysical and procedural domains. Dataset and code are available athttps://texaser.github.io/exact_project_page/