MimeQA: Towards Socially-Intelligent Nonverbal Foundation Models

Abstract

As AI becomes more closely integrated with peoples' daily activities,socially intelligent AI that can understand and interact seamlessly with humansin daily lives is increasingly important. However, current works in AI socialreasoning all rely on language-only or language-dominant approaches tobenchmark and training models, resulting in systems that are improving inverbal communication but struggle with nonverbal social understanding. Toaddress this limitation, we tap into a novel data source rich in nonverbalsocial interactions -- mime videos. Mimes refer to the art of expressionthrough gesture and movement without spoken words, which presents uniquechallenges and opportunities in interpreting nonverbal social communication. Wecontribute a new dataset called MimeQA, obtained by sourcing 8 hours of videosclips from YouTube and developing a comprehensive video question-answeringbenchmark comprising 806 carefully annotated and verified question-answerpairs, designed to probe nonverbal social reasoning capabilities. Using MimeQA,we evaluate state-of-the-art video large language models (vLLMs) and find thatthey achieve low overall accuracy, ranging from 20-30%, while humans score 86%.Our analysis reveals that vLLMs often fail to ground imagined objects andover-rely on the text prompt while ignoring subtle nonverbal interactions. Wehope to inspire future work in AI models that embody true social intelligencecapable of interpreting non-verbal human interactions.

Quick Read (beta)

loading the full paper ...