BQA: Body Language Question Answering Dataset for Video Large Language Models

Abstract

A large part of human communication relies on nonverbal cues such as facialexpressions, eye contact, and body language. Unlike language or sign language,such nonverbal communication lacks formal rules, requiring complex reasoningbased on commonsense understanding. Enabling current Video Large LanguageModels (VideoLLMs) to accurately interpret body language is a crucialchallenge, as human unconscious actions can easily cause the model tomisinterpret their intent. To address this, we propose a dataset, BQA, a bodylanguage question answering dataset, to validate whether the model cancorrectly interpret emotions from short clips of body language comprising 26emotion labels of videos of body language. We evaluated various VideoLLMs onBQA and revealed that understanding body language is challenging, and ouranalyses of the wrong answers by VideoLLMs show that certain VideoLLMs madesignificantly biased answers depending on the age group and ethnicity of theindividuals in the video. The dataset is available.

Quick Read (beta)

loading the full paper ...