BloomVQA: Assessing Hierarchical Multi-modal Comprehension

Abstract

We propose a novel VQA dataset, BloomVQA, to facilitate comprehensiveevaluation of large vision-language models on comprehension tasks. Unlikecurrent benchmarks that often focus on fact-based memorization and simplereasoning tasks without theoretical grounding, we collect multiple-choicesamples based on picture stories that reflect different levels ofcomprehension, as laid out in Bloom's Taxonomy, a classic framework forlearning assessment widely adopted in education research. Our data maps to anovel hierarchical graph representation which enables automatic dataaugmentation and novel measures characterizing model consistency. We performgraded evaluation and reliability analysis on recent multi-modal models. Incomparison to low-level tasks, we observe decreased performance on tasksrequiring advanced comprehension and cognitive skills with up to 38.0\% drop inVQA accuracy. In comparison to earlier models, GPT-4V demonstrates improvedaccuracy over all comprehension levels and shows a tendency of bypassing visualinputs especially for higher-level tasks. Current models also show consistencypatterns misaligned with human comprehension in various scenarios,demonstrating the need for improvement based on theoretically-groundedcriteria.

Quick Read (beta)

loading the full paper ...