ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models

  • 2025-05-19 18:59:27
  • Liyan Tang, Grace Kim, Xinyu Zhao, Thom Lake, Wenxuan Ding, Fangcong Yin, Prasann Singhal, Manya Wadhwa, Zeyu Leo Liu, Zayne Sprague, Ramya Namuduri, Bodun Hu, Juan Diego Rodriguez, Puyuan Peng, Greg Durrett
  • 0

Abstract

Chart understanding presents a unique challenge for large vision-languagemodels (LVLMs), as it requires the integration of sophisticated textual andvisual reasoning capabilities. However, current LVLMs exhibit a notableimbalance between these skills, falling short on visual reasoning that isdifficult to perform in text. We conduct a case study using a synthetic datasetsolvable only through visual reasoning and show that model performance degradessignificantly with increasing visual complexity, while human performanceremains robust. We then introduce ChartMuseum, a new Chart Question Answering(QA) benchmark containing 1,162 expert-annotated questions spanning multiplereasoning types, curated from real-world charts across 184 sources,specifically built to evaluate complex visual and textual reasoning. Unlikeprior chart understanding benchmarks -- where frontier models perform similarlyand near saturation -- our benchmark exposes a substantial gap between modeland human performance, while effectively differentiating model capabilities:although humans achieve 93% accuracy, the best-performing model Gemini-2.5-Proattains only 63.0%, and the leading open-source LVLM Qwen2.5-VL-72B-Instructachieves only 38.5%. Moreover, on questions requiring primarily visualreasoning, all models experience a 35%-55% performance drop fromtext-reasoning-heavy question performance. Lastly, our qualitative erroranalysis reveals specific categories of visual reasoning that are challengingfor current LVLMs.

 

Quick Read (beta)

loading the full paper ...