Abstract
Chart understanding presents a unique challenge for large vision-languagemodels (LVLMs), as it requires the integration of sophisticated textual andvisual reasoning capabilities. However, current LVLMs exhibit a notableimbalance between these skills, falling short on visual reasoning that isdifficult to perform in text. We conduct a case study using a synthetic datasetsolvable only through visual reasoning and show that model performance degradessignificantly with increasing visual complexity, while human performanceremains robust. We then introduce ChartMuseum, a new Chart Question Answering(QA) benchmark containing 1,162 expert-annotated questions spanning multiplereasoning types, curated from real-world charts across 184 sources,specifically built to evaluate complex visual and textual reasoning. Unlikeprior chart understanding benchmarks -- where frontier models perform similarlyand near saturation -- our benchmark exposes a substantial gap between modeland human performance, while effectively differentiating model capabilities:although humans achieve 93% accuracy, the best-performing model Gemini-2.5-Proattains only 63.0%, and the leading open-source LVLM Qwen2.5-VL-72B-Instructachieves only 38.5%. Moreover, on questions requiring primarily visualreasoning, all models experience a 35%-55% performance drop fromtext-reasoning-heavy question performance. Lastly, our qualitative erroranalysis reveals specific categories of visual reasoning that are challengingfor current LVLMs.