Abstract
Chart understanding plays a pivotal role when applying Multimodal LargeLanguage Models (MLLMs) to real-world tasks such as analyzing scientific papersor financial reports. However, existing datasets often focus on oversimplifiedand homogeneous charts with template-based questions, leading to anover-optimistic measure of progress. We demonstrate that although open-sourcemodels can appear to outperform strong proprietary models on these benchmarks,a simple stress test with slightly different charts or questions candeteriorate performance by up to 34.5%. In this work, we propose CharXiv, acomprehensive evaluation suite involving 2,323 natural, challenging, anddiverse charts from arXiv papers. CharXiv includes two types of questions: 1)descriptive questions about examining basic chart elements and 2) reasoningquestions that require synthesizing information across complex visual elementsin the chart. To ensure quality, all charts and questions are handpicked,curated, and verified by human experts. Our results reveal a substantial,previously underestimated gap between the reasoning skills of the strongestproprietary model (i.e., GPT-4o), which achieves 47.1% accuracy, and thestrongest open-source model (i.e., InternVL Chat V1.5), which achieves 29.2%.All models lag far behind human performance of 80.5%, underscoring weaknessesin the chart understanding capabilities of existing MLLMs. We hope CharXivfacilitates future research on MLLM chart understanding by providing a morerealistic and faithful measure of progress. Project page and leaderboard:https://charxiv.github.io/