Abstract
Large vision language models (LVLMs) have improved the document understandingcapabilities remarkably, enabling the handling of complex document elements,longer contexts, and a wider range of tasks. However, existing documentunderstanding benchmarks have been limited to handling only a small number ofpages and fail to provide a comprehensive analysis of layout elements locating.In this paper, we first define three primary task categories: Long DocumentUnderstanding, numerical Reasoning, and cross-element Locating, and thenpropose a comprehensive benchmark, LongDocURL, integrating above three primarytasks and comprising 20 sub-tasks categorized based on different primary tasksand answer evidences. Furthermore, we develop a semi-automated constructionpipeline and collect 2,325 high-quality question-answering pairs, covering morethan 33,000 pages of documents, significantly outperforming existingbenchmarks. Subsequently, we conduct comprehensive evaluation experiments onboth open-source and closed-source models across 26 different configurations,revealing critical performance gaps in this field.