LongDocURL: a Comprehensive Multimodal Long Document Benchmark Integrating Understanding, Reasoning, and Locating

  • 2024-12-27 08:33:31
  • Chao Deng, Jiale Yuan, Pi Bu, Peijie Wang, Zhong-Zhi Li, Jian Xu, Xiao-Hui Li, Yuan Gao, Jun Song, Bo Zheng, Cheng-Lin Liu
  • 0

Abstract

Large vision language models (LVLMs) have improved the document understandingcapabilities remarkably, enabling the handling of complex document elements,longer contexts, and a wider range of tasks. However, existing documentunderstanding benchmarks have been limited to handling only a small number ofpages and fail to provide a comprehensive analysis of layout elements locating.In this paper, we first define three primary task categories: Long DocumentUnderstanding, numerical Reasoning, and cross-element Locating, and thenpropose a comprehensive benchmark, LongDocURL, integrating above three primarytasks and comprising 20 sub-tasks categorized based on different primary tasksand answer evidences. Furthermore, we develop a semi-automated constructionpipeline and collect 2,325 high-quality question-answering pairs, covering morethan 33,000 pages of documents, significantly outperforming existingbenchmarks. Subsequently, we conduct comprehensive evaluation experiments onboth open-source and closed-source models across 26 different configurations,revealing critical performance gaps in this field.

 

Quick Read (beta)

loading the full paper ...