Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding

Abstract

Visual Document Understanding has become essential with the increase oftext-rich visual content. This field poses significant challenges due to theneed for effective integration of visual perception and textual comprehension,particularly across diverse document types with complex layouts. Moreover,existing fine-tuning datasets for this domain often fall short in providing thedetailed contextual information for robust understanding, leading tohallucinations and limited comprehension of spatial relationships among visualelements. To address these challenges, we propose an innovative pipeline thatutilizes adaptive generation of markup languages, such as Markdown, JSON, HTML,and TiKZ, to build highly structured document representations and delivercontextually-grounded responses. We introduce two fine-grained structureddatasets: DocMark-Pile, comprising approximately 3.8M pretraining data pairsfor document parsing, and DocMark-Instruct, featuring 624k fine-tuning dataannotations for grounded instruction following. Extensive experimentsdemonstrate that our proposed model significantly outperforms existingstate-of-theart MLLMs across a range of visual document understandingbenchmarks, facilitating advanced reasoning and comprehension capabilities incomplex visual scenarios. Our code and models are released at https://github.com/Euphoria16/DocMark.

Quick Read (beta)

loading the full paper ...