Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks

Abstract

Text-rich images, where text serves as the central visual element guiding theoverall understanding, are prevalent in real-world applications, such aspresentation slides, scanned documents, and webpage snapshots. Tasks involvingmultiple text-rich images are especially challenging, as they require not onlyunderstanding the content of individual images but reasoning aboutinter-relationships and logical flows across multiple visual inputs. Despitethe importance of these scenarios, current multimodal large language models(MLLMs) struggle to handle such tasks due to two key challenges: (1) thescarcity of high-quality instruction tuning datasets for text-rich multi-imagescenarios, and (2) the difficulty in balancing image resolution with visualfeature sequence length. To address these challenges, we propose Leopard, anMLLM tailored for handling vision-language tasks involving multiple text-richimages. First, we curated about one million high-quality multimodalinstruction-tuning data, tailored to text-rich, multi-image scenarios. Second,we proposed an adaptive high-resolution multi-image encoding module todynamically optimize the allocation of visual sequence length based on theoriginal aspect ratios and resolutions of images. Experiments on a diverse setof benchmarks reveal that our model consistently outperforms state-of-the-artsystems, such as Llama-3.2 and Qwen2-VL, in challenging text-rich, multi-imageevaluations. Remarkably, our approach achieves outstanding performance usingonly 1.2M training instances, all of which are fully open-sourced,demonstrating both high efficiency and effectiveness compared to models trainedon large-scale in-house data. Our code and data are available athttps://github.com/tencent-ailab/Leopard.

Quick Read (beta)

loading the full paper ...