Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want

Abstract

The interaction between humans and artificial intelligence (AI) is a crucialfactor that reflects the effectiveness of multimodal large language models(MLLMs). However, current MLLMs primarily focus on image-level comprehensionand limit interaction to textual instructions, thereby constraining theirflexibility in usage and depth of response. In this paper, we introduce theDraw-and-Understand project: a new model, a multi-domain dataset, and achallenging benchmark for visual prompting. Specifically, we propose SPHINX-V,a new end-to-end trained Multimodal Large Language Model (MLLM) that connects avision encoder, a visual prompt encoder and an LLM for various visual prompts(points, bounding boxes, and free-form shape) and language understanding. Toadvance visual prompting research for MLLMs, we introduce MDVP-Data andMDVP-Bench. MDVP-Data features a multi-domain dataset containing 1.6M uniqueimage-visual prompt-text instruction-following samples, including naturalimages, document images, OCR images, mobile screenshots, web screenshots, andmulti-panel images. Furthermore, we present MDVP-Bench, a comprehensive andchallenging benchmark to assess a model's capability in understanding visualprompting instructions. Our experiments demonstrate SPHINX-V's impressivemultimodal interaction capabilities through visual prompting, revealingsignificant improvements in detailed pixel-level description andquestion-answering abilities.

Quick Read (beta)

loading the full paper ...