Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

Abstract

We present Set-of-Mark (SoM), a new visual prompting method, to unleash thevisual grounding abilities of large multimodal models (LMMs), such as GPT-4V.As illustrated in Fig. 1 (right), we employ off-the-shelf interactivesegmentation models, such as SAM, to partition an image into regions atdifferent levels of granularity, and overlay these regions with a set of markse.g., alphanumerics, masks, boxes. Using the marked image as input, GPT-4V cananswer the questions that require visual grounding. We perform a comprehensiveempirical study to validate the effectiveness of SoM on a wide range offine-grained vision and multimodal tasks. For example, our experiments showthat GPT-4V with SoM outperforms the state-of-the-art fully-finetuned referringsegmentation model on RefCOCOg in a zero-shot setting.

Quick Read (beta)

loading the full paper ...