GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest

Abstract

Visual instruction tuning large language model(LLM) on image-text pairs hasachieved general-purpose vision-language abilities. However, the lack ofregion-text pairs limits their advancements to fine-grained multimodalunderstanding. In this paper, we propose spatial instruction tuning, whichintroduces the reference to the region-of-interest(RoI) in the instruction.Before sending to LLM, the reference is replaced by RoI features andinterleaved with language embeddings as a sequence. Our model GPT4RoI, trainedon 7 region-text pair datasets, brings an unprecedented interactive andconversational experience compared to previous image-level models. (1)Interaction beyond language: Users can interact with our model by both languageand drawing bounding boxes to flexibly adjust the referring granularity. (2)Versatile multimodal abilities: A variety of attribute information within eachRoI can be mined by GPT4RoI, e.g., color, shape, material, action, etc.Furthermore, it can reason about multiple RoIs based on common sense. On theVisual Commonsense Reasoning(VCR) dataset, GPT4RoI achieves a remarkableaccuracy of 81.6%, surpassing all existing models by a significant margin (thesecond place is 75.6%) and almost reaching human-level performance of 85.0%.The code and model can be found at https://github.com/jshilong/GPT4RoI.

Quick Read (beta)

loading the full paper ...