How Well Can Vision Language Models See Image Details?

Abstract

Large Language Model-based Vision-Language Models (LLM-based VLMs) havedemonstrated impressive results in various vision-language understanding tasks.However, how well these VLMs can see image detail beyond the semantic levelremains unclear. In our study, we introduce a pixel value prediction task (PVP)to explore "How Well Can Vision Language Models See Image Details?" and toassist VLMs in perceiving more details. Typically, these models comprise afrozen CLIP visual encoder, a large language model, and a connecting module.After fine-tuning VLMs on the PVP task, we find: 1) existing VLMs struggle topredict precise pixel values by only fine-tuning the connection module and LLM;and 2) prediction precision is significantly improved when the vision encoderis also adapted. Additionally, our research reveals that incorporating pixelvalue prediction as one of the VLM pre-training tasks and vision encoderadaptation markedly boosts VLM performance on downstream image-languageunderstanding tasks requiring detailed image perception, such as referringimage segmentation (with an average +10.19 cIoU improvement) and video gamedecision making (with average score improvements of +80.34 and +70.54 on twogames, respectively).

Quick Read (beta)

loading the full paper ...