MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs

Abstract

Multimodal Large Language Models (MLLMs) have experienced rapid progress invisual recognition tasks in recent years. Given their potential integrationinto many critical applications, it is important to understand the limitationsof their visual perception. In this work, we study whether MLLMs can perceivesmall visual details as effectively as large ones when answering questionsabout images. We observe that their performance is very sensitive to the sizeof the visual subject of the question, and further show that this effect is infact causal by conducting an intervention study. Next, we study the attentionpatterns of MLLMs when answering visual questions, and intriguingly find thatthey consistently know where to look, even when they provide the wrong answer.Based on these findings, we then propose training-free visual interventionmethods that leverage the internal knowledge of any MLLM itself, in the form ofattention and gradient maps, to enhance its perception of small visual details.We evaluate our proposed methods on two widely-used MLLMs and seven visualquestion answering benchmarks and show that they can significantly improveMLLMs' accuracy without requiring any training. Our results elucidate the riskof applying MLLMs to visual recognition tasks concerning small details andindicate that visual intervention using the model's internal state is apromising direction to mitigate this risk.

Quick Read (beta)

loading the full paper ...