Vision language models have difficulty recognizing virtual objects

Abstract

Vision language models (VLMs) are AI systems paired with both language andvision encoders to process multimodal input. They are capable of performingcomplex semantic tasks such as automatic captioning, but it remains an openquestion about how well they comprehend the visuospatial properties of scenesdepicted in the images they process. We argue that descriptions of virtualobjects -- objects that are not visually represented in an image -- can helptest scene comprehension in these AI systems. For example, an image thatdepicts a person standing under a tree can be paired with the following prompt:imagine that a kite is stuck in the tree. VLMs that comprehend the scene shouldupdate their representations and reason sensibly about the spatial relationsbetween all three objects. We describe systematic evaluations ofstate-of-the-art VLMs and show that their ability to process virtual objects isinadequate.

Quick Read (beta)

loading the full paper ...