Vision language models are unreliable at trivial spatial cognition

Abstract

Vision language models (VLMs) are designed to extract relevant visuospatialinformation from images. Some research suggests that VLMs can exhibit humanlikescene understanding, while other investigations reveal difficulties in theirability to process relational information. To achieve widespread applicability,VLMs must perform reliably, yielding comparable competence across a widevariety of related tasks. We sought to test how reliable these architecturesare at engaging in trivial spatial cognition, e.g., recognizing whether oneobject is left of another in an uncluttered scene. We developed a benchmarkdataset -- TableTest -- whose images depict 3D scenes of objects arranged on atable, and used it to evaluate state-of-the-art VLMs. Results show thatperformance could be degraded by minor variations of prompts that use logicallyequivalent descriptions. These analyses suggest limitations in how VLMs mayreason about spatial relations in real-world applications. They also revealnovel opportunities for bolstering image caption corpora for more efficienttraining and testing.

Quick Read (beta)

loading the full paper ...