Decomposing Complex Visual Comprehension into Atomic Visual Skills for Vision Language Models

Abstract

Recent Vision-Language Models (VLMs) have demonstrated impressive multimodalcomprehension and reasoning capabilities, yet they often struggle withtrivially simple visual tasks. In this work, we focus on the domain of basic 2DEuclidean geometry and systematically categorize the fundamental, indivisiblevisual perception skills, which we refer to as atomic visual skills. We thenintroduce the Atomic Visual Skills Dataset (AVSD) for evaluating VLMs on theatomic visual skills. Using AVSD, we benchmark state-of-the-art VLMs and findthat they struggle with these tasks, despite being trivial for adult humans.Our findings highlight the need for purpose-built datasets to train andevaluate VLMs on atomic, rather than composite, visual perception tasks.

Quick Read (beta)

loading the full paper ...