Analyzing the Roles of Language and Vision in Learning from Limited Data

Abstract

Does language help make sense of the visual world? How important is it toactually see the world rather than having it described with words? These basicquestions about the nature of intelligence have been difficult to answerbecause we only had one example of an intelligent system -- humans -- andlimited access to cases that isolated language or vision. However, thedevelopment of sophisticated Vision-Language Models (VLMs) by artificialintelligence researchers offers us new opportunities to explore thecontributions that language and vision make to learning about the world. Weablate components from the cognitive architecture of these models to identifytheir contributions to learning new tasks from limited data. We find that alanguage model leveraging all components recovers a majority of a VLM'sperformance, despite its lack of visual input, and that language seems to allowthis by providing access to prior knowledge and reasoning.

Quick Read (beta)

loading the full paper ...