Images in Language Space: Exploring the Suitability of Large Language Models for Vision & Language Tasks

Abstract

Large language models have demonstrated robust performance on variouslanguage tasks using zero-shot or few-shot learning paradigms. While beingactively researched, multimodal models that can additionally handle images asinput have yet to catch up in size and generality with language-only models. Inthis work, we ask whether language-only models can be utilised for tasks thatrequire visual input -- but also, as we argue, often require a strong reasoningcomponent. Similar to some recent related work, we make visual informationaccessible to the language model using separate verbalisation models.Specifically, we investigate the performance of open-source, open-accesslanguage models against GPT-3 on five vision-language tasks when giventextually-encoded visual information. Our results suggest that language modelsare effective for solving vision-language tasks even with limited samples. Thisapproach also enhances the interpretability of a model's output by providing ameans of tracing the output back through the verbalised image content.

Quick Read (beta)

loading the full paper ...