Revisiting the Role of Language Priors in Vision-Language Models

Abstract

Vision-language models (VLMs) are impactful in part because they can beapplied to a variety of visual understanding tasks in a zero-shot fashion,without any fine-tuning. We study $\textit{generative VLMs}$ that are trainedfor next-word generation given an image. We explore their zero-shot performanceon the illustrative task of image-text retrieval across 8 popularvision-language benchmarks. Our first observation is that they can berepurposed for discriminative tasks (such as image-text retrieval) by simplycomputing the match score of generating a particular text string given animage. We call this probabilistic score the $\textit{Visual GenerativePre-Training Score}$ (VisualGPTScore). While the VisualGPTScore producesnear-perfect accuracy on some retrieval benchmarks, it yields poor accuracy onothers. We analyze this behavior through a probabilistic lens, pointing outthat some benchmarks inadvertently capture unnatural language distributions bycreating adversarial but unlikely text captions. In fact, we demonstrate thateven a "blind" language model that ignores any image evidence can sometimesoutperform all prior art, reminiscent of similar challenges faced by thevisual-question answering (VQA) community many years ago. We derive aprobabilistic post-processing scheme that controls for the amount of linguisticbias in generative VLMs at test time without having to retrain or fine-tune themodel. We show that the VisualGPTScore, when appropriately debiased, is astrong zero-shot baseline for vision-language understanding, oftentimesproducing state-of-the-art accuracy.

Quick Read (beta)

loading the full paper ...