Vision-Language Models Performing Zero-Shot Tasks Exhibit Gender-based Disparities

Abstract

We explore the extent to which zero-shot vision-language models exhibitgender bias for different vision tasks. Vision models traditionally requiredtask-specific labels for representing concepts, as well as finetuning;zero-shot models like CLIP instead perform tasks with an open-vocabulary,meaning they do not need a fixed set of labels, by using text embeddings torepresent concepts. With these capabilities in mind, we ask: Do vision-languagemodels exhibit gender bias when performing zero-shot image classification,object detection and semantic segmentation? We evaluate differentvision-language models with multiple datasets across a set of concepts and find(i) all models evaluated show distinct performance differences based on theperceived gender of the person co-occurring with a given concept in the imageand that aggregating analyses over all concepts can mask these concerns; (ii)model calibration (i.e. the relationship between accuracy and confidence) alsodiffers distinctly by perceived gender, even when evaluating on similarrepresentations of concepts; and (iii) these observed disparities align withexisting gender biases in word embeddings from language models. These findingssuggest that, while language greatly expands the capability of vision tasks, itcan also contribute to social biases in zero-shot vision settings. Furthermore,biases can further propagate when foundational models like CLIP are used byother models to enable zero-shot capabilities.

Quick Read (beta)

loading the full paper ...