Who's in and who's out? A case study of multimodal CLIP-filtering in DataComp

Abstract

As training datasets become increasingly drawn from unstructured,uncontrolled environments such as the web, researchers and industrypractitioners have increasingly relied upon data filtering techniques to"filter out the noise" of web-scraped data. While datasets have been widelyshown to reflect the biases and values of their creators, in this paper wecontribute to an emerging body of research that assesses the filters used tocreate these datasets. We show that image-text data filtering also has biasesand is value-laden, encoding specific notions of what is counted as"high-quality" data. In our work, we audit a standard approach of image-textCLIP-filtering on the academic benchmark DataComp's CommonPool by analyzingdiscrepancies of filtering through various annotation techniques acrossmultiple modalities of image, text, and website source. We find that datarelating to several imputed demographic groups -- such as LGBTQ+ people, olderwomen, and younger men -- are associated with higher rates of exclusion.Moreover, we demonstrate cases of exclusion amplification: not only are certainmarginalized groups already underrepresented in the unfiltered data, butCLIP-filtering excludes data from these groups at higher rates. Thedata-filtering step in the machine learning pipeline can therefore exacerbaterepresentation disparities already present in the data-gathering step,especially when existing filters are designed to optimize a specifically-chosendownstream performance metric like zero-shot image classification accuracy.Finally, we show that the NSFW filter fails to remove sexually-explicit contentfrom CommonPool, and that CLIP-filtering includes several categories ofcopyrighted content at high rates. Our conclusions point to a need forfundamental changes in dataset creation and filtering practices.

Quick Read (beta)

loading the full paper ...