Whose Language Counts as High Quality? Measuring Language Ideologies in Text Data Selection

Abstract

Language models increasingly rely on massive web dumps for diverse text data.However, these sources are rife with undesirable content. As such, resourceslike Wikipedia, books, and newswire often serve as anchors for automaticallyselecting web text most suitable for language modeling, a process typicallyreferred to as quality filtering. Using a new dataset of U.S. high schoolnewspaper articles -- written by students from across the country -- weinvestigate whose language is preferred by the quality filter used for GPT-3.We find that newspapers from larger schools, located in wealthier, educated,and urban ZIP codes are more likely to be classified as high quality. We thendemonstrate that the filter's measurement of quality is unaligned with othersensible metrics, such as factuality or literary acclaim. We argue thatprivileging any corpus as high quality entails a language ideology, and morecare is needed to construct training corpora for language models, with bettertransparency and justification for the inclusion or exclusion of various texts.

Quick Read (beta)

loading the full paper ...