Leveraging Large Language Models to Measure Gender Representation Bias in Gendered Language Corpora

Abstract

Gender bias in text corpora that are used for a variety of natural languageprocessing (NLP) tasks, such as for training large language models (LLMs), canlead to the perpetuation and amplification of societal inequalities. Thisphenomenon is particularly pronounced in gendered languages like Spanish orFrench, where grammatical structures inherently encode gender, making the biasanalysis more challenging. A first step in quantifying gender bias in textentails computing biases in gender representation, i.e., differences in theprevalence of words referring to males vs. females. Existing methods to measuregender representation bias in text corpora have mainly been proposed forEnglish and do not generalize to gendered languages due to the intrinsiclinguistic differences between English and gendered languages. This paperintroduces a novel methodology that leverages the contextual understandingcapabilities of LLMs to quantitatively measure gender representation bias inSpanish corpora. By utilizing LLMs to identify and classify gendered nouns andpronouns in relation to their reference to human entities, our approachprovides a robust analysis of gender representation bias in gendered languages.We empirically validate our method on four widely-used benchmark datasets,uncovering significant gender prevalence disparities with a male-to-femaleratio ranging from 4:1 to 6:1. These findings demonstrate the value of ourmethodology for bias quantification in gendered language corpora and suggestits application in NLP, contributing to the development of more equitablelanguage technologies.

Quick Read (beta)

loading the full paper ...