Local Contrastive Editing of Gender Stereotypes

Abstract

Stereotypical bias encoded in language models (LMs) poses a threat to safelanguage technology, yet our understanding of how bias manifests in theparameters of LMs remains incomplete. We introduce local contrastive editingthat enables the localization and editing of a subset of weights in a targetmodel in relation to a reference model. We deploy this approach to identify andmodify subsets of weights that are associated with gender stereotypes in LMs.Through a series of experiments, we demonstrate that local contrastive editingcan precisely localize and control a small subset (< 0.5%) of weights thatencode gender bias. Our work (i) advances our understanding of howstereotypical biases can manifest in the parameter space of LMs and (ii) opensup new avenues for developing parameter-efficient strategies for controllingmodel properties in a contrastive manner.

Quick Read (beta)

loading the full paper ...