BiasEdit: Debiasing Stereotyped Language Models via Model Editing

Abstract

Previous studies have established that language models manifest stereotypedbiases. Existing debiasing strategies, such as retraining a model withcounterfactual data, representation projection, and prompting often fail toefficiently eliminate bias or directly alter the models' biased internalrepresentations. To address these issues, we propose BiasEdit, an efficientmodel editing method to remove stereotypical bias from language models throughlightweight networks that act as editors to generate parameter updates.BiasEdit employs a debiasing loss guiding editor networks to conduct localedits on partial parameters of a language model for debiasing while preservingthe language modeling abilities during editing through a retention loss.Experiments on StereoSet and Crows-Pairs demonstrate the effectiveness,efficiency, and robustness of BiasEdit in eliminating bias compared totangental debiasing baselines and little to no impact on the language models'general capabilities. In addition, we conduct bias tracing to probe bias invarious modules and explore bias editing impacts on different components oflanguage models.

Quick Read (beta)

loading the full paper ...