BiasGym: Fantastic LLM Biases and How to Find (and Remove) Them

Abstract

Understanding biases and stereotypes encoded in the weights of Large LanguageModels (LLMs) is crucial for developing effective mitigation strategies. Biasedbehaviour is often subtle and non-trivial to isolate, even when deliberatelyelicited, making systematic analysis and debiasing particularly challenging. Toaddress this, we introduce BiasGym, a simple, cost-effective, and generalizableframework for reliably injecting, analyzing, and mitigating conceptualassociations within LLMs. BiasGym consists of two components: BiasInject, whichinjects specific biases into the model via token-based fine-tuning whilekeeping the model frozen, and BiasScope, which leverages these injected signalsto identify and steer the components responsible for biased behavior. Ourmethod enables consistent bias elicitation for mechanistic analysis, supportstargeted debiasing without degrading performance on downstream tasks, andgeneralizes to biases unseen during token-based fine-tuning. We demonstrate theeffectiveness of BiasGym in reducing real-world stereotypes (e.g., people fromItaly being `reckless drivers') and in probing fictional associations (e.g.,people from a fictional country having `blue skin'), showing its utility forboth safety interventions and interpretability research.

Quick Read (beta)

loading the full paper ...