Dissecting Bias in LLMs: A Mechanistic Interpretability Perspective

Abstract

Large Language Models (LLMs) are known to exhibit social, demographic, andgender biases, often as a consequence of the data on which they are trained. Inthis work, we adopt a mechanistic interpretability approach to analyze how suchbiases are structurally represented within models such as GPT-2 and Llama2.Focusing on demographic and gender biases, we explore different metrics toidentify the internal edges responsible for biased behavior. We then assess thestability, localization, and generalizability of these components acrossdataset and linguistic variations. Through systematic ablations, we demonstratethat bias-related computations are highly localized, often concentrated in asmall subset of layers. Moreover, the identified components change acrossfine-tuning settings, including those unrelated to bias. Finally, we show thatremoving these components not only reduces biased outputs but also affectsother NLP tasks, such as named entity recognition and linguistic acceptabilityjudgment because of the sharing of important components with these tasks.

Quick Read (beta)

loading the full paper ...