Attribution-guided Pruning for Compression, Circuit Discovery, and Targeted Correction in LLMs

Abstract

Large Language Models (LLMs) are central to many contemporary AIapplications, yet their extensive parameter counts pose significant challengesfor deployment in memory- and compute-constrained environments. Recent works ineXplainable AI (XAI), particularly on attribution methods, suggest thatinterpretability can also enable model compression by identifying and removingcomponents irrelevant to inference. In this paper, we leverage Layer-wiseRelevance Propagation (LRP) to perform attribution-guided pruning of LLMs.While LRP has shown promise in structured pruning for vision models, we extendit to unstructured pruning in LLMs and demonstrate that it can substantiallyreduce model size with minimal performance loss. Our method is especiallyeffective in extracting task-relevant subgraphs -- so-called ``circuits'' --which can represent core functions (e.g., indirect object identification).Building on this, we introduce a technique for model correction, by selectivelyremoving circuits responsible for spurious behaviors (e.g., toxic outputs). Allin all, we gather these techniques as a uniform holistic framework and showcaseits effectiveness and limitations through extensive experiments forcompression, circuit discovery and model correction on Llama and OPT models,highlighting its potential for improving both model efficiency and safety. Ourcode is publicly available at https://github.com/erfanhatefi/SparC3.

Quick Read (beta)

loading the full paper ...