Pruning Strategies for Backdoor Defense in LLMs

Abstract

Backdoor attacks are a significant threat to the performance and integrity ofpre-trained language models. Although such models are routinely fine-tuned fordownstream NLP tasks, recent work shows they remain vulnerable to backdoorattacks that survive vanilla fine-tuning. These attacks are difficult to defendbecause end users typically lack knowledge of the attack triggers. Such attacksconsist of stealthy malicious triggers introduced through subtle syntactic orstylistic manipulations, which can bypass traditional detection and remain inthe model, making post-hoc purification essential. In this study, we explorewhether attention-head pruning can mitigate these threats without any knowledgeof the trigger or access to a clean reference model. To this end, we design andimplement six pruning-based strategies: (i) gradient-based pruning, (ii)layer-wise variance pruning, (iii) gradient-based pruning with structured L1/L2sparsification, (iv) randomized ensemble pruning, (v)reinforcement-learning-guided pruning, and (vi) Bayesian uncertainty pruning.Each method iteratively removes the least informative heads while monitoringvalidation accuracy to avoid over-pruning. Experimental evaluation shows thatgradient-based pruning performs best while defending the syntactic triggers,whereas reinforcement learning and Bayesian pruning better withstand stylisticattacks.

Quick Read (beta)

loading the full paper ...