Why LLM Safety Guardrails Collapse After Fine-tuning: A Similarity Analysis Between Alignment and Fine-tuning Datasets

Abstract

Recent advancements in large language models (LLMs) have underscored theirvulnerability to safety alignment jailbreaks, particularly when subjected todownstream fine-tuning. However, existing mitigation strategies primarily focuson reactively addressing jailbreak incidents after safety guardrails have beencompromised, removing harmful gradients during fine-tuning, or continuouslyreinforcing safety alignment throughout fine-tuning. As such, they tend tooverlook a critical upstream factor: the role of the original safety-alignmentdata. This paper therefore investigates the degradation of safety guardrailsthrough the lens of representation similarity between upstream alignmentdatasets and downstream fine-tuning tasks. Our experiments demonstrate thathigh similarity between these datasets significantly weakens safety guardrails,making models more susceptible to jailbreaks. Conversely, low similaritybetween these two types of datasets yields substantially more robust models andthus reduces harmfulness score by up to 10.33%. By highlighting the importanceof upstream dataset design in the building of durable safety guardrails andreducing real-world vulnerability to jailbreak attacks, these findings offeractionable insights for fine-tuning service providers.

Quick Read (beta)

loading the full paper ...