Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

Abstract

Optimizing large language models (LLMs) for downstream use cases ofteninvolves the customization of pre-trained LLMs through further fine-tuning.Meta's open release of Llama models and OpenAI's APIs for fine-tuning GPT-3.5Turbo on custom datasets also encourage this practice. But, what are the safetycosts associated with such custom fine-tuning? We note that while existingsafety alignment infrastructures can restrict harmful behaviors of LLMs atinference time, they do not cover safety risks when fine-tuning privileges areextended to end-users. Our red teaming studies find that the safety alignmentof LLMs can be compromised by fine-tuning with only a few adversariallydesigned training examples. For instance, we jailbreak GPT-3.5 Turbo's safetyguardrails by fine-tuning it on only 10 such examples at a cost of less than$0.20 via OpenAI's APIs, making the model responsive to nearly any harmfulinstructions. Disconcertingly, our research also reveals that, even withoutmalicious intent, simply fine-tuning with benign and commonly used datasets canalso inadvertently degrade the safety alignment of LLMs, though to a lesserextent. These findings suggest that fine-tuning aligned LLMs introduces newsafety risks that current safety infrastructures fall short of addressing --even if a model's initial safety alignment is impeccable, it is not necessarilyto be maintained after custom fine-tuning. We outline and critically analyzepotential mitigations and advocate for further research efforts towardreinforcing safety protocols for the custom fine-tuning of aligned LLMs.

Quick Read (beta)

loading the full paper ...