Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models

Abstract

Safety alignment is the key to guiding the behaviors of large language models(LLMs) that are in line with human preferences and restrict harmful behaviorsat inference time, but recent studies show that it can be easily compromised byfinetuning with only a few adversarially designed training examples. We aim tomeasure the risks in finetuning LLMs through navigating the LLM safetylandscape. We discover a new phenomenon observed universally in the modelparameter space of popular open-source LLMs, termed as "safety basin": randomlyperturbing model weights maintains the safety level of the original alignedmodel in its local neighborhood. Our discovery inspires us to propose the newVISAGE safety metric that measures the safety in LLM finetuning by probing itssafety landscape. Visualizing the safety landscape of the aligned model enablesus to understand how finetuning compromises safety by dragging the model awayfrom the safety basin. LLM safety landscape also highlights the system prompt'scritical role in protecting a model, and that such protection transfers to itsperturbed variants within the safety basin. These observations from our safetylandscape research provide new insights for future work on LLM safetycommunity.

Quick Read (beta)

loading the full paper ...