Jailbroken: How Does LLM Safety Training Fail?

Abstract

Large language models trained for safety and harmlessness remain susceptibleto adversarial misuse, as evidenced by the prevalence of "jailbreak" attacks onearly releases of ChatGPT that elicit undesired behavior. Going beyondrecognition of the issue, we investigate why such attacks succeed and how theycan be created. We hypothesize two failure modes of safety training: competingobjectives and mismatched generalization. Competing objectives arise when amodel's capabilities and safety goals conflict, while mismatched generalizationoccurs when safety training fails to generalize to a domain for whichcapabilities exist. We use these failure modes to guide jailbreak design andthen evaluate state-of-the-art models, including OpenAI's GPT-4 and Anthropic'sClaude v1.3, against both existing and newly designed attacks. We find thatvulnerabilities persist despite the extensive red-teaming and safety-trainingefforts behind these models. Notably, new attacks utilizing our failure modessucceed on every prompt in a collection of unsafe requests from the models'red-teaming evaluation sets and outperform existing ad hoc jailbreaks. Ouranalysis emphasizes the need for safety-capability parity -- that safetymechanisms should be as sophisticated as the underlying model -- and arguesagainst the idea that scaling alone can resolve these safety failure modes.

Quick Read (beta)

loading the full paper ...