A Realistic Threat Model for Large Language Model Jailbreaks

Abstract

A plethora of jailbreaking attacks have been proposed to obtain harmfulresponses from safety-tuned LLMs. In their original settings, these methods alllargely succeed in coercing the target output, but their attacks varysubstantially in fluency and computational effort. In this work, we propose aunified threat model for the principled comparison of these methods. Our threatmodel combines constraints in perplexity, measuring how far a jailbreakdeviates from natural text, and computational budget, in total FLOPs. For theformer, we build an N-gram model on 1T tokens, which, in contrast tomodel-based perplexity, allows for an LLM-agnostic and inherently interpretableevaluation. We adapt popular attacks to this new, realistic threat model, withwhich we, for the first time, benchmark these attacks on equal footing. After arigorous comparison, we not only find attack success rates against safety-tunedmodern models to be lower than previously presented but also find that attacksbased on discrete optimization significantly outperform recent LLM-basedattacks. Being inherently interpretable, our threat model allows for acomprehensive analysis and comparison of jailbreak attacks. We find thateffective attacks exploit and abuse infrequent N-grams, either selectingN-grams absent from real-world text or rare ones, e.g. specific to codedatasets.

Quick Read (beta)

loading the full paper ...