Best-of-N Jailbreaking - Paper Detail

Abstract

We introduce Best-of-N (BoN) Jailbreaking, a simple black-box algorithm thatjailbreaks frontier AI systems across modalities. BoN Jailbreaking works byrepeatedly sampling variations of a prompt with a combination of augmentations- such as random shuffling or capitalization for textual prompts - until aharmful response is elicited. We find that BoN Jailbreaking achieves highattack success rates (ASRs) on closed-source language models, such as 89% onGPT-4o and 78% on Claude 3.5 Sonnet when sampling 10,000 augmented prompts.Further, it is similarly effective at circumventing state-of-the-artopen-source defenses like circuit breakers. BoN also seamlessly extends toother modalities: it jailbreaks vision language models (VLMs) such as GPT-4oand audio language models (ALMs) like Gemini 1.5 Pro, using modality-specificaugmentations. BoN reliably improves when we sample more augmented prompts.Across all modalities, ASR, as a function of the number of samples (N),empirically follows power-law-like behavior for many orders of magnitude. BoNJailbreaking can also be composed with other black-box algorithms for even moreeffective attacks - combining BoN with an optimized prefix attack achieves upto a 35% increase in ASR. Overall, our work indicates that, despite theircapability, language models are sensitive to seemingly innocuous changes toinputs, which attackers can exploit across modalities.

Quick Read (beta)

loading the full paper ...