Siege: Autonomous Multi-Turn Jailbreaking of Large Language Models with Tree Search

Abstract

We introduce Siege, a multi-turn adversarial framework that models thegradual erosion of Large Language Model (LLM) safety through a tree searchperspective. Unlike single-turn jailbreaks that rely on one meticulouslyengineered prompt, Siege expands the conversation at each turn in abreadth-first fashion, branching out multiple adversarial prompts that exploitpartial compliance from previous responses. By tracking these incrementalpolicy leaks and re-injecting them into subsequent queries, Siege reveals howminor concessions can accumulate into fully disallowed outputs. Evaluations onthe JailbreakBench dataset show that Siege achieves a 100% success rate onGPT-3.5-turbo and 97% on GPT-4 in a single multi-turn run, using fewer queriesthan baselines such as Crescendo or GOAT. This tree search methodology offersan in-depth view of how model safeguards degrade over successive dialogueturns, underscoring the urgency of robust multi-turn testing procedures forlanguage models.

Quick Read (beta)

loading the full paper ...