AutoDAN: Automatic and Interpretable Adversarial Attacks on Large Language Models

Abstract

Safety alignment of Large Language Models (LLMs) can be compromised withmanual jailbreak attacks and (automatic) adversarial attacks. Recent worksuggests that patching LLMs against these attacks is possible: manual jailbreakattacks are human-readable but often limited and public, making them easy toblock; adversarial attacks generate gibberish prompts that can be detectedusing perplexity-based filters. In this paper, we show that these solutions maybe too optimistic. We propose an interpretable adversarial attack,\texttt{AutoDAN}, that combines the strengths of both types of attacks. Itautomatically generates attack prompts that bypass perplexity-based filterswhile maintaining a high attack success rate like manual jailbreak attacks.These prompts are interpretable and diverse, exhibiting strategies commonlyused in manual jailbreak attacks, and transfer better than their non-readablecounterparts when using limited training data or a single proxy model. We alsocustomize \texttt{AutoDAN}'s objective to leak system prompts, anotherjailbreak application not addressed in the adversarial attack literature. %,demonstrating the versatility of the approach. We can also customize theobjective of \texttt{AutoDAN} to leak system prompts, beyond the ability toelicit harmful content from the model, demonstrating the versatility of theapproach. Our work provides a new way to red-team LLMs and to understand themechanism of jailbreak attacks.

Quick Read (beta)

loading the full paper ...