AutoDAN: Automatic and Interpretable Adversarial Attacks on Large Language Models

  • 2023-10-23 18:46:07
  • Sicheng Zhu, Ruiyi Zhang, Bang An, Gang Wu, Joe Barrow, Zichao Wang, Furong Huang, Ani Nenkova, Tong Sun
  • 0

Abstract

Safety alignment of Large Language Models (LLMs) can be compromised withmanual jailbreak attacks and (automatic) adversarial attacks. Recent worksuggests that patching LLMs against these attacks is possible: manual jailbreakattacks are human-readable but often limited and public, making them easy toblock; adversarial attacks generate gibberish prompts that can be detectedusing perplexity-based filters. In this paper, we show that these solutions maybe too optimistic. We propose an interpretable adversarial attack,\texttt{AutoDAN}, that combines the strengths of both types of attacks. Itautomatically generates attack prompts that bypass perplexity-based filterswhile maintaining a high attack success rate like manual jailbreak attacks.These prompts are interpretable and diverse, exhibiting strategies commonlyused in manual jailbreak attacks, and transfer better than their non-readablecounterparts when using limited training data or a single proxy model. We alsocustomize \texttt{AutoDAN}'s objective to leak system prompts, anotherjailbreak application not addressed in the adversarial attack literature. %,demonstrating the versatility of the approach. We can also customize theobjective of \texttt{AutoDAN} to leak system prompts, beyond the ability toelicit harmful content from the model, demonstrating the versatility of theapproach. Our work provides a new way to red-team LLMs and to understand themechanism of jailbreak attacks.

 

Quick Read (beta)

loading the full paper ...