Abstract
This paper studies the vulnerabilities of transformer-based Large LanguageModels (LLMs) to jailbreaking attacks, focusing specifically on theoptimization-based Greedy Coordinate Gradient (GCG) strategy. We first observea positive correlation between the effectiveness of attacks and the internalbehaviors of the models. For instance, attacks tend to be less effective whenmodels pay more attention to system prompts designed to ensure LLM safetyalignment. Building on this discovery, we introduce an enhanced method thatmanipulates models' attention scores to facilitate LLM jailbreaking, which weterm AttnGCG. Empirically, AttnGCG shows consistent improvements in attackefficacy across diverse LLMs, achieving an average increase of ~7% in theLlama-2 series and ~10% in the Gemma series. Our strategy also demonstratesrobust attack transferability against both unseen harmful goals and black-boxLLMs like GPT-3.5 and GPT-4. Moreover, we note our attention-scorevisualization is more interpretable, allowing us to gain better insights intohow our targeted attention manipulation facilitates more effectivejailbreaking. We release the code athttps://github.com/UCSC-VLAA/AttnGCG-attack.