Exploiting Synergistic Cognitive Biases to Bypass Safety in LLMs

Abstract

Large Language Models (LLMs) demonstrate impressive capabilities across awide range of tasks, yet their safety mechanisms remain susceptible toadversarial attacks that exploit cognitive biases -- systematic deviations fromrational judgment. Unlike prior jailbreaking approaches focused on promptengineering or algorithmic manipulation, this work highlights the overlookedpower of multi-bias interactions in undermining LLM safeguards. We proposeCognitiveAttack, a novel red-teaming framework that systematically leveragesboth individual and combined cognitive biases. By integrating supervisedfine-tuning and reinforcement learning, CognitiveAttack generates prompts thatembed optimized bias combinations, effectively bypassing safety protocols whilemaintaining high attack success rates. Experimental results reveal significantvulnerabilities across 30 diverse LLMs, particularly in open-source models.CognitiveAttack achieves a substantially higher attack success rate compared tothe SOTA black-box method PAP (60.1% vs. 31.6%), exposing critical limitationsin current defense mechanisms. These findings highlight multi-bias interactionsas a powerful yet underexplored attack vector. This work introduces a novelinterdisciplinary perspective by bridging cognitive science and LLM safety,paving the way for more robust and human-aligned AI systems.

Quick Read (beta)

loading the full paper ...