Gandalf the Red: Adaptive Security for LLMs

  • 2025-08-04 16:37:00
  • Niklas Pfister, Václav Volhejn, Manuel Knott, Santiago Arias, Julia Bazińska, Mykhailo Bichurin, Alan Commike, Janet Darling, Peter Dienes, Matthew Fiedler, David Haber, Matthias Kraft, Marco Lancini, Max Mathys, Damián Pascual-Ortiz, Jakub Podolak, Adrià Romero-López, Kyriacos Shiarlis, Andreas Signer, Zsolt Terek, Athanasios Theocharis, Daniel Timbrell, Samuel Trautwein, Samuel Watts, Yun-Han Wu, Mateo Rojas-Carulla
  • 0

Abstract

Current evaluations of defenses against prompt attacks in large languagemodel (LLM) applications often overlook two critical factors: the dynamicnature of adversarial behavior and the usability penalties imposed onlegitimate users by restrictive defenses. We propose D-SEC (Dynamic SecurityUtility Threat Model), which explicitly separates attackers from legitimateusers, models multi-step interactions, and expresses the security-utility in anoptimizable form. We further address the shortcomings in existing evaluationsby introducing Gandalf, a crowd-sourced, gamified red-teaming platform designedto generate realistic, adaptive attack. Using Gandalf, we collect and release adataset of 279k prompt attacks. Complemented by benign user data, our analysisreveals the interplay between security and utility, showing that defensesintegrated in the LLM (e.g., system prompts) can degrade usability even withoutblocking requests. We demonstrate that restricted application domains,defense-in-depth, and adaptive defenses are effective strategies for buildingsecure and useful LLM applications.

 

Quick Read (beta)

loading the full paper ...