AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents

Abstract

The robustness of LLMs to jailbreak attacks, where users design prompts tocircumvent safety measures and misuse model capabilities, has been studiedprimarily for LLMs acting as simple chatbots. Meanwhile, LLM agents -- whichuse external tools and can execute multi-stage tasks -- may pose a greater riskif misused, but their robustness remains underexplored. To facilitate researchon LLM agent misuse, we propose a new benchmark called AgentHarm. The benchmarkincludes a diverse set of 110 explicitly malicious agent tasks (440 withaugmentations), covering 11 harm categories including fraud, cybercrime, andharassment. In addition to measuring whether models refuse harmful agenticrequests, scoring well on AgentHarm requires jailbroken agents to maintaintheir capabilities following an attack to complete a multi-step task. Weevaluate a range of leading LLMs, and find (1) leading LLMs are surprisinglycompliant with malicious agent requests without jailbreaking, (2) simpleuniversal jailbreak templates can be adapted to effectively jailbreak agents,and (3) these jailbreaks enable coherent and malicious multi-step agentbehavior and retain model capabilities. We publicly release AgentHarm to enablesimple and reliable evaluation of attacks and defenses for LLM-based agents. Wepublicly release the benchmark athttps://huggingface.co/ai-safety-institute/AgentHarm.

Quick Read (beta)

loading the full paper ...