AegisLLM: Scaling Agentic Systems for Self-Reflective Defense in LLM Security

Abstract

We introduce AegisLLM, a cooperative multi-agent defense against adversarialattacks and information leakage. In AegisLLM, a structured workflow ofautonomous agents - orchestrator, deflector, responder, and evaluator -collaborate to ensure safe and compliant LLM outputs, while self-improving overtime through prompt optimization. We show that scaling agentic reasoning systemat test-time - both by incorporating additional agent roles and by leveragingautomated prompt optimization (such as DSPy)- substantially enhances robustnesswithout compromising model utility. This test-time defense enables real-timeadaptability to evolving attacks, without requiring model retraining.Comprehensive evaluations across key threat scenarios, including unlearning andjailbreaking, demonstrate the effectiveness of AegisLLM. On the WMDP unlearningbenchmark, AegisLLM achieves near-perfect unlearning with only 20 trainingexamples and fewer than 300 LM calls. For jailbreaking benchmarks, we achieve51% improvement compared to the base model on StrongReject, with false refusalrates of only 7.9% on PHTest compared to 18-55% for comparable methods. Ourresults highlight the advantages of adaptive, agentic reasoning over staticdefenses, establishing AegisLLM as a strong runtime alternative to traditionalapproaches based on model modifications. Code is available athttps://github.com/zikuicai/aegisllm

Quick Read (beta)

loading the full paper ...