Abstract
Despite extensive pre-training in moral alignment to prevent generatingharmful information, large language models (LLMs) remain vulnerable tojailbreak attacks. In this paper, we propose AutoDefense, a multi-agent defenseframework that filters harmful responses from LLMs. With the response-filteringmechanism, our framework is robust against different jailbreak attack prompts,and can be used to defend different victim models. AutoDefense assignsdifferent roles to LLM agents and employs them to complete the defense taskcollaboratively. The division in tasks enhances the overallinstruction-following of LLMs and enables the integration of other defensecomponents as tools. With AutoDefense, small open-source LMs can serve asagents and defend larger models against jailbreak attacks. Our experiments showthat AutoDefense can effectively defense against different jailbreak attacks,while maintaining the performance at normal user request. For example, wereduce the attack success rate on GPT-3.5 from 55.74% to 7.95% usingLLaMA-2-13b with a 3-agent system. Our code and data are publicly available athttps://github.com/XHMY/AutoDefense.