Monitoring Decomposition Attacks in LLMs with Lightweight Sequential Monitors

Abstract

Current LLM safety defenses fail under decomposition attacks, where amalicious goal is decomposed into benign subtasks that circumvent refusals. Thechallenge lies in the existing shallow safety alignment techniques: they onlydetect harm in the immediate prompt and do not reason about long-range intent,leaving them blind to malicious intent that emerges over a sequence ofseemingly benign instructions. We therefore propose adding an external monitorthat observes the conversation at a higher granularity. To facilitate our studyof monitoring decomposition attacks, we curate the largest and most diversedataset to date, including question-answering, text-to-image, and agentictasks. We verify our datasets by testing them on frontier LLMs and show an 87%attack success rate on average on GPT-4o. This confirms that decompositionattack is broadly effective. Additionally, we find that random tasks can beinjected into the decomposed subtasks to further obfuscate malicious intents.To defend in real time, we propose a lightweight sequential monitoringframework that cumulatively evaluates each subtask. We show that a carefullyprompt engineered lightweight monitor achieves a 93% defense success rate,beating reasoning models like o3 mini as a monitor. Moreover, it remains robustagainst random task injection and cuts cost by 90% and latency by 50%. Ourfindings suggest that lightweight sequential monitors are highly effective inmitigating decomposition attacks and are viable in deployment.

Quick Read (beta)

loading the full paper ...