Abstract
Current LLM safety defenses fail under decomposition attacks, where amalicious goal is decomposed into benign subtasks that circumvent refusals. Thechallenge lies in the existing shallow safety alignment techniques: they onlydetect harm in the immediate prompt and do not reason about long-range intent,leaving them blind to malicious intent that emerges over a sequence ofseemingly benign instructions. We therefore propose adding an external monitorthat observes the conversation at a higher granularity. To facilitate our studyof monitoring decomposition attacks, we curate the largest and most diversedataset to date, including question-answering, text-to-image, and agentictasks. We verify our datasets by testing them on frontier LLMs and show an 87%attack success rate on average on GPT-4o. This confirms that decompositionattack is broadly effective. Additionally, we find that random tasks can beinjected into the decomposed subtasks to further obfuscate malicious intents.To defend in real time, we propose a lightweight sequential monitoringframework that cumulatively evaluates each subtask. We show that a carefullyprompt engineered lightweight monitor achieves a 93% defense success rate,beating reasoning models like o3 mini as a monitor. Moreover, it remains robustagainst random task injection and cuts cost by 90% and latency by 50%. Ourfindings suggest that lightweight sequential monitors are highly effective inmitigating decomposition attacks and are viable in deployment.