Abstract
As AI deployments become more complex and high-stakes, it becomesincreasingly important to be able to estimate their risk. AI control is oneframework for doing so. However, good control evaluations require elicitingstrong attack policies. This can be challenging in complex agentic environmentswhere compute constraints leave us data-poor. In this work, we show how tooptimize attack policies in SHADE-Arena, a dataset of diverse realistic controlenvironments. We do this by decomposing attack capability into five constituentskills -- suspicion modeling, attack selection, plan synthesis, execution andsubtlety -- and optimizing each component individually. To get around theconstraint of limited data, we develop a probabilistic model of attackdynamics, optimize our attack hyperparameters using this simulation, and thenshow that the results transfer to SHADE-Arena. This results in a substantialimprovement in attack strength, reducing safety score from a baseline of 0.87to 0.41 using our scaffold.