Improving Alignment and Robustness with Circuit Breakers

Abstract

AI systems can take harmful actions and are highly vulnerable to adversarialattacks. We present an approach, inspired by recent advances in representationengineering, that interrupts the models as they respond with harmful outputswith "circuit breakers." Existing techniques aimed at improving alignment, suchas refusal training, are often bypassed. Techniques such as adversarialtraining try to plug these holes by countering specific attacks. As analternative to refusal training and adversarial training, circuit-breakingdirectly controls the representations that are responsible for harmful outputsin the first place. Our technique can be applied to both text-only andmultimodal language models to prevent the generation of harmful outputs withoutsacrificing utility -- even in the presence of powerful unseen attacks.Notably, while adversarial robustness in standalone image recognition remainsan open challenge, circuit breakers allow the larger multimodal system toreliably withstand image "hijacks" that aim to produce harmful content.Finally, we extend our approach to AI agents, demonstrating considerablereductions in the rate of harmful actions when they are under attack. Ourapproach represents a significant step forward in the development of reliablesafeguards to harmful behavior and adversarial attacks.

Quick Read (beta)

loading the full paper ...