Distillation Robustifies Unlearning

Abstract

Current LLM unlearning methods are not robust: they can be reverted easilywith a few steps of finetuning. This is true even for the idealized unlearningmethod of training to imitate an oracle model that was never exposed tounwanted information, suggesting that output-based finetuning is insufficientto achieve robust unlearning. In a similar vein, we find that training arandomly initialized student to imitate an unlearned model transfers desiredbehaviors while leaving undesired capabilities behind. In other words,distillation robustifies unlearning. Building on this insight, we proposeUnlearn-Noise-Distill-on-Outputs (UNDO), a scalable method that distills anunlearned model into a partially noised copy of itself. UNDO introduces atunable tradeoff between compute cost and robustness, establishing a new Paretofrontier on synthetic language and arithmetic tasks. At its strongest setting,UNDO matches the robustness of a model retrained from scratch with perfect datafiltering while using only 60-80% of the compute and requiring only 0.01% ofthe pretraining data to be labeled. We also show that UNDO robustifiesunlearning on the more realistic Weapons of Mass Destruction Proxy (WMDP)benchmark. Since distillation is widely used in practice, incorporating anunlearning step beforehand offers a convenient path to robust capabilityremoval.

Quick Read (beta)

loading the full paper ...