Abstract
A backdoor data poisoning attack is an adversarial attack wherein theattacker injects several watermarked, mislabeled training examples into atraining set. The watermark does not impact the test-time performance of themodel on typical data; however, the model reliably errs on watermarkedexamples. To gain a better foundational understanding of backdoor data poisoningattacks, we present a formal theoretical framework within which one can discussbackdoor data poisoning attacks for classification problems. We then use thisto analyze important statistical and computational issues surrounding theseattacks. On the statistical front, we identify a parameter we call the memorizationcapacity that captures the intrinsic vulnerability of a learning problem to abackdoor attack. This allows us to argue about the robustness of severalnatural learning problems to backdoor attacks. Our results favoring theattacker involve presenting explicit constructions of backdoor attacks, and ourrobustness results show that some natural problem settings cannot yieldsuccessful backdoor attacks. From a computational standpoint, we show that under certain assumptions,adversarial training can detect the presence of backdoors in a training set. Wethen show that under similar assumptions, two closely related problems we callbackdoor filtering and robust generalization are nearly equivalent. Thisimplies that it is both asymptotically necessary and sufficient to designalgorithms that can identify watermarked examples in the training set in orderto obtain a learning algorithm that both generalizes well to unseen data and isrobust to backdoors.