Abstract
As society grows more reliant on machine learning, ensuring the security ofmachine learning systems against sophisticated attacks becomes a pressingconcern. A recent result of Goldwasser, Kim, Vaikuntanathan, and Zamir (2022)shows that an adversary can plant undetectable backdoors in machine learningmodels, allowing the adversary to covertly control the model's behavior.Backdoors can be planted in such a way that the backdoored machine learningmodel is computationally indistinguishable from an honest model withoutbackdoors. In this paper, we present strategies for defending against backdoors in MLmodels, even if they are undetectable. The key observation is that it issometimes possible to provably mitigate or even remove backdoors withoutneeding to detect them, using techniques inspired by the notion of randomself-reducibility. This depends on properties of the ground-truth labels(chosen by nature), and not of the proposed ML model (which may be chosen by anattacker). We give formal definitions for secure backdoor mitigation, and proceed toshow two types of results. First, we show a "global mitigation" technique,which removes all backdoors from a machine learning model under the assumptionthat the ground-truth labels are close to a Fourier-heavy function. Second, weconsider distributions where the ground-truth labels are close to a linear orpolynomial function in $\mathbb{R}^n$. Here, we show "local mitigation"techniques, which remove backdoors with high probability for every inputs ofinterest, and are computationally cheaper than global mitigation. All of ourconstructions are black-box, so our techniques work without needing access tothe model's representation (i.e., its code or parameters). Along the way weprove a simple result for robust mean estimation.