Model Reconstruction from Model Explanations

Abstract

We show through theory and experiment that gradient-based explanations of amodel quickly reveal the model itself. Our results speak to a tension betweenthe desire to keep a proprietary model secret and the ability to offer modelexplanations. On the theoretical side, we give an algorithm that provablylearns a two-layer ReLU network in a setting where the algorithm may query thegradient of the model with respect to chosen inputs. The number of queries isindependent of the dimension and nearly optimal in its dependence on the modelsize. Of interest not only from a learning-theoretic perspective, this resulthighlights the power of gradients rather than labels as a learning primitive.Complementing our theory, we give effective heuristics for reconstructingmodels from gradient explanations that are orders of magnitude morequery-efficient than reconstruction attacks relying on prediction interfaces.

Quick Read (beta)

loading the full paper ...