Efficiently Verifiable Proofs of Data Attribution

Abstract

Data attribution methods aim to answer useful counterfactual questions like"what would a ML model's prediction be if it were trained on a differentdataset?" However, estimation of data attribution models through techniqueslike empirical influence or "datamodeling" remains very computationallyexpensive. This causes a critical trust issue: if only a few computationallyrich parties can obtain data attributions, how can resource-constrained partiestrust that the provided attributions are indeed "good," especially when theyare used for important downstream applications (e.g., data pricing)? In thispaper, we address this trust issue by proposing an interactive verificationparadigm for data attribution. An untrusted and computationally powerful Proverlearns data attributions, and then engages in an interactive proof with aresource-constrained Verifier. Our main result is a protocol that providesformal completeness, soundness, and efficiency guarantees in the sense ofProbably-Approximately-Correct (PAC) verification. Specifically, if both Proverand Verifier follow the protocol, the Verifier accepts data attributions thatare {\epsilon}-close to the optimal data attributions (in terms of the MeanSquared Error) with probability 1-{\delta}. Conversely, if the Proverarbitrarily deviates from the protocol, even with infinite compute, then thisis detected (or it still yields data attributions to the Verifier) except withprobability {\delta}. Importantly, our protocol ensures the Verifier'sworkload, measured by the number of independent model retrainings it mustperform, scales only as O(1/{\epsilon}); i.e., independently of the datasetsize. At a technical level, our results apply to efficiently verifying anylinear function over the boolean hypercube computed by the Prover, making thembroadly applicable to various attribution tasks.

Quick Read (beta)

loading the full paper ...