Vamsa: Automated Provenance Tracking in Data Science Scripts

Abstract

There has recently been a lot of ongoing research in the areas of fairness,bias and explainability of machine learning (ML) models due to the self-evidentor regulatory requirements of various ML applications. We make the followingobservation: All of these approaches require a robust understanding of therelationship between ML models and the data used to train them. In this work,we introduce the ML provenance tracking problem: the fundamental idea is toautomatically track which columns in a dataset have been used to derive thefeatures/labels of an ML model. We discuss the challenges in capturing suchinformation in the context of Python, the most common language used by datascientists. We then present Vamsa, a modular system that extracts provenancefrom Python scripts without requiring any changes to the users' code. Using 26Kreal data science scripts, we verify the effectiveness of Vamsa in terms ofcoverage, and performance. We also evaluate Vamsa's accuracy on a smallersubset of manually labeled data. Our analysis shows that Vamsa's precision andrecall range from 90.4% to 99.1% and its latency is in the order ofmilliseconds for average size scripts. Drawing from our experience in deployingML models in production, we also present an example in which Vamsa helpsautomatically identify models that are affected by data corruption issues.

Quick Read (beta)

loading the full paper ...