Diffprivlib: The IBM Differential Privacy Library

  • 2019-07-04 15:08:38
  • Naoise Holohan, Stefano Braghin, Pól Mac Aonghusa, Killian Levacher
  • 22

Abstract

Since its conception in 2006, differential privacy has emerged as thede-facto standard in data privacy, owing to its robust mathematical guarantees,generalised applicability and rich body of literature. Over the years,researchers have studied differential privacy and its applicability to anever-widening field of topics. Mechanisms have been created to optimise theprocess of achieving differential privacy, for various data types andscenarios. Until this work however, all previous work on differential privacyhas been conducted on a ad-hoc basis, without a single, unifying codebase toimplement results. In this work, we present the IBM Differential Privacy Library, a generalpurpose, open source library for investigating, experimenting and developingdifferential privacy applications in the Python programming language. Thelibrary includes a host of mechanisms, the building blocks of differentialprivacy, alongside a number of applications to machine learning and other dataanalytics tasks. Simplicity and accessibility has been prioritised indeveloping the library, making it suitable to a wide audience of users, fromthose using the library for their first investigations in data privacy, to theprivacy experts looking to contribute their own models and mechanisms forothers to use.

 

Quick Read (beta)

Diffprivlib: The IBM Differential Privacy Library

A general purpose, open source Python library for differential privacy
Naoise Holohan 0000-0003-2222-9394IBM Research – Ireland [email protected] Stefano Braghin IBM Research – Ireland [email protected] Pól Mac Aonghusa IBM Research – Ireland [email protected]  and  Killian Levacher IBM Research – Ireland [email protected]
Abstract.

Since its conception in 2006, differential privacy has emerged as the de-facto standard in data privacy, owing to its robust mathematical guarantees, generalised applicability and rich body of literature. Over the years, researchers have studied differential privacy and its applicability to an ever-widening field of topics. Mechanisms have been created to optimise the process of achieving differential privacy, for various data types and scenarios. Until this work however, all previous work on differential privacy has been conducted on a ad-hoc basis, without a single, unifying codebase to implement results.

In this work, we present the IBM Differential Privacy Library, a general purpose, open source library for investigating, experimenting and developing differential privacy applications in the Python programming language. The library includes a host of mechanisms, the building blocks of differential privacy, alongside a number of applications to machine learning and other data analytics tasks. Simplicity and accessibility has been prioritised in developing the library, making it suitable to a wide audience of users, from those using the library for their first investigations in data privacy, to the privacy experts looking to contribute their own models and mechanisms for others to use.

Differential privacy, Python, open source, machine learning, data analysis.

1. Introduction

For more than a decade, differential privacy has been a focal-point for the research and development of new methods for data privacy. Buoyed by the general applicability of differential privacy, and the rigorous mathematical privacy guarantees it possesses, researchers have contributed to a comprehensive body of literature proposing new applications for differential privacy in a wide array of fields. In the early days of differential privacy, simple applications included privacy-preserving histograms (Dwo06, ), while in more recent times differential privacy has been applied to deep learning (ACG16, ). In between, seemingly countless adaptations of machine learning models have been presented that satisfy differential privacy, presenting a genuine opportunity to perform privacy-preserving exploration or analysis of data.

However, these applications of differential privacy currently exist in a fragmented development environment, encompassing different programming languages, diverse coding styles and incomplete, unmaintained, model-by-model, repositories that makes the implementation of differential privacy a difficult task, even for experts in the field11 1 Examples of valuable contributions to differential privacy include https://github.com/uber/sql-differential-privacy, https://github.com/brubinstein/diffpriv and https://gitlab.com/dp-stats/dp-stats/, but all use different programming languages.. With the work presented herein, we are bringing together a number of differential privacy applications and fundamentals that will be accessible and useful to all levels of expertise and knowledge on data privacy.

The IBM Differential Privacy Library, also known as diffprivlib, is written in Python 3, a popular programming language for machine learning and data analysis. Diffprivlib leverages the functionality and familiarity of the NumPy (WCV11, ) and Scikit-learn (PVG11, ) packages, meaning functions and models are instantly recognisable, with default parameters ensuring accessibility for all. Released under the MIT Open Source license, diffprivlib is free to use and modify, and contributions of its users are welcomed to help expand the functionality and features of the library.

Diffprivlib provides a large collection of mechanisms, the fundamental building blocks of differential privacy that handle the addition of noise. These mechanisms are used under-the-hood in the machine learning models and other tools to satisfy differential privacy, allowing the complex task of achieving differential privacy to be hidden from view. Like Scikit-learn, machine learning models can be trained in just two lines of code with diffprivlib: one import statement, and one line to fit the model (e.g., LogisticRegression().fit(X, y)).

In this paper we provide a brief overview of the library (Section 2), give details on the library’s main modules (Section 3) and provide a worked example (Section 4) to showcase its functionality.

2. Overview

Diffprivlib is written in Python, and requires Python 3.4 to run. The library is not compatible with Python 2, which is due to reach end-of-life in 2020. Python was selected due to its accessibility, wealth of machine learning capabilities, and active user community. The library is written to comply with the PEP8 coding standard (PEP8, ), as much as is practicable.

Diffprivlib can be installed using the pip command:

>>> pip install diffprivlib

A major pillar of diffprivlib is for it to be accessible to researchers and programmers of all privacy expertise, from those investigating differential privacy for the first time, to the more advanced users looking to develop their own applications. To this end, we have integrated diffprivlib with the popular machine learning library Scikit-learn, allowing users to run machine learning instances in a near-identical way to how they do so already. Similarly, we have mirrored and leveraged the familiarity and functionality of NumPy in developing easy-to-use differential privacy tools.

The library consists of three main modules:

  • mechanisms – a collection of differential privacy mechanisms, the building blocks for developing differential privacy applications;

  • models – a collection of differentially private machine learning models;

  • tools – a collection of tools and utilities for simple data analytics with differential privacy.

Diffprivlib is hosted on GitHub22 2 https://github.com/IBM/differential-privacy-library, and cloning the library from there provides users with additional resources. A tests/ directory includes unit tests for library functionality, and can be run using pytest from the root directory after installation. A collection of Jupyter notebook examples is also included in the notebooks/ directory that showcase the functionality of the library. Documentation is hosted on Read the Docs33 3 https://diffprivlib.readthedocs.io/.

3. Library Contents

3.1. Mechanisms

The mechanisms that have been implemented in Version 0.1.1 of diffprivlib are listed in Table 1. These mechanisms have been built primarily for their inclusion in more complex applications and models, but can be executed and experimented with in isolation, as shown in Listing 1.

For simplicity, similar mechanisms are grouped together in the same sub-module (e.g., all Laplace-like mechanisms are contained in mechanisms.laplace), but all can be imported directly frommechanisms.

Mechanisms are interacted with using function-specific methods, reducing the need for code and documentation duplication. For example, each mechanism has a set_epsilon() (and/orset_epsilon_delta()) method to set the ϵ (and δ) parameters of the mechanism, and many mechanisms have aset_sensitivity() method. Similarly, each method has arandomise() method, which takes an input value and returns a differentially private output value, assuming the mechanism has been correctly configured. A mechanism’s methods can be checked in the documentation (e.g., help(Binary)), or by using Python’s dir() method.

>>> from diffprivlib.mechanisms import Laplace
>>> laplace = Laplace()
>>> laplace.set_epsilon(0.5)
>>> laplace.set_sensitivity(1)
>>> laplace.randomise(3)
5.835104866820303
Listing 1: Example usage of mechanisms
Mechanism name Notes
DPMechanism Base class for all mechanisms
TruncationAndFoldingMixin Mixin for truncating or folding numeric outputs of a mechanism
Binary (HLM17, )
Exponential (MT07, )
ExponentialHierarchical Exponential, with hierarchical utility function
Gaussian (DR14, )
GaussianAnalytic (BW18, )
Geometric (GRS12, )
GeometricTruncated Geometric, with post-processing
GeometricFolded Geometric, with post-processing and support for half-integer bounds
Laplace (DMN06, ), with support for δ>0 from (HLM15, )
LaplaceTruncated Laplace, with post-processing
LaplaceFolded Laplace, with post-processing
LaplaceBoundedDomain (HAB18, )
LaplaceBoundedNoise (GDG18, )
Staircase (GKO15, )
Uniform Special case of (GDG18, ) when ϵ=0
Vector (CMS11, )
Table 1. The list of mechanisms included in diffprivlib.mechanisms.

3.2. Machine Learning Models

The models developed in diffprivlib have been engineered to mirror the behaviour and syntax of the privacy-agnostic versions present in Scikit-learn. This allows for a simple one-step change to switch from a vanilla Scikit-learn machine learning model to one which implements differential privacy with diffprivlib.

For example, when implementing a Gaussian naïve Bayes classifier with differential privacy, a user need only replace the import statement from Scikit-learn with the corresponding import from diffprivlib and run their code and analysis in the same way. Diffprivlib supports the same pre-processing pipelines that are present in Scikit-learn, and, in some cases, their use is encouraged to optimise noise-addition and model sensitivity. The ϵ value is specified as a parameter when the model is initialised (e.g.,GaussianNB(epsilon=0.1)), otherwise a default of ϵ=1 is used.

>>> from diffprivlib.models import GaussianNB
>>> nb = GaussianNB()
>>> X = [[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]]
>>> y = [1, 1, 1, 2, 2, 2]
>>> nb.fit(X, y)
PrivacyLeakWarning: Bounds have not been specified ...
>>> nb.predict([[-0.5, -2], [3,2]])
array([1, 2])
Listing 2: Example usage of models

For more advanced users, the models can be parameterised with information about the data to ensure no privacy leakage (and the accompanying PrivacyLeakWarning) and optimal noise-addition. This can include specifying the range of values for each feature (column) of the data, or the maximum norm of each sample in the data. If these parameters are not specified in advance of training the model, the user is warned about a possible privacy leak. However, the model continues to execute, with the parameters computed on the data. For the example given in Listing 2, the warning could have been suppressed by initialising with GaussianNB(bounds=[(-3, 3), (-2, 2)]), as the values in each column of X fall within those bounds.

The differentially private machine learning models that have been implemented in Version 0.1.1 of diffprivlib are listed in Table 2.

Model Notes
Supervised learning
LogisticRegression Implements the logistic regression classifier in (CMS11, ) (using the Vector mechanism), with minor changes to allow for non-unity data norm and to allow integration with the corresponding classifier in SKLearn.
GaussianNB Implementation of the Gaussian naïve Bayes classifier presented in (VSB13, ) (using the Laplace and LaplaceBoundedDomain mechanisms) with minor amendments44 4 The paper omitted splitting ϵ between the randomisation operations acting on the mean and variance of each feature, something we account for with a uniform split
Unsupervised learning
KMeans Implementation of the k-means algorithm presented in (SCL16, ) (using the GeometricFolded and LaplaceBoundedDomain mechanisms)
Table 2. The list of machine learning models included in diffprivlib.models.

3.3. Tools

The tools module of diffprivlib contains a number of basic tools and utilities that include differential privacy. Presently, the module includes histogram functions to compute differentially private histograms on data, and statistical functions for computing the differentially private mean, variance and standard deviation of an array of numbers. All of these tools mirror and leverage the functionality of the corresponding functions in NumPy, ensuring broad applicability and ease of use.

Further details are given in Table 3.

Tools Notes
histogram, histogram2d, histogramdd Histogram functions mirroring and leveraging the functionality of their NumPy counterparts, with differential privacy (using the GeometricTruncated mechanism)
mean, var, std Simple statistical functions mirroring and leveraging their Numpy counterparts, with differential privacy (using the Laplace and LaplaceBoundedDomain mechanisms)
Table 3. The list of differential privacy tools included in diffprivlib.tools.

3.4. Troubleshooting

Two library-specific warnings that can be triggered by diffprivlib:

  1. (1)

    PrivacyLeakWarning: Triggered when a differential privacy application (in models or tools) has been incorrectly configured and consequently will not strictly satisfy differential privacy. In such circumstances, the execution of the application continues, but the user is warned about the potential for additional privacy leakage (beyond that controlled by differential privacy). Examples of a PrivacyLeakWarning include: (i) not specifying the bounds/norm of the data when calling a model (requiring the model to compute the bounds/norm on the data, thereby leaking privacy); or (ii) where the bounds/norm supplied to the model do(es) not cover all the data, resulting in a potential violation of differential privacy.

  2. (2)

    DiffprivlibCompatibilityWarning: Triggered when a parameter, typically present in the parent of the function/class (i.e., in NumPy or Scikit-learn), is specified to a diffprivlib component which does not use the parameter. The user is warned of this incompatibility to ensure full understanding of the results, and continues execution with the parameter amended/deleted as required by diffprivlib.

4. Worked Example

We now present an example use case of the diffprivlib. In this example we use the naïve Bayes classifier GaussianNB, implemented as a child class of Scikit-learn’s GaussianNB(). Naïve Bayes is a probabilistic classifier that learns the means and variances of each feature (assumed independent) for each label, allowing Bayes theorem to be applied to classify unseen examples. Differential privacy is applied by adding noise to the means and variances that the model learns, ensuring a decoupling of the model from the data upon which it was trained.

The first step in using the model is to import the training and test data we wish to use. For this simple example, we use the Iris flower dataset, and implement an 80/20 train/test split, using sklearn.

>>> from sklearn import datasets
>>> from sklearn.model_selection import train_test_split
>>> dataset = datasets.load_iris()
>>> X_train, X_test, y_train, y_test = train_test_split(dataset.data, dataset.target, test_size=0.2)

We can now use X_train and y_train to train the differentially private model. To do so, we import GaussianNB() from diffprivlib and call its fit() method. The model will fall back on default parameters if none are specified at initialisation.

>>> from diffprivlib.models import GaussianNB
>>> clf = dp.GaussianNB()
>>> clf.fit(X_train, y_train)
PrivacyLeakWarning: Bounds have not been specified ...

The PrivacyLeakWarning is triggered as a result of not specifying the bounds of the data on initialisation. The model must therefore compute the bounds on the given data, resulting in additional privacy leakage. To prevent this, the bounds should be determined independently of the data (i.e., using domain knowledge), and specifying them as a parameter at initialisation.

We can now classify unseen examples, knowing the model now satisfies differential privacy and protects the privacy of the individuals in the training dataset. The resulting array is the predicted class of the corresponding row in X_test.

>>> clf.predict(X_test)
array([1, 0, 2, 1, 2, 1, 2, 1, 0, 0, 1, 2, 2, 0, 0, 0, 1, 1, 1, 1, 0, 2, 1, 1, 0, 0, 1, 0, 0, 1])

We can check the accuracy of the prediction using the corresponding y_test array.

>>> (clf.predict(X_test) == y_test).sum() / y_test.shape[0]
0.9333333333333333

We can loop through values of ϵ to plot the accuracy of the model for a range of privacy guarantees, as plotted in Figure 1. The small size of the Iris flower dataset (150 samples) results in a high threshold for ϵ in order to maintain accuracy. Similar analysis on the larger UCI Adult dataset (48 842 samples) shows significantly superior accuracy, reaching 95% of the non-private accuracy at ϵ=0.05.

Figure 1. Comparison of accuracy versus ϵ for a differentially private naïve Bayes classifier on the Iris dataset. For each ϵ, the average accuracy over 30 simulations is shown.

5. Conclusion

In this paper we have presented the IBM Differential Privacy Library, a general purpose, open source Python library for the simulation, experimentation and development of applications and tools for differential privacy. We have presented the functionality of Version 0.1.1 of the library, but also its user-friendly development ethos, smooth integration with the popular NumPy and Scikit-learn packages and wider development potential which will be built upon in the forthcoming months and years. We encourage readers to install the library (with pip install diffprivlib), and start running their own user-friendly, differentially private machine learning tasks. As an open source project, we welcome and encourage contributions to develop and enhance the library in all their forms.

Acknowledgements

A special thanks goes to our former colleagues Maria-Irina Nicolae and Spiros Antonatos for their instrumental help in getting diffprivlib off the ground, and to our colleague Joao Bettencourt-Silva for his help in testing the library.

References

  • (1) Abadi, M., Chu, A., Goodfellow, I., McMahan, H. B., Mironov, I., Talwar, K., and Zhang, L. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (New York, NY, USA, 2016), CCS ’16, ACM, pp. 308–318.
  • (2) Balle, B., and Wang, Y. Improving the Gaussian mechanism for differential privacy: Analytical calibration and optimal denoising. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018 (2018), pp. 403–412.
  • (3) Chaudhuri, K., Monteleoni, C., and Sarwate, A. D. Differentially private empirical risk minimization. Journal of Machine Learning Research 12 (2011), 1069–1109.
  • (4) Dwork, C. Differential privacy. In Automata, Languages and Programming: 33rd International Colloquium, ICALP 2006, Venice, Italy, July 10-14, 2006, Proceedings, Part II. Springer Berlin Heidelberg, Berlin, Heidelberg, 2006, pp. 1–12.
  • (5) Dwork, C., McSherry, F., Nissim, K., and Smith, A. Calibrating noise to sensitivity in private data analysis. In Theory of cryptography. Springer, 2006, pp. 265–284.
  • (6) Dwork, C., and Roth, A. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science 9, 3-4 (2014), 211–407.
  • (7) Geng, Q., Ding, W., Guo, R., and Kumar, S. Privacy and utility tradeoff in approximate differential privacy. cs.CR abs/1810.00877 (2018).
  • (8) Geng, Q., Kairouz, P., Oh, S., and Viswanath, P. The staircase mechanism in differential privacy. Selected Topics in Signal Processing, IEEE Journal of 9, 7 (2015), 1176–1184.
  • (9) Ghosh, A., Roughgarden, T., and Sundararajan, M. Universally utility-maximizing privacy mechanisms. SIAM Journal on Computing 41, 6 (2012), 1673–1693.
  • (10) Holohan, N., Antonatos, S., Braghin, S., and Mac Aonghusa, P. The bounded Laplace mechanism in differential privacy. CoRR abs/1808.10410 (2018).
  • (11) Holohan, N., Leith, D. J., and Mason, O. Differential privacy in metric spaces: Numerical, categorical and functional data under the one roof. Information Sciences 305 (2015), 256–268.
  • (12) Holohan, N., Leith, D. J., and Mason, O. Optimal differentially private mechanisms for randomised response. IEEE Transactions on Information Forensics and Security 12, 11 (Nov 2017), 2726–2735.
  • (13) McSherry, F., and Talwar, K. Mechanism design via differential privacy. In Foundations of Computer Science, 2007. FOCS’07. 48th Annual IEEE Symposium on (2007), IEEE, pp. 94–103.
  • (14) Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830.
  • (15) Su, D., Cao, J., Li, N., Bertino, E., and Jin, H. Differentially private k-means clustering. In Proceedings of the Sixth ACM Conference on Data and Application Security and Privacy (New York, NY, USA, 2016), CODASPY ’16, ACM, pp. 26–37.
  • (16) Vaidya, J., Shafiq, B., Basu, A., and Hong, Y. Differentially private naïve Bayes classification. In Proceedings of the 2013 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT) - Volume 01 (2013), WI-IAT ’13, IEEE Computer Society, pp. 571–576.
  • (17) van der Walt, S., Colbert, S. C., and Varoquaux, G. The NumPy array: A structure for efficient numerical computation. Computing in Science and Engineering 13, 2 (2011), 22–30.
  • (18) van Rossum, G., Warsaw, B., and Coghlan, N. PEP 8 – style guide for Python code. https://www.python.org/dev/peps/pep-0008/ [Accessed: 2019-06-21], 2001.