For a dataset of label-count pairs, an anonymized histogram is the multisetof counts. Anonymized histograms appear in various potentially sensitivecontexts such as password-frequency lists, degree distribution in socialnetworks, and estimation of symmetric properties of discrete distributions.Motivated by these applications, we propose the first differentially privatemechanism to release anonymized histograms that achieves near-optimal privacyutility trade-off both in terms of number of items and the privacy parameter.Further, if the underlying histogram is given in a compact format, the proposedalgorithm runs in time sub-linear in the number of items. For anonymizedhistograms generated from unknown discrete distributions, we show that thereleased histogram can be directly used for estimating symmetric properties ofthe underlying distribution.
Quick Read (beta)
Differentially private anonymized histograms
For a dataset of label-count pairs, an anonymized histogram is the multiset of counts. Anonymized histograms appear in various potentially sensitive contexts such as password-frequency lists, degree distribution in social networks, and estimation of symmetric properties of discrete distributions. Motivated by these applications, we propose the first differentially private mechanism to release anonymized histograms that achieves near-optimal privacy utility trade-off both in terms of number of items and the privacy parameter. Further, if the underlying histogram is given in a compact format, the proposed algorithm runs in time sub-linear in the number of items. For anonymized histograms generated from unknown discrete distributions, we show that the released histogram can be directly used for estimating symmetric properties of the underlying distribution.
Differentially private anonymized histograms
Ananda Theertha Suresh Google Research New York [email protected]
noticebox[b]Preprint. Under review.\[email protected]
Given a set of labels , a dataset is a collection of labels and counts, . An anonymized histogram of such a dataset is the unordered multiset of all non-zero counts without any label information,
For example, if , , then 11 1 is a multiset and not a set and duplicates are allowed.. Anonymized histograms do not contain any information about the labels, including the cardinality of . Furthermore, we only consider histograms with positive counts. The results can be extended to histograms that include zero counts. A histogram can also be represented succinctly using prevalences. For a histogram , the prevalence is the number of elements in the histogram with count ,
In the above example, , , and for . Anonymized histograms are also referred to as histogram of histograms [batu2000testing], histogram order statistics [paninski2003estimation], profiles [orlitsky2004modeling], unattributed histograms [hay2010boosting], fingerprints [valiant2011estimating], and frequency lists [blocki2016differentially].
Anonymized histograms appear in several potentially sensitive contexts ranging from password frequency lists to social networks. Before we proceed to the problem formulation and results, we first provide an overview of the various contexts where anonymized histograms have been studied under differential privacy and their motivation.
Password frequency lists: Anonymized password histograms are useful to security researchers who wish to understand the underlying password distribution in order to estimate the security risks or evaluate various password defenses [bonneau2012science, blocki2016differentially]. For example, if is the most frequent password, then is the number of accounts that an adversary could compromise with guesses per user. Hence, if the server changes the -strikes policy from to , the frequency distribution can be used to evaluate the security implications of this change. We refer readers to [blocki2016differentially, blocki2018economics] for more uses of password frequency lists. Despite their usefulness, organizations may be wary of publishing these lists due to privacy concerns. This is further justified as it is not unreasonable to expect that an adversary will have some side information based on attacks against other organizations. Motivated by this, [blocki2016differentially] studied the problem of releasing anonymized password histograms.
Degree-distributions in social networks: Degree distributions is one of the most widely studied properties of a graph as it influences the structure of the graph. Degree distribution can also be used to estimate linear queries on degree distributions such as number of -stars. However, some graphs may have unique degree distributions and releasing exact degree distributions is no more safer than naive anonymization, which can leave social network participants vulnerable to a variety of attacks [backstrom2007wherefore, hay2008resisting, narayanan2009anonymizing]. Thus releasing them exactly can be revealing. Hence, [hay2009accurate, hay2010boosting, karwa2012differentially, kasiviswanathan2013analyzing, raskhodnikova2015efficient, blockidifferentially2, day2016publishing] considered the problem of releasing degree distributions of graphs with differential privacy. Degree distributions are anonymized histograms over the graph node degrees.
Estimating symmetric properties of discrete distributions: Let . A discrete distribution is a mapping from a domain to such that . Given a discrete distribution over symbols, a symmetric property is a property that depends only on the multiset of probabilities [valiant2011power, acharya2017unified], e.g., entropy ( ). Other symmetric properties include support size, Rényi entropy, distance to uniformity, and support coverage. Given independent samples from an unknown , the goal of property estimation is to estimate the value of the symmetric property of interest for . Estimating symmetric properties from unknown distributions has received a wide attention in the recent past e.g., [valiant2011estimating, valiant2011power, jiao2015minimax, wu2016minimax, orlitsky2016optimal, acharya2017unified] and has applications in various fields from neuro-science [paninski2003estimation] to genetics [zou2016quantifying]. Recently, [acharya2018inspectre] proposed algorithms to estimate support size, support coverage and entropy with differential privacy. Optimal estimators for symmetric properties only depend on the anonymized histograms of the samples [batu2000testing, acharya2017unified]. Hence, releasing anonymous histograms with differential privacy would simultaneously yield differentially-private plug-in estimators for all symmetric properties.
2 Differential privacy
Before we outline our results, we first define the privacy and utility aspects of anonymized histograms. Privacy has been studied extensively in statistics and computer science [dalenius1977towards, adam1989security, agrawal2001design, dwork2008differential] and references therein. Perhaps the most studied form of privacy is differential privacy (DP) [dwork2006calibrating, dwork2014algorithmic], where the objective is to ensure that an adversary would not infer whether a user is present in the dataset or not.
We study the problem of releasing anonymized histograms via the lens of global-DP. We begin by defining the notion of DP. Formally, given a set of datasets and a notion of neighboring datasets , and a query function , for some domain , then a mechanism is said to be -DP, if for any two neighboring datasets , and all ,
Broadly-speaking -DP ensures that given the output, an attacker would not be able to differentiate between any two neighboring datasets. -DP is also called pure-DP and provides stricter guarantees than the approximate -DP, where equation (1) needs to hold with probability .
Since introduction, DP has been studied extensively in various applications from dataset release to learning machine learning models [abadi2016deep]. It has also been adapted by industry [erlingsson2014rappor]. There are two models of DP: server or global or output DP, where a centralized entity has access to the entire dataset and answers the queries in a DP manner. The second model is local DP, where -DP is guaranteed for each individual user’s data [warner1965randomized, kasiviswanathan2011can, duchi2013local, kairouz2014extremal, acharya2018communication]. We study the problem of releasing anonymized histograms under global-DP. Here is the set of anonymized histograms, is the identity mapping, and .
2.2 Distance measure
For DP, a general notion of neighbors is as follows. Two datasets are neighbors if and only if one can be obtained from another by adding or removing a user [dwork2008differential]. Since, anonymized histograms do not contain explicit user information, we need few definitions to apply the above notion. We first define a notion of distance between label-count datasets. A natural notion of distance between datasets and over is the distance, , where is the count of in dataset . Since anonymized histograms do not contain any information about labels, we define distance between two histograms as
The following simple lemma characterizes the above distance in terms of counts.
Lemma 1 (Appendix B).
For an anonymized histogram , let be the highest count in the dataset.22 2 For larger than number of counts in , . For any two anonymized histograms ,
The above distance is also referred to as sorted distance or earth-mover’s distance. With the above definition of distance, we can define neighbors as follows.
Two anonymized histograms and are neighbors if and only if .
The above definition of neighboring histograms is same as the definition of neighbors in the previous works on anonymized histograms [hay2010boosting, blocki2016differentially].
3 Previous and new results
3.1 Anonymized histogram estimation
Similar to previous works [blocki2016differentially], we measure the utility of the algorithm in terms of the number of items in the anonymized histogram, .
Previous results: The problem of releasing anonymized histograms was first studied by [hay2009accurate, hay2010boosting] in the context of degree distributions of graphs. They showed that adding Laplace noise to each count, followed by a post-processing isotonic regression step results in a histogram with expected sorted- error of
Their algorithm runs in time . The problem was also considered in the context of password frequency lists by [blocki2016differentially]. They observed that an exponential mechanism over integer partitions yields an -DP algorithm. Based on this, for , they proposed a dynamic programming based relaxation of the exponential mechanism that runs in time and returns a histogram such that
with probability . Furthermore, the relaxed mechanism is -DP.
The best information-theoretic lower bound for the utility of any -DP mechanism is due to [alda2018lower], who showed that for , any -DP mechanism has expected error of for some dataset.
New results: Following [blocki2016differentially], we study the problem in metric. We propose a new DP mechanism PrivHist that satisfies the following:
Given a histogram in the prevalence form , PrivHist returns a histogram and a sum count that is -DP. Furthermore, if , then
for some constant and has an expected run time of . If then,
and has an expected run time of .
Together with the lower bound of [alda2018lower], this settles the optimal privacy utility trade-off for up to a multiplicative factor of . We also show that PrivHist is near-optimal for , by showing the following lower bound.
Theorem 2 (Appendix E).
For a given , let . For any -DP mechanism , there exists a histogram , such that
Theorems 1 and 2 together with [alda2018lower] show that the the proposed mechanism has near-optimal utility for all . We can infer the number of items in the dataset by . However, this estimate is very noisy. Hence, we also return the sum of counts as it is useful for applications in symmetric property estimation for distributions. Apart from the near-optimal privacy-utility trade-off, we also show that PrivHist has several other useful properties.
Time complexity: By the Hardy-Ramanujan integer partition theorem [hardy1918asymptotic], the number of anonymized histograms with items is . Hence, we can succinctly represent them using space. Recall that any anonymized histogram can be written as , where is the number of symbols with count . Let be the number of distinct counts and let be the distinct counts with non-zero prevalences. Then and
and hence there are at most non-zero prevalences and can be represented as using count-prevalence pairs. Histograms are often stored in this format for space efficiency e.g., password frequency lists in [bonneau_2015]. PrivHist takes advantage of this succinct representation. Hence, given such a succinct representation, it runs time as opposed to the running time of [hay2009accurate] and running time of [blocki2016differentially]. This is highly advantageous for large datasets such as password frequency lists with data points [blocki2016differentially].
Pure vs approximate differential privacy: The only previous known algorithm with utility of is that of [blocki2016differentially] and it runs in time . However, their algorithm is -approximate DP which is strictly weaker than PrivHist, whose output is -DP. For applications in social networks it is desirable to have group privacy for large groups [dwork2014algorithmic]. For groups of size , approximate DP, scales as -DP, which can be prohibitive for large values of . Hence -DP is preferable.
Applications to symmetric property estimation: We show that the output of PrivHist can be directly applied to obtain near-optimal sample complexity algorithms for discrete distribution symmetric property estimation.
3.2 Symmetric property estimation of discrete distributions
For a symmetric property and an estimator that uses samples, let be an upper bound on the worst expected error over all distributions with support at most , . Let sample complexity denote the minimum number of samples such that ,
Given samples , let denote the corresponding anonymous histogram. For a symmetric property , linear estimators of the form
are shown to be sample-optimal for symmetric properties such as entropy [wu2016minimax], support size [valiant2011power, jiao2015minimax], support coverage [orlitsky2016optimal], and Rényi entropy [acharya2014complexity], where s are some distribution-independent coefficients that depend on the property . Recently, [acharya2018inspectre] showed that for any given property such as entropy or support size, one can construct DP estimators by adding Laplace noise to the non-private estimator. They further showed that this approach is information theoretically near-optimal.
Instead of just computing a DP estimate for a given property, the output of PrivHist can be directly used to estimate any symmetric property. By the post-processing lemma [dwork2014algorithmic], since the output of PrivHist is DP, the estimate is also DP. For an estimator , let be the Lipschitz constant given by If instead of , a DP histogram and the sum of counts is available, then can be modified as
which is differentially private. Using Theorem 1, we show that:
Corollary 1 (Appendix F).
Let satisfy , for a . Further, let there exists such that . Let . If is the sample complexity of estimator , then for
for some constant . For ,
Further, by the post-processing lemma, is also -DP.
For entropy (), normalized support size (), and normalized support coverage, there exists sample-optimal linear estimators with and have the property [acharya2017unified, acharya2018inspectre]. Hence the sample complexity of the proposed algorithm increases at most by a polynomial in . Furthermore, the increase is dependent on the maximum value of the function for distributions of interest and it does not explicitly depend on the support size. This result is slightly worse than the property specific results of [acharya2018inspectre] in terms of dependence on and . In particular, for entropy estimation, the main term in our privacy cost is and the bound of [acharya2018inspectre] is . Thus for , our dependence on and is slightly worse. However, we note that our results are more general in that can be used with any linear estimator. For example, our algorithm implies DP algorithms for estimating distance to uniformity, which have been not been studied before. Furthermore, PrivHist can also be combined with the maximum likelihood estimators of [orlitsky2004modeling, orlitsky2016optimal] and linear programming estimators of [valiant2011estimating], however we do not provide any theoretical guarantees for these combined algorithms.
In the algorithm description and analysis, let denote the vector and let denote the cumulative prevalences. Since, anonymized histograms are multisets, we can define the sum of two anonymized histograms as follows: for two histograms , the sum is given by Furthermore, since there is a one-to-one mapping between histograms in count form and in prevalence form , we use both interchangeably. For the ease of analysis, we also use the notation of improper histogram, where the ’s can be negative or non-integers. Finally, for a histogram indexed by super-script , we define for the ease of notation.
Instead of describing the technicalities involved in the algorithm directly, we first motivate the algorithm with few incorrect or high-error algorithms. Before we proceed recall that histograms can be written either in terms of prevalences or in terms of sorted counts .
An incorrect algorithm: A naive tendency would be to just add noise only to non-zero prevalences or counts. However, this is not differentially private. For example, consider two neighboring histograms in prevalence format, and . The resulting outputs for the above two inputs would be very different as the output of never produces a non-zero , whereas the output of produces non-zero with high probability. Similar counter examples can be shown for sorted counts.
A high-error algorithm: Instead of adding noise to non-zero counts or prevalences, one can add noise to all the counts or prevalences. It can be shown that adding noise to all the counts (including those appeared zero times), yields a error , whereas adding noise to prevalences can yield an error of , if we naively use the utility bound in terms of prevalences (3). We note that [hay2009accurate] showed that a post-processing step after adding noise to sorted-counts and improves the utility. A naive application of Cauchy-Schwarz inequality yields an error of for that algorithm. While it might be possible to improve the dependence on by a tighter analysis, it is not clear if the dependence on can be improved.
The algorithm is given in PrivHist. After some computation, it calls two sub-routines PrivHist-LowPrivacy and PrivHist-HighPrivacy depending on the value of . PrivHist has two main new ideas: (i) splitting around and using prevalences in one regime and counts in another and (ii) the smoothing technique used to zero out the prevalence vector. Of the two (i) is crucial for the computational complexity of the algorithm and (ii) is crucial in improving the -dependence from to in the high privacy regime (). There are more subtle differences such as using cumulative prevalences instead of actual prevalences. We highlight them in the next section. We now overview our algorithm for low and high privacy regimes separately.
4.2 Low privacy regime
We first consider the problem in the low privacy regime when . We make few observations.
Geometric mechanism vs Laplace mechanism: For obtaining DP output of integer data, one can add either Laplace noise or geometric noise [ghosh2012universally]. For -DP, the expected noise added by Laplace mechanism is , which strictly larger than that of the geometric mechanism () (see Appendix A). For , we use the geometric mechanism to obtain optimal trade off in terms of .
Prevalences vs counts: If we add noise to each coordinate of a -dimensional vector, the total amount of noise in norm scales linearly in , hence it is better to add noise to a small dimensional vector. In the worst case, both prevalences and counts can be an -dimensional vector. Hence, we propose to use prevalences for small values of , and use counts for . This ensures that the dimensionality of vectors to which we add noise is at most .
Cumulative prevalences vs prevalences: The error can be bounded in terms of prevalences as follows. See Appendix B for a proof.
If we add noise to prevalences, the error can be very high as noise is multiplied with the corresponding count (3) . The bound in terms of cumulative prevalences is much tighter. Hence, for small values of , we use cumulative prevalences instead of prevalences themselves.
The above observations provide an algorithm for the low privacy regime. However, there are few technical difficulties. For example, if we split the counts at a threshold naively, then it is not differentially private. We now describe each of the steps in high-privacy regime and how we overcome these technical difficulties.
(1) Find : To divide the histogram into two smaller histograms, we need to know , which may not be available. Hence, we allot privacy cost to find a DP value of .
(2) Sensitivity preserving histogram split: If we divide the histogram into two parts based on counts naively and analyze the privacy costs independently for the higher and smaller parts separately, then the sensitivity would be lot higher for certain neighboring histograms. For example, consider two neighboring histograms and . If we divide in to two parts based on threshold , say and and and , then . Thus, the distance between neighboring separated histograms , would be much higher compared to and we need to add a lot of noise. Therefore, we perturb and using geometric noise. This ensures DP in instances where the neighboring histograms differ at and , and doesn’t change the privacy analysis for other types of histograms. However, adding noise may make the histogram improper as can become negative. To this end, we add fake counts at and to ensure that the histogram is proper with high probability. We remove them later in (L4). We refer readers to Appendix C.1.1 for details about this step.
(3,4) Add noise: Let (small counts) and (large counts) be the split-histograms. We add noise to cumulative prevalences in and counts in as described in the overview of the proposed algorithm.
(L1, L2) Post-processing: The noisy versions of may not satisfy the properties satisfied by the histograms i.e., . To overcome this, we run isotonic regression over noisy subject to the monotonicity constraints i.e., given noisy counts , find that minimizes , subject to the constraint that , for all . Isotonic regression in one dimension can be run linear in the number of inputs using Pool Adjacent Violators Algorithm (PAVA) or its variants [barlow1972statistical, mair2009isotone]. Hence, the time complexity of this algorithm is . We then round the prevalences to the nearest non-negative integers. We similarly post-process large counts and remove the fake counts that we introduced in step (2).
Since we used succinct representation of histograms, used prevalences for smaller than and counts otherwise, the expected run time of the algorithm is for .
4.3 High privacy regime
For the high-privacy regime, when , all known previous algorithms achieve an error of . To reduce the error from to , we use smoothing techniques to reduce the sensitivity and hence reduce the amount of added noise.
Smoothing method: Recall that the amount of noise added to a vector depends on its dimensionality. Since prevalences have length , the amount of noise would be . To improve on this, we first smooth the input prevalence vector such that it is non-zero only for few values of and show that the smoothing method reduces the sensitivity of cumulative prevalences and hence reduces the amount of noise added.
While applying smoothing is the core idea, two questions remain: how to select the location of non-zero values and how to smooth to reduce the sensitivity? We now describe these technical details.
(H1) Approximate high prevalences: Recall that was obtained by adding geometric noise to . In the rare case that this geometric noise is very negative, then there can be prevalences much larger than . This can affect the smoothing step. To overcome this, we move all counts above to . Since this changes the histogram with low probability, it does not affect the error.
(H2) Compute boundaries: We find a set of boundaries and smooth counts to elements in . Ideally we would like to ensure that there is a boundary that exists close to every count. For small values of , we ensure this by adding all the counts and hence there is no smoothing. If , we use boundaries that are uniform in the log-count space. However, using this technique for all values of , results in an additional factor. To overcome this, for , we use the noisy large counts in step (4) to find the boundaries and ensure that there is a boundary close to every count.
(H3) Smooth prevalences: For a that lies between two boundaries and , we divide into and as follows. We assign fraction of to and the remaining to . If two neighboring histograms differ in and , then after smoothing, and differ by . Hence the sensitivity goes down to and we can add less noise. We refer readers to Appendix C.1.2 for details about this step.
(H4) Add small noise: Since the prevalences are smoothed, we add small amount of noise to the corresponding cumulative prevalences. For , we add to obtain DP.
(H5) Post-processing: Finally, we post-process the prevalences similar to (L1) to impose monotonicity and ensure that the resulting prevalences are positive and non-negative integers.
Since we used succinct histogram representation, ensured that the size of is small, and used counts larger than to find boundaries, the expected run time is for .
Authors thank Jayadev Acharya and Alex Kulesza for helpful comments and discussions.
Algorithm PrivHist-LowPrivacy Input: low-count histogram , high-count histogram and Output: a histogram . L1. Post processing of : (a) Find that minimizes with . (b) Find such that for all , . L2. Post processing of : Compute . L3. Let . L4. Compute by removing elements closest to from and then removing elements closest to and output it.
Appendix: Differentially private anonymized histograms
Appendix A Geometric mechanism
The mostly popular mechanism for -DP is the Laplace mechanism, which is defined as follows.
Definition 2 (Laplace mechanism () [dwork2014algorithmic]).
When the true query result is , the mechanism outputs where Z is a random variable distributed as a Laplace distribution distribution: for every . If output of has sensitivity , then to achieve -DP add .
Since, we have integer inputs, we use the geometric mechanism:
Definition 3 (Geometric mechanism () [ghosh2012universally]).
When the true query result is , the mechanism outputs where Z is a random variable distributed as a two-sided geometric distribution: for every integer . If output of is integers and has sensitivity (an integer), then to achieve -DP add .
[ghosh2012universally] showed that geometric mechanism is universally optimal for a general class of functions under a Bayesian framework. Geometric mechanism is beneficial over Laplace mechanism in two ways: The output space of the mechanism is discrete. Since we have integer inputs, this removes the necessity of adding rounding off the outputs. For -DP, the expected noise added by the Laplace mechanism is , which strictly larger than that of the geometric mechanism () (see below). For moderate values of , this difference is a constant. We now state few properties of geometric distribution which are used in the rest of the paper.
We find the following set of equations useful in the rest of the paper. In the following let be a geometric random variable and be a Laplace random variable.
The next lemma bounds moments of when is a zero mean random variable.
Let be a random variable and . If , then
To prove the first inequality, observe that
Taking expectation yields the first equation. For the second term,
Combining the above two equations and using the fact that yields the second equation in the lemma. ∎
Appendix B Properties of the distance metric
Proof of Lemma 1.
Recall that the distance between to histograms is given by
Let and be the datasets that achieve the minimum above. Consider any two labels such that . Let be the dataset obtained as follows: and and for all other , . Since is the optimum,
Expanding both sides and canceling common terms, we get,
and thus if , then . Hence, the label of the highest count in both the datasets should be the same and
The distance measure satisfies triangle inequality, i.e., for any three histograms , and ,
The proof of the above equation is a simple consequence of Lemma 1 and is omitted. We now show that dividing histograms only increases the distance.
If and , then
Since the elements in are same as elements in and elements in are same as elements in , there exists a permutation such that
Similar to proof of Lemma 1, it can be shown that the that minimizes the above sum is the one that matches highest count in to highest count in and hence
It is useful to have few upper bounds on the distance over histograms.
For any two histograms ,
where is the maximum such that .
We prove the first inequality by induction on . Suppose , then the inequality holds trivially as
Now suppose it holds for all . For . Let and be two datasets obtained as follows:
This mapping preserves the ordering of s up to ties and . Thus,
Combining the above equation with Equation (6) yields the first inequality. For the second inequality, observe that
where the inequality follows by triangle inequality and the last equality follows by observing that each term corresponding to index appears exactly times. ∎
We now show a simple property of rounding off integers.
Let be integers. Let be real numbers. Let be the nearest integer to . Then,
For any ,
where the second inequality follows from the observation that is the nearest integer to . Summing over all indices yields the lemma. ∎
We need the next auxilllary lemma, which we use in the proofs.
For a histogram , let be the histogram obtained by adding elements of value to . Let be another histogram and let is obtained by removing elements that are closest to . Then
Let be the histogram obtained by adding elements of value to . Since adding same number of elements to two datasets do not decrease the distance,
where the second inequality follows by triangle inequality. Consider the set of all histograms that have . Both and belong to this set. It can be shown that of all histograms in that set is closest to and hence
and hence the lemma. ∎
Appendix C Privacy analysis of PrivHist
C.1 Overview of privacy analysis
We break the analysis of PrivHist step by step. We will show that release of (step ) is -DP. Then, we show that and are -DP. Observe that PrivHist-LowPrivacy is just a post-processing step and by the post processing lemma does not need any differentialy privacy analysis. Finally we show that PrivHist-HighPrivacy is differentially private. By the composition theorem [dwork2014algorithmic], it follows that the total privacy cost is and hence the privacy cost in Theorem 1. Of the above steps, proving is -DP is straightforward and a sketch is in Lemma 7. Proving and is -DP is more involved and is in Lemma 8. The main intuition behind Lemma 8 is a sensitivity preserving histogram split, which we describe below.
C.1.1 Sensitivity preserving histogram split
Any two neighboring datasets and can fall into one of three categories:
They differ in and .
They differ in and for some .
They differ in and for some .
For cases and above, it suffices to add noise to cumulative prevalences and counts as in (3) and (4). However, if they differ in and , the analysis is more involved. For example, consider the following simple example. and . and have distance of one and are neighbors. If we divide in to two parts based on threshold , say and and and , then . Thus, if we naively add noise to cumulative prevalences for and to counts , then we need to add noise , which makes the utility of the algorithm much worse. To overcome this, we preprocess by moving mass from to , where is a Geometric random variable. This provides the required privacy without increasing the utility considerably. Finally, moving mass can make the histogram to have negative prevalences. To overcome this, we add fake counts to and .
C.1.2 Smoothing method
The main ingredient in proving better utility in the high privacy regime is the smoothing technique, which we describe now with an example. Suppose we know that all histograms have non-zero prevalences only between counts and and further suppose we have two neighboring histograms and as follows. and , for all and and , for all . If we want to release the prevalences or cumulative prevalences, we add noise of for each prevalence in . Thus the total amount of noise would be . We propose to reduce this noise by smoothing prevalences.
For a , we divide into and as follows. We assign fraction of to and the remaining to . After this transformation, the first histogram becomes given by, and and all other prevalences are zero. Similarly, the second histogram becomes, and and all other prevalences are zero. Thus the prevalences after smoothing differ only in two locations and and they differ by at most . Thus the total amount of noise needed for a DP release is to these two prevalences. However, note that we also incur a loss as due to smoothing, which can be shown to be . Hence, the total amount of error would be . Choosing , yields a total error of . The above analysis is for a toy example and extending it to general histograms requires additional work. In particular, we need to find the smoothing boundaries that give the best utility. As described in Section 4.3, we choose boundaries based on logarithmic partition of counts and also by private values of counts. The utility analysis with these choice of boundaries is in Appendix D.2.
C.1.3 Choise of
We note that there is no particular reason for , , and to be equal and we chose those values for simplicity and easy readability. For example, since is just used to estimate , the analysis of the algorithm shows that affects utility more than . Hence, we can set and to get better practical results. Furthermore, for low privacy regime, the algorithm only uses a privacy budget of . Hence, we can set , , and .
C.2 Technical details
We first prove a dataset depending composition theorem that helps us decompose differential privacy analysis depending on the dataset.
Theorem 3 (Dataset dependent composition theorem).
Let be a set of independent random variables. Let be a deterministic function. Similarly let be deterministic functions for . If for any two neighboring data sets and ,
then is an -DP output.
For any two datasets, since is a Markov chain,
Hence for any two datasets , any , for all ,
Similarly for any ,
Hence for any two datasets ,
and hence for every pair of datasets if the right hand side is smaller than , then is diffferentially private. ∎
C.3 Privacy analysis
We start with proving is -DP.
The proof follows from Definition 3 and the fact that for any two neighboring datasets , , and hence the sensitivity of this query is . ∎
We now show that release of and is DP.
Release of and is -DP.
is a Markov chain. We use Theorem 3 to show that the output of this Markov chain is DP for all datasets. For any two neighboring datasets and can fall into one of three categories:
They differ in and .
They differ in and for some .
They differ in and for some .
We prove that release is -DP for each of the above three cases.
Case 1: We show that the process satisfies (7). Observe that
follows by observing that there is a one to one correspondence between the histogram and the prevalences, follows from the fact that noise is added only to and , and follows from the fact that the noise added to is a deterministic function of , and always.
Case 2: We show that satisfies (7). Let and be the values of for inputs and respectively. For any , since difference of maximums is at most maximum of differences,
The first equality follows from the observation that there is a one to one correspondence between s and s. The second equation follows from the fact that noise added to various s are independent of each other. The rest of the proof follows from the definition of Geometric mechanism and the fact that and are neighbors.
Case 3: We show that satisfies (7). Let and are the ’s corresponding to and respectively. Conditioned on the value of , for any two neighboring histograms that differ in two consecutive ’s that are larger than ,
Since their sums are equal, conditioned on the value of , it can be shown that are proper histograms and both contain elements and they differ in at most two consecutive values of . Thus and contain same number of counts and differ in at most one count denoted by . With these observations we now bound the ratio of probabilities for differential privacy.
Since there is a one to one correspondence between sorted counts and the histograms, we get
where the last set of inequalities follow from the definition of Geometric mechanism and the fact that the noise added to s are independent of each other, and is the only count in which the two histograms differ. ∎
We now show that PrivHist-HigPrivacy is -DP.
PrivHist-HigPrivacy is -DP.
Observe that is a Markov chain, hence it suffices to prove is -DP.
Let and are two neighboring datasets. Without loss of generality, let they differ in and . Let and be the two histograms obtained after quanitzation step . Since and differ only at and , and differ only in and for some . Furthermore,
For all , . Hence for only one value of , let be this value.
Appendix D Utility analysis of PrivHist
Let be the output of either PrivHist-LowPrivacy or PrivHist-HighPrivacy. In both the low and high privacy regimes, the output error can be bounded as
where the last inequality, follows by breaking it in to cases and . The bound on follows from the fact that , Lemma 2, and the moments of the geometric distribution. In the next two sections, we bound for both low privacy and high privacy regimes.
D.1 Low privacy regime
In the following analysis, let be a geometric random variable distributed as . Let . For the bounds on the histogram, by Lemma 6 and triangle inequality,
For the second term, observe that since is obtained by moving terms between and and then majorizing it. If , then majorization does not change the histogram. Hence,
Taking expectation on both sides
For the first term, by Lemma 3,
We now bound both the terms above. For the large counts, let be the index of the noisy count .
Since the number of terms above is at most , in expectation,
For the smaller counts observe that
where follows from the fact that rounding off increases the error at most by (Lemma 5), follows by the Cauchy-Schwarz inequality, and follows from the fact that are monotonic and hence monotonic projection only decreases the error. Hence in expectation,
Substituting and , yields
Taking expectation w.r.t. and using Lemma 2 yields
where the last inequality follows from moments of geometric distribution.
D.2 High privacy utility
For the high privacy regime, by the triangle inequality,
For the first term, since we are only reducing counts of certain elements,
The second term can be bounded as
where follows by Lemma 5 and the last inequality follows by the triangle inequality. The first term in the last equation corresponds to the smoothing error and we analyze it now.
Since for ,
For , that lies between and in ,
If , the analysis depends on the value of . If , , and , then there exists a that is at most away for from . If not, then the error is at most . Hence, the smoothing error in expectation can be bounded by
For the second part,
where follows by observing that for , follows from the Cauchy-Schwarz inequality. follows from the fact that projecting on to the simplex only increases the error. By the second moments of , taking expectation on both sides yields
We now bound the size of the set . By step of the algorithm, number of elements in can be bounded by
where is the number of elements less than in , that upon adding noise increases to . Expectation of conditioned on is
where the second inequality follows by substituting and third inequality follows by algebraic manipulation. Combining (14), (15), (16), (17), and (18), and using the fact that quantization error is at most yields
where the last equation follows by substituting the value of . Taking expectation with respect to and using Lemma 2 and the fact that yields,
D.3 Time complexity
In this section, we provide a proof sketch of the time complexity. Suppose . The number of prevalences in is . Similarly, the number of counts in is . Further, the number of additional fake counts added is and the isotonic regression using PAVA takes linear in the size of the input. Hence, the expected run time, conditioned on is at most , which upon taking expectation w.r.t. yields an expected run time of .
If , steps (1)-(4) in PrivHist takes time . The time complexity to find boundaries, smooth prevalences, add noise, and perform isotonic regression is . By the utility analysis, . Summing all the time complexities and taking expectation w.r.t. similar to the utility analysis, yields a total time complexity of .
Appendix E Lower bound in the low privacy regime
Let . Suppose two vectors are neighbors if . If be an -DP estimate of , then
Let be the uniform distribution over . For a vector , let denote the two vectors such that and . Then
For any and , vectors in are neighbors and hence by the definition of DP,
Substituting the above lower bound in (19) yields the lemma. ∎
We now use the above bound to prove Theorem 2. We first define a of histograms to show the lower bound. Let . For a given vector , let and , for , and otherwise. Observe that and are valid histograms. If two vectors and have hamming distance , then the corresponding distance between anonymized histograms is .
Consider a slightly different definition of neighboring datasets over histograms, where two datasets are neighboring if the neighboring datasets are distance apart. If a mechanism is -DP in the previous definition of neighboring datasets, then the mechanism is -DP in the new notion of neighbors.
One mechanism for releasing the vectors with -DP is to encode it as the histograms as mentioned above and release them. Since such a mechanism has hamming distance , the anonymized histograms cannot be estimated with accuracy with -DP under new definition of neighbors and hence it cannot be estimated with accuracy with -DP under the old definition of neighbors.
Appendix F Proof of Corollary 1
Recall that is the DP estimate of and is the DP estimate of . In the following Let be the initial set of samples. Let be samples obtained from as follows. If , obtain by removing samples uniformly from . If , add samples from to . Note that are i.i.d. samples from . We bound the error of the estimator as follows.
Taking expectation on the second term,
for some constant . For the case , we bound as follows.
We bound each of the three (difference) terms above. First observe that by the assumptions in the theorem:
Since and differ in at most terms, by the properties of sorted distance,
Further by the properties of sorted distance,
Summing all the three terms, we get the error is at most
For , difference between expected error and is at most
where the last inequality uses the fact that . Combining the result with the case , we get that
The result for follows if
For , by (20), the difference between expected error and is at most
Combining with the result with the case, , we get
The result for follows if