Differentially private anonymized histograms

  • 2019-10-08 17:24:32
  • Ananda Theertha Suresh
  • 13

Abstract

For a dataset of label-count pairs, an anonymized histogram is the multisetof counts. Anonymized histograms appear in various potentially sensitivecontexts such as password-frequency lists, degree distribution in socialnetworks, and estimation of symmetric properties of discrete distributions.Motivated by these applications, we propose the first differentially privatemechanism to release anonymized histograms that achieves near-optimal privacyutility trade-off both in terms of number of items and the privacy parameter.Further, if the underlying histogram is given in a compact format, the proposedalgorithm runs in time sub-linear in the number of items. For anonymizedhistograms generated from unknown discrete distributions, we show that thereleased histogram can be directly used for estimating symmetric properties ofthe underlying distribution.

 

Quick Read (beta)

Differentially private anonymized histograms

Ananda Theertha Suresh
Google Research New York
[email protected]
Abstract

For a dataset of label-count pairs, an anonymized histogram is the multiset of counts. Anonymized histograms appear in various potentially sensitive contexts such as password-frequency lists, degree distribution in social networks, and estimation of symmetric properties of discrete distributions. Motivated by these applications, we propose the first differentially private mechanism to release anonymized histograms that achieves near-optimal privacy utility trade-off both in terms of number of items and the privacy parameter. Further, if the underlying histogram is given in a compact format, the proposed algorithm runs in time sub-linear in the number of items. For anonymized histograms generated from unknown discrete distributions, we show that the released histogram can be directly used for estimating symmetric properties of the underlying distribution.

 

Differentially private anonymized histograms


  Ananda Theertha Suresh Google Research New York [email protected]

\@float

noticebox[b]Preprint. Under review.\[email protected]

1 Introduction

Given a set of labels 𝒳, a dataset D is a collection of labels and counts, D=def{(x,nx):x𝒳}. An anonymized histogram of such a dataset is the unordered multiset of all non-zero counts without any label information,

h(D)=def{nx:x𝒳 and nx>0}.

For example, if 𝒳={a,b,c,d}, D={(a,8),(b,0),(c,8),(d,3)}, then h(D)={3,8,8}11 1 h(D) is a multiset and not a set and duplicates are allowed.. Anonymized histograms do not contain any information about the labels, including the cardinality of 𝒳. Furthermore, we only consider histograms with positive counts. The results can be extended to histograms that include zero counts. A histogram can also be represented succinctly using prevalences. For a histogram h, the prevalence φr is the number of elements in the histogram with count r,

φr(h)=defnxh𝟏nx=r.

In the above example, φ3(h)=1, φ8(h)=2, and φr(h)=0 for r{3,8}. Anonymized histograms are also referred to as histogram of histograms [batu2000testing], histogram order statistics [paninski2003estimation], profiles [orlitsky2004modeling], unattributed histograms [hay2010boosting], fingerprints [valiant2011estimating], and frequency lists [blocki2016differentially].

Anonymized histograms appear in several potentially sensitive contexts ranging from password frequency lists to social networks. Before we proceed to the problem formulation and results, we first provide an overview of the various contexts where anonymized histograms have been studied under differential privacy and their motivation.

Password frequency lists: Anonymized password histograms are useful to security researchers who wish to understand the underlying password distribution in order to estimate the security risks or evaluate various password defenses [bonneau2012science, blocki2016differentially]. For example, if n(i) is the ith most frequent password, then λβ=i=1βn(i) is the number of accounts that an adversary could compromise with β guesses per user. Hence, if the server changes the k-strikes policy from 3 to 5, the frequency distribution can be used to evaluate the security implications of this change. We refer readers to [blocki2016differentially, blocki2018economics] for more uses of password frequency lists. Despite their usefulness, organizations may be wary of publishing these lists due to privacy concerns. This is further justified as it is not unreasonable to expect that an adversary will have some side information based on attacks against other organizations. Motivated by this, [blocki2016differentially] studied the problem of releasing anonymized password histograms.

Degree-distributions in social networks: Degree distributions is one of the most widely studied properties of a graph as it influences the structure of the graph. Degree distribution can also be used to estimate linear queries on degree distributions such as number of k-stars. However, some graphs may have unique degree distributions and releasing exact degree distributions is no more safer than naive anonymization, which can leave social network participants vulnerable to a variety of attacks [backstrom2007wherefore, hay2008resisting, narayanan2009anonymizing]. Thus releasing them exactly can be revealing. Hence, [hay2009accurate, hay2010boosting, karwa2012differentially, kasiviswanathan2013analyzing, raskhodnikova2015efficient, blockidifferentially2, day2016publishing] considered the problem of releasing degree distributions of graphs with differential privacy. Degree distributions are anonymized histograms over the graph node degrees.

Estimating symmetric properties of discrete distributions: Let k=def|𝒳|. A discrete distribution p is a mapping from a domain 𝒳 to [0,1]k such that xpx=1. Given a discrete distribution p over k symbols, a symmetric property is a property that depends only on the multiset of probabilities [valiant2011power, acharya2017unified], e.g., entropy ( x𝒳pxlog1px). Other symmetric properties include support size, Rényi entropy, distance to uniformity, and support coverage. Given independent samples from an unknown p, the goal of property estimation is to estimate the value of the symmetric property of interest for p. Estimating symmetric properties from unknown distributions has received a wide attention in the recent past e.g., [valiant2011estimating, valiant2011power, jiao2015minimax, wu2016minimax, orlitsky2016optimal, acharya2017unified] and has applications in various fields from neuro-science [paninski2003estimation] to genetics [zou2016quantifying]. Recently, [acharya2018inspectre] proposed algorithms to estimate support size, support coverage and entropy with differential privacy. Optimal estimators for symmetric properties only depend on the anonymized histograms of the samples [batu2000testing, acharya2017unified]. Hence, releasing anonymous histograms with differential privacy would simultaneously yield differentially-private plug-in estimators for all symmetric properties.

2 Differential privacy

2.1 Definitions

Before we outline our results, we first define the privacy and utility aspects of anonymized histograms. Privacy has been studied extensively in statistics and computer science [dalenius1977towards, adam1989security, agrawal2001design, dwork2008differential] and references therein. Perhaps the most studied form of privacy is differential privacy (DP) [dwork2006calibrating, dwork2014algorithmic], where the objective is to ensure that an adversary would not infer whether a user is present in the dataset or not.

We study the problem of releasing anonymized histograms via the lens of global-DP. We begin by defining the notion of DP. Formally, given a set of datasets and a notion of neighboring datasets 𝒩×, and a query function f:𝒴, for some domain 𝒴, then a mechanism :𝒴𝒪 is said to be ϵ-DP, if for any two neighboring datasets (h1,h2)𝒩, and all S𝒪,

Pr((f(h1))S)eϵPr((f(h2))S). (1)

Broadly-speaking ϵ-DP ensures that given the output, an attacker would not be able to differentiate between any two neighboring datasets. ϵ-DP is also called pure-DP and provides stricter guarantees than the approximate (ϵ,δ)-DP, where equation (1) needs to hold with probability 1-δ.

Since introduction, DP has been studied extensively in various applications from dataset release to learning machine learning models [abadi2016deep]. It has also been adapted by industry [erlingsson2014rappor]. There are two models of DP: server or global or output DP, where a centralized entity has access to the entire dataset and answers the queries in a DP manner. The second model is local DP, where ϵ-DP is guaranteed for each individual user’s data [warner1965randomized, kasiviswanathan2011can, duchi2013local, kairouz2014extremal, acharya2018communication]. We study the problem of releasing anonymized histograms under global-DP. Here is the set of anonymized histograms, f is the identity mapping, and 𝒪=.

2.2 Distance measure

For DP, a general notion of neighbors is as follows. Two datasets are neighbors if and only if one can be obtained from another by adding or removing a user [dwork2008differential]. Since, anonymized histograms do not contain explicit user information, we need few definitions to apply the above notion. We first define a notion of distance between label-count datasets. A natural notion of distance between datasets D1 and D2 over 𝒳 is the 1 distance, 1(D1,D2)=defx𝒳|nx(D1)-nx(D2)|, where nx(D) is the count of x in dataset D. Since anonymized histograms do not contain any information about labels, we define distance between two histograms h1,h2 as

1(h1,h2)=defminD1,D2:h(D1)=h1,h(D2)=h21(D1,D2). (2)

The following simple lemma characterizes the above distance in terms of counts.

Lemma 1 (Appendix B).

For an anonymized histogram h={nx}, let n(i) be the i𝑡ℎ highest count in the dataset.22 2 For i larger than number of counts in h, n(i)=0. For any two anonymized histograms h1,h2,

1(h1,h2)=i1|n(i)(h1)-n(i)(h2)|.

The above distance is also referred to as sorted 1 distance or earth-mover’s distance. With the above definition of distance, we can define neighbors as follows.

Definition 1.

Two anonymized histograms h and h are neighbors if and only if 1(h,h)=1.

The above definition of neighboring histograms is same as the definition of neighbors in the previous works on anonymized histograms [hay2010boosting, blocki2016differentially].

3 Previous and new results

3.1 Anonymized histogram estimation

Similar to previous works [blocki2016differentially], we measure the utility of the algorithm in terms of the number of items in the anonymized histogram, n=defnxhnx=r1φr(h)r.

Previous results: The problem of releasing anonymized histograms was first studied by [hay2009accurate, hay2010boosting] in the context of degree distributions of graphs. They showed that adding Laplace noise to each count, followed by a post-processing isotonic regression step results in a histogram H with expected sorted-22 error of

𝔼[22(h,H)]=𝔼[i1(n(i)(h1)-n(i)(h2))2]=r0𝒪(log3max(φr,1)ϵ2)=𝒪(nϵ2).

Their algorithm runs in time 𝒪(n). The problem was also considered in the context of password frequency lists by [blocki2016differentially]. They observed that an exponential mechanism over integer partitions yields an ϵ-DP algorithm. Based on this, for ϵ=Ω(1/n), they proposed a dynamic programming based relaxation of the exponential mechanism that runs in time 𝒪(n3/2ϵ+nlog1δ) and returns a histogram H such that

1(h,H)=𝒪(n+log1δϵ),

with probability 1-δ. Furthermore, the relaxed mechanism is (ϵ,δ)-DP.

The best information-theoretic lower bound for the 1 utility of any ϵ-DP mechanism is due to [alda2018lower], who showed that for ϵΩ(1/n), any ϵ-DP mechanism has expected 1 error of Ω(n/ϵ) for some dataset.

New results: Following [blocki2016differentially], we study the problem in 1 metric. We propose a new DP mechanism PrivHist that satisfies the following:

Theorem 1.

Given a histogram in the prevalence form h={(r,φr):φr>0}, PrivHist returns a histogram H and a sum count N that is ϵ-DP. Furthermore, if ϵ>1, then

𝔼[1(h,H)]=𝒪(ne-cϵ) and 𝔼[|N-n|]e-cϵ

for some constant c>0 and has an expected run time of O~(n). If 1ϵ=Ω(1/n) then,

𝔼[1(h,H)]=𝒪(nϵlog2ϵ) and 𝔼[|N-n|]𝒪(1ϵ),

and has an expected run time of O~(nϵ+1ϵ).

Together with the lower bound of [alda2018lower], this settles the optimal privacy utility trade-off for ϵ[Ω(1/n),1] up to a multiplicative factor of 𝒪(log(2/ϵ)). We also show that PrivHist is near-optimal for ϵ>1, by showing the following lower bound.

Theorem 2 (Appendix E).

For a given n, let H={h:nrrφr(h)n+1}. For any ϵ-DP mechanism M, there exists a histogram hH, such that

𝔼[1(h,(h))]Ω(ne-2ϵ).

Theorems 1 and 2 together with [alda2018lower] show that the the proposed mechanism has near-optimal utility for all ϵ=Ω(1/n). We can infer the number of items in the dataset by rrφr(H). However, this estimate is very noisy. Hence, we also return the sum of counts N as it is useful for applications in symmetric property estimation for distributions. Apart from the near-optimal privacy-utility trade-off, we also show that PrivHist has several other useful properties.

Time complexity: By the Hardy-Ramanujan integer partition theorem [hardy1918asymptotic], the number of anonymized histograms with n items is eΘ(n). Hence, we can succinctly represent them using 𝒪(n) space. Recall that any anonymized histogram can be written as {(r,φr):φr>0}, where φr is the number of symbols with count r. Let t be the number of distinct counts and let r1,r2,,rt be the distinct counts with non-zero prevalences. Then rii and

n=i=1triφrii=1trii=1tit22,

and hence there are at most t2n non-zero prevalences and h can be represented as {(r,φr):φr>0} using 𝒪(n) count-prevalence pairs. Histograms are often stored in this format for space efficiency e.g., password frequency lists in [bonneau_2015]. PrivHist takes advantage of this succinct representation. Hence, given such a succinct representation, it runs time 𝒪(n) as opposed to the 𝒪(n) running time of [hay2009accurate] and 𝒪(n3/2) running time of  [blocki2016differentially]. This is highly advantageous for large datasets such as password frequency lists with n=70M data points [blocki2016differentially].

Pure vs approximate differential privacy: The only previous known algorithm with 1 utility of 𝒪(n) is that of [blocki2016differentially] and it runs in time 𝒪(n3/2). However, their algorithm is (ϵ,δ)-approximate DP which is strictly weaker than PrivHist, whose output is ϵ-DP. For applications in social networks it is desirable to have group privacy for large groups [dwork2014algorithmic]. For groups of size k, (ϵ,δ) approximate DP, scales as (kϵ,ekϵδ)-DP, which can be prohibitive for large values of k. Hence ϵ-DP is preferable.

Applications to symmetric property estimation: We show that the output of PrivHist can be directly applied to obtain near-optimal sample complexity algorithms for discrete distribution symmetric property estimation.

3.2 Symmetric property estimation of discrete distributions

For a symmetric property f and an estimator f^ that uses n samples, let (f^,n) be an upper bound on the worst expected error over all distributions p with support at most k, (f^,n)=defmaxpΔk𝔼[|f(p)-f^(Xn)|] . Let sample complexity n(f,α) denote the minimum number of samples such that (f^,n)α,

n(f,α)=defmin{n:(f^,n)α}.

Given samples Xn=defX1,X2,,Xn, let h(Xn) denote the corresponding anonymous histogram. For a symmetric property f, linear estimators of the form

f^(h(X))=defr1f(r,n)φr(h(Xn),

are shown to be sample-optimal for symmetric properties such as entropy [wu2016minimax], support size [valiant2011power, jiao2015minimax], support coverage [orlitsky2016optimal], and Rényi entropy [acharya2014complexity], where f(r,n)s are some distribution-independent coefficients that depend on the property f. Recently, [acharya2018inspectre] showed that for any given property such as entropy or support size, one can construct DP estimators by adding Laplace noise to the non-private estimator. They further showed that this approach is information theoretically near-optimal.

Instead of just computing a DP estimate for a given property, the output of PrivHist can be directly used to estimate any symmetric property. By the post-processing lemma [dwork2014algorithmic], since the output of PrivHist is DP, the estimate is also DP. For an estimator f^, let Lf^n be the Lipschitz constant given by Lf^n=defmax(f(1,n),maxr1|f(r,n)-f(r+1,n)|). If instead of h(Xn), a DP histogram H and the sum of counts N is available, then f^ can be modified as

f^dp=defr1f(r,N)φr(H),

which is differentially private. Using Theorem 1, we show that:

Corollary 1 (Appendix F).

Let f^ satisfy Lf^nnβ-1, for a β0.5. Further, let there exists E such that |E(f^,n)-E(f^,n+1)|nβ-1. Let fmax=maxpΔkf(p). If n(f^,α) is the sample complexity of estimator f^, then for ϵ>1

n(f^dp,2α)max(n(f^,α),𝒪((1αecϵ)21-2β+1ϵlogfmaxα)).

for some constant c>0. For Ω(1/n)ϵ1,

n(f^dp,2α)max(n(f^,α),𝒪((log(2/ϵ)α2ϵ)11-2β+1ϵlogfmaxαϵ)).

Further, by the post-processing lemma, f^dp is also ϵ-DP.

For entropy (-xpxlogpx), normalized support size (x𝟏px>1/k/k), and normalized support coverage, there exists sample-optimal linear estimators with β<0.1 and have the property |(f^,n)-(f^,n+1)|(f^,n)nβ-1 [acharya2017unified, acharya2018inspectre]. Hence the sample complexity of the proposed algorithm increases at most by a polynomial in 1/ϵα. Furthermore, the increase is dependent on the maximum value of the function for distributions of interest and it does not explicitly depend on the support size. This result is slightly worse than the property specific results of [acharya2018inspectre] in terms of dependence on ϵ and α. In particular, for entropy estimation, the main term in our privacy cost is 𝒪~((1/α2ϵ)11-2β) and the bound of [acharya2018inspectre] is 𝒪(1/(αϵ)1+β). Thus for β=0.1, our dependence on ϵ and α is slightly worse. However, we note that our results are more general in that H can be used with any linear estimator. For example, our algorithm implies DP algorithms for estimating distance to uniformity, which have been not been studied before. Furthermore, PrivHist can also be combined with the maximum likelihood estimators of [orlitsky2004modeling, orlitsky2016optimal] and linear programming estimators of [valiant2011estimating], however we do not provide any theoretical guarantees for these combined algorithms.

4 PrivHist

In the algorithm description and analysis, let x¯ denote the vector x and let φr+(h)=defsrφs(h) denote the cumulative prevalences. Since, anonymized histograms are multisets, we can define the sum of two anonymized histograms as follows: for two histograms h1,h2, the sum h=h1+h2 is given by φr(h)=φr(h1)+φr(h2),r. Furthermore, since there is a one-to-one mapping between histograms in count form h={n(i)} and in prevalence form h={(r,φr):φr>0}, we use both interchangeably. For the ease of analysis, we also use the notation of improper histogram, where the φr’s can be negative or non-integers. Finally, for a histogram ha indexed by super-script a , we define φa=defφ(ha) for the ease of notation.

4.1 Approach

Instead of describing the technicalities involved in the algorithm directly, we first motivate the algorithm with few incorrect or high-error algorithms. Before we proceed recall that histograms can be written either in terms of prevalences φr or in terms of sorted counts n(i).

An incorrect algorithm: A naive tendency would be to just add noise only to non-zero prevalences or counts. However, this is not differentially private. For example, consider two neighboring histograms in prevalence format, h={φ1=2} and h={φ1=1,φ2=1}. The resulting outputs for the above two inputs would be very different as the output of h never produces a non-zero φ2, whereas the output of h produces non-zero φ2 with high probability. Similar counter examples can be shown for sorted counts.

A high-error algorithm: Instead of adding noise to non-zero counts or prevalences, one can add noise to all the counts or prevalences. It can be shown that adding noise to all the counts (including those appeared zero times), yields a 1 error 𝒪(n/ϵ), whereas adding noise to prevalences can yield an 1 error of 𝒪(n2/ϵ), if we naively use the utility bound in terms of prevalences (3). We note that [hay2009accurate] showed that a post-processing step after adding noise to sorted-counts and improves the 2 utility. A naive application of Cauchy-Schwarz inequality yields an 1 error of n3/4/ϵ for that algorithm. While it might be possible to improve the dependence on n by a tighter analysis, it is not clear if the dependence on ϵ can be improved.

The algorithm is given in PrivHist. After some computation, it calls two sub-routines PrivHist-LowPrivacy and PrivHist-HighPrivacy depending on the value of ϵ. PrivHist has two main new ideas: (i) splitting r around n and using prevalences in one regime and counts in another and (ii) the smoothing technique used to zero out the prevalence vector. Of the two (i) is crucial for the computational complexity of the algorithm and (ii) is crucial in improving the ϵ-dependence from 1/ϵ to 1/ϵ in the high privacy regime (ϵ1). There are more subtle differences such as using cumulative prevalences instead of actual prevalences. We highlight them in the next section. We now overview our algorithm for low and high privacy regimes separately.

4.2 Low privacy regime

We first consider the problem in the low privacy regime when ϵ>1. We make few observations.

Geometric mechanism vs Laplace mechanism: For obtaining DP output of integer data, one can add either Laplace noise or geometric noise [ghosh2012universally]. For ϵ-DP, the expected 1 noise added by Laplace mechanism is 1/ϵ, which strictly larger than that of the geometric mechanism (2e-ϵ/(1-e-2ϵ)) (see Appendix A). For ϵ>1, we use the geometric mechanism to obtain optimal trade off in terms of ϵ.

Prevalences vs counts: If we add noise to each coordinate of a d-dimensional vector, the total amount of noise in 1 norm scales linearly in d, hence it is better to add noise to a small dimensional vector. In the worst case, both prevalences and counts can be an n-dimensional vector. Hence, we propose to use prevalences for small values of rn, and use counts for r>n. This ensures that the dimensionality of vectors to which we add noise is at most 2n.

Cumulative prevalences vs prevalences: The 1 error can be bounded in terms of prevalences as follows. See Appendix B for a proof.

1(h1,h2)r1|φr+(h1)-φr+(h2)|r1r|φr(h1)-φr(h2)|, (3)

If we add noise to prevalences, the 1 error can be very high as noise is multiplied with the corresponding count r (3) . The bound in terms of cumulative prevalences φ+ is much tighter. Hence, for small values of r, we use cumulative prevalences instead of prevalences themselves.

The above observations provide an algorithm for the low privacy regime. However, there are few technical difficulties. For example, if we split the counts at a threshold naively, then it is not differentially private. We now describe each of the steps in high-privacy regime and how we overcome these technical difficulties.

(1) Find n: To divide the histogram into two smaller histograms, we need to know n, which may not be available. Hence, we allot ϵ1 privacy cost to find a DP value of n.

(2) Sensitivity preserving histogram split: If we divide the histogram into two parts based on counts naively and analyze the privacy costs independently for the higher and smaller parts separately, then the sensitivity would be lot higher for certain neighboring histograms. For example, consider two neighboring histograms h1={φT=1,φn-T=1} and h2={φT+1=1,φn-T-1=1}. If we divide h1 in to two parts based on threshold T, say h1s={φT=1} and h1={φn-T=1} and h2s={} and h2={φT+1=1,φn-T-1=1}, then 1(h1,h2)=T+2. Thus, the 1 distance between neighboring separated histograms 1(h1,h2), 1(h1s,h2s) would be much higher compared to 1(h2,h2) and we need to add a lot of noise. Therefore, we perturb φT and φT+1 using geometric noise. This ensures DP in instances where the neighboring histograms differ at φT and φT+1, and doesn’t change the privacy analysis for other types of histograms. However, adding noise may make the histogram improper as φT can become negative. To this end, we add M fake counts at T and T+1 to ensure that the histogram is proper with high probability. We remove them later in (L4). We refer readers to Appendix C.1.1 for details about this step.

(3,4) Add noise: Let Hbs (small counts) and Hb (large counts) be the split-histograms. We add noise to cumulative prevalences in Hbs and counts in Hb as described in the overview of the proposed algorithm.

(L1, L2) Post-processing: The noisy versions of φr+ may not satisfy the properties satisfied by the histograms i.e., φr+φ(r+1)+. To overcome this, we run isotonic regression over noisy φr+ subject to the monotonicity constraints i.e., given noisy counts φr+, find φr+mon that minimizes rT(φr+-φr+mon)2, subject to the constraint that φr+monφ(r+1)+mon, for all rT. Isotonic regression in one dimension can be run linear in the number of inputs using Pool Adjacent Violators Algorithm (PAVA) or its variants [barlow1972statistical, mair2009isotone]. Hence, the time complexity of this algorithm is 𝒪(T)n. We then round the prevalences to the nearest non-negative integers. We similarly post-process large counts and remove the fake counts that we introduced in step (2).

Since we used succinct representation of histograms, used prevalences for r smaller than 𝒪(n) and counts otherwise, the expected run time of the algorithm is 𝒪(n) for ϵ>1.

4.3 High privacy regime

For the high-privacy regime, when ϵ1, all known previous algorithms achieve an error of 1/ϵ. To reduce the error from 1/ϵ to 1/ϵ, we use smoothing techniques to reduce the sensitivity and hence reduce the amount of added noise.

Smoothing method: Recall that the amount of noise added to a vector depends on its dimensionality. Since prevalences have length n, the amount of 1 noise would be 𝒪(n/ϵ). To improve on this, we first smooth the input prevalence vector such that it is non-zero only for few values of r and show that the smoothing method reduces the sensitivity of cumulative prevalences and hence reduces the amount of noise added.

While applying smoothing is the core idea, two questions remain: how to select the location of non-zero values and how to smooth to reduce the sensitivity? We now describe these technical details.

(H1) Approximate high prevalences: Recall that N was obtained by adding geometric noise to n. In the rare case that this geometric noise is very negative, then there can be prevalences much larger than 2N. This can affect the smoothing step. To overcome this, we move all counts above 2N to N. Since this changes the histogram with low probability, it does not affect the 1 error.

(H2) Compute boundaries: We find a set of boundaries S and smooth counts to elements in S. Ideally we would like to ensure that there is a boundary that exists close to every count. For small values of r, we ensure this by adding all the counts and hence there is no smoothing. If rn, we use boundaries that are uniform in the log-count space. However, using this technique for all values of r, results in an additional logn factor. To overcome this, for rn, we use the noisy large counts in step (4) to find the boundaries and ensure that there is a boundary close to every count.

(H3) Smooth prevalences: For a r that lies between two boundaries si and si+1, we divide φr into φsi and φsi+1 as follows. We assign si+1-rsi+1-si fraction of φr to φsi and the remaining to φsi+1. If two neighboring histograms differ in φr and φr+1, then after smoothing, φsi and φsi+1 differ by 1/(si+1-si). Hence the sensitivity goes down to 1/(si+1-si) and we can add less noise. We refer readers to Appendix C.1.2 for details about this step.

(H4) Add small noise: Since the prevalences are smoothed, we add small amount of noise to the corresponding cumulative prevalences. For φsi+, we add L(1/(si-si-1)ϵ) to obtain DP.

(H5) Post-processing: Finally, we post-process the prevalences similar to (L1) to impose monotonicity and ensure that the resulting prevalences are positive and non-negative integers.

Since we used succinct histogram representation, ensured that the size of S is small, and used counts larger than 𝒪(nϵ) to find boundaries, the expected run time is 𝒪(nϵ+lognϵ) for ϵ1.

5 Acknowledgments

Authors thank Jayadev Acharya and Alex Kulesza for helpful comments and discussions.

Algorithm PrivHist Input: anonymized histogram h in terms of prevalences i.e., {(r,φr):φr>0}, privacy cost ϵ.
Parameters: ϵ1=ϵ2=ϵ3=ϵ/3.
Output: DP anonymized histogram H and N (an estimate of n).
1. DP value of the total sum: N=max(nxhnx+Za,0), where ZaG(e-ϵ1). If N=0, output empty histogram and N. Otherwise continue. 2. Split h: Let T=Nmin(ϵ,1) and M=max(2logNeϵ2,1)ϵ2. (a) Ha:φTa=φT+M,φT+1a=φT+1+M and r{T,T+1}, φra=φr. (b) Hb:φT+1b=φT+1a+Zb,φTb=φTb-Zb and r{T,T+1} φrb=φra, where ZbG(e-ϵ2). (c) Divide Hb into two histograms Hbs and Hb. For all rT+1, φrb=max(0,s=T+1rφrb-s=T+1r-1φrb) for all rT φrbs=max(0,s=rTφrb-s=r+1Tφrb). 3. DP value of Hbs. Let ZrcsG(e-ϵ2) i.i.d. and Hcs be φr+cs=φr+bs+Zrcs for rT. 4. DP value of Hb: Let ZicG(e-ϵ2) i.i.d. and Hc be Nic=N(i)b+Zic for N(i)Hb. 5. If ϵ>1, output PrivHist-LowPrivacy(Hcs,Hc,T,M) and N. 6. If ϵ1, output PrivHist-HighPrivacy(Hc,h,N,ϵ3) and N.

Algorithm PrivHist-LowPrivacy Input: low-count histogram Hcs, high-count histogram Hc,T,M and Output: a histogram H. L1. Post processing of Hcs: (a) Find φ¯mon that minimizes r1(φr+mon-φr+(Hcs))2. with φr+monφ(r+1)+mon,r. (b) Find Hds such that for all r, φr+ds=round(max(φr+mon,0)). L2. Post processing of Hc: Compute Hd={max(Ni(Hc),T),i}. L3. Let Hd=Hds+Hd. L4. Compute He by removing M elements closest to T+1 from Hd and then removing M elements closest to T and output it.

Algorithm PrivHist-HighPrivacy Input: non-private histogram h, high-count histogram H,T,N,ϵ3 and Output: a histogram H. H1. Approximate higher prevalences: for r<2N, φru=φr(h) and φ2Nu=φ2N+(h). H2. Compute boundaries: Let the set S be defined as follows: (a) T=10N/ϵ33, q=log(1/ϵ3)/Nϵ3 (b) S={1,2,,T}{T(1+q)i:ilog1+q(T/T)}{Nx:NxH,NxT}{2N}. H3. Smooth prevalences: Let si denote the ith smallest element in S. (a) φsiv=j=sisi+1φjusi+1-jsi+1-si+j=si-1si-1φjuj-si-1si-si-1 and if jS, φjv=0. H4. DP value of Hv: for each siS, let φsi+w=φsi+v+Zsi, where ZsiL(1ϵ3(si-si-1)). H5. Find Hx that minimizes siS(φsi+x-φsi+w)2(si-si-1)2 such that φsi+xφsi+1+xi. H6. Return Hy given by, φr+y=round(max(φr+x,0))r.

References

Appendix: Differentially private anonymized histograms

Appendix A Geometric mechanism

The mostly popular mechanism for ϵ-DP is the Laplace mechanism, which is defined as follows.

Definition 2 (Laplace mechanism (L(b)[dwork2014algorithmic]).

When the true query result is f, the mechanism outputs f+Z where Z is a random variable distributed as a Laplace distribution distribution: Pr(Z=z)=12bexp(-|x|b) for every zR. If output of f has sensitivity Δ, then to achieve ϵ-DP add ZL(Δ/ϵ).

Since, we have integer inputs, we use the geometric mechanism:

Definition 3 (Geometric mechanism (G(α)[ghosh2012universally]).

When the true query result is f, the mechanism outputs f+Z where Z is a random variable distributed as a two-sided geometric distribution: Pr(Z=z)=1-α1+αα|z| for every integer z. If output of f is integers and has sensitivity Δ (an integer), then to achieve ϵ-DP add ZG(eϵ/Δ).

[ghosh2012universally] showed that geometric mechanism is universally optimal for a general class of functions under a Bayesian framework. Geometric mechanism is beneficial over Laplace mechanism in two ways: The output space of the mechanism is discrete. Since we have integer inputs, this removes the necessity of adding rounding off the outputs. For ϵ-DP, the expected 1 noise added by the Laplace mechanism is 1/ϵ, which strictly larger than that of the geometric mechanism (2e-ϵ/(1-e-2ϵ)) (see below). For moderate values of ϵ, this difference is a constant. We now state few properties of geometric distribution which are used in the rest of the paper.

We find the following set of equations useful in the rest of the paper. In the following let ZGG(e-ϵ) be a geometric random variable and ZLL(1/ϵ) be a Laplace random variable.

𝔼[ZG] =0=𝔼[ZL].
𝔼[|ZG|] =2e-ϵ1-e-2ϵ1ϵ=𝔼[|ZL|].
𝔼[ZG2] =2e-ϵ(1-e-ϵ)22ϵ2=𝔼[ZL2].

The next lemma bounds moments of max(n+Z,0) when Z is a zero mean random variable.

Lemma 2.

Let Z be a random variable and n0. If Y=max(n+Z,0), then

𝔼[|Y-n|]𝔼|Z|,

and

𝔼[𝟏Y>0Y]1n+𝔼[Z2]2n2.
Proof.

To prove the first inequality, observe that

|Y-n|=|max(Z,-n)||Z|.

Taking expectation yields the first equation. For the second term,

1Y=1n+n-YYn=1n+n-Yn2+(n-Y)2n2Y1n+n-Yn2+(n-Y)22n2. (4)

Furthermore,

(n-Y)𝟏Y>0=-Z𝟏Y>0=-Z𝟏-Z<n-Z.

Combining the above two equations and using the fact that |Y-n||Z| yields the second equation in the lemma. ∎

Appendix B Properties of the distance metric

Proof of Lemma 1.

Recall that the distance between to histograms is given by

1(h1,h2) =defminD1,D2:h(D1)=h1,h(D2)=h21(D1,D2)=minD1,D2:h(D1)=h1,h(D2)=h2x𝒳|nx(D1)-nx(D2)|.

Let D1* and D2* be the datasets that achieve the minimum above. Consider any two labels x,y such that nx(D1*)ny(D1*). Let D2 be the dataset obtained as follows: ny(D2)=nx(D2*) and nx(D2)=ny(D2*) and for all other z{x,y}, nz(D2)=nz(D2*). Since D2* is the optimum,

1(D1*,D2*)1(D1*,D2).

Expanding both sides and canceling common terms, we get,

|nx(D1*)-nx(D2*)|+|ny(D1*)-ny(D2*)| |nx(D1*)-nx(D2)|+|ny(D1*)-ny(D2)|
|nx(D1*)-ny(D2*)|+|ny(D1*)-nx(D2*)|,

and thus if nx(D1*)ny(D1*), then nx(D2*)ny(D2*). Hence, the label of the ith highest count in both the datasets should be the same and

1(D1*,D2*)=x𝒳|nx(D1*)-nx(D2*)|=i1|n(i)(h1)-n(i)(h2)|.

The distance measure satisfies triangle inequality, i.e., for any three histograms h1,h2, and h3,

1(h1,h2)1(h1,h3)+1(h2,h3).

The proof of the above equation is a simple consequence of Lemma 1 and is omitted. We now show that dividing histograms only increases the distance.

Lemma 3.

If h=h1+h2 and h=h1+h2, then

1(h,h)1(h1,h1)+1(h2,h2).
Proof.

Since the elements in h1+h2 are same as elements in h and elements in h1+h2 are same as elements in h2, there exists a permutation σ such that

1(h1,h1)+1(h2,h2) =i1|n(i)(h1)-n(i)(h1)|+i1|n(i)(h2)-n(i)(h2)|
=i1|n(i)(h)-n(σi)(h)|.

Similar to proof of Lemma 1, it can be shown that the σ that minimizes the above sum is the one that matches ith highest count in h to ith highest count in h and hence

1(h,h)=i1|n(i)(h)-n(i)(h)|i1|n(i)(h)-n(σi)|.

It is useful to have few upper bounds on the 1 distance over histograms.

Lemma 4.

For any two histograms h1,h2,

1(h1,h2)r1rmax(h1,h2)|φr+(h1)-φr+(h2)|r1r|φr(h1)-φr(h2)|, (5)

where rmax(h1,h2) is the maximum r such that φr(h1)+φr(h2)0.

Proof.

We prove the first inequality by induction on rmax(h1,h2). Suppose rmax(h1,h2)=1, then the inequality holds trivially as

i|n(i)(h1)-n(i)(h2)|=|φ1(h1)-φ1(h2)|=r=1rmax(h1,h2)|φr+(h1)-φr+(h2)|.

Now suppose it holds for all rmax(h1,h2)<r0. For r0=defrmax(h1,h2). Let h1 and h2 be two datasets obtained as follows:

hi=max(nx,r0-1):nxhi}.

This mapping preserves the ordering of n(i)s up to ties and rmax(h1,h2)=r0-1. Thus,

1(h1,h2)r=1rmax(h1,h2)|φr+(h1)-φr+(h2)|=r=1rmax(h1,h2)-1|φr+(h1)-φr+(h2)|. (6)

Hence,

i|n(i)(h1)-n(i)(h2)|
=i:max(n(i)(h1),n(i)(h2))<r0|n(i)(h1)-n(i)(h2)|+i:max(n(i)(h1),n(i)(h2))r0|n(i)(h1)-n(i)(h2)|
=i:max(n(i)(h1),n(i)(h2))<r0|n(i)(h1)-n(i)(h2)|
+i:max(n(i)(h1),n(i)(h2))r0|n(i)(h1)-n(i)(h2)+𝟏n(i)(h1)=r0-𝟏n(i)(h2)=r0|
i:max(n(i)(h1),n(i)(h2))<r0|n(i)(h1)-n(i)(h2)|+i:max(n(i)(h1),n(i)(h2))r0|n(i)(h1)-n(i)(h2)|
+i:max(n(i)(h1),n(i)(h2))r0|𝟏n(i)(h1)=r0-𝟏n(i)(h2)=r0|
=1(h1,h2)+|φr0(h1)-φr0(h2)|

Combining the above equation with Equation (6) yields the first inequality. For the second inequality, observe that

r1rmax(h1,h2)|φr+(h1)-φr+(h2)| =r1rmax(h1,h2)|srφs(h1)-φs(h2)|
r1rmax(h1,h2)sr|φs(h1)-φs(h2)|
=s1s|φs(h1)-φs(h2)|,

where the inequality follows by triangle inequality and the last equality follows by observing that each term corresponding to index s appears exactly s times. ∎

We now show a simple property of rounding off integers.

Lemma 5.

Let x1,x2,,xn be integers. Let y1,y2,,yn be real numbers. Let y^i be the nearest integer to yi. Then,

i=1n|xi-y^i|2i=1n|xi-yi|.
Proof.

For any i,

|xi-y^i||xi-yi|+|yi-y^i|2|xi-yi|,

where the second inequality follows from the observation that y^i is the nearest integer to yi. Summing over all indices i yields the lemma. ∎

We need the next auxilllary lemma, which we use in the proofs.

Lemma 6.

For a histogram h1, let h1 be the histogram obtained by adding k elements of value t to h. Let h2 be another histogram and let h2 is obtained by removing k elements that are closest to t. Then

1(h1,h2)21(h1,h2).
Proof.

Let h2′′ be the histogram obtained by adding k elements of value t to h2. Since adding same number of elements to two datasets do not decrease the 1 distance,

1(h1,h2)=1(h1,h2′′)1(h1,h2)+1(h2,h2′′),

where the second inequality follows by triangle inequality. Consider the set of all histograms that have φtk. Both h2′′ and h1 belong to this set. It can be shown that of all histograms in that set h2′′ is closest to h2 and hence

1(h2,h2′′)1(h1,h2),

and hence the lemma. ∎

Appendix C Privacy analysis of PrivHist

C.1 Overview of privacy analysis

We break the analysis of PrivHist step by step. We will show that release of N (step (1)) is ϵ1-DP. Then, we show that Hcs and Hc are ϵ2-DP. Observe that PrivHist-LowPrivacy is just a post-processing step and by the post processing lemma does not need any differentialy privacy analysis. Finally we show that PrivHist-HighPrivacy is ϵ3 differentially private. By the composition theorem [dwork2014algorithmic], it follows that the total privacy cost is ϵ1+ϵ2+ϵ3=ϵ and hence the privacy cost in Theorem 1. Of the above steps, proving N is ϵ1-DP is straightforward and a sketch is in Lemma 7. Proving Hcs and Hc is ϵ2-DP is more involved and is in Lemma 8. The main intuition behind Lemma 8 is a sensitivity preserving histogram split, which we describe below.

C.1.1 Sensitivity preserving histogram split

Any two neighboring datasets h1 and h2 can fall into one of three categories:

  1. 1.

    They differ in φT and φT+1.

  2. 2.

    They differ in φr and φr+1 for some 0r<T-1.

  3. 3.

    They differ in φr and φr+1 for some r>T.

For cases 2 and 3 above, it suffices to add noise to cumulative prevalences and counts as in (3) and (4). However, if they differ in φT and φT+1, the analysis is more involved. For example, consider the following simple example. h1={φT=1,φn-T=1} and h2={φT+1=1,φn-T-1=1}. h1 and h2 have 1 distance of one and are neighbors. If we divide h1 in to two parts based on threshold T, say h1s={φT=1} and h1={φn-T=1} and h2s={} and h2={φT+1=1,φn-T-1=1}, then 1(h1,h2)=T+2. Thus, if we naively add noise to cumulative prevalences for rT and to counts r>T, then we need to add noise L(T/ϵ), which makes the utility of the algorithm much worse. To overcome this, we preprocess h by moving Zb mass from φT to φT+1, where Zb is a Geometric random variable. This provides the required privacy without increasing the utility considerably. Finally, moving mass Zb can make the histogram to have negative prevalences. To overcome this, we add M fake counts to φT and φT+1.

C.1.2 Smoothing method

The main ingredient in proving better utility in the high privacy regime is the smoothing technique, which we describe now with an example. Suppose we know that all histograms have non-zero prevalences only between counts s and s+t and further suppose we have two neighboring histograms h1 and h2 as follows. φr1=1 and φi1=0, for all i[s,s+t]{r} and φr+12=1 and φi1, for all i[s,s+t]{r+1}. If we want to release the prevalences or cumulative prevalences, we add 1 noise of 1/ϵ for each prevalence in [s,s+t]. Thus the total amount of noise would be 𝒪(t/ϵ). We propose to reduce this noise by smoothing prevalences.

For a r[s,s+t] , we divide φr into φs and φs+t as follows. We assign s+t-rt fraction of φr to φs and the remaining to φs+t. After this transformation, the first histogram becomes ht1 given by, φst1=t+s-rt and φst1=rt and all other prevalences are zero. Similarly, the second histogram becomes, ht2 φst2=t+s-r-1t and φst1=r+1t and all other prevalences are zero. Thus the prevalences after smoothing differ only in two locations s and s+t and they differ by at most 1/t. Thus the total amount of noise needed for a DP release is 𝒪(1/tϵ) to these two prevalences. However, note that we also incur a loss as due to smoothing, which can be shown to be 𝒪(t). Hence, the total amount of error would be 𝒪(1/(tϵ)+t). Choosing t=1/ϵ, yields a total error of 𝒪(1/ϵ). The above analysis is for a toy example and extending it to general histograms requires additional work. In particular, we need to find the smoothing boundaries that give the best utility. As described in Section 4.3, we choose boundaries based on logarithmic partition of counts and also by private values of counts. The utility analysis with these choice of boundaries is in Appendix D.2.

C.1.3 Choise of ϵ1,ϵ2,ϵ3

We note that there is no particular reason for ϵ1, ϵ2, and ϵ3 to be equal and we chose those values for simplicity and easy readability. For example, since ϵ1 is just used to estimate n, the analysis of the algorithm shows that ϵ2,ϵ3 affects utility more than ϵ1. Hence, we can set ϵ2=ϵ3=ϵ(1-o(1))/2 and ϵ1=o(ϵ) to get better practical results. Furthermore, for low privacy regime, the algorithm only uses a privacy budget of ϵ1+ϵ2. Hence, we can set ϵ1=o(1), ϵ2=ϵ(1-o(1)), and ϵ3=0.

C.2 Technical details

We first prove a dataset depending composition theorem that helps us decompose differential privacy analysis depending on the dataset.

Theorem 3 (Dataset dependent composition theorem).

Let Z1,Z2,,Zn be a set of independent random variables. Let X1=f1(x,Z1) be a deterministic function. Similarly let Xi=fi(Xi-1,Zi) be deterministic functions for 2in. If for any two neighboring data sets x and x,

mini1maxz1,z2,zi-1,xiPr(Xi=xi|x,z1,z2,zi-1)Pr(Xi=xi|x,z1,z2,zi-1)eϵ,33 3 For notational simplicity, let = z 0 1 . (7)

then Xn is an ϵ-DP output.

Proof.

For any two datasets, since xX1Xn is a Markov chain,

Pr(Xn=xn|x)=xiPr(Xn=xn|Xi=xi)Pr(Xi=xi|x).

Hence for any two datasets x,x, any xn, for all i,

Pr(Xn=xn|x)Pr(Xn=xn|x)=x1Pr(Xn=xn|Xi=xi)Pr(Xi=xi|x)xiPr(Xn=xn|Xi=xi)Pr(Xi=xi|x)maxxiPr(Xi=xi|x)Pr(Xi=xi|x).

Similarly for any i,

Pr(Xi=xi|x)=z1,z2,,zj-1Pr(Xj=xj|x,z1,z2,,zi-1)Pr(z1,z2,zi-1).

Hence for any two datasets x,x,

Pr(Xi=xi|x)Pr(Xi=xi|x) =z1,z2,,zi-1Pr(Xi=xi|x,z1,z2,,zi-1)Pr(z1,z2,zi-1)z1,z2,,zi-1Pr(Xi=xi|x,z1,z2,,zi-1)Pr(z1,z2,zi-1)
maxz1,z2,zi-1Pr(Xi=xi|x,z1,z2,,zi-1)Pr(Xi=xi|x,z1,z2,,zi-1).

Hence,

maxxnPr(Xn=xn|x)Pr(Xn=xn|x) mini1maxxiPr(Xi=xi|x)Pr(Xi=xi|x)
mini1maxximaxz1,z2,zi-1Pr(Xi=xi|x,z1,z2,,zi-1)Pr(Xi=xi|x,z1,z2,,zi-1),

and hence for every pair of datasets if the right hand side is smaller than eϵ, then Xn is diffferentially private. ∎

C.3 Privacy analysis

We start with proving N is ϵ1-DP.

Lemma 7.

N is ϵ1-DP.

Proof sketch.

The proof follows from Definition 3 and the fact that for any two neighboring datasets h1,h2, n(h1)-n(h2)=|nxh1nx-nxh2nx|1(h1,h2)=1, and hence the sensitivity of this query is 1. ∎

We now show that release of Hcs and Hc is DP.

Lemma 8.

Release of Hcs and Hc is ϵ2-DP.

Proof.

hHb(Hbs,Hb)(Hcs,Hb)(Hcs,Hc) is a Markov chain. We use Theorem 3 to show that the output of this Markov chain is DP for all datasets. For any two neighboring datasets h1 and h2 can fall into one of three categories:

  1. 1.

    They differ in φT and φT+1.

  2. 2.

    They differ in φr and φr+1 for some 0r<T-1.

  3. 3.

    They differ in φr and φr+1 for some r>T.

We prove that (Hcs,Hc) release is ϵ2-DP for each of the above three cases.

Case 1: We show that the process hHb satisfies (7). Observe that

maxhbPr(Hb=hb|h2)Pr(Hb=bb|h1) =(a)Pr(φ¯(Hb)=φ¯b|φ¯(h2))Pr(φ¯(Hb)=φ¯b|φ¯(h1))
=(b)Pr(φT(Hb)=φTb,φT+1(Hb)=φT+1b|φT(h2))Pr(φT(Hb)=φTb,φT+1(Hb)=φT+1b|φ¯T(h1))
=(c)Pr(Zb=φTb-φT(h2))Pr(Zb=φTb-φT(h1))
=e-ϵ2|φTb-φT(h2)|e-ϵ2|φTb-φT(h1)|
eϵ2|φT(h2)-φT(h1)|
eϵ2.

(a) follows by observing that there is a one to one correspondence between the histogram and the prevalences, (b) follows from the fact that noise is added only to φT and φT+1, and (c) follows from the fact that the noise added to φT+1 is a deterministic function of φT, and φTb+φT+1b=φTa+φT+1a always.

Case 2: We show that (Hbs,Hb)(Hcs,Hb) satisfies (7). Let H1bs and H2bs be the values of Hbs for inputs h1 and h2 respectively. For any Zb=zb, since difference of maximums is at most maximum of differences,

r|φr+(H2bs)-φr+(H1bs)|r|φr+(h2)-zb-φr+(h1)+zb|=r|φr+(h2)-φr+(h1)|1.

Hence,

Pr((Hcs,Hb)=(hcs,hb)|H2bs,Hb)Pr(Hcs,Hb)=(hcs,hb)|H1bs,Hb) =Pr(Hcs=hcs|H2bs)Pr(Hcs=hcs|H1bs)
=Pr(φ¯r+(Hcs)=φ¯r+(hcs)|φ¯r+(H2bs))Pr(φ¯r+(Hcs)=φ¯r+(hcs)|φ¯r+(H1bs))
=r1Pr(φr+(Hcs)=φr+(hcs)|φ¯r+(H2bs))Pr(φr+(Hcs)=φr+(hcs)|φ¯r+(H1bs))
=r1e-ϵ2|φr+(hcs)-φr+(H2bs)|e-ϵ2|φr+(hcs)-φr+(H1bs)|
r1eϵ2|φr+(H2bs)-φr+(H1bs)|eϵ2.

The first equality follows from the observation that there is a one to one correspondence between φrs and φr+s. The second equation follows from the fact that noise added to various  φr+s are independent of each other. The rest of the proof follows from the definition of Geometric mechanism and the fact that h1 and h2 are neighbors.

Case 3: We show that (Hcs,Hb)(Hcs,Hc) satisfies (7). Let H1b and H2b are the Hb’s corresponding to h1 and h2 respectively. Conditioned on the value of Zb, for any two neighboring histograms that differ in two consecutive r’s that are larger than T,

r>Tφr(H2b)-r>Tφr(H2b)=r>Tφr(h2)-φr(h1)=0.

Since their sums are equal, conditioned on the value of Zb, it can be shown that H1b,H2b are proper histograms and both contain max(0,M+Zb+r>Tφr(h1)) elements and they differ in at most two consecutive values of r. Thus H1b and H2b contain same number of counts and differ in at most one count denoted by i*. With these observations we now bound the ratio of probabilities for differential privacy.

Since there is a one to one correspondence between sorted counts and the histograms, we get

Pr((Hcs,Hc)=(hcs,hc)|H2b,Hcs)Pr(Hcs,Hc)=(hcs,hc)|H1b,Hcs) =Pr(Hc=hc|H2b)Pr(Hc=hc|H1b)
=iPr(Ni=ni|n(i)(H2b))iPr(Ni=ni|n(i)(H2b))
=Pr(Ni*=ni*|n(i*)(H2b))Pr(Ni*=ni*|n(i*)(H1b))
=e-ϵ2|ni*-n(i*)(H2b)|e-ϵ2|ni*-n(i*)(H1b)|eϵ2|n(i*)(H2b)-n(i*)(H1b)|eϵ2,

where the last set of inequalities follow from the definition of Geometric mechanism and the fact that the noise added to N(i)s are independent of each other, and n(i*) is the only count in which the two histograms differ. ∎

We now show that PrivHist-HigPrivacy is ϵ3-DP.

Lemma 9.

PrivHist-HigPrivacy is ϵ3-DP.

Proof.

Observe that hHvHwHxHy is a Markov chain, hence it suffices to prove HvHw is ϵ3-DP.

Let h1 and h2 are two neighboring datasets. Without loss of generality, let they differ in φk and φk+1. Let h1v and h2v be the two histograms obtained after quanitzation step (3). Since h1 and h2 differ only at k and k+1, h1v and h2v differ only in φsi-1v and φsiv for some i. Furthermore,

|φsi-1v(h1v)-φsi-1v(h2v)|=|φsiv(h1v)-φsiv(h2v)|1si-si-1.

For all j{si-1,si}, φjv(h1v)=φjv(h2v). Hence φsi+v(h1v)φsi+v(h2v) for only one value of i, let i* be this value.

Pr(Hw=hw|h1v)Pr(Hw=hw|h2v) =iPr(φsi+(Hw)=φsi+(hw)|φsi+(h1v))Pr(φsi+(Hw)=φsi+(hw)|φsi+(h2v))
=Pr(φsi*+(Hw)=φsi*+(hw)|φsi*+(h1v))Pr(φsi*+(Hw)=φsi*+(hw)|φsi*+(h2v))
=Pr(Zsi*=φsi*+(hw)-φsi*+(h1v))Pr(Zsi*=φsi*+(hw)-φsi*+(h2v))
exp(ϵ3(si*-si*-1)si*-si*-1)eϵ3.

Appendix D Utility analysis of PrivHist

Let Ho be the output of either PrivHist-LowPrivacy or PrivHist-HighPrivacy. In both the low and high privacy regimes, the output error can be bounded as

1(h,Ho)𝟏N>0+n𝟏N0.

Furthermore,

𝔼[n𝟏N0]ne-nϵ1e-ϵ1/2,44 4 We use instead of O notation and instead of Ω notation for compactness.

where the last inequality, follows by breaking it in to cases n=0 and n>0. The bound on 𝔼[|N-n|] follows from the fact that N=n+G(e-ϵ1), Lemma 2, and the moments of the geometric distribution. In the next two sections, we bound 1(h,Ho)𝟏N>0 for both low privacy and high privacy regimes.

D.1 Low privacy regime

In the following analysis, let Z be a geometric random variable distributed as G(e-ϵ2). Let Hb=Hbs+Hb. For the bounds on the histogram, by Lemma 6 and triangle inequality,

1(h,He)41(Ha,Hd)41(Hb,Hd)+41(Hb,Ha). (8)

For the second term, observe that since Hb is obtained by moving Zb terms between T and T+1 and then majorizing it. If |Zb|M, then majorization does not change the histogram. Hence,

1(Hb,Ha)=1(Hb,Ha)𝟏|Zb|M+1(Hb,Ha)𝟏|Zb|>M|Zb|+(n+2M(T+1))𝟏|Zb|>M.

Taking expectation on both sides

𝔼N[1(Hb,Ha)]𝔼[|Zb|+(n+2M(T+1))𝟏|Zb|>M]𝔼[Z]+n+NN2e-2ϵ2.55 5 E N denotes conditional expectation w.r.t. N (9)

For the first term, by Lemma 3,

1(Hb,Hd)1(Hbs,Hds)+1(Hb,Hd). (10)

We now bound both the terms above. For the large counts, let ij be the index of the noisy count N(i)(Hb).

1(Hb,Hd) i|N(i)(Hb)-N(i)(Hd)|
i|N(i)(Hb)-Nij(Hd)|
i|N(i)(Hb)-Nij(Hc)|.

Since the number of terms above T is at most n/T+M+Zb, in expectation,

𝔼N[1(Hb,Hd)]𝔼[i|Zic|](nT+M+𝔼|Z|)𝔼[|Z|]. (11)

For the smaller counts observe that

1(Hbs,Hds) r0|φr+bs-φr+ds|
(a)2r0|φr+bs-φr+mon|
(b)2Tr0(φr+bs-φr+mon)2
(c)2Tr0(φr+bs-φr+cs)2
2Tr0(Zrcs-Zb)2, (12)

where (a) follows from the fact that rounding off increases the error at most by 2 (Lemma 5), (b) follows by the Cauchy-Schwarz inequality, and (c) follows from the fact that φr+bs are monotonic and hence monotonic projection only decreases the error. Hence in expectation,

𝔼N[1(Hbs,Hds)]T𝔼[Z2]. (13)

Combining (8), (9), (10), (11), (12), and (13),

𝔼N[1(h,He)]|Z|+4T𝔼[Z2]+2(nT+M+𝔼|Z|)𝔼[|Z|]+n+NN2e-2ϵ2.

Substituting T=N and M=2logNeϵ2ϵ2, yields

𝔼N[1(h,He)]|Z|+N𝔼[Z2]+2(1+nN+logNϵ2ϵ2+𝔼|Z|)𝔼[|Z|]+n+NNe-2ϵ2.

Taking expectation w.r.t. N and using Lemma 2 yields

𝔼[1(h,He)𝟏N>0]n𝔼[Z2]+e-2ϵ2ne-ϵ/6,

where the last inequality follows from moments of geometric distribution.

D.2 High privacy utility

For the high privacy regime, by the triangle inequality,

1(h,Hy)1(h,Hu)+1(Hu,Hy). (14)

For the first term, since we are only reducing counts of certain elements,

𝔼[1(h,Hu)]𝔼[𝟏N<n/2n]e-nϵ1/2n1ϵ1. (15)

The second term can be bounded as

𝔼[1(Hu,Hy)] r>0|φr+(Hu)-φr+(Hy)|
(a)2r>0|φr+(Hu)-φr+(Hx)|
2r>0|φr+(Hu)-φr+(Hv)|+|φr+(Hx)-φr+(Hv)|, (16)

where (a) follows by Lemma 5 and the last inequality follows by the triangle inequality. The first term in the last equation corresponds to the smoothing error and we analyze it now.

φsi+1+v=r=sisi+1φru(r-si)si+1-si+r>si+1φru.

Since φjv=0 for jS,

j=sisi+1-1|φj+v-φj+u| =j=sisi+1-1|r=sisi+1φru(r-si)si+1-si-r=jsi+1φru|
j=sisi+1r=sisi+1φru|(r-si)si+1-si-𝟏rj|
=r=sisi+12φru(r-si)(si+1-r)si+1-si
2r=sisi+1φrumin(si+1-r,r-si).

For rT, that lies between si and si+1 in S,

|r-si|min(T(1+q)i+1-r,r-T(1+q)i)rq.

If rT, the analysis depends on the value of Zb. If Zb-M, rT, and φru>0, then there exists a si that is at most Zicl away for from r. If not, then the error is at most 2n. Hence, the smoothing error in expectation can be bounded by

rTTφruqr+rTφru𝔼N[|Z|𝟏Zb-M]+𝔼N[2n𝟏Zb<-M]nq+n𝔼[|Z|]T+nN. (17)

For the second part,

r1|φr+v-φr+x| (a)siS|φsi+v-φsi+x|(si-si-1)
(b)|S|siS(φsi+v-φsi+x)2(si-si-1)2
(c)|S|siS(φsi+w-φsi+x)2(si-si-1)2,

where (a) follows by observing that φrx=0 for rS, (b) follows from the Cauchy-Schwarz inequality. (c) follows from the fact that projecting on to the simplex only increases the error. By the second moments of Zis, taking expectation on both sides yields

r1𝔼N|φr+v-φr+x||S|1ϵ.

We now bound the size of the set S. By step (2) of the algorithm, number of elements in S can be bounded by

T+1+log1+qTT+10nT+U,

where U is the number of elements less than T/10 in Hb, that upon adding noise increases to T. Expectation of U conditioned on N is

𝔼N[U](nT+M+𝔼[|Z|])e-9Tϵ2/10(nNϵ+1+logNϵ)e-3N/ϵnϵN+1, (18)

where the second inequality follows by substituting T=10N/ϵ33 and third inequality follows by algebraic manipulation. Combining (14), (15), (16), (17), and (18), and using the fact that quantization error is at most 𝒪(n) yields

nq𝟏N>n/2+n𝟏Nn/2+n𝔼[|Z|]T+(T+1+log1+qTT+nT+nϵN)1ϵ+nN
nq+(Nϵ+1qlog2ϵ+nϵ3N+nϵN)1ϵ+nN+n𝟏Nn/2
(N+1+nN)1ϵlog2ϵ+nϵN+nN+n𝟏Nn/2,

where the last equation follows by substituting the value of q=1Nϵ3log1ϵ3. Taking expectation with respect to N and using Lemma 2 and the fact that nϵ1 yields,

𝔼[1(h,Hz)𝟏N>0]nϵlog2ϵ+1ϵ.

D.3 Time complexity

In this section, we provide a proof sketch of the time complexity. Suppose ϵ>1. The number of prevalences in Hbs is TN. Similarly, the number of counts in Hb is n/Tn/N. Further, the number of additional fake counts added is MlogN and the isotonic regression using PAVA takes linear in the size of the input. Hence, the expected run time, conditioned on N is at most N+nN+M, which upon taking expectation w.r.t. N yields an expected run time of n.

If ϵ1, steps (1)-(4) in PrivHist takes time N/ϵ+logN/ϵ. The time complexity to find boundaries, smooth prevalences, add noise, and perform isotonic regression is |S|+n. By the utility analysis, |S|Nlog1ϵ+nϵN. Summing all the time complexities and taking expectation w.r.t. N similar to the utility analysis, yields a total time complexity of 𝒪~(nϵ+lognϵ).

Appendix E Lower bound in the low privacy regime

Lemma 10.

Let x{0,1}k. Suppose two vectors x,y are neighbors if ||x-y||11. If X^=M(x) be an ϵ-DP estimate of x, then

maxx{0,1}k𝔼[||x-X^||1]ke-ϵ.
Proof.

Let p(x) be the uniform distribution over {0,1}k. For a vector v{0,1}k-1, let 𝒳vi denote the two vectors such that x1i-1=v1i-1 and xi+1k=vik-1. Then

maxx{0,1}k𝔼[||x-X^||1] 12kx{0,1}k𝔼[||x-X^||1]
=12kx{0,1}ki=1k𝔼|xi-X^i|
=i=1k12kx{0,1}k𝔼|xi-X^i|
=i=1k12k-1v{0,1}k-112x𝒳vi𝔼|xi-X^i|. (19)

For any v and i, vectors in 𝒳vi are neighbors and hence by the definition of DP,

x𝒳vi𝔼|xi-X^i| =x^iPr(Xi^=x^i|xi=1)|1-x^i|+Pr(Xi^=x^i|xi=0)|0-x^i|
x^iPr(Xi^=x^i|xi=1)|1-x^i|+e-ϵPr(Xi^=x^i|xi=1)|0-x^i|
e-ϵx^iPr(Xi^=x^i|xi=1)
e-ϵ.

Substituting the above lower bound in (19) yields the lemma. ∎

We now use the above bound to prove Theorem 2. We first define a of histograms to show the lower bound. Let k=n/10. For a given vector x{0,1}k, let φ4i=φ4i+3=xi-1 and φ4i+1=φ4i+2=1-xi-1, φr=1 for r=n-i=14kφrr, and φr=0 otherwise. Observe that r1φrr=n and are valid histograms. If two vectors x and y have hamming distance d, then the corresponding distance between anonymized histograms is 2d.

Consider a slightly different definition of neighboring datasets over histograms, where two datasets are neighboring if the neighboring datasets are distance 2 apart. If a mechanism is ϵ-DP in the previous definition of neighboring datasets, then the mechanism is 2ϵ-DP in the new notion of neighbors.

One mechanism for releasing the vectors x{0,1}k with 2ϵ-DP is to encode it as the histograms as mentioned above and release them. Since such a mechanism has hamming distance ke-2ϵ, the anonymized histograms cannot be estimated with accuracy ke-2ϵ with 2ϵ-DP under new definition of neighbors and hence it cannot be estimated with accuracy ke-2ϵ with ϵ-DP under the old definition of neighbors.

Appendix F Proof of Corollary 1

Recall that N is the DP estimate of n and H is the DP estimate of h. In the following Let Xn be the initial set of n samples. Let XN be N samples obtained from p,Xn as follows. If N<n, obtain XN by removing n-N samples uniformly from Xn. If Nn, add N-n samples from p to Xn. Note that XN are N i.i.d. samples from p. We bound the error of the estimator as follows.

f(p)-f^p=(f(p)-f^p)𝟏N>0+(f(p)-f^p)𝟏N=0.

Taking expectation on the second term,

𝔼[(f(p)-f^p)𝟏N=0]=𝔼[f(p)𝟏N=0]f(p)e-ncϵ,

for some constant c. For the case N>0, we bound as follows.

f(p)-f^p
=f(p)-r1f(r,N)φr(H)
=f(p)-r1f(r,N)φr(h(XN))+r1f(r,N)φr(h(XN))-r1f(r,N)φr(h(Xn))
+r1f(r,N)φr(h(Xn))-r1f(r,N)φr(H).

We bound each of the three (difference) terms above. First observe that by the assumptions in the theorem:

𝔼N[|f(p)-r1f(r,N)φr(h(XN))|](f^,N)(f^,n)+|N-n|NβN.

Since XN and Xn differ in at most N-n terms, by the properties of sorted 1 distance,

r1f(r,N)φr(h(XN))-r1f(r,N)φr(h(Xn))|N-n|maxr|f(r,N)||N-n|NβN.

Further by the properties of sorted 1 distance,

𝔼N[|r1f(r,N)φr(h(Xn))-r1f(r,N)φr(H)|] maxr|f(r,N)|𝔼N[1(h,H)]
NβN𝔼N[1(h,H)].

Summing all the three terms, we get the error is at most

(f^,n)+|N-n|NβN+NβN𝔼N[1(h,H)]. (20)

For ϵ>1, difference between expected error and (f^,n) is at most

𝔼[NβN|N-n|𝟏N>0+NβN(N+nN)𝟏N>0]e-cϵ
𝔼[NβN|N-n|𝟏N>n/2+NβN(N+nN)𝟏N>n/2]e-cϵ+𝔼[n𝟏n/2N>0]e-cϵ
𝔼[nβn|N-n|+nβn]e-cϵ+𝔼[n𝟏n/2N>0]e-cϵ
nβ-1/2e-cϵ+ne-ncϵ
nβ-1/2e-cϵ,

where the last inequality uses the fact that ϵ>1. Combining the result with the case N=0, we get that

𝔼[|f(p)-f^p|](f^,n)+nβ-1/2e-cϵ+f(p)e-nϵ.

The result for ϵ>1 follows if

nmax(n(f^,α),𝒪((1αecϵ)21-2β+1ϵlogfmaxα)).

For Ω(1/n)ϵ1, by (20), the difference between expected error and (f^,n) is at most

𝔼[NβN(|N-n|+N1ϵlog2ϵ+nϵN+nN)𝟏N>0]
𝔼[NβN(|N-n|+N1ϵlog2ϵ+nϵN+nN)𝟏N>n/2]+𝔼[(1ϵlog2ϵ+n)𝟏n/2N>0]
𝔼[nβn|N-n|+nβn(n1ϵlog2ϵ)]+𝔼[(1ϵlog2ϵ+n)𝟏n/2N>0]
nβnϵ+nβn1ϵlog2ϵ+(1ϵlog2ϵ+n)e-ncϵ
nβn1ϵlog2ϵ+ne-ncϵ.

Combining with the result with the case, N=0, we get

𝔼[|f(p)-f^p|](f^,n)+nβ-1/21ϵlog2ϵ+(n+f(p))e-ncϵ.

The result for Ω(1/n)ϵ<1 follows if

nmax(n(f^,α),𝒪((log(2/ϵ)αϵ)21-2β+1ϵlogfmaxαϵ)).