Communication and Memory Efficient Testing of Discrete Distributions

  • 2019-06-11 17:26:21
  • Ilias Diakonikolas, Themis Gouleakis, Daniel M. Kane, Sankeerth Rao
  • 1

Abstract

We study distribution testing with communication and memory constraints inthe following computational models: (1) The {\em one-pass streaming model}where the goal is to minimize the sample complexity of the protocol subject toa memory constraint, and (2) A {\em distributed model} where the data samplesreside at multiple machines and the goal is to minimize the communication costof the protocol. In both these models, we provide efficient algorithms foruniformity/identity testing (goodness of fit) and closeness testing (two sampletesting). Moreover, we show nearly-tight lower bounds on (1) the samplecomplexity of any one-pass streaming tester for uniformity, subject to thememory constraint, and (2) the communication cost of any uniformity testingprotocol, in a restricted `one-pass' model of communication.

 

Quick Read (beta)

[

\NameIlias Diakonikolas \Email[email protected]
\addr
University of Southern California \AND\NameThemis Gouleakis \Email[email protected]
\addr
Max Planck Institute for Informatics \AND\NameDaniel M. Kane \Email[email protected]
\addr
University of California
   San Diego \AND\NameSankeerth Rao \Email[email protected]
\addr
University of California
   San Diego
Abstract

We study distribution testing with communication and memory constraints in the following computational models: (1) The one-pass streaming model where the goal is to minimize the sample complexity of the protocol subject to a memory constraint, and (2) A distributed model where the data samples reside at multiple machines and the goal is to minimize the communication cost of the protocol. In both these models, we provide efficient algorithms for uniformity/identity testing (goodness of fit) and closeness testing (two sample testing). Moreover, we show nearly-tight lower bounds on (1) the sample complexity of any one-pass streaming tester for uniformity, subject to the memory constraint, and (2) the communication cost of any uniformity testing protocol, in a restricted “one-pass” model of communication.

d

Communication and Memory Efficient Testing of Discrete Distributions]Communication and Memory Efficient Testing
of Discrete Distributions

istribution testing, identity testing, closeness testing, communication complexity, streaming

1 Introduction

1.1 Background

Classical statistics theory focuses on characterizing the inherent sample complexity of inference tasks, typically formalized via minimax rates of convergence. Research in this field has primarily focused on understanding the sample complexity of inference in the centralized setting, where all the samples are available to a single machine that performs the computation. We now have a rich theory (see, e.g., [Devroye and Györfi(1985), Devroye and Lugosi(2001), Tsybakov(2008)] for a few books on the topic) that has led to characterizing the sample complexity of a wide range of statistical tasks in this regime.

In modern data analysis, one may have additional constraints on data collection and storage. Modern datasets are often too large to be stored on a single computer, and so it is natural to consider methods that either impose upper bounds on the available memory or involve multiple machines, each containing a small subset of the dataset. Typical examples include anomaly detection in various settings (e.g., inference based on distributed sensor measurements, fraud detection based on different transactions of a customer, deciding whether a region of the sky is interesting based on astronomical data from multiple telescopes, etc.)

In this paper, we study distribution property testing [Batu et al.(2000)Batu, Fortnow, Rubinfeld, Smith, and White] in the following computational models: (1) The one-pass streaming model where the goal is to minimize the sample complexity of the protocol subject to a memory constraint, and (2) A distributed model where the data samples reside at multiple machines, and the goal is to minimize the communication cost of the protocol. In both these models, we provide efficient algorithms for uniformity/identity testing (goodness of fit) and closeness testing (two sample testing). Moreover, we show lower bounds (in some cases, nearly-tight) on (1) the sample complexity of any one-pass streaming tester, subject to the memory constraint and (2) the communication cost of any protocol performing the testing task (in a restricted “one-pass” model of communication, described below).

Computational Models

In the one-pass streaming model, the data samples are revealed online in a stream and the algorithm is allowed a single pass over the data. Moreover, there is an upper bound, which we will typically denote by m, on the number of bits the algorithm can store at any point of its execution. In our setting, the goal is to minimize the sample complexity of testing subject to the memory constraint.

Our distributed communication model uses a blackboard (broadcast) model of communication in the sense that each message sent by each machine (player) is visible to all machines. There is an arbitrarily large number of machines, each holding independent samples from the unknown distribution(s). Additionally, there is a referee (arbitrator) who holds no samples. In each round, the referee either returns an answer or asks a one-bit question to one of the players about their input and receives a response. The goal is for the referee to return the correct answer to our testing problem (with at least 2/3 probability) in as few rounds of communication as possible. Notice that this model only costs the communication needed to answer the referee’s questions and not the information encoded by the questions themselves or by which player the referee chooses to ask. This is natural in a broadcast communication model, as this information would be implicitly determined by the communication transcript up to this point.

Unfortunately, we do not know how to prove lower bounds in the above general model, and will instead work in the one-pass version of this model. In the one-pass version, the referee is not allowed to go back to querying a player after they have moved on. In particular, the referee cannot ask a question to player A, subsequently ask a question to player BA, and then ask a question to player A again. Our communication lower bounds hold in this one-pass model. We note that our algorithms work in the one-pass model as well.

1.2 Our Contributions

We give algorithms and lower bounds for uniformity/identity testing11 1 The reduction of identity to uniformity in [Goldreich(2016)] immediately translates our upper bounds from uniformity to identity. and closeness testing in the presence of communication and memory constraints. More specifically, we obtain the following results:

  1. 1.

    A one-pass streaming algorithm for uniformity testing on [n] with memory upper bound of m bits that has sample complexity O(nlog(n)/(mϵ4)) for a broad range of parameters. Moreover, we show that this sample upper bound is tight for a fairly wide range of m, and tight up to a log(n) factor when ϵ is constant for all values of m.

  2. 2.

    A distributed uniformity tester for samples per machine with communication complexity O(nlog(n)//ϵ2). (Note that when =1, this beats the trivial algorithm by a log(n) factor.) We also give a matching lower bound when is not too big and ϵ is not too small. The former of these constraints on the lower bound is necessary. Indeed, we give a different algorithm with communication O(nlog(n)/(2ϵ4)), which beats the previous bound when is sufficiently large.

  3. 3.

    A one-pass streaming algorithm for closeness testing that uses O(nlog(n)/m/ϵ2) samples for a wide range of values of m, and a distributed closeness tester with communication complexity O(n2/3log1/3(n)/(2/3ϵ4/3)). (Note that for =1, this improves by a log2/3(n) factor over the naive algorithm.)

Our results are summarized in Tables 1 and 2.

Sample Complexity Bounds with Memory Constraints
Property Upper Bound Lower Bound 1 Lower Bound 2
Uniformity O(nlognmϵ4) Ω(nlognmϵ4) Ω(nmϵ2)
Conditions n0.9mlog(n)/ϵ2 m=Ω~(n0.34ϵ8/3+n0.1ϵ4) Unconditional
Closeness O(nlog(n)/(mϵ2))
Conditions Θ~(min(n,n2/3/ϵ4/3))mlog(n)
Table 1: Summary of our Streaming Upper and Lower Bounds.
Communication Complexity Bounds
Property UB 1 UB 2 LB 1 LB 2 LB 3
Uniformity O(nlog(n)/ϵ2) O(nlog(n)2ϵ4) Ω(nlog(n)/ϵ2) Ω(n/ϵ) Ω(n2ϵ2logn)
Conditions ϵ8nlognϵ-4n0.9 nϵ2 ϵ4/3n0.3 =O~(n1/3ϵ4/3) =Ω~(n1/3ϵ4/3)
Closeness O(n2/3log1/3(n)2/3ϵ4/3) - - - -
Conditions nϵ4/log(n) - - - -
Table 2: Summary of our Distributed Upper and Lower Bounds.

1.3 Overview of our Approach

In this section, we provide a detailed sketch of our algorithms and lower bounds.

Uniformity Testing Algorithms

We start by describing the main ideas underlying our efficient uniformity testing protocols. We note that in both the distributed and the streaming settings, our uniformity testers rely on a unified idea that we term bipartite collision testing. We remind the reader that testing uniformity based on the number of pairwise collisions is the oldest algorithm in distribution testing [Goldreich and Ron(2000)], which is now known to be sample-optimal in the centralized setting [Diakonikolas et al.(2016)Diakonikolas, Gouleakis, Peebles, and Price]. (In fact, it turns out that all known efficient uniformity testers rely to some extent on counting pairwise collisions.) Recall that the collisions-based uniformity tester [Goldreich and Ron(2000)] takes N samples from the unknown distribution p on [n] and counts the total number of pairwise collisions among them. If p is the uniform distribution, the expected number of collisions will be (1/n)(N2). On the other hand, if p is ϵ-far from the uniform distribution, the expected number of collisions will be at least (1/n)(N2)(1+Θ(ϵ2)). A now standard analysis — involving carefully bounding the variance of this estimator followed by an application of Chebyshev’s inequality — shows that for N=Ω(n/ϵ2) we can distinguish between the two cases with high constant probability.

Unfortunately, the standard collisions-based tester described above involves computing the total number of collisions among all pairs of samples, and it is unclear if it can be implemented with non-trivial communication or memory. 22 2 We note that the trivial protocol based on a total of N samples uses Nlogn bits of memory and Nlogn bits of communication. To improve on these naive bounds, we propose a modified uniformity tester (in the collocated/centralized setting) that we can implement in the memory and communication restricted settings we study. In particular, we consider a bipartite collision tester that works as follows: We draw two independent sets of samples S1 and S2 from the unknown distribution p and count the number of pairs of an element of S1 and an element of S2 that collide. Importantly, we will use this scheme in such a way so that the first set of samples S1 will be substantially smaller than the second set of samples S2. Roughly speaking, our algorithm will store the set S1 exactly, while for each element of the set S2, it will only need to know the number of collisions with elements of S1. This last step will allow us to save on space or communication. An important technical condition that is required for our bipartite tester to succeed is that |S1||S2|n/ϵ4.

We now provide an overview of the analysis. As is standard, we need to show that the number of collisions is much larger in the non-uniform case than in the uniform case. To achieve that, we consider for fixed S1 the sum p(S1)=sS1ps. We note that the expected number of collisions is just p(S1)|S2|, and by standard concentration bounds one can show that it will likely be close to this value. However, the average size of p(S1) is |S1|p22, which is somewhat larger for non-uniform p than for p uniform. The detailed analysis is given in Section 3.1.

We note that our bipartite collision tester can be easily implemented in the memory and communication bounded settings. In the former setting (Section 3.2), it leads to a uniformity tester with sample complexity O(nlogn/(mϵ4)), where m is the bits of memory used. In the distributed setting (Section 3.3), when each machine stores samples, it leads to a tester with sample complexity O(nlogn/ϵ2) and communication cost O((nlogn)//ϵ2) bits. (It should be noted that the above memory and communication upper bounds match our lower bounds in some regimes of parameters. We show that the tradeoff between sample complexity and memory/communication is inherent. See Theorem 4.2 and Corollary 4.3.)

It should be noted that when is sufficiently large, we can design a uniformity tester whose communication beats the above (Section 3.4). This fact should not be surprising since if n/ϵ2, a single machine could run a uniformity tester and simply return the answer. If is somewhat smaller than this value, we can still take advantage of the large number of samples per machine. The basic idea is for each player to communicate the number of pairwise collisions among their own samples. As there are O(2) pairs of samples per machine, this will require roughly n/2 machines before we start seeing any collisions, and this will give our approximate complexity for large ϵ, which can be seen to be better than our aforementioned bound, when is sufficiently large.

Information-Theoretic Lower Bounds.

We start by describing our memory lower bounds followed by our communication lower bounds.

Memory Lower Bound. We define a family of distributions on [n]×[2], i.e., supported on 2n elements that will make uniformity testing difficult. In particular, we think of these distributions as consisting of n pairs of bins where the probability of landing in each pair is exactly 1/n. Equivalently, a distribution from this family can be thought of as having n (possibly biased) coins. The distribution picks a uniform random coin, flips it, and returns the pair of the coin and the result. If all coins are fair, we have the uniform distribution over [n]×[2]. On the other hand, if each coin is ϵ-biased in a randomly chosen direction, we have a distribution that is ϵ-far from uniform. We show that these two cases are hard to distinguish from each other without expending a substantial amount of our computational resources. Note that the standard sample complexity lower bounds for uniformity testing [Paninski(2008)] rely on essentially the same hard instances.

We consider running a testing algorithm on a distribution that is randomly either uniform or a random ϵ-biased distribution as described above. Let X be the bit that describes whether or not the distribution is uniform or not. We consider the shared information between X and the memory of our algorithm after seeing k samples. We will attempt to show that this increases by at most O(ϵ2m/n) per step, where m is the upper bound on the memory, which implies that it will take n/(mϵ2) steps in order to reliably determine X. The idea of the proof is fairly simple: The choice of which of the n coins is flipped is uniform no matter what, and so this choice does not tell us anything about X. What might tell us something is the result of flipping this coin, but two things make this difficult for us. On the one hand, the coin is at most ϵ-biased, so knowing the result of the flip can only provide O(ϵ2) information about this bias. More critically, a biased coin is equally likely to be biased in either direction, so knowing the result by itself still tells us nothing unless we have some prior information about the direction of the possible bias. However, since our streaming algorithm can store only m bits of memory, on average it has at most O(m/n) bits of information about the bias of the (randomly chosen) coin of the next flip. This argument can be formalized to show that the next flip can only contribute O(m/n) bits to the shared information between X and our memory. See Section 4.2.

It may seem that the above argument is tight. With m bits of information, one could know the biases (if they exist) of m of the coins, and then each flip will (with probability m/n) give us ϵ2 information to accept or reject our hypothesis that these biases are real. However, we would need to be extraordinarily lucky to have this information. The only way that we could know the biases of these m coins is if in our previous set of samples we had seen each of these coins m times. This will prove difficult for two reasons: First, unless the number of samples is quite large, we would not expect to see any coin show up in more than, say, 10 samples. This will limit the amount of information we could expect to know about the bias of any given coin to O(ϵ2). Additionally, we will have information about the biases of all of these particular m coins only if all of them have shown up in our set of samples. But if our total number of samples is substantially sub-linear, this will only happen with probability n-Ω(m). On the other hand, since there are only 2m possible memory states, with high probability, we must be in one that occurs with probability only as small as 2-Ω(m), so we should only hope to have this information about m/log(n) coins (which we could get by, say, recording the coins and the results of the first m/log(n) samples). To formalize this intuition, we let r be the vector whose ith entry is the expected number of times we have seen the ith coin based on the transcript, and note that the information gained is proportional to r22. This can only be large if there is some unit vector w with wr large. But this would in turn mean that, conditioned on our (reasonably high probability) transcript, the average of wC over our samples C is big. We show that this cannot happen with high probability by a Chernoff bound. The details are deferred to Section 4.2.

Communication Complexity Lower Bound. To make this argument work for our communication complexity setting (Section 4.3), we will need to restrict the model, in particular requiring that the referee sees the players in sequence, never returning to one after it has moved onto the next. In order to get our lower bounds for many samples per player, we can proceed by immediate reduction to the streaming model. Our algorithm stores the transcript of the communication thus far. When the referee talks to a new player, we record the samples that this player has, and then we compress this down to the extended transcript involving the answers to the questions asked of this new player. Note that at any time the amount of memory used is at most O(|T|+slog(n)), where T is the final communication transcript.

Closeness Testing Algorithms

We now describe our algorithms for closeness testing (Section 5). In the communication model, a reasonable approach would be to follow the methodology of  [Diakonikolas and Kane(2016)], by first using a few samples to flatten and then applying some collision-based tester similar to that used for our uniformity testers. We instead give an algorithm that turns out to be slightly more communication efficient. The essential idea is to pick a random subset W of the domain, and test whether or not p conditioned on W is close to q conditioned on W. If we take |W| to be about n/(slog(n)), we note that each machine can send either one bit (encoding that they have no samples in W) or log(n) bits giving one such sample. The former outcome happens about log(n) times more often, and so in total we need to send about log(n) bits per element of S received. However, for constant ϵ, we should need only about |W|2/3 such samples to run a standard closeness tester.

There is one substantial issue with this plan however. We need it to be the case that 𝐏𝐫[W] is approximately |W|/n and that dTV(p|W,q|W)dTV(p,q). We note that both of these happen with high probability so long as none of p,q or p-q are dominated by a small number of bins. In particular, for ϵ large, it would be sufficient to know that p22,q221/(slog(n)). Although this might not be the case for the given p and q we can obtain it by flattening. The details of the analysis are given in Section 5.2.

For the streaming algorithm for closeness (Section 5.1), we take a somewhat different approach. We select a random hash function h:[n][m] and instead compare the distributions h(p) and h(q). It is not hard to show that if p=q then h(p)=h(q), and that if p and q are far in variation distance that h(p) and h(q) are likely somewhat far in variation distance. The algorithm can record the counts of the number of samples from each distribution in each bin in O(mlog(n)) samples and these counts are enough to run standard closeness testing algorithms.

1.4 Related work

Distribution Testing.

The field of distribution property testing [Batu et al.(2000)Batu, Fortnow, Rubinfeld, Smith, and White] has been extensively investigated in the past couple of decades, see, e.g., the surveys [Rubinfeld(2012), Canonne(2015), Goldreich(2017)]. A large body of the literature has focused on characterizing the sample size needed to test properties of arbitrary discrete distributions in the centralized setting. This broad inference task originates from the field of statistics and has been extensively studied in hypothesis testing [Neyman and Pearson(1933), Lehmann and Romano(2005)] with somewhat different formalism. The centralized regime is fairly well understood in a variety of settings: for many properties of interest there exist sample-efficient testers  [Paninski(2008), Chan et al.(2014)Chan, Diakonikolas, Valiant, and Valiant, Valiant and Valiant(2014), Diakonikolas et al.(2015a)Diakonikolas, Kane, and Nikishkin, Acharya et al.(2015)Acharya, Daskalakis, and Kamath, Canonne et al.(2016)Canonne, Diakonikolas, Gouleakis, and Rubinfeld, Diakonikolas and Kane(2016), Diakonikolas et al.(2016)Diakonikolas, Gouleakis, Peebles, and Price, Canonne et al.(2018)Canonne, Diakonikolas, and Stewart, Goldreich(2017), Diakonikolas et al.(2017a)Diakonikolas, Gouleakis, Peebles, and Price, Batu and Canonne(2017), Diakonikolas et al.(2018)Diakonikolas, Kane, and Stewart, Canonne et al.(2017b)Canonne, Diakonikolas, Kane, and Stewart]. More recently, an emerging body of work has focused on leveraging a priori structure of the underlying distributions to obtain significantly improved sample complexities [Batu et al.(2004)Batu, Kumar, and Rubinfeld, Daskalakis et al.(2013)Daskalakis, Diakonikolas, Servedio, Valiant, and Valiant, Diakonikolas et al.(2015a)Diakonikolas, Kane, and Nikishkin, Diakonikolas et al.(2015b)Diakonikolas, Kane, and Nikishkin, Canonne et al.(2017a)Canonne, Diakonikolas, Kane, and Stewart, Daskalakis and Pan(2017), Daskalakis et al.(2018)Daskalakis, Dikkala, and Kamath, Diakonikolas et al.(2017c)Diakonikolas, Kane, and Nikishkin].

Distributed Statistical Inference.

There has been substantial recent interest in distributed estimation with communication constraints. A series of works [Zhang et al.(2013)Zhang, Duchi, Jordan, and Wainwright, Garg et al.(2014)Garg, Ma, and Nguyen, Braverman et al.(2016)Braverman, Garg, Ma, Nguyen, and Woodruff, Jordan et al.(2016)Jordan, Lee, and Yang, Diakonikolas et al.(2017b)Diakonikolas, Grigorescu, Li, Natarajan, Onak, and Schmidt, Han et al.(2018b)Han, Özgür, and Weissman, Han et al.(2018a)Han, Mukherjee, Özgür, and Weissman] studied distributed learning in both parametric and nonparametric settings obtaining (nearly-) matching upper and lower bounds. Other learning tasks has been studied as well in the distributed setting, including regression [Zhu and Lafferty(2018)] and PCA [Kannan et al.(2014)Kannan, Vempala, and Woodruff, Liang et al.(2014)Liang, Balcan, Kanchanapally, and Woodruff].

Classical work in information theory [Cover(1969), Ahlswede and Csiszar(1986)] studies simple hypothesis testing problems with communication constraints (as opposed to the composite testing problems studied in our work). Their results and techniques appear to be orthogonal to ours. Two recent works [Acharya et al.(2018a)Acharya, Canonne, and Tyagi, Acharya et al.(2018b)Acharya, Canonne, and Tyagi] give algorithms and lower bounds for distribution testing in a distributed model where each machine holds a single sample and is allowed to communicate logn bits of information. We note that this communication model is very different from ours: First, our focus is on the multiple samples per machine setting, where significant savings over the naive protocols is attainable. Moreover, we do not impose any restrictions on the amount of information communicated by any individual machine, and our goal is to minimize the total communication complexity of the protocol. Another recent work  [Andoni et al.(2018)Andoni, Malkin, and Nosatzki] studies distributed closeness testing in a two party setting where each party has access to samples from one of the two distributions. This model is different than ours for two reasons: First, we consider a large number of machines (parties). Second, we do not make the assumption that each machine holds samples from different distributions. Finally, recent work [Fischer et al.(2018)Fischer, Meir, and Oshman] studied uniformity testing in the standard LOCAL and CONGEST models when each machine holds a single sample. We note that their results and techniques seem incomparable to ours, and in particular have no implications on the communication complexity of uniformity testing in our broadcast model. In the aforementioned models, there is an underlying graph whose nodes are the machines (players). In each round, every player can send information to all its neighbors. The objective is to minimize the number of rounds of the protocol, as opposed to the number of bits communicated. In particular, for the special case that the underlying graph is a clique, their algorithm does not achieve non-trivial communication complexity.

The work of [Chien et al.(2010)Chien, Ligett, and McGregor] studied distribution testing with limited memory focusing on streaming algorithms in the framework of the canonical tester [Valiant(2011)]. In particular, they show that for testing problems in this framework with sample complexity f, there are streaming algorithms with memory m and sample complexity f22O(log(n))/m. We note that this is competitive with our uniformity tester up to the 2O(log(n)) factor, but does substantially worse than our closeness tester for small m. They also consider another class of streaming algorithms, when the number of samples is large enough to learn the distribution in question, and when the problem of computing empirical distances has an efficient streaming algorithm. This does not compare favorably to our algorithms, which always use sub-learning sample complexity.

1.5 Organization

The structure of this paper is as follows: After some preliminaries in Section 2, in Section 3 we present our bipartite collision tester and its implications to memory efficient and distributed uniformity testing. Section 4 presents our memory and communication lower bounds for uniformity testing. In Section 5, we give our upper bounds for closeness testing. Finally, we conclude in Section 6 with a set of open problems for future work.

2 Preliminaries

In this section, we introduce the mathematical notation and background necessary to state and prove our results in the following sections.

Notation We write [n] to denote the set {1,,n}. We consider discrete distributions over [n], which are functions p:[n][0,1] such that i=1npi=1. We use the notation pi to denote the probability of element i in distribution p. The 1 (resp. 2) norm of a distribution is identified with the 1 (resp. 2) norm of the corresponding vector, i.e., p1=i=1n|pi| and p2=i=1npi2. The 1 (resp. 2) distance between (pseudo-)distributions p and q is defined as the the 1 (resp. 2) norm of the vector of their difference, i.e., p-q1=i=1n|pi-qi| and p-q2=i=1n(pi-qi)2.

For a random variable X, we denote by 𝐄[X] its expectation, 𝐕𝐚𝐫[X] its variance, and 𝐏𝐫[] will denote the probability of event . Sometimes we will use the notation 𝐄D[],𝐏𝐫D[] to make the underlying distribution explicit.

Distribution Testing Distribution property testing studies the following family of problems: given sample access to one or more unknown distributions, determine whether they satisfy some global property or are “far” from satisfying the property. In this work, we study the properties of identity testing (goodness of fit) and closeness testing (two-sample testing) between two distributions.

{definition}

[Identity Testing/Uniformity Testing] The identity testing problem is the following: Given samples from an unknown distribution p on [n], a known distribution q on [n], and a parameter 0<ϵ<1, we want to distinguish, with probability at least 2/3, between the cases that p=q versus p-q1ϵ. The special case corresponding to q=Un, the uniform distribution on [n], is the problem of uniformity testing.

{definition}

[Closeness Testing] The closeness testing problem is the following: Given samples from two unknown distributions p,q on [n], and a parameter 0<ϵ<1, we want to distinguish, with probability at least 2/3, between the cases that p=q versus p-q1ϵ.

3 Communication and Memory Efficient Uniformity Testing

In this section, we design our protocols for uniformity testing. In Section 3.1, we start with our bipartite uniformity tester. In Section 3.2, we give our streaming uniformity tester, which follows as an application of our bipartite uniformity tester. In Section 3.3, we give our first distributed uniformity tester, which is an instantiation of our bipartite collision tester and performs well when the number of samples per machine is not very large. For the complementary setting that is large, we give a different protocol in Section 3.4.

3.1 Testing Uniformity via Bipartite Collisions

Our bipartite collision-based tester is described in pseudo-code below:

Algorithm Bipartite-Collision-Uniformity(p,n,ϵ) Input: Sample access to distribution p over [n] and ϵ>0. Output: “YES” if p=Un; “NO” if p-Un1ϵ. 1. Draw a multiset S1 of N1 i.i.d. samples from p. 2. For all j[n] compute aj=|{sS1:s=j}|, i.e., the multiplicity of j[n] in S1. 3. If maxj[n]aj>10, return “NO”. Otherwise, continue with next step. 4. Draw a multiset S2 of N2 i.i.d. samples from p. For k[N2], let bk be the number of times that the k-th sample in S2 appears in S1. 5. Let Z=(1/N2)k=1N2bk and T=defN1n(1+ϵ250). 6. If ZT return “NO”; otherwise, return “YES”.


The main result of this section is the following theorem:

{theorem}

Suppose that N1,N2 satisfy the following conditions: (i) Ω(ϵ-6)=N1n9/10 and N1N2=Ω(n/ϵ4), where the implied constants in the Ω() are sufficiently large. Then the algorithm Bipartite-Collision-Uniformity distinguishes between the cases that p=Un versus p-Un1ϵ with probability at least 2/3.

This section is devoted to the proof of Theorem 3.1. We start with the following definition: {definition}[Probability Mass of Multiset] Let p be a discrete distribution over [n] and S be a multiset of elements in [n]. We define the probability mass of the multiset S as follows: p(S)=defj=1najpj, where aj is the number of occurrences of j[n] in S.

It should be noted that we will use the above quantity for S being a multiset of i.i.d. samples from the distribution p. In this case, p(S) is a random variable satisfying the following:

Claim \thetheorem

Let S be a multiset of m i.i.d. samples from the distribution p on [n]. Then, we have that (i) ES[p(S)]=mp22 and (ii) VarS[p(S)]=m(p33-p24).

{proof}

By definition, p(S)=defj=1najpj, where ajBinomial(m,pj), j[n]. For the expected value, we can write: 𝐄S[p(S)]=j=1n𝐄S[aj]pj=j=1n(mpj)pj=mj=1npj2=mp22. To calculate the variance, we note that p(S) can be equivalently expressed as p(S)=i=1mri, where for i[m] we have that ri=pj with probability pj, for j[n]. Note that the ri’s are i.i.d. random variables and that 𝐄S[ri]=p22, 𝐄S[ri2]=p33. Therefore, we obtain that 𝐕𝐚𝐫S[p(S)]=i=1m𝐕𝐚𝐫S[ri]=m(p33-p24). This completes the proof of Claim 3.1.

We now proceed to establish the completeness and soundness of our tester.

Completeness Analysis

We will show that if p=Un, then the tester outputs “YES” with probability at least 2/3. We start by noting that it is very unlikely that the tester rejects in Step 3.

Claim \thetheorem

Let E=def{S1:maxj[n]aj10}. For N1n9/10, we have that PrS1[E]19/20.

{proof}

The probability that there exists some domain element in [n] that appears at least k times in S1, where |S1|=N1, is at most n(N1k)n-knN1kk!nk. By our assumption that N1n9/10, for k=10 the above probability is at most 1/k!, which gives the claim.

We proceed to analyze the behavior of the random variable Z defining our test statistic in Step 5. Note that Z is the empirical estimate of p(S1) based on the multiset of samples S2. Indeed, denoting by (p^j)S2 the empirical probability of j[n] based on S2, then we have that Z=(1/N2)k=1N2bk=j=1naj(p^j)S2. We have the following simple claim:

Claim \thetheorem

Let p be any distribution over [n]. Then we have that: (i) ES2[Z]=p(S1) and (ii) VarS2[Z]=(1/N2)(j=1naj2pj-p2(S1)).

{proof}

Since Z=j=1naj(p^j)S2 and 𝐄S2[(p^j)S2]=pj, we get that 𝐄S2[Z]=j=1najpj=p(S1). To calculate the variance, we use the equivalent definition of Z=(1/N2)k=1N2bk, where for each k[N2] the i.i.d. random variables bk are defined as follows: bk=aj with probability pj, for j[n]. It follows that 𝐄S2[bk]=j=1najpj=p(S1) and 𝐄S2[bk2]=j=1naj2pj. Therefore, we get that 𝐕𝐚𝐫S2[Z]=(1/N22)N2𝐕𝐚𝐫S2[b1]=(1/N2)(j=1naj2pj-p2(S1)). This completes the proof.

We note that Claim 3.1 will be useful both in the completeness and the soundness cases.

To establish the correctness of the tester in the completeness case, it suffices to show that 𝐏𝐫S1,S2[Z>T]1/3. Since the event occurs with high constant probability over S1, by a union bound, it suffices to show that 𝐏𝐫S1,S2[Z>T]1/10. An application of Chebyshev’s inequality yields that

𝐏𝐫S1,S2[|Z-𝐄S1,S2[Z]|>γ]𝐕𝐚𝐫S1,S2[Z]/γ2. (1)

Note that in the completeness case (p=Un) we have that p(S1)=N1/n deterministically over S1. Therefore, by Claim 3.1 (i), we get that 𝐄S1,S2[Z]=𝐄S2[Z]=p(S1)=N1/n. Similarly, we obtain that

𝐕𝐚𝐫S1,S2[Z] = 𝐄S1[𝐕𝐚𝐫S2[Z|S1]]+𝐕𝐚𝐫S1[𝐄S2[Z]]
(1/N2)maxS1j=1naj2pj+0
(10/N2)p(S1)+0=10N1/(N2n),

where the second line uses Claim 3.1 (ii) and the fact that 𝐕𝐚𝐫S1[𝐄S2[Z]]=0, and the third line follows from the definition of . By setting γ=def(ϵ2/100)𝐄S1,S2[Z]=(ϵ2/100)p(S1), the RHS of (1) is at most O(n/(ϵ4N1N2)). Recalling our assumption that N1N2=Ω(n/ϵ4) for a sufficiently large constant in the Ω(), we get that with probability at least 9/10 we have that Z<(1+ϵ2/100)(N1/n)<T. This proves the completeness of our tester.

Soundness Analysis

We will show that if p-Un1ϵ for ϵ satisfying N1=Ω(1/ϵ6), where the constant in Ω() is sufficiently large, then the tester outputs “NO” with probability at least 2/3. The technical part of the soundness proof involves showing that 𝐄S2[Z]=p(S1) will be significantly larger than its value of m/n in the completenesss case. Specifically, we show the following:

{lemma}

Let =def{S1:p(S1)(1+ϵ2/40)(N1/n)}. If p-Un1ϵ and N1=Ω(1/ϵ6), then 𝐏𝐫S1[]9/10. {proof} The proof proceeds by case analysis. First, we consider the case that the distribution p assigns sufficient probability on heavy elements. This case turns out to be fairly easy to handle. The complementary case that p has a small amount of mass on heavy elements requires a more elaborate argument. Let θ=def104/(ϵ2n) and let H=H(p)=def{i[n]piθ} be the set of heavy bins. We consider the following cases:

[Case I: p(H)=iHpiϵ2/1000.] Let W=|S1H|. Note that 𝐄S1[W]=N1p(H)N1ϵ2/1000. By a multiplicative Chernoff bound, for δ=1/3, we get that

𝐏𝐫S1[W(1-δ)N1ϵ2/1000]e-N1ϵ2δ2/200<1/10

where we used the fact that N1=ω(1/ϵ2). That is, with probability at least 9/10, the multiset S1 will contain at least N1ϵ2/150 samples from the set H. If this happens, then we have that

p(S1)>(N1ϵ2/1500)θ>(6N1)/n>(1+ϵ2/40)(N1/n)

which proves Case I.

[Case II: p(H)=iHpiϵ2/100.] Let H¯=def[n]H and S1=S1H¯, i.e., S1 contains the samples in S1 that correspond to light elements of p. Clearly, we have that p(S1)p(S1). We will show that PrS1[p(S1)(1+ϵ2/40)(N1/n)]9/10.

Since p(H)ϵ2/1000, with probability at least 1-ϵ2/1000, a given sample in S1 will also be in S1. So, if N1=|S1| we have that 𝐄S1[N1]N1(1-ϵ2/1000). By Markov’s inequality applied to N1-N1, with probability at least 19/20 over S1, we have that N1N1(1-ϵ2/50). We will henceforth condition on this event.

To bound p(S1) from below, we consider the normalized distribution, p, over [n] obtained from p after removing its heavy elements. That is, for iH¯, pi=pi/(1-p(H)); and for iH, pi=0. Note that S1 consists of N1 elements drawn i.i.d. from p. By definition, we have that p(S1)=(1-p(H))p(S1). We will bound p(S1) by applying Chebyshev’s inequality.

We start by noting that p is also far from being uniform:

Claim \thetheorem

We have that p-Un1ϵ/3.

{proof}

Since p-Un1ϵ, we have that j[n]:pj>1/n|pj-1/n|>ϵ/2. Since p(H)<ϵ2/100, it follows that j:1/n<pi<θ|pj-1/n|>ϵ/2-ϵ2/100>ϵ/3. Noting that if 1/n<pj<θ, then |pj-1/n|<|pj-1/n| gives the claim. By Chebyshev’s inequality, we can write:

𝐏𝐫S1[|p(S1)-𝐄S1[p(S1)]|>γ]𝐕𝐚𝐫S1[p(S1)]/γ2, (2)

where we will set γ=def(ϵ2/20)𝐄S1[p(S1)]. By Claim 3.1, we have that 𝐄S1[p(S1)]=N1p22, and 𝐕𝐚𝐫S1[p(S1)]=N1(p33-p24). In the following claim, we bound the variance from above:

Claim \thetheorem

We have that VarS1[p(S1)]<2000N1p22ϵ2n.

{proof}

We have that 𝐕𝐚𝐫S1[p(S1)]<N1p33N1pp22. We will show that p2θ from which the claim follows. Indeed, note that the non-zero probability values of p are pi=pi/(1-p(H)), for iH. Since piθ and p(H)ϵ2/1000<1/2, we get the desired bound on p and the claim follows. The RHS of (2) can be bounded by O(𝐕𝐚𝐫S1[p(S1)]/(ϵ4𝐄S1[p(S1)]2))=O(1/(N1ϵ6)), where we used our bounds on the first two moments and the fact that p221/n. Using the condition that N1=Ω(1/ϵ6) and the fact that N1N1(1-O(ϵ2)), the above probability is at most 1/20.

We now show that if |p(S1)-𝐄S1[p(S1)]|γ the event holds. By the definition of γ and the fact 𝐄S1[p(S1)]=N1p22, this is equivalent to p(S1)>(1-ϵ2/20)N1p22. Claim 3.1 implies that p22(1/n)(1+ϵ2/9), and therefore we get that

p(S1)>(1-ϵ2/20)(1-ϵ2/50)(N1/n)(1+ϵ2/9)>(N1/n)(1+ϵ2/35),

where we used our lower bound on N1. Finally, we have that

p(S1)p(S1)=(1-p(H))p(S1)(1-ϵ2/1000)(N1/n)(1+ϵ2/35)>(N1/n)(1+ϵ2/40),

and the proof of Lemma 3.1 is complete.

To establish correctness in the soundness case, it suffices to show that 𝐏𝐫S1,S2[ZT]1/3. To show this, we condition on any S1 such that the events and hold. Note that if does not occur, then our tester correctly rejects. By Lemma 3.1 above, holds with probability at least 9/10 over S1. Hence, by a union bound, it suffices to show that 𝐏𝐫S1,S2[ZT,]1/10.

An application of Chebyshev’s inequality for γ=(ϵ2/100)𝐄S1,S2[Z,] yields that

𝐏𝐫S1,S2[|Z-𝐄S2[Z]|>γ,]𝐕𝐚𝐫S1,S2[Z,]/γ2. (3)

By Claim 3.1, we have that 𝐄S2[Z]=p(S1) and 𝐕𝐚𝐫S2[Z]=(1/N2)(j=1naj2pj-p2(S1)). The same argument as in the completeness case gives that 𝐕𝐚𝐫S1,S2[Z,]<(10/N2)p(S1). By our choice of γ and our bound on the variance the right-hand side of (3) is at most O(1/(N2ϵ4p(S1))). Recalling our assumption that N1N2=Ω(n/ϵ4) and the fact that p(S1)N1/n (by Lemma 3.1, since occurs), it follows that this probability is at most 1/10. By Lemma 3.1 we have that p(S1)>(N1/n)(1+ϵ2/40). Therefore, with probability at least 8/10 (by a union bound on the two error events), we have that Z>(N1/n)(1+ϵ2/50). This establishes the soundness case and completes the proof of Theorem 3.1.

3.2 Memory Efficient Uniformity Testing

In this section, we show how our bipartite collision tester can be used to obtain a memory efficient single pass streaming algorithm. Our memory efficient tester is described in pseudo-code below:

Algorithm Streaming-Uniformity(p,n,m,ϵ) Input: Sample access to distribution p over [n], memory bound m, and ϵ>0. Output: “YES” if p=Un; “NO” if p-Un1ϵ. 1. Draw a multiset S1 of N1=defm/(2logn) i.i.d. samples from p. Store S1 in memory. 2. For all j[n] compute aj=|{sS1:s=j}|, i.e., the multiplicity of j[n] in S1. 3. If maxj[n]aj>10, return “NO”. Otherwise, continue with next step. 4. Draw a multiset S2 of N2=defΘ(nlogn/(mϵ4)) i.i.d. samples from p, for an appropriately large constant in Θ(). For k[N2], let bk be the number of times that the k-th sample in S2 appears in S1. Store the partial sum kbk in a single pass. 5. Let Z=(1/N2)k=1N2bk and T=defN1n(1+ϵ250). 6. If ZT return “NO”; otherwise, return “YES”.


The following theorem is essentially a corollary of Theorem 3.1 and characterizes the performance of the above algorithm:

{theorem}

Suppose that mΩ(logn/ϵ6) and mO~(n9/10). Algorithm Streaming-Uniformity is a single pass streaming algorithm using at most m bits of memory, and distinguishes between the cases that p=Un versus p-Un1ϵ with probability at least 2/3. {proof} The correctness of Streaming-Uniformity as a uniformity testing algorithm follows from Theorem 3.1 and our choice of parameters. It is straightforward to check that the assumptions of the latter theorem are satisfied for our choice of N1 and N2.

It is also easy to argue that the algorithm is implementable in the single pass streaming model with at most m bits of memory. Step 1 of the algorithm uses N1lognm/2 bits of memory. We claim that Step 4 can be implemented in a single pass with at most logN2+4 bits of memory. Indeed, since no element of [n] appears in S1 more than 10 times (since the algorithm did not reject in Step 3), each bk is at most 10. Therefore, k=1N2bk10N2, and thus the sum k=1N2bk can be stored with logN2+4 bits of memory. In summary, the total memory used by our algorithm is at most m/2+logN2+4. By our assumption that mΩ(logn/ϵ6), it follows that N2ϵ2n or logN2<logn. Therefore, logN2m which completes the proof of Theorem 3.2.

3.3 Distributed Uniformity Testing for Small Number of Samples per Machine

Let denote the total number of samples from p possessed by each machine. When =O~(n1/3/ϵ4/3), we use the following tester, which can be viewed as an instantiation of our bipartite collision tester.

Algorithm I Distributed-Bipartite-Uniformity(p,n,,ϵ) Input: Unbounded number of machines, each with i.i.d. samples from p and ϵ>0. Output: “YES” if p=Un; “NO” if p-Un1ϵ. 1. The referee asks m1=Θ((1/(ϵ23/2))n/logn) machines to reveal all their samples. Let S1 be the resulting multiset of samples from p. 2. For all j[n] the referee computes aj=|{sS1:s=j}|, i.e., the multiplicity of j[n] in S1. 3. If maxj[n]aj>10, the referee returns “NO”. Otherwise, we continue with next step. 4. The referee queries a new set of m2=Θ(nϵ42m1) machines, indexed by k[m2], in increasing order of k, to report the value bk=i=1asik corresponding to their sample set S2k={sik}i=1. Note that bk is the number of collisions of S2k with S1. 5. For t[m2], define Bt=k=1tbk. Let Z=Bm2/(m2) and T=defm1n(1+ϵ2/50). 6. The referee computes Bt in increasing order of t[m2]. If for some t[m2], Btm2T, we terminate and returns “NO”. Otherwise, we have that Z<T and we return “YES”.

The following theorem characterizes the performance of the above algorithm:

{theorem}

Suppose that Ω(1/ϵ6)m1O(n9/10). Algorithm Distributed-Bipartite-Uniformity draws a total of O((1/ϵ2)nlogn) samples from p, uses at most O((1/ϵ2)(n/)logn) bits of communication, and distinguishes between the cases that p=Un versus p-Un1ϵ with probability at least 2/3.

{proof}

The correctness of Distributed-Bipartite-Uniformity follows from Theorem 3.1 and our choice of parameters. It is straightforward to check that the assumptions of the latter theorem are satisfied for our choice of N1=m1 and N2=m2.

It is also clear that the sample complexity of our algorithm is

(m1+m2)=O((1/ϵ2)n/(3logn)+(1/ϵ2)(nlogn)/)=O((1/ϵ2)nlogn).

We now proceed to bound the communication complexity. Note that Step 1 uses m1logn bits of communication. We claim that Step 6 can be implemented with O(m2) bits of communication. To achieve this, we transmit each bk in unary by sending bk many 1’s followed by a 0 and terminate the algorithm rejecting if the partial sum Bt=k=1tbk, for some t[m2], exceeds the rescaled threshold m2T. Since we use bk+1 bits to encode each bk, the number of bits of communication is bounded by maxt(Bt+t), which is at most m2T+m2. We now show that m2T=O(m2), which proves the desired communication upper bound. Note that by our choice of m1,m2, we have that m2T=O(1/ϵ4) and moreover m2>m1=Ω(1/ϵ6). Therefore, m2Tm2, as desired. This completes the proof of Theorem 3.3.

3.4 Distributed Uniformity Testing for Large Number of Samples per Machine

In this section, we give our alternate distributed uniformity tester with improved communication complexity when the number of samples per machine is large.

When the number of samples per machine satisfies =Ω~(n1/3/ϵ4/3), we will use the following algorithm:

Algorithm II Distributed-Aggregate-Uniformity(p,n,,ϵ) Input: Each machine has samples from a distribution p over [n] and ϵ>0. Output: “YES” if p=Un; “NO” if p-Un1ϵ. 1. The referee asks the first m=12800n2ϵ4 machines to reveal the number of collisions ck, k[m], each of them see in their samples. 2. Compute the statistic Z=(1/m)k=1mck and the Threshold T=(2)1+ϵ2/2n. 3. If ZT return “NO”; otherwise, return “YES”.

The following theorem characterizes the performance of the above algorithm:

{theorem}

The algorithm Distributed-Aggregate-Uniformity draws a total of O(n/(ϵ4)) samples from p, uses O(nlogn2ϵ4) bits of communication, and distinguishes between the cases that p=Un versus p-Un1ϵ with probability at least 2/3.

In this rest of this section, we prove Theorem 3.4.

First note that the sample complexity of our algorithm is m=O(nϵ4). Moreover, the communication complexity of our tester is O(n2ϵ4log(2))=O(nlogn2ϵ4), since each of the Θ(n2ϵ4) players sends the number of their internal collisions, which can be at most (2).

We will compute the mean and variance of the statistic Z and show that the uniform and non-uniform cases are well-separated. Lemma 2.3 in [Diakonikolas et al.(2016)Diakonikolas, Gouleakis, Peebles, and Price] implies that:

𝐄[Z]=1mi=1m𝐄[ci]=𝐄[ci]=(2)p22

and

𝐕𝐚𝐫[Z]=1m2i=1t𝐕𝐚𝐫[ci] =2ϵ412800n[(2)(p22-p24)+(-1)(-2)(p33-p24)]
2ϵ412800n[2p22+3(p33-p24)].

The following lemma, which is an adaptation of an analogous lemma in [Diakonikolas et al.(2016)Diakonikolas, Gouleakis, Peebles, and Price]), will give us the minimum number of samples per player required for the tester to work given our choice of parameters:

{lemma}

Let α satisfy p22=1+αn and σ be the standard deviation of Z. The number of samples required by Distributed-Aggregate-Uniformity is at most

5σn|α-ϵ2/2|,

in order to get error probability at most 1/4.

{proof}

By Chebyshev’s inequality, we have that

𝐏𝐫[|Z-𝐄[Z]|kσ]=𝐏𝐫[|Z-(2)p22|kσ]1k2,

where σ𝐕𝐚𝐫[Z].

We want Z to be closer to its expected value than the threshold is to that expected value because when this occurs, the tester outputs the right answer. Furthermore, to achieve our desired probability of error of at most 1/4, we want this to happen with probability at least 3/4. So, we want

kσ|𝐄[Z]-T|=|(2)(p22-1+ϵ2/2n)|=(2)|α-ϵ2/2|/n.

For larger than some small constant and k=2, the following slightly stronger condition for suffices:

σ2|α-ϵ2/2|5n.

So, it suffices to have

5σn|α-ϵ2/2|. (4)

We might as well take the smallest number of samples per player for which the tester works, which implies the desired inequality.

To complete the proof, we need to show that given enough samples there is a clear separation between the completeness and soundness cases regarding the value of our statistic.

By Lemma 3.4, it suffices to bound from above the variance σ2. We proceed by case analysis based on whether the term 2p22 or 3(p33-p24) contributes more to the variance.

Case when 2p22 is Larger

We have the following lemma:

{lemma}

Let p22=(1+α)/n. Consider the completeness case when α=0 and the soundness case when αϵ2. If 2p22 contributes more to the variance, i.e., if

2p223(p33-p24),

then Distributed-Aggregate-Uniformity has error probability at most 1/4 for any value of .

{proof}

Suppose that 2p223(p33-p24). Then σ224ϵ412800np22=4ϵ46400n(1+α)/n. Substituting this into (4) gives:

5σn|α-ϵ2/2|ϵ(1+α)1/44|α-ϵ2/2|.

One can show the latter inequality, using calculus to maximize the ratio: ϵ(1+α)1/44|α-ϵ2/2| by varying α. One gets that α=ϵ2 maximizes the expression for α{0}[ϵ2,n-1], since it is decreasing in the interval [ϵ2,n-1] and also α=ϵ2 gives a slightly larger value than α=0. Thus, we get that:

5σn|α-ϵ2/2|ϵ(1+ϵ2)1/44ϵ2/2(1+ϵ2)1/42

for any ϵ<1. Therefore, this completes the proof since the requirements of Lemma 3.4 are satisfied.

Case when 3(p33-p24) is Larger

In this case, we show the following: {lemma} Let p22=(1+α)/n. Consider the completeness case when α=0 and the soundness case when αϵ2. If 3(p33-p24) contributes more to the variance, i.e., if

3(p33-p24)2p22,

then for any 16n3ϵ2 in the completeness case and any 16n3α in the soundness case, our tester Distributed-Aggregate-Uniformity achieves error probability at most 1/4.

{proof}

Suppose that 3(p33-p24)2p22. Then σ225ϵ412800n(p33-p24). Substituting this into (4) gives:

5σn|α-ϵ2/2|5/4ϵn1/4(p33-p24)1/44|α-ϵ2/2|
256|α-ϵ2/2|2ϵ4n(p33-p24)

Let us parameterize p as pi=1/n+ai for some vector a=(a1,,an). Then we have a22=α/n.

In the completeness case, the above sufficient condition always holds since the right hand side is infinite (i.e., p33=p24). In the soundness case, we get the following sufficient condition:

256(α/2)2ϵ4n(p33-p24)(since ϵ2α).

We also have that:

p33-p24 p33-1n2=[i=1n(1/n+ai)3]-1n2
=[1n2+3n2i=1nai+3ni=1nai2+i=1nai3]-1n2
=3n2i=1nai+3ni=1nai2+i=1nai3
=3ni=1nai2+i=1nai3
3na22+a333na22+a23=3n(α/n)+(α/n)3/2.

We thus get:

256(α/2)2ϵ4n(p33-p24) 256(α/2)2α2n(3n(α/n)+(α/n)3/2)
64n(3n(α/n)+(α/n)3/2)
643αn+α3/2n
64n3α+α3/2n
min{64n3α,64nα3/2}
64n3α.

Therefore, any 64n3α satisfies the conditions of Lemma 3.4. Combining the above, we can see that our tester works for any value of that is less than the sample complexity of the problem in the classical (non-distributed) model.

The correctness and error probability of the algorithm is established by Lemmas 3.4 and 3.4. This completes the proof of Theorem 3.4.

4 Communication and Memory Lower Bounds for Uniformity Testing

In this section, we prove our memory and communication lower bounds.

4.1 Background from Information Theory

For completeness, we start by recalling basic definitions from information theory. We will first define the entropy of a random variable: {definition} Let X be a discrete random variable supported on {x1,,xn} that has a probability mass function p=(p1,,pn) such that pi=𝐏𝐫[X=xi]. Then we define the entropy of X to be H(X)=i=1npilog(1/pi). For the special case of n=2, which corresponds to a Bernoulli random variable with parameter p[0,1], we define the binary entropy to be the following function: H2(p)=-plogp-(1-p)log(1-p).

The entropy is a measure of randomness for a random variable. In other words, it is the number of bits of information that we get on average by observing the outcome of the random variable.

In some cases, we would like to know how much excess information we get by observing the outcome of a random variable Y given that we know the outcome of another random variable X. This is usually called conditional entropy of Y given X, and is defined as follows:

{definition}

Let X,Y be a discrete random variables supported on the sets 𝒳 and 𝒴 respectively. Also let p(x,y) be the joint probability mass function of X,Y such that p(x,y)=𝐏𝐫[X=x,Y=y]. Then we define the conditional entropy of Y given X to be:

H(Y|X)=H(X,Y)-H(X)=-x𝒳,y𝒴p(x,y)logp(x,y)p(x).

Furthermore, the amount of information that is shared between two random variables is called mutual information and defined as follows: {definition} Let X,Y be a discrete random variables. The mutual information between X and Y is defined as I(X;Y)=H(X)-H(X|Y)=H(X)+H(Y)-H(X,Y). Note that this quantity is symmetric, i.e., I(X;Y)=I(Y;X). We also define the conditional shared information as I(X;Y|Z)=H(X|Z)-H(X|Y,Z).

The following well known lemma in information theory, intuitively implies that the mutual information between two random variables X and Y cannot be increased by transforming Y into a new variable Z either deterministically or by using randomness that is independent of X (i.e., without using any additional knowledge for X).

{lemma}

[Data Processing Inequality] Let X,Y,Z be random variables, such that XZ|Y. Then I(X;Z)I(X;Y). We also make use of another standard lemma known as the chain rule. {lemma}[Chain Rule] For variables X,Y and Z we have that I(X;Y,Z)=I(X;Z)+I(X;Y|Z).

Finally, we use the following well known Taylor series for the binary entropy function:

Fact \thetheorem
1-H2(12+a)=12ln2l=1(2a)2ll(2l-1)=O(a2).

4.2 Memory Lower Bounds for Uniformity Testing

In this section, we prove our memory lower bounds as described by the following theorem: {theorem} Let 𝒜 be an algorithm which tests if a distribution p is uniform versus ϵ-far from uniform with error probability 1/3, can access the samples in a single-pass streaming fashion using m bits of memory and k samples, then km=Ω(nϵ2). Furthermore, if k<n9/10 and mk2/n0.9, then km=Ω(nlognϵ4). {remark} This result should hold (with worse constants) if the above bounds on m and k are replaced by any bounds of the form mk2/n1-c and kn1-c for any constant c>0.

In this rest of this section, we prove Theorem 4.2. To do so, we use the adversary method. Let X be a uniformly random bit. Based on X, the adversary chooses the distribution p on [2n] bins as follows:

  • X=0: Pick p=U2n.

  • X=1: Pair the bins as {1,2},{3,4},,{2n-1,2n}. Now on each pair {2i-1,2i} pick a random Yi{±1} to pick:

    (p2i-1,p2i)={(1+ϵ2n,1-ϵ2n),Yi=1(1-ϵ2n,1+ϵ2n),Yi=-1

In either case, we can think of the output of p as being a pair (C,V), where C is an element of [n] is chosen uniformly, and V{0,1} is a fair coin if X=0 and has bias ϵYC if X=1.

Let s1,,sk be the observed samples from p. Let Mt denote the bits stored in the memory after the algorithm sees the t-th sample st.

Since the algorithm learns X with probability 2/3 after viewing k samples, we know that I(X;Mk)>Ω(1). On the other hand, Mt is computed from (Mt-1,st) without using any information about X 33 3 Note that we can use deterministic operations and possibly random bits, which however cannot be correlated with the random variable X since st is by definition the only sample from the distribution that is drawn between the memory states Mt-1 and Mt.. More formally, XMt|(Mt-1,st) and therefore we can use the data processing inequality (Lemma 4.1) to get:

I(X;Mt)I(X;Mt-1,st)=I(X;Mt-1)+I(X;st|Mt-1).

Our basic technique will be to bound I(X;st|Mt-1). This will give us an upper bound on I(X;Mk) via telescoping.

The sample s corresponds to which pair of bins was picked and within that pair which one of two bins was picked, that is s=(C,V). V is a random variable taking values in {0,1}.

Since irrespective of X, C is uniform over the pairs of bins, we note that C is independent of X even when conditioned on the memory M.

Thus,

I(X;st|Mt-1)=I(X;CtVt|Mt-1)=I(X;Vt|Mt-1Ct).

Let αt-1=𝐏𝐫[X=1|Mt-1Ct] and thus 𝐏𝐫[X=0|Mt-1Ct]=1-αt-1.

We have that

𝐏𝐫[Vt=0|X=0,Mt-1,Ct] =12
𝐏𝐫[Vt=0|X=1,Mt-1,Ct] =1+ϵ𝐄[YCt|Mt-1]2
𝐏𝐫[Vt=0|Mt-1,Ct] =(1-αt-1)12+αt-11+ϵ𝐄[YCt|Mt-1]2=12+αt-1ϵ𝐄[YCt|Mt-1]2.

We can calculate

I(X;Vt|Mt-1Ct)=H(Vt|Mt-1Ct)-H(Vt|Mt-1CtX)
=H2(𝐏𝐫[Vt=0|Mt-1,Ct])-{𝐏𝐫[X=1|Mt-1Ct]H2(𝐏𝐫[Vt=0|X=1,Mt-1,Ct])
+𝐏𝐫[X=0|Mt-1Ct]H2(𝐏𝐫[Vt=0|X=0,Mt-1,Ct])}
=H2(12+αt-1ϵ𝐄[YCt|Mt-1]2)-αt-1H2(12+ϵ𝐄[YCt|Mt-1]2)-(1-αt-1)H2(12)
=αt-1[1-H2(12+ϵ𝐄[YCt|Mt-1]2)]-[1-H2(12+αt-1ϵ𝐄[YCt|Mt-1]2)].

Thus, using Fact 4.1 we have,

I(X;Vt|Mt-1Ct) =Θ(1)αt-1(1-αt-1)ϵ2𝐄[YCt|Mt-1]2
O(1)ϵ2𝐄[YCt|Mt-1]2.

Since Ct is uniformly random, we have that

I(X;Vt|Mt-1Ct)=1nj=1nO(1)ϵ2𝐄[Yj|Mt-1]2. (5)

We begin by proving a relatively straightforward unconditional bound on this sum using the fact that Mt-1 has only m bits of information. {lemma} We have that j=1n𝐄[Yj|Mt-1]2=O(m). {proof} First we notice that since H(Mt-1)m that I(Y1Yn;Mt-1)m, and thus that H(Y1Yn|Mt-1)=H(Y1Yn)-I(Y1Yn;Mt-1)n-m. On the other hand, we have that

i=1nH(Yi|Mt-1)H(Y1Yn|Mt-1)n-m.

Thus,

mi=1n[1-H(Yi|Mt-1)]=Θ(i=1n𝐄[Yi|Mt-1]2),

where the equality comes from Fact 4.1 and the fact that if 𝐏𝐫[Yi=1|Mt-1]=12+α, then 𝐄[Yi|Mt-1]=𝐏𝐫[Yi=1|Mt-1](+1)+𝐏𝐫[Yi=-1|Mt-1](-1)=(12+α)-(12-α)=2α. This completes our proof.

Lemma 4.2 will be enough to prove our weaker lower bound. To get the stronger one we will need a more in depth analysis. In particular, we let rj=𝐄[#{1it-1:Ci=j}|Mt-1]. We first show that 𝐄[Yj|Mt-1]=O(ϵrj) (see Claim 4.2 below). This leaves us with the task of bounding r2. For this, we note that if w=r/r2, then r is only going to be large if, conditioned on Mt-1, many of the Ci will lie on components where w is large. However the sum of wCi is a sum of independent random variables, so by an appropriate Chernoff bound, we can show that it is likely not too much larger than its mean. However, since Mt-1 only encodes m bits of information, it can only correspond to an event whose likelihood is exponentially small in m, and this will give us our bound on r2.

We will now show the following claim:

Claim \thetheorem

We have that |E[Yj|Mt-1]|=O(ϵrj).

{proof}

It suffices to show that

|𝐄[Yj|s1,,st-1,X]|=O(ϵ#{1it-1:Ci=j}),

as our final result will follow by averaging over the si and X conditioned on Mt-1.

If X=0, this is trivial since in this case the si convey no information about Yj so the expectation of Yj is 0.

If X=1, it is not hard to see by Bayes’ Theorem that if n1 is the number of times when si=(j,0) and n2 the number of times si=(j,1), then the expectation of Yj conditioned on X and the si is

(1+ϵ)n1(1-ϵ)n2-(1-ϵ)n1(1+ϵ)n2(1+ϵ)n1(1-ϵ)n2+(1-ϵ)n1(1+ϵ)n2=O(ϵ(n1+n2)),

and our result follows.

Since Ct is uniform independent of Mt-1, we have that

𝐄[YCt|Mt-1]2=O(ϵ21nj=1nrj2)=O(ϵ2r22n).

Therefore, we get that:

I(X;Vt|Mt-1Ct)O(1)ϵ4r22n. (6)

Typical Memory States

Consider a fixed algorithm 𝒜. Call a memory state M typical for time step t if the following hold:

  • 𝐏𝐫(Mt=M)>e-m.

  • The corresponding vector r has r30.

We will need the first condition to ensure that Mt does not encode events that are too unlikely, and we will need the second to bound the maximum size of individual contributions for our Chernoff bound. Fortunately, both of these events happen with high probability as we see below:

Claim \thetheorem

Assuming tn9/10 and m is bigger than a sufficiently large multiple of log(n), we have that Mt is typical for time t with probability at least 1-1/n.

{proof}

First, we deal with the probability that Mt violates the first condition, in particular that it is a transcript that shows up with probability at most e-m. For this, we note that there are at most 2m such transcripts, each occurring with probability at most e-m, and so the total probability that any of them occur is at most (2/e)m<1/(2n).

Next, we deal with the probability that Mt violates the second condition. In particular, we bound the probability that there exists a j so that the expected number of Ci equal to j for 1it conditioned on Mt is at least 30. For this we, note that for any particular j, the expectation of max(0,#{1it:Ci=j}-20) over sets of samples is at most n-2. Therefore, the expectation over transcripts of max(0,rj-20) is at most n-2. Our result now follows by a Markov inequality and union bound over j.

We are now ready to prove an upper bound on N=r22 in the following lemma:

{lemma}

For the fixed transcript A typical for time t, with mt2/n0.9, we have

N=r2=O(mlogn).
{proof}

Let w=r/N. Note that r2=rw, and that w2=1.

Let X=wC1t. These are i.i.d random variables taking values in [0,30N] with mean 𝐄[X]=1ni=1nwi1n, since w2=1.

Define X:==1twC and note that μ=𝐄[X]tn. We have that

N =𝐄[=1twC|Mt=A]=1𝐏𝐫(Mt=A)a=0𝐏𝐫[X>a,Mt=A]da
=1𝐏𝐫[Mt=A]a=0N/2𝐏𝐫[X>a,Mt=A]da+1𝐏𝐫[Mt=A]a=N/2𝐏𝐫[X>a,Mt=A]da
1𝐏𝐫[Mt=A]a=0N/2𝐏𝐫[Mt=A]da+1𝐏𝐫[Mt=A]a=N/2𝐏𝐫[X>a]da
=N2+1𝐏𝐫[Mt=A]a=N/2𝐏𝐫[X>a]da.

Thus, we have

N2ema=N/2𝐏𝐫[X>a]da.

Now let us bound the tail 𝐏𝐫[X>a]. We have 𝐏𝐫[X>a]=𝐏𝐫[Nx>Na]. We would like to show that Nmlogn. Thus, we can assume that N>4tn0.49 else we already have Nmlogn. Hence, we can assume that a>2tn0.49 in the above integral. We write a=(1+δ)μ and apply the Chernoff bound on the random variable N30X (note that this is a sum of i.i.d random variables supported in [0,1]) to get:

𝐏𝐫[X>a]=𝐏𝐫[N30X>N30(1+δ)μ]<eδNμ30(1+δ)(1+δ)Nμ/30e-140Nalog(1+δ).

We have 1+δ=aμ>Nn30t>n1/200. Thus, for aN2 we have

Pr(X>a)e-αNalogn,

for some constant α>0. Substituting in the above integral gives:

N2 ema=N/2𝐏𝐫[X>a]daema=N/2e-αNalognda=1αNlognem-αN2logn/2.

Thus, we have for some constant α:

αN2logn2em-αN2logn/2.

Since m1, the equation θem-θ can have a solution only when θm. That is αN2logn2m, this gives us the required bound r2=N2/αmlogn=O(mlogn).

{proof}

[Proof of Theorem 4.2] Using Equation (5) and Lemma 4.2, we get that

I(X;Vt|Mt-1Ct)O(1)ϵ2mn.

However, we know that:

Ω(1)I(Mk;X) =t=0k-1I(Mt+1;X)-I(Mt;X)
=t=0k-1I(Mt,St+1;X)-I(Mt;X)
=t=0k-1I(St+1;X|Mt)
=t=0k-1I(Vt+1;X|Mt,Ct+1)
=O(1)kϵ2mn.

This implies that km=Ω(nϵ2).

Under our stronger assumptions, we can instead use Lemma 4.2 to similarly obtain:

Ω(1)I(Mk;X) =t=0k-1I(Vt+1;X|Mt,Ct+1)
=O(1)kϵ4mnlogn+O(k/n).

The last line here comes from the fact that I(Vt+1;X|Mt,Ct+1)=O(ϵ4m/nlog(n)) for typical transcripts Mt and the fact that Mt is typical except with probability 1/n.

Thus, equivalently, we have km=Ω(nlognϵ4), completing the proof.

4.3 Communication Lower Bounds for Uniformity Testing

In this section, we will show a communication lower bound for distributed uniformity testing algorithms in our one-pass communication model, via a reduction to the streaming/limited memory setting. In particular, we show the following theorem:

{theorem}

Suppose that there exists a communication protocol with a transcript of length |T| bits, for the setting where each machine holds samples, that can distinguish between a uniform distribution and one that is ϵ-far from uniform in total variation distance. Then there exists a streaming algorithm that uses at most m=|T|+logn bits of memory and has access to a stream of at most s=|T| samples.

{proof}

We will simulate the protocol by storing in memory the entire communication transcript up to any given point in the simulated protocol, while having some additional space in order to temporarily store the samples of a single player (machine) at a time.

In particular, we consider the stream of t samples, where t is the number of players that participate, that is created by taking the samples of the first player to speak and iteratively appending the samples of the next player to speak until there are no more players. We also use the memory to remember the communication transcript so far at any given point and the samples of the player speaking in that round so that the algorithm is able to compute the bits that the players send.

Therefore, during any given round i of the communication protocol, the partial transcript Ti-1 of the communication in the first i-1 rounds is stored in memory along with the samples of the player that is about to speak in round i. Note that, since the referee is not going to ask any questions again to that particular player, the exact samples of that player are no longer useful to the algorithm after the current round ends. Therefore, those logn bits of additional memory can be reused while simulating the next round. However, the algorithm will use the bits transmitted by that player along with the current partial transcript Ti-1, to create the new partial transcript Ti.

Observe that every player has to send at least 1 bit, since otherwise we can assume that they did not participate in the protocol. Therefore, we have that t|T| and consequently our stream will have at most s=|T| samples.

Furthermore, the transcript that is created after the last player speaks is the one that contains the most information among the partial transcripts which are all optimally compressed. Thus, we have that i:|Ti||T|, where T=Tt is the transcript of the entire communication. This means that no more than m=|T|+logn bits of memory are needed at any given point of the execution.

Using the above theorem, we can prove the following two corollaries:

{corollary}

Let π be a distributed communication protocol, for the setting where each machine holds samples, which tests if a distribution p is uniform versus ϵ-far from uniform with error probability 1/3, and the referee asks questions to each player only once. Then, π must involve Ω(n/ϵ) bits of communication for any =O(n1/3ϵ4/3(logn)1/3). Furthermore, π must involve Ω(n/ϵ2logn) bits of communication for any =O(ϵ4/3n0.3). {proof} Suppose, for the sake of contradiction, that there exists such a protocol that uses t=Θ(n/ϵ2logn) bits of communication with a sufficiently small implied constant. Then, using theorem 4.3, we conclude that there also exists an streaming algorithm that uses a stream of size st samples and a memory of size m=Θ(t+log(n))=Θ(t) bits. Furthermore, we have that

s2/(mn0.99)=Θ(t2/n0.99)(3/2/(n0.45ϵ2))1.

Therefore, Theorem 4.2 applies and we must have ms=Ω(nlog(n)/ϵ4), but ms=O(t2), which is a sufficiently small multiple of nlog(n)/ϵ4, that it yields a contradiction.

Note that for any =O(n1/3ϵ4/3(logn)1/3), it still holds that m=Θ(t+log(n))=Θ(t). We will combine now Theorem 4.3 with the weaker version of Theorem 4.2, and assume that there exists such a protocol that uses t=Θ(n/ϵ) bits of communication with a sufficiently small implied constant. In this case, we must have ms=Ω(n/ϵ2), but ms=O((t)2), which is a sufficiently small multiple of n/ϵ2, that it yields a contradiction.

Furthermore, if we have a restricted number of samples, we can get better communication lower bounds: {corollary} Let π be a distributed communication protocol, for the setting where each machine holds samples with a total of t machines, which tests if a distribution p is uniform versus ϵ-far from uniform with error probability 1/3, and the referee asks questions to each player only once. Then, if t=O(n/ϵ2logn) and t=O(n0.6/ϵ4/3), π must involve Ω(nlog(n)ϵ4t) bits of communication. {proof} Again we use Theorem 4.3. We now have a streaming algorithm using k=t samples and m=|π|+log(n) memory. We claim that this is impossible even if |π|=p=Θ(nlog(n)ϵ4t) with the implied constant sufficiently small. In fact, this in case we have that m=O(p). We have that mn0.9k=Θ(n1.9log(n)/ϵ4)>k3, and so the strong version of Theorem 4.2 applies. This means that mk=Ω(nlog(n)/ϵ4), when in reality it is too small a constant times this, yielding a contradiction.

A case of particular interest for the above is when t=O(n/ϵ2) is the information-theoretically optimal number of samples. In this case (so long as =O(log(n))) our communication must be at least Ω(nlog(n)/ϵ2), which is not much better than sending all of the samples directly.

Finally, the following corollary gives a communication complexity lower bound for all values of Ω(n1/3ϵ4/3(logn)1/3)O(nϵ2) using the weaker version of Theorem 4.2. {corollary} Let π be a distributed communication protocol, for the setting where each machine holds samples, which tests if a distribution p is uniform versus ϵ-far from uniform with error probability 1/3, and the referee asks questions to each player only once. Then, π must involve Ω(n2ϵ2logn) bits of communication for any Ω(n1/3ϵ4/3(logn)1/3)O(nϵ2). {proof} Suppose, for the sake of contradiction, that there exists such a protocol that uses t=Θ(n2ϵ2logn) bits of communication with a sufficiently small implied constant. Then, using Theorem 4.3, we conclude that there also exists an streaming algorithm that uses a stream of size s samples and a memory of size m bits, such that mk=Θ(t2logn), since k=t and m=Θ(logn) for this range of values for .

This means that mk=Ω(t2logn)=Ω(nϵ2), when in reality it is too small a constant times this, yielding a contradiction due to Theorem 4.2.

5 Communication and Memory Efficient Closeness Testing

In this section, we design our protocols for closeness testing. We start with the setting of memory and then give our communication efficient protocols.

5.1 Memory Efficient Closeness Testing

In this section, we provide an algorithm for closeness testing in the streaming model that uses O(nmϵ2) samples and O(mlog(n)) bits of memory where m is a parameter such that

min(n,n2/3/ϵ4/3)m1.

By reparametrizing, this implies an algorithm with min(nlog(n),n2/3log(n)/ϵ4/3)mlog(n) bits of memory and O(nlog(n)mϵ2) samples. However, we are going to use the former parametrization assuming an upper bound of O(m) words of memory (each of length O(logn) bits), as it will be more convenient for us, so we will use that.

The performance of the algorithm is described in the following theorem: {theorem} Let p,q be two discrete distributions on [n]. Suppose that min(n,n2/3/ϵ4/3)m1. Then there exists a single pass streaming algorithm that uses at most mlogn bits of memory and O(nmϵ2) samples from p and q, and distinguishes between the cases that p=q versus p-q1ϵ with probability at least 2/3.

The algorithm is given in pseudo-code below:

Algorithm Test-Closeness-Memory(p,q,n,m,ϵ) Input: Sample access to distributions p,q over [n], memory bound m, and ϵ>0. Output: “YES” if p=q; “NO” if p-q1ϵ. 1. Draw O(m) samples from p and q to flatten them to p,q such that p2,q2O(1m). Let [n] be the new domain. 2. Apply a hash map h to p,q. This hash map h:[n][m] approximately preserves p-q2 and p2 with constant probability. 3. Use a standard 2 tester to distinguish between h(p)=h(q) and h(p)-h(q)2ϵn.

This section is devoted to the proof of Theorem 5.1.

Flatten p, and q.

Our algorithm begins by taking (and storing) m samples from each of p and q. We use these samples to produce the split distributions p and q whereby the ith bin is split into ai equal sub-bins where ai is one more than the number of copies of i in this set of samples. We note the following important facts from [Diakonikolas and Kane(2016)]:

  • Given the list of samples, one can simulate a sample from p (resp. q) from a single sample of p (resp. q) and some additional randomness.

  • p-q1=p-q1.

  • p2,q2=O(1/m) with at least 9/10 probability.

Our analysis from here on out will be under the assumption that the last property holds.

Hash p and q.

Our next step is to pick a hash map h:[n][m] (where n is the size of the domain of p and q) from a 4-wise independent family (note that for an appropriate family we can store h using O(mlog(n)) bits). We claim that with at least 9/10 probability that h(p)2,h(q)2 are not too big and that h(p)-h(q)2h(p)-h(q)2.

In particular, we show the following lemma: {lemma} We have the following:

𝐄h[h(p)-h(q)22] =(1-1m)p-q22
𝐕𝐚𝐫h[h(p)-h(q)22] =1m(1-1m)[p-q24-p-q44]
𝐄h[h(p)22] =1m+(1-1m)p22
𝐕𝐚𝐫h[h(p)22] =1m(1-1m)[p24-p44]
{proof}

We can write:

h(p)-h(q)22 =i[m][j[n](pj-qj)I{h(j)=i}]2
=i[m]j1,j2[n](pj1-qj1)(pj2-qj2)I{h(j1)=h(j2)=i}
=j1,j2[n](pj1-qj1)(pj2-qj2)I{h(j1)=h(j2)}
=p-q22+j1j2[n](pj1-qj1)(pj2-qj2)I{h(j1)=h(j2)}.

We therefore have that:

𝐄h[h(p)-h(q)22] =p-q22+1mj1j2[n](pj1-qj1)(pj2-qj2)
=p-q22+1mj1[n][(pj1-qj1)j2j1[n](pj2-qj2)]
=p-q22-1mj[n](pj-qj)2=(1-1m)p-q22,

and

𝐕𝐚𝐫h[h(p)-h(q)22] =j1j2[n]𝐕𝐚𝐫h[(pj1-qj1)(pj2-qj2)I{h(j1)=h(j2)}]
=j1j2[n](pj1-qj1)2(pj2-qj2)21m(1-1m)
=1m(1-1m)j1[n](pj1-qj1)2[j2j1[n](pj2-qj2)2]
=1m(1-1m)j1[n](pj1-qj1)2[p-q22-(pj1-qj1)2]
=1m(1-1m)[p-q24-p-q44].

The other two identities follow similarly.

By Chebyshev’s inequality, we have that the following statements hold true with 90% probability:

h(p)-h(q)22 12p-q22
h(p)22 1m+32p22=O(1/m).

We now have to distinguish between a completeness case where p=q and thus h(p)=h(q) and a soundness case where p-q1=p-q1ϵ, and therefore h(p)-h(q)2p-q2ϵn.

We will also need the following lemma from [Chan et al.(2014)Chan, Diakonikolas, Valiant, and Valiant]: {lemma} Let p,q be distributions such that max(p2,q2)b, then there is an estimator that takes O(bϵ2) samples from p,q and estimates p-q2 up to an error of ϵ.

{proof}

[Proof of Theorem 5.1] Using the tester from Lemma 5.1 for ϵ=ϵn and given that b=1/m, we can distinguish between h(p)=h(q) and h(p)-h(q)2ϵn in O(nmϵ2) samples from h(p) and h(q) (which can be simulated given samples from p and q). We note also that this tester only needs to know the number of samples from each of h(p) and h(q) that landed in each bin, and can thus be simulated in a streaming algorithm with O(mlog(n)) memory by keeping a running total of the number of samples from each bin. This establishes the correctness of our algorithm.

As far as memory usage is concerned, the algorithm uses a total of O(mlog(n)) bits of memory for each of its steps. In particular, these steps are: recording a set of samples for flattening, storing the seed of the hash function h, and storing the counts of the number of samples from h(p),h(q) from each bin. Thus, the total memory usage is O(mlog(n)) bits.

5.2 Communication Efficient Algorithm for Distributed Closeness Testing

We use a somewhat different algorithm for the communication version of this problem. The basic idea is that while our memory algorithm compared h(p) to h(q) for some hash function h, our algorithm here will compare the conditional distribution (p|W) to (q|W), for some randomly chosen subset W of our domain. After applying some flattening, we can ensure that with high probability the difference between p and q on W approximates the difference of p and q on the whole domain. Since W is small, we will need fewer samples to test closeness on W. Of course, we cannot make W too small, as then we will need to query too many machines before even finding a sample from W. This is balanced our when |W|n/(log(n)), so that one out of every log(n) machines should have a sample (which it communicates in log(n) bits), while the other log(n) machines need one sample each to tell us that they have no samples from W.

{theorem}

Suppose that =O(nϵ4/log(n)). Then there exists an algorithm that given distributed sample access to two distributions p and q over [n] distinguishes with probability 2/3 between the cases p=q and p-q1>ϵ using O(n2/3log1/3(n)2/3ϵ4/3) bits of communication.

The algorithm is given in pseudo-code below:

Algorithm Test-Closeness-Distributed(p,q,n,ϵ) Input: Each player has samples from each of p and q over [n], ϵ>0. Output: “YES” if p=q; “NO” if p-q1ϵ. 1. Let C be a sufficiently large constant. Abort the following algorithm if more than C2(nlogn)2/32/3ϵ4/3 bits of communication are ever used. 2. Draw N=C(logn)ϵ samples from p and q to flatten them to p and q. Let [n] be the new domain. 3. The referee picks a random subset W of [n] by selecting each element with probability r=1logn and broadcasts this set W to all the machines. 4. The referee asks M=Clog(n)|W|2/3/ϵ4/3 machines first if they have any samples from W, and if so for the list of these samples along with which distribution they are from. 5. Let m1 of the above samples be from p and m2 be from q. Unless |m1-m2|<Cm1 and |m1|>|W|C2/3/ϵ4/3 return “NO”. 6. Use the above samples to test ϵ/C1/2-closeness of two distributions on W and return the result.

This section is devoted to the proof of Theorem 5.2.

Using the same analysis for flattening as in [Diakonikolas and Kane(2016)] (see also Section 5.1) gives us that p,q satisfy p2,q21N and that p-q1=p-q1 with probability at least 99%. Note that p|W,q|W are non-normalized pseudo-distributions given by restrictions of p,q to W, and (p|W) and (q|W) (the corresponding conditional distributions) are their normalized distributions.

We also note that

𝐄[p-q22] =i|pi-qi|2𝐄[1/(ai+1)]
=O(i|pi-qi|2/(N(pi+qi)))=O(1/N)i|pi-qi|=O(p-q1/N).

Therefore, p-q22=O(p-q1/N) with 99% probability. We assume this and the above bounds on p2 and q2 throughout the rest.

Next, we analyze the sizes of W,p(W),q(W) and p|W-q|W1. In particular, we note that since W selects each element independently with probability 1/(log(n)), |W| has mean n/(log(n)) with a similar variance, and so |W|=Θ(n/(log(n))) with 99% probability (note that n=n+N=Θ(n)). The mean of p(W)=1/(log(n)) and the variance is p22r, therefore with 99% probability p(W)=Θ(nr), and similarly for q(W). Finally, we have that p|W-q|W1 has mean p-q1r=p-q1r and variance p-q22r=O(p-q1r/N). So by Chebyshev’s inequality, with 99% probability we have that if either p=q or p-q1ϵ, then p|W-q|W1=Θ(p-q1r).

We next consider the completeness and soundness cases, ignoring the possibility of the early abort.

Completeness

If p=q, then p(W)=q(W) and (p|W)=(q|W). We note that m1 and m2 each have average values p(W)M=Θ(rM)=Θ(M/log(n)) and variances less than their means. This implies that with at least 99% probability it holds |m1-m2|<Cm1 and |m1|>|W|C2/3/ϵ4/3. Additionally, since (p|W)=(q|W), we will pass the closeness test for these distributions.

Soundness

If p-q1>ϵ, we have that p|W-q|W1=Ω(ϵr). We note that this implies that either |p(W)-q(W)|>ϵr/C1/3 or (p|W)-(q|W)1>ϵ/C1/2. This is because

p|W-q|W1 =(p|W)p(W)-(q|W)q(W)1
(p|W)p(W)-(q|W)p(W)1+(q|W)p(W)-(q|W)q(W)1
=p(W)(p|W)-(q|W)1+|p(W)-q(W)|.

First, consider what happens if |p(W)-q(W)|>ϵr/C1/3. We notice that m1 and m2 have means of Mp(W)=Θ(M/log(n)) and Mq(W)=Θ(M/log(n)), respectively, with variances on the order of their means. Now if |p(W)-q(W)|>ϵr/C1/3, the means of m1 and m2 will differ by Ω(Mϵ/log(n)C1/3)=Ω(C2/3n2/3/(2/3ϵ1/3)), while the variance is O(Cn2/3/(2/3ϵ4/3)). Since the difference of the means is much bigger than both the square root of the mean of m1 and the square root of the variance, |m1-m2| will be bigger than Cm1 with 99% probability.

On the other hand. if (p|W)-(q|W)1>ϵ/C1/2, our closeness tester in the last step will fail.

Communication Complexity

Here we show that the communication complexity of the algorithm is within the desired bounds, and that we have enough samples to perform the test in the last step. Firstly, we note that the N samples in the first step requires only Nlog(n) communication, which is well within our bounds. The other major step requires asking M machines. It takes only O(M) communication for each machine to report whether or not they have a sample, and we have an average of M(p(W)+q(W)) samples that take O(log(n)) bits each. This is at most O(Mlog(n)r)=O(C|W|2/3/ϵ4/3)=O(Cn2/3/(2/3log2/3(n)ϵ4/3)) samples, for an appropriate number of bits.

Finally, we note that for the last step since |W|=n/(log(n))ϵ-4, our tester only requires O(|W|2/3/ϵ4/3) samples, which are available. This completes the proof of Theorem 5.2.

6 Conclusions and Future Directions

This work gave algorithms and lower bounds, in some cases matching, for distribution testing with communication and memory constraints. Our work is a first step in these directions and suggests a host of interesting open problems. More concretely:

Communication Lower Bounds without One-pass Assumption

Our current techniques for proving communication lower bounds seem to depend fairly strongly on the one-pass assumption. In particular, when bounding the information learned by the tth sample, it is critical for us to know that the information that we have from the current transcript is independent of that sample. Unfortunately, it is not clear how to get around that obstacle, and without it we have only the trivial lower bound of Ω(n/(ϵ2)).

Multi-pass Streaming Models

Another interesting open problem would be to consider multiple pass streaming models. For the reasons outlined above, it seems like our lower bounds would be difficult to generalize to even a two-pass streaming model. This leads to the interesting question of whether or not there are better algorithms in this model. At the very least, it is easy to see that the standard uniformity and closeness testers can be implemented with optimal sample complexity, O(log(n/ϵ)) memory, and n passes over the data. What can be done with an intermediate number of passes?

Communication Lower Bounds for Closeness Testing

We would like to show communication lower bounds for closeness testing that are not implied by our uniformity testing lower bounds and the general sample complexity lower bounds. Our current adversary method is not sufficient for this task, as the testers that we have can distinguish our adversarial distributions from uniform in a small number of samples. In order to prove good closeness lower bounds, a more complicated adversary is necessary, and it is unclear how to combine this adversary with our information-theoretic arguments. It would even be interesting to make progress in this question for the case of constant ϵ.

Extending Ranges of Validity

An immediate open question is to extend the range of validity of many of our bounds. Both our algorithms and lower bounds only work for constrained ranges of parameters in ways that do not allow us to adequately cover the whole space of parameters. It would be interesting to see if this dependence could be removed. Another interesting parameter range would be to see if there are any streaming algorithms at all with o(log(n)) memory.

Instance-Optimal/Adaptive Testing

[Valiant and Valiant(2014)] showed that testing identity to distributions other than the uniform distribution can often be done with better sample complexity in the centralized setting. It would be interesting to see what sort of analogue of this result can be obtained in our models. An analogous question can be asked for the adaptive closeness tester of [Diakonikolas and Kane(2016)].

\acks

I.D. is supported by NSF Award CCF-1652862 (CAREER) and a Sloan Research Fellowship. This research was performed while T.G. was a postdoctoral researcher at USC, supported by I.D.’s startup grant. D.K. is supported by NSF Award CCF-1553288 (CAREER) and a Sloan Research Fellowship. S.R. would like to thank Dheeraj P.V. for helpful discussions.

References

  • [Acharya et al.(2015)Acharya, Daskalakis, and Kamath] J. Acharya, C. Daskalakis, and G. Kamath. Optimal testing for properties of distributions. In Advances in Neural Information Processing Systems (NIPS), pages 3591–3599, 2015.
  • [Acharya et al.(2018a)Acharya, Canonne, and Tyagi] J. Acharya, C. L. Canonne, and H. Tyagi. Distributed simulation and distributed inference. CoRR, abs/1804.06952, 2018a. URL http://arxiv.org/abs/1804.06952.
  • [Acharya et al.(2018b)Acharya, Canonne, and Tyagi] J. Acharya, C. L. Canonne, and H. Tyagi. Inference under information constraints I: lower bounds from chi-square contraction. CoRR, abs/1812.11476, 2018b. URL http://arxiv.org/abs/1812.11476.
  • [Ahlswede and Csiszar(1986)] R. Ahlswede and I. Csiszar. Hypothesis testing with communication constraints. IEEE Trans. Inf. Theor., 32(4):533–542, September 1986.
  • [Andoni et al.(2018)Andoni, Malkin, and Nosatzki] A. Andoni, T. Malkin, and N. S. Nosatzki. Two party distribution testing: Communication and security. CoRR, abs/1811.04065, 2018. URL http://arxiv.org/abs/1811.04065.
  • [Batu and Canonne(2017)] T. Batu and C. L. Canonne. Generalized uniformity testing. In 58th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2017, pages 880–889, 2017.
  • [Batu et al.(2000)Batu, Fortnow, Rubinfeld, Smith, and White] T. Batu, L. Fortnow, R. Rubinfeld, W. D. Smith, and P. White. Testing that distributions are close. In IEEE Symposium on Foundations of Computer Science, pages 259–269, 2000. URL citeseer.ist.psu.edu/batu00testing.html.
  • [Batu et al.(2004)Batu, Kumar, and Rubinfeld] T. Batu, R. Kumar, and R. Rubinfeld. Sublinear algorithms for testing monotone and unimodal distributions. In ACM Symposium on Theory of Computing, pages 381–390, 2004.
  • [Braverman et al.(2016)Braverman, Garg, Ma, Nguyen, and Woodruff] M. Braverman, A. Garg, T. Ma, H. L. Nguyen, and D. P. Woodruff. Communication lower bounds for statistical estimation problems via a distributed data processing inequality. In Proceedings of the 48th Annual ACM Symposium on Theory of Computing, STOC 2016, pages 1011–1020, 2016.
  • [Canonne(2015)] C. L. Canonne. A survey on distribution testing: Your data is big. but is it blue? Electronic Colloquium on Computational Complexity (ECCC), 22:63, 2015.
  • [Canonne et al.(2016)Canonne, Diakonikolas, Gouleakis, and Rubinfeld] C. L. Canonne, I. Diakonikolas, T. Gouleakis, and R. Rubinfeld. Testing shape restrictions of discrete distributions. In 33rd Symposium on Theoretical Aspects of Computer Science, STACS 2016, pages 25:1–25:14, 2016.
  • [Canonne et al.(2017a)Canonne, Diakonikolas, Kane, and Stewart] C. L. Canonne, I. Diakonikolas, D. M. Kane, and A. Stewart. Testing bayesian networks. In Proceedings of the 30th Conference on Learning Theory, COLT 2017, pages 370–448, 2017a.
  • [Canonne et al.(2017b)Canonne, Diakonikolas, Kane, and Stewart] C. L. Canonne, I. Diakonikolas, D. M. Kane, and A. Stewart. Testing conditional independence of discrete distributions. CoRR, abs/1711.11560, 2017b. URL http://arxiv.org/abs/1711.11560. In STOC’18.
  • [Canonne et al.(2018)Canonne, Diakonikolas, and Stewart] C. L. Canonne, I. Diakonikolas, and A. Stewart. Testing for families of distributions via the fourier transform. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, pages 10084–10095, 2018.
  • [Chan et al.(2014)Chan, Diakonikolas, Valiant, and Valiant] S. Chan, I. Diakonikolas, P. Valiant, and G. Valiant. Optimal algorithms for testing closeness of discrete distributions. In SODA, pages 1193–1203, 2014.
  • [Chien et al.(2010)Chien, Ligett, and McGregor] S. Chien, K. Ligett, and A. McGregor. Space-efficient estimation of robust statistics and distribution testing. In Innovations in Computer Science - ICS 2010, pages 251–265, 2010.
  • [Cover(1969)] T. M. Cover. Hypothesis testing with finite statistics. Ann. Math. Statist., 40(3):828–835, 06 1969.
  • [Daskalakis and Pan(2017)] C. Daskalakis and Q. Pan. Square Hellinger subadditivity for Bayesian networks and its applications to identity testing. In Proceedings of the 30th Conference on Learning Theory, COLT 2017, pages 697–703, 2017.
  • [Daskalakis et al.(2013)Daskalakis, Diakonikolas, Servedio, Valiant, and Valiant] C. Daskalakis, I. Diakonikolas, R. Servedio, G. Valiant, and P. Valiant. Testing k-modal distributions: Optimal algorithms via reductions. In SODA, pages 1833–1852, 2013.
  • [Daskalakis et al.(2018)Daskalakis, Dikkala, and Kamath] C. Daskalakis, N. Dikkala, and G. Kamath. Testing Ising models. In SODA, 2018.
  • [Devroye and Györfi(1985)] L. Devroye and L. Györfi. Nonparametric Density Estimation: The L1 View. John Wiley & Sons, 1985.
  • [Devroye and Lugosi(2001)] L. Devroye and G. Lugosi. Combinatorial methods in density estimation. Springer Series in Statistics, Springer, 2001.
  • [Diakonikolas and Kane(2016)] I. Diakonikolas and D. M. Kane. A new approach for testing properties of discrete distributions. In FOCS, pages 685–694, 2016. Full version available at abs/1601.05557.
  • [Diakonikolas et al.(2015a)Diakonikolas, Kane, and Nikishkin] I. Diakonikolas, D. M. Kane, and V. Nikishkin. Testing identity of structured distributions. In Proceedings of the Twenty-Sixth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2015, pages 1841–1854, 2015a.
  • [Diakonikolas et al.(2015b)Diakonikolas, Kane, and Nikishkin] I. Diakonikolas, D. M. Kane, and V. Nikishkin. Optimal algorithms and lower bounds for testing closeness of structured distributions. In IEEE 56th Annual Symposium on Foundations of Computer Science, FOCS 2015, pages 1183–1202, 2015b.
  • [Diakonikolas et al.(2016)Diakonikolas, Gouleakis, Peebles, and Price] I. Diakonikolas, T. Gouleakis, J. Peebles, and E. Price. Collision-based testers are optimal for uniformity and closeness. Electronic Colloquium on Computational Complexity (ECCC), 23:178, 2016.
  • [Diakonikolas et al.(2017a)Diakonikolas, Gouleakis, Peebles, and Price] I. Diakonikolas, T. Gouleakis, J. Peebles, and E. Price. Sample-optimal identity testing with high probability. CoRR, abs/1708.02728, 2017a.
  • [Diakonikolas et al.(2017b)Diakonikolas, Grigorescu, Li, Natarajan, Onak, and Schmidt] I. Diakonikolas, E. Grigorescu, J. Li, A. Natarajan, K. Onak, and L. Schmidt. Communication-efficient distributed learning of discrete distributions. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, pages 6394–6404, 2017b.
  • [Diakonikolas et al.(2017c)Diakonikolas, Kane, and Nikishkin] I. Diakonikolas, D. M. Kane, and V. Nikishkin. Near-optimal closeness testing of discrete histogram distributions. In 44th International Colloquium on Automata, Languages, and Programming, ICALP 2017, pages 8:1–8:15, 2017c.
  • [Diakonikolas et al.(2018)Diakonikolas, Kane, and Stewart] I. Diakonikolas, D. M. Kane, and A. Stewart. Sharp bounds for generalized uniformity testing. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, pages 6204–6213, 2018.
  • [Fischer et al.(2018)Fischer, Meir, and Oshman] O. Fischer, U. Meir, and R. Oshman. Distributed uniformity testing. In Proceedings of the 2018 ACM Symposium on Principles of Distributed Computing, PODC ’18, pages 455–464, New York, NY, USA, 2018. ACM.
  • [Garg et al.(2014)Garg, Ma, and Nguyen] A. Garg, T. Ma, and H. Nguyen. On communication cost of distributed statistical estimation and dimensionality. In Advances in Neural Information Processing Systems (NIPS), pages 2726–2734, 2014.
  • [Goldreich(2016)] O. Goldreich. The uniform distribution is complete with respect to testing identity to a fixed distribution. ECCC, 23, 2016.
  • [Goldreich(2017)] O. Goldreich. Introduction to Property Testing. Forthcoming, 2017. URL http://www.wisdom.weizmann.ac.il/~oded/pt-intro.html.
  • [Goldreich and Ron(2000)] O. Goldreich and D. Ron. On testing expansion in bounded-degree graphs. Technical Report TR00-020, Electronic Colloquium on Computational Complexity, 2000.
  • [Han et al.(2018a)Han, Mukherjee, Özgür, and Weissman] Y. Han, P. Mukherjee, A. Özgür, and T. Weissman. Distributed statistical estimation of high-dimensional and nonparametric distributions. In 2018 IEEE International Symposium on Information Theory, ISIT 2018, pages 506–510, 2018a.
  • [Han et al.(2018b)Han, Özgür, and Weissman] Y. Han, A. Özgür, and T. Weissman. Geometric lower bounds for distributed parameter estimation under communication constraints. In Conference On Learning Theory, COLT 2018, pages 3163–3188, 2018b. URL http://proceedings.mlr.press/v75/han18a.html.
  • [Jordan et al.(2016)Jordan, Lee, and Yang] M. I. Jordan, J. D. Lee, and Y. Yang. Communication-efficient distributed statistical learning. CoRR, abs/1605.07689, 2016.
  • [Kannan et al.(2014)Kannan, Vempala, and Woodruff] R. Kannan, S. Vempala, and D. Woodruff. Principal component analysis and higher correlations for distributed data. In Conference on Learning Theory, pages 1040–1057, 2014.
  • [Lehmann and Romano(2005)] E. L. Lehmann and J. P. Romano. Testing statistical hypotheses. Springer Texts in Statistics. Springer, 2005.
  • [Liang et al.(2014)Liang, Balcan, Kanchanapally, and Woodruff] Y. Liang, M. F. Balcan, V. Kanchanapally, and D. Woodruff. Improved distributed principal component analysis. In Advances in Neural Information Processing Systems (NIPS), pages 3113–3121, 2014.
  • [Neyman and Pearson(1933)] J. Neyman and E. S. Pearson. On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, 231(694-706):289–337, 1933. doi: 10.1098/rsta.1933.0009. URL http://rsta.royalsocietypublishing.org/content/231/694-706/289.short.
  • [Paninski(2008)] L. Paninski. A coincidence-based test for uniformity given very sparsely-sampled discrete data. IEEE Transactions on Information Theory, 54:4750–4755, 2008.
  • [Rubinfeld(2012)] R. Rubinfeld. Taming big probability distributions. XRDS, 19(1):24–28, 2012.
  • [Tsybakov(2008)] A. B. Tsybakov. Introduction to Nonparametric Estimation. Springer Publishing Company, Incorporated, 2008.
  • [Valiant and Valiant(2014)] G. Valiant and P. Valiant. An automatic inequality prover and instance optimal identity testing. In FOCS, 2014.
  • [Valiant(2011)] P. Valiant. Testing symmetric properties of distributions. SIAM J. Comput., 40(6):1927–1968, 2011.
  • [Zhang et al.(2013)Zhang, Duchi, Jordan, and Wainwright] Y. Zhang, J. Duchi, M. Jordan, and M. J. Wainwright. Information-theoretic lower bounds for distributed statistical estimation with communication constraints. In Advances in Neural Information Processing Systems (NIPS), pages 2328–2336, 2013.
  • [Zhu and Lafferty(2018)] Y. Zhu and J. Lafferty. Distributed nonparametric regression under communication constraints. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, pages 6004–6012, 2018.