Life After Bootstrap: Residual Randomization Inference in Regression Models

  • 2019-08-12 16:09:15
  • Panos Toulis
  • 8

Abstract

We develop a randomization-based method for inference in regression models.The basis of inference is an invariance assumption on the regression errors,such as invariance to permutations or random signs. To test significance, therandomization method repeatedly calculates a suitable test statistic overtransformations of the regression residuals according to the invariant.Inversion of the test can produce confidence intervals. We prove generalconditions for asymptotic validity of this residual randomization test andillustrate in many models, including clustered errors with one-way or two-wayclustering structure. We also show that finite-sample validity is possibleunder a suitable construction, and illustrate with an exact test for a case ofthe Behrens-Fisher problem. The proposed method offers four main advantagesover the bootstrap: (1) it addresses the inference problem in a unified way,while bootstrap typically needs to be adapted to the task; (2) it can be morepowerful by exploiting a richer and more flexible set of invariances thanexchangeability; (3) it does not rely on asymptotic normality; and (4) it canbe valid in finite samples. In extensive empirical evaluations, including highdimensional regression and autocorrelated errors, the proposed method performsfavorably against many alternatives, including bootstrap variants andasymptotic robust error methods.

 

Quick Read (beta)

Life After Bootstrap:
Residual Randomization Inference in Regression Models

Panos Toulis11 1 email: [email protected].

University of Chicago
Booth School of Business
Abstract

We develop a randomization-based method for inference in regression models. The basis of inference is an invariance assumption on the regression errors, such as invariance to permutations or random signs. To test significance, the randomization method repeatedly calculates a suitable test statistic over transformations of the regression residuals according to the invariant. Inversion of the test can produce confidence intervals. We prove general conditions for asymptotic validity of this residual randomization test and illustrate in many models, including clustered errors with one-way or two-way clustering structure. We also show that finite-sample validity is possible under a suitable construction, and illustrate with an exact test for a case of the Behrens–Fisher problem. The proposed method offers four main advantages over the bootstrap: (1) it addresses the inference problem in a unified way, while bootstrap typically needs to be adapted to the task; (2) it can be more powerful by exploiting a richer and more flexible set of invariances than exchangeability; (3) it does not rely on asymptotic normality; and (4) it can be valid in finite samples. In extensive empirical evaluations, including high-dimensional regression and autocorrelated errors, the proposed method performs favorably against many alternatives, including bootstrap variants and asymptotic robust error methods.

Keywords: bootstrap, randomization, invariance, inferential primitive, exact inference.

1 Introduction

The bootstrap (efron1979bootstrap; efron1981nonparametric) is a widely popular method for nonparametric estimation of standard errors. In a regression model,

yi=xiβ+εi,i=1,,n, (1)

the residual bootstrap (freedman1981bootstrapping; efron1986bootstrap) uses an estimate β^ (e.g., from OLS) to bootstrap datasets {(yi,xi)}, where yi=xiβ^+ε^i, and (ε^1,,ε^n) is a sample with replacement from the residuals, ε^i=yi-xiβ^, or the centered residuals, ε^i-ε^¯. The residual bootstrap is more flexible and generally adapts better to the problem than other bootstrap alternatives22 2 Another approach is the pairs bootstrap (freedman1981bootstrapping), which generates {(yi,xi)} by sampling with replacement from the original data. Because pairs bootstrap does not condition on x, it does not adapt to the problem geometry well, and generally performs poorly in small samples (flachaire2005bootstrapping; mackinnon2002bootstrap). because it conditions on covariates, while its asymptotics are well understood (freedman1981bootstrapping; peters1984some; bickel1983bootstrapping).

To be valid asymptotically, the bootstrap requires some form of exchangeability of the errors. When such exchangeability does not hold, the method may have to be altered substantially. For instance, in regression models with clustered errors, the “cluster wild bootstrap” method (cameron2008bootstrap) resamples the entire cluster residuals with replacement, and switches their signs at random. Moreover, although it is not necessary in theory, bootstrap methods in practice rely on asymptotic normality, which is generally not plausible with small samples. This requires additional complex alterations, such as acceleration or refinement (diciccio1988review; diciccio1996bootstrap). In short, a bootstrap approach needs to adapt the problem to the method, rather than the method to the problem.

As a robust alternative for inference we propose the residual randomization method. To test a linear hypothesis on parameters, H0:aβ=a0, with ap,a0 fixed, the proposed method relies on the following two key ideas. First, we make an invariance assumption on the errors, such as ε=d𝚐ε, where 𝚐𝒢 and 𝒢 is an algebraic group of nn linear maps. For example, if 𝒢 is the set of all permutations, the assumption is equivalent to exchangeability of errors. Second, we choose a test statistic Tn, such that Tn=tn(ε) under the null hypothesis, and tn:n has an asymptotic distribution law. For example, Tn=na(β^-β) and tn(u)=na(XX)-1Xu satisfy this requirement, where β^ is the standard OLS estimator. Then, inference on β that is valid in finite samples is possible, in principle, by comparing the observed value of Tn against the set of alternative values {tn(𝚐ε):𝚐𝒢}.

In practice, however, this method is infeasible because the errors are generally unknown, except when we are testing a joint hypothesis on all β. The feasible method is then to compare Tn with respect to the set {tn(𝚐εo^):𝚐𝒢}, where εo^ are the residuals from OLS constrained under H0. The feasible method is no longer valid in finite samples, in general, but is asymptotically valid under certain conditions depending on the choice of 𝒢. An overarching sufficient condition for validity, which we prove in this paper (see Theorem 1), is that (1/n)XGX is similar to (1/n)XX, in the sense that (XX)-1XGXpbI, where GUnif(𝒢), b is scalar and I is the identity matrix. Intuitively, this ensures that the choice of 𝒢 does not alter substantially the information structure of the problem, so that tn(Gεo^) is sufficiently close to tn(Gε). Moreover, a construction of 𝒢 may be possible so that the aforementioned similarity condition holds exactly in the sample, leading to a randomization test of significance that achieves the nominal level for any finite n.

In related work, the idea of residual randomization was first developed in two seminal papers by freedman1983nonstochastic; freedman1983significance. Their key result was to show that the permutation distribution of a suitable test statistic converges to a standard distribution. freedman1983significance described their idea as the extension of Fisher’s randomization test (fisher1935design) to observational studies. Interestingly, Freedman did not carry out the extension to more complex 𝒢, but instead produced an important series of papers on bootstrap-based variations of his permutation inference method, including the residual bootstrap (bickel1981some; freedman1981bootstrapping; peters1984some; freedman1984bootstrapping). This work stimulated an important sequence of papers on permutation tests in linear regression models, which highlighted the asymptotic properties and power of residual randomization (anderson2001permutation; kennedy1995randomization; manly2018randomization; ter1992permutation). All of these methods, including the original work of freedman1983nonstochastic, also rely on residual transformations, but only consider exchangeability as the main invariance assumption, i.e., 𝒢 denotes only permutations. Furthermore, they do not analyze complex error structures, such clustered errors with one-way or two-way clustering, as we do in this paper.

Recently, randomization tests have received increased attention, especially in causal inference, due to their robustness and minimal assumptions (imbens2015causal; rosenbaum2002observational; lehmann2006testing; gerber2012field; athey2018exact; basse2019randomization; ding2016randomization; ding2017paradox; cattaneo2015randomization; canay2017randomization; canay2018wild; chernozhukov2017exact). In the context of clustered errors, canay2018wild have made the important connection between randomization testing and “wild bootstrap” (wu1986jackknife). Specifically, they proved asymptotic validity of the wild bootstrap when clusters grow in size but are finite in number, under the homogeneity condition that XX within each cluster is similar to the population matrix; that is, canay2018wild consider 𝒢 as the set of diagonal matrices with clustered random signs on the diagonal. In Section 4 of this paper, we extend this result and show that validity is possible even without the homogeneity condition, provided that the errors have some suitable of symmetry, such as exchangeability within clusters.

Our paper is organized as follows. Section 2 presents the details of the residual randomization method, and the key technical result for validity. In Section 3, we discuss simple error structures, and in Section 4 we consider clustered structures, including two-way clustering. For each particular error structure we derive theoretical conditions of asymptotic validity. Section 5 presents extensive empirical evaluations that show the benefits of the proposed randomization method, including a construction of an exact test for an instance of the famous Behrens–Fisher problem in Section 5.2.2. Finally, in Section 6 we discuss residual randomization with high-dimensional regression models and autocorrelated errors, showing notable empirical success in both cases.

2 The residual randomization method

We begin with the regression model of Equation (1), written compactly as

y=Xβ+ε, (2)

where y,ε are n-dimensional column vectors, X is the n×p matrix of covariates, and β is the p-dimensional vector of parameters; here, p is fixed and n>p. We focus on testing linear hypotheses on the regression parameters, β, of the following form:

H0:aβ=a0, (3)

where a=(a1,,ap)p and a0 are fixed and known. These hypotheses include the individual significance tests, H0,j:βj=0, for j=1,,p. The inversion of these tests can produce confidence intervals.

Our proposed method relies on two main components:

  1. (a)

    An inferential primitive, 𝒢, as an algebraic group of nn linear maps.

  2. (b)

    An invariant, tn:n, as a measurable function, such that

    tn(ε)=dtn(𝚐ε),for all𝚐𝒢,and any finiten. (4)

The idea is that if there exists a test statistic Tn, such that Tn=tn(ε) under H0, then the randomization test that compares Tn with the values {tn(𝚐ε):𝚐𝒢} is exact, i.e., valid for any finite n. In the context of linear regression, a natural approach is to define the test statistic as

Tn=n(aβ^-a0), (5)

where β^ is the standard OLS estimate of β, and the invariant as tn(u)=na(XX)-1Xu. Then, by standard OLS theory, Tn=tn(ε) under H0, as required. The corresponding p-value,

pval=P{tn(Gε)tTn=t},GUnif(𝒢),

is exact (lehmann2006testing, Theorem 15.2.3).

In practice, this test is infeasible because the errors are unknown, and so we rely on estimated residuals. The approach is then to calculate the constrained OLS estimator βo^, such that aβo^=a0, and then obtain the restricted residuals, εo^=y-Xβo^. The approximate p-value,

pval^=P{tn(Grεo^)tTn=t}, (6)

where GrUnif(𝒢)r=1,,R, with R fixed, defines a feasible randomization test that is also valid asymptotically under certain conditions, which we formalize in Theorem 1.

To simplify, we choose 𝒢 as a group of real-valued n×n matrices. We also satisfy Equation (4), for any tn, by the following stronger invariance assumption:

ε=d𝚐ε,for all𝚐𝒢,and any finiten. (7)

These simplifications add useful conceptual clarity. For example, when 𝒢 is the set of all n×n permutation matrices, Equation (7) is equivalent to exchangeability of errors; or when 𝒢 denotes signed diagonal matrices, the assumption is equivalent to error symmetry around zero, also known as “orthant symmetry” (efron1969student). Moreover, finite-sample invariance in Equation (7) may be necessary even in asymptotic settings, notably to show asymptotic validity of cluster sign tests when the number of clusters is finite but their sizes grow (see Section 4). Finally, under a suitable construction, finite-sample invariance can produce tests that are feasible and valid even in finite samples; see Section 2.3 for details.

2.1 Concrete procedure

For the sake of concreteness, we define here the basic procedure for the residual randomization test. All subsequent tests in this paper are derivatives of this procedure for various choices of 𝒢.

  1. 1.

    Given y,X, calculate the restricted OLS estimate, βo^, such that aβo^=a0, and then calculate the corresponding constrained residuals, εo^=y-Xβo^.

  2. 2.

    Define the test statistic, Tn=n(aβ^-a0), and let Tn=t be the observed value.

  3. 3.

    Repeat r=1,,R, with R fixed:
    Sample GrUnif(𝒢), and store tn(Grεo^), where tn(u)=na(XX)-1Xu.

  4. 4.

    Calculate the one-sided p-value as in Equation (6), or the two-sided p-value:

    pval2^=min(P{tn(Grεo^)tTn=t},P{tn(Grεo^)tTn=t}). (8)

At target level α(0,1), the test decision may be described by the function

ϕ(y;X)=𝕀{pval2^α/2}. (9)

There are two main differences between the residual randomization procedure defined above and residual bootstrap. First, in the randomization procedure we can use any inferential primitive 𝒢 beyond just permutations. This is especially useful with complex error structures, such as clustered errors, and can also be useful for sensitivity analysis (see Section 5.1). Second, the residual randomization procedure does not rely on any form of parametric assumptions to assess significance because it calculates the reference distribution empirically through random transformations of the residuals (Step 3). In contrast, bootstrap errors typically rely on asymptotic normality.

Remark 2.1.  As an intuition for why this procedure is valid, consider the idealized residual randomization test, which essentially conditions on event A={εR(ε;𝒢)} with R(u;𝒢)={𝚐u:𝚐𝒢}. Because 𝒢 is an algebraic group, R(ε;𝒢) is the orbit of ε in the error sample space. Orbits are special group-theoretic objects with the key property that they define a partition of their domain set. Thus, Equation (7) implies that tn(ε)A=dtn(Gε)ε, where GUnif(𝒢). This guarantees validity of the idealized randomization test (lehmann2006testing, Chapter 15), and suggests asymptotic validity of the feasible residual test. Such inference based on algebraic orbits is reminiscent of the “structure of inference” (fraser68).

Remark 2.2. A confidence interval for βj is possible through test inversion (rosenbaum2002observational, Section 2.6.2). Specifically, define a so that aj=1, and is zero otherwise. Then, for a sequence of a0 values, calculate the p-value of the test, say, pval(a0). Given Aj={a0:pval(a0)α/2}, we report [min(Aj),max(Aj)] as the 100(1-α)% confidence interval for βj.

Remark 2.3. In some tests, the values of tn(Gεo^) in Step 3 may often repeat. Then, a correction is needed for the test to achieve the target test level. See Appendix D.1 for details.

2.2 Asymptotic validity

Here, we develop theory on the asymptotic validity of the residual randomization test in Section 2.1. We start with a general theorem, and then specialize in the following sections.

Assumption 1.

For the linear regression model of Equation (2) suppose that:

  1. (a)

    X is non-stochastic; XX/n is bounded and invertible;

  2. (b)

    XX/nS, where S is positive definite; and

  3. (c)

    there exists random variable 𝒵p, such that 1nXεd𝒵.

Assumptions 1(a) and (b) are common regularity conditions on X to guarantee that the estimators, β^ and βo^, are well-behaved. Assumption 1(c) requires sufficient regularity in X so that the test statistic has an asymptotic distribution law. This is only an existence assumption, which is relatively weak because 𝒵 does not have to be normal necessarily, and may even be unknown.

Theorem 1.

Suppose that Assumptions 1(a)-(c) hold for the regression model of Equation (2) with inferential primitive G. Let GUnif(G), and suppose that

(XX)-1(XGX)pbI,b0. (10)

Then, the residual randomization procedure of Section 2.1 is asymptotically valid; that is,

limsupnE(ϕ(y;X))α.

The proof of Theorem 1 is straightforward. The idea is to show that tn(Gεo^) of the approximate test is asymptotically close to tn(Gε) of the idealized test. The difference between these two values is exactly equal to na(XX)-1XGX(βo^-β). Under the similarity condition of Equation (10), this quantity is proportional to na(βo^-β), which is zero under the null hypothesis.

Remark 2.4. The Gram matrix XX captures the information structure of the problem, also known as Fisher information. With that in mind, the condition in Equation (10) requires, intuitively, that the choice of inferential primitive, 𝒢, does not alter substantially the information structure of the problem. We clarify here that 𝒢 grows with n, but this dependence is left implicit to simplify notation; e.g., it is implicit in the limit of Equation (10).

Remark 2.5. Theorem 1 does not require finite-sample invariance, as defined in Equation (7), but actually a weaker assumption of asymptotic invariance:

(tn(𝚐ε),tn(ε))d(T,T),for all𝚐𝒢,and some lawT. (11)

However, we use the finite-sample invariance in Equation (7) as our main working assumption in this paper due to its conceptual clarity; see also the discussion around Equation (7) for more detailed arguments.

Remark 2.6. Theorem 1 describes only sufficient conditions for asymptotic validity. It is therefore possible that some conditions do not hold but the test is still valid asympotically—we discuss one case in Section 3.1. It can also happen that the test is valid but powerless. For example, the residual randomization test for the intercept model, yi=β0+εi, has zero power under only exchangeability of εi, since here the OLS estimator (i.e., the sample mean, y¯) is invariant to permutations.

2.3 Finite sample validity

Here, we discuss a construction of feasible and exact randomization tests as a complement to the asymptotic result of Theorem 1. The idea is to cluster some of the datapoints in a way such that the similarity condition in Equation (10) holds in the sample. This could mean an artificial clustering of the data done by the analyst, as opposed to some natural clustering, e.g., by state or school.

We need the following notation. A clustering, Cn, denotes a partition of datapoints {1,,n}, and is indexed by c. Let [c]{1,,n} denote the set of datapoints in the c-th cluster. Let us also define cluster membership indicators, namely, 𝟙c=(𝕀{1[c]},,𝕀{n[c]}) and 𝔻c=𝟙c𝟙c; thus, 𝟙c is the n-dimensional column vector with elements that are one only for the members of [c], and are zero otherwise; 𝔻c is the corresponding diagonal n×n matrix.

Theorem 2.

For fixed n, consider a clustering Cn with |Cn|=J<. Suppose that for every cluster c there exists bcR such that XcXc=bcXX, where Xc=DcX are the cluster covariates. For some inferential primitive G, suppose that every member transform, gG, can be decomposed as g=c=1JscDc, where scR. Then, the residual randomization test with G as the primitive is exact in finite samples, i.e., E(ϕ(y,X))=α.

Intuitively, Theorem 2 suggests the following construction of exact residual randomization tests. First, split the data in J clusters, such that each matrix XX within each cluster is similar to the population matrix. Second, define transforms jointly on the cluster level; e.g., flipping the signs of cluster residuals. If these transforms form an algebraic group and reflect the true error invariance, then the resulting test is exact. We illustrate this idea in an instance of the Behrens-Fisher problem in Section 5.2.2, where we split a binary covariate in clusters, and then apply a cluster sign test. The resulting test is exact even though there are only 30 data points.

3 Simple error structures

In this section, we begin with simple error structures. For each structure we specify an appropriate inferential primitive 𝒢, and then derive conditions for asymptotic validity of residual randomization.

3.1 Exchangeable errors

The simplest error structure is when εi are exchangeable, so that ε=d𝚐ε, for any n×n permutation matrix 𝚐. We may write 𝚐=i=1n𝟙i𝟙π(i), where 𝟙i is the binary n-dimensional column vector that is one only at the i-th element, and π is a permutation of n elements, i.e., a bijection of {1,,n} on itself. Let Πn be the space of all such permutations. Then, the inferential primitive is

𝒢𝗉={i=1n𝟙i𝟙π(i):πΠn}. (12)

Operationally, the residual randomization test of Section 2.1 using 𝒢𝗉 as the primitive repeatedly calculates the test statistic, tn(), on permutations of the constrained residuals, εo^.

Theorem 3.

Suppose that Assumptions 1(a)-(c) hold for the regression model of Equation (2), and ε=dgε, for all gGp. Then, the residual randomization test using Gp as the primitive is asymptotically valid.

Notably, the condition of Theorem 1 does not hold here because there is bias between tn(Gεo^) and tn(Gε), the test statistics of the feasible and idealized test, respectively. Intuitively, this bias exists because a permutation matrix 𝚐 does not completely de-correlate the columns of X. However, the theorem shows that this bias is negligible in the limit.

3.2 Heteroskedastic errors

We continue with independent and heteroskedastic errors, where E(εi2xi)=σi2. In this setting, the seminal papers of white1980heteroskedasticity; eicker1963asymptotic have established asymptotically correct inference for β through heteroskedasticity-consistent variance estimation. However, in small samples these methods can be severely biased (chesher1987bias), especially when the design is highly leveraged; mackinnon1985some; davidson2008wild; godfrey2001significance provide additional theory and simulations illustrating the problem.

For residual randomization, we can assume that the errors are symmetric around zero, and thus their distribution to be invariant to random sign flips. Formally, ε=d𝚐ε, where 𝚐=i=1nsi𝟙i𝟙i and s=(si){-1,+1}n is a vector of signs. The inferential primitive is defined as

𝒢𝗌={i=1nsi𝟙i𝟙i:s{-1,+1}n}. (13)

Operationally, the residual randomization test using G𝗌 as the primitive repeatedly calculates the test statistic, tn(), on constrained residuals with their signs randomly flipped.

Theorem 4.

Suppose that Assumptions 1(a)-(c) hold for the regression model of Equation (2), and ε=dgε, for all gGs. Then, the residual randomization test using Gs as the primitive is asymptotically valid.

The residual randomization test with G𝗌 as the inferential primitive is very similar to the “wild bootstrap” (wu1986jackknife). This bootstrap variant resamples data sets as yi=xiβ^+vif(ε^i), where vi is a zero-mean random variable, and f is a transform of the residuals, usually the identity f(ε^i)=ε^i. mammen1993bootstrap proposed sampling vi as -(5-1)/2 or (5+1)/2, such that E(vi)=0,E(vi2)=1, and E(vi3)=1; these moment conditions lead to fast convergence of the bootstrap according to Edgeworth expansions; see also (liu1988bootstrap).

Remark 3.1. Recent work has pointed out that the aforementioned analytical approximations of the wild bootstrap are untrustworthy, and suggest using random signs for vi, also known as the Rademacher distribution (davidson2008wild; flachaire2005bootstrapping; horowitz2001bootstrap). The residual randomization approach sides with the latter choice. Random signs express symmetry around zero, which is more natural than the symmetries expressed by other choices.

Remark 3.2.  The proof of Theorem 4 does not require that the errors are symmetric around zero in finite samples, but only that the test statistic has such symmetry asymptotically in the sense of Equation (11). Theorem 4 can therefore be more general. We choose to present the narrower formulation here because the finite-sample condition of sign symmetry is conceptually clearer than the asymptotic condition; see also Section 2 and Remark 2.6 for additional discussion.

4 Clustered error structures

Here, we consider problems with a natural clustering of the data, for example, a clustering by city or state, school or classroom. moulton1986random and arellano1987practitioners provide classical results on cluster-robust variance estimation, showing that regular OLS may be significantly biased when the cluster structure is ignored. Asymptotically correct inference is possible when the number of clusters grows (white1984asymptotic; hansen2007asymptotic; carter2017asymptotic), or when the number is fixed (ibragimov2010t; ibragimov2016inference; bester2011inference; conley2011inference). However, all these parametric methods rely on asymptotics that can be biased in practice, especially when the number of clusters is small (donald2007inference; imbens2016robust; cameron2008bootstrap; cameron2015practitioner); see also angrist2008mostly and (abadie2017should) on practical discussions about the need for cluster error adjustments.

Such small sample issues can be alleviated by “cluster wild bootstrap” (cameron2008bootstrap). However, the standard bootstrap asymptotics don’t work with a finite number of clusters. In current work, canay2018wild analyze the cluster wild bootstrap in a randomization framework, and prove its asymptotic validity when the number of clusters is finite but the individual cluster sizes grow. Their result requires that XX is homogeneous across clusters, and similar to the population XX. In this section, we extend this result in various ways, and show, for instance, that such homogeneity is not necessary when errors are exchangeable within clusters. We summarize all our results in the following section, and then proceed to individual cases.

4.1 Overview of results

To proceed we need some additional notation. Let C1,,Cn denote a sequence of clusterings of {1},,{1,,n}, respectively. Let Jn=|Cn| and let i[c] denote that datapoint i is in c-th cluster.  The clusterings may grow infinitely but they exhibit a nested structure:

i[c]underCni[c]under allCm,mn, (14)

so that a datapoint stays within its assigned cluster. This assumption is without practical loss of generality, while it allows us to posit within-cluster limits (e.g., means) as n grows.

In this paper, we consider three types of cluster invariances: (1) exchangeability within clusters; (2) sign symmetry across clusters; and (3) double invariance, where both symmetries hold. In all cases we assume, as usual, that the errors are independent across clusters. Our technical results on validity of residual randomization for all such invariances are summarized in Table 1.

Cluster size, Jn=|Cn|
inferential primitive Jn=J< Jn
exchangeability within clusters (𝒢𝗉) X𝟙/n=0, or HA X𝟙/n=0, or HA
sign symmetry across clusters (𝒢𝗌) HA HA, or c=1Jnnc2/n20.
double invariance (𝒢𝗉+𝗌) X𝟙/n=0, or HA X𝟙/n=0, or HA, or c=1Jnnc2/n20.
Table 1: Summary of conditions that lead to asymptotic validity of residual randomization with clustered errors; “HA” corresponds to the homogeneity assumption of Equation (15). The inferential primitives, 𝒢𝗉,𝒢𝗌,𝒢𝗉+𝗌, are defined in Sections 4.2-4.4.

The assumption of cluster homogeneity, denoted by “HA” in Table 1, stands out. For example, when the number of clusters is fixed (Jn=J<), the homogeneity assumption posits that, for every cluster c=1,,J,

(XX)-1X𝔻cXbcI,bc0, (15)

where 𝔻c is the diagonal matrix of cluster membership. As mentioned before, Equation (15) implies a homogeneous covariate correlation structure across clusters in the limit, and was first identified by canay2018wild in the context of cluster wild bootstrap. Interestingly, this condition is the asymptotic equivalent of the finite-sample homogeneity condition in Theorem 2, which can yield exact tests in finite samples. However, homogeneity is not necessary for validity when the errors are exchangeable within clusters (see first row of Table 1). In this case, permuting the residuals, after cluster covariates have been centered, also leads to valid inference, except for the intercept β0 that is differenced out. On the other hand, when Jn the additional condition cnc2/n20 can also lead to valid inference, beyond homogeneity or centered clusters (see second column of Table 1). This requires, intuitively, that no cluster dominates the others in size.

4.2 Exchangeability within clusters

We start with the cluster variant of exchangeability in Section 3.1. Formally, let π(;Cn) be a permutation of n elements within clusters of Cn; i.e., if i[c], for some [c]Cn, then π(i;Cn)[c]. Let Π(Cn) denote the space of all such permutations. Then, finite-sample invariance implies that ε=d𝚐ε, where 𝚐=i=1n𝟙i𝟙π(i), for some unique πΠ(Cn). The inferential primitive is

𝒢𝗉(Cn)={i=1n𝟙i𝟙π(i):πΠ(Cn)}. (16)

Operationally, the residual randomization test using 𝒢𝗉(Cn) as the primitive repeatedly calculates the test statistic, tn(), on random within-cluster permutations of the residuals, εo^.

Theorem 5.

Suppose that Assumptions 1(a)-(c) hold for the regression model of Equation (2), and ε=dgε, for all gGp(Cn). Suppose also that Assumptions 1(a)-(b) hold within clusters. Then, the residual randomization test using Gp(Cn) as the primitive is asymptotically valid if the clusters are centered; or if the homogeneity condition of Equation (15) holds and mincnc.

Theorem 5 shows two conditions for asymptotic validity of a residual randomization test that permutes the residuals within clusters. First, when the covariates are centered the test is valid even when the number of clusters is fixed; however, this precludes inference on β0. Second, when the homogeneity condition holds and all clusters grow in size, the test can be analyzed as a permutation within each cluster separately, and then validity follows from Theorem 3.

4.3 Sign symmetry across clusters

When the errors are not exchangeable within clusters, we may assume that the entire cluster errors are independent and sign-symmetric, similar to Section 3.2. Then, finite-sample invariance implies that ε=d𝚐ε, where 𝚐=c=1Jnsci[c]𝟙i𝟙i, and sc=±1. The inferential primitive is

𝒢𝗌(Cn)={c=1Jnsci[c]𝟙i𝟙i:s{-1,+1}Jn}. (17)

Operationally, the residual randomization test using 𝒢𝗌(Cn) as the primitive repeatedly calculates the test statistic, tn(), on cluster residuals with their signs randomly flipped.

Theorem 6.

Suppose that Assumptions 1(a)-(c) hold for the regression model of Equation (2), and ε=dgε, for all gGs(Cn). Suppose also that Assumptions 1(a)-(b) hold within clusters. If Jn=J is fixed and mincnc, then the residual randomization test using Gs(Cn) as the primitive is asymptotically valid when the homogeneity condition of Equation (15) holds. If Jn, then the test is valid also when c=1Jnnc2/n20.

Theorem 6 shows two conditions for validity of a residual randomization test that randomly flips the signs of cluster residuals. The homogeneity condition is required when the number of clusters is finite because tn(Gεo^)-tn(Gε) does not converge in this setting. However, when the number of clusters grows, then the condition cnc2/n20 also guarantees validity of the test. This condition roughly requires that no cluster dominates the others in the limit.

Remark 4.1. The residual randomization test here is similar to the “cluster wild bootstrap” method of cameron2015practitioner. The main difference is that cluster wild bootstrap resamples the cluster residuals with replacement. canay2018wild proved the asymptotic validity of cluster wild bootstrap with finite clusters under the cluster homogeneity condition of Equation (15). In the following section, we show that homogeneity is not required when errors are also exchangeable within clusters. When exchangeability does not hold, however, the proof of Theorem 6 shows that validity is still possible if the covariates are uncorrelated. This suggests running the randomization test on de-correlated covariates (e.g., through PCA).

Remark 4.2. canay2018wild also pointed out that asymptotic validity with a fixed number of clusters requires a finite-sample invariance assumption, i.e., the symmetry of errors around zero in finite samples. This is not surprising since we cannot estimate cluster variance consistently if the number of clusters is fixed. See also Remarks 2.6 and 3.2 regarding our discussion thread on such contrast between finite-sample and asymptotic invariance.

4.4 Double invariance

Here, we make both invariance assumptions from the two previous sections, namely exchangeability within clusters and sign symmetry across clusters. Formally, ε=d𝚐ε, where 𝚐=c=1Jnsci[c]𝟙i𝟙π(i), where sc are cluster signs and πΠ(Cn). The inferential primitive is defined as follows:

𝒢𝗉+𝗌(Cn)={c=1Jnsci[c]𝟙i𝟙π(i):s{-1,+1}Jn,πΠ(Cn)}. (18)

Operationally, the residual randomization test of Section 2.1 using 𝒢𝗉+𝗌 as the primitive repeatedly calculates the test statistic, tn(), on permuted cluster residuals with randomly flipped signs.

Theorem 7.

Suppose that Assumptions 1(a)-(c) hold for the regression model of Equation (2), and ε=dgε, for all gGp+s(Cn). Suppose also that Assumptions 1(a)-(b) hold within clusters. If Jn=J is fixed and mincnc, then the residual randomization test using Gp+s(Cn) as the primitive is asymptotically valid when the homogeneity condition of Equation (15) holds, or when the clusters are centered. If Jn, then the test is valid also when cnc2/n20.

The idea behind the double invariant, 𝒢𝗌+𝗉(Cn), is to combine the strengths of both cluster permutations and random signs. This procedure should be used when there is no concern of heteroskedasticity within clusters, because it will generally lead to more powerful tests and sharper inference compared to the the simple cluster sign test (or the wild bootstrap). We illustrate this point in the experiments of Section 5.2.

4.5 Two-way clustering

In many problems, the data could be clustered in two ways. For example, with state-school data one cluster could be the state and another could be the school; or in panel data, the clusters could be state and time. Here, we discuss a residual randomization test for such error structure, and prove conditions for validity. These technical conditions are more subtle than in one-way clustering because the data size may grow in either one or both cluster dimensions.

In related work, cameron2011robust proposed estimation based on the inclusion-exclusion principle, where standard robust error estimates are used for every cluster individually and in combinations. aronow2015; fafchamps2007formation; conley1999gmm study sandwich estimators. However, all these estimators have the typical small-sample problems, including a reliance on normal asymptotics. Additionally, they may be ill-defined (e.g., yielding non-positive definite estimates), and can be computationally cumbersome.

Our approach starts by visualizing an arrangement of the errors in a two-dimensional array. Data fall in cells that correspond to a “row clustering” and “column clustering”, namely Rn and Cn, indexed by r and c, respectively. Without loss of generality we assume that the clusters have the same size, so that r,c=1,,Jn. For datapoint i, the value r(i){1,,Jn} denotes its assigned row cluster, and c(i){1,,Jn} its column cluster. The error structure can be represented without loss of generality as follows. Let i be the kth observation in row cluster r and column cluster c, then we can write

εi=ϵr+ϵc+ϵrc(k),wherer(i)=r,c(i)=c,kth replication in cell. (19)

The common row and column terms induce the two-way correlation in the errors. We will assume, as usual, that the error ϵrc(k) is pure noise, so that it obeys any desirable invariance. Inference therefore depends on the assumptions we put on ϵr,ϵc.

Suppose that ϵr,ϵc are individually exchangeable and there is only one replication per cell. Then, the errors have the simple form εi=ϵr+ϵc+ϵrc, and are thus invariant to entire row and column permutations. Specifically, define Π(Rn,Cn) as the following group of permutations:

Π(Rn,Cn)={πΠn:r(i)=r(j)r(π(i))=r(π(j)),andc(i)=c(j)c(π(i))=c(π(j)).}

That is, Π(Rn,Cn) is the set of permutations for which if i and j are in the same row (or column), then both i and j are mapped to the same row (or column) under any πΠ(Rn,Cn). Moreover, Π(Rn,Cn) is a subgroup of Πn. The inferential primitive can thus be defined as follows:

𝒢𝗉(Rn,Cn)={i=1n𝟙i𝟙π(i):πΠ(Rn,Cn)}. (20)

Operationally, the residual randomization test of Section 2.1 using 𝒢𝗉(Rn,Cn) as the primitive essentially permutes the residuals row- and column-wise.

Theorem 8.

Suppose that Assumptions 1(a)-(c) hold for the regression model of Equation (2), and ε=dgε, for all gGp(Rn,Cn). Suppose also that Assumptions 1(a)-(b) hold within clusters with respect to both Rn and Cn. When the number of clusters of both Rn,Cn grows indefinitely, the residual randomization test using Gp(Rn,Cn) as the primitive is asymptotically valid. When only the row (or only the column) cluster grows indefinitely, the same randomization test is valid if the covariates have been centered column-wise (or row-wise).

Notably, the conditions for validity in Theorem 8 are similar to Theorem 3 on unclustered exchangeability, and are actually weaker than Theorem 5 on one-way clustered exchangeability. Intuitively, in one-way clustering the randomization bias depends on the cluster covariate means, which requires additional assumptions, such as cluster homogeneity. In two-way clustering the randomization bias depends on the global means, and so validity follows exactly as in Theorem 3.

Remark 4.3. The permutation test described above can be extended to multi-way clustered errors. As in the two-way clustering case, this would require an assumption of additivity of cluster errors and exchangeability within clusters.

Remark 4.4. A related test known as “quadratic assignment procedure”(QAP) (hubert1986assignment; hubert1976quadratic; krackhardt1988predicting) is used in network analysis to test independence between network data (christakis2013social). In a dyadic regression model, for instance, QAP tests permute the rows and vectors of the dependent variable, y, and refit the model. As such, a QAP test acts as an F-test, but sometimes is misused to assess significance of individual parameters. More appropriate for significance is the residual randomization test based on 𝒢𝗉(Rn,Cn), which is essentially the residual version of the QAP test. dekker2007sensitivity have made a similar point based on their adaptation of the residual permutation method of freedman1983nonstochastic.

Remark 4.5. If there are replications within cells, and exchangeability still holds, then we could simply average data y,X within cells, and then perform the randomization test as above. Such test operates on the errors εi=ϵr+ϵc+ϵ¯rc, which are still invariant to row and column permutations. The averaging may preclude inference for parameters that are differenced out.

Remark 4.6.  Inference with heteroskedasticity in two-way clustered errors is more challenging. It may be tempting to switch the signs of the entire row or column residuals, but this would be wrong because changing one row affects all columns, and changing one column affects all rows. One approach is to follow a decomposition similar to ANOVA, where we difference out row and column averages. Such test operates on the transformed errors εi=εi-ε¯r(i)-ε¯c(i)+ε¯=ϵrc-ϵ¯r(i)-ϵ¯c(i)+ϵ¯. If the number of rows (and columns) grows, then the averages converge to zero. The errors εi could then be treated as approximately satisfying sign symmetry. Theorem 1 could be extended to include such approximate symmetries, but we leave this for future work.

5 Applications and experiments

In this section, we perform an extensive evaluation of the residual randomization tests introduced earlier. Our comparisons include both the bootstrap and its variants, and also standard asymptotic methods from the robust standard error literature. Exact algorithmic details on the methods used here can be found in Appendix D.

5.1 Hormone data

In the hormone data of efron1986bootstrap, there are 27 medical devices, each producing a data point (yi,xi); here, xi is number of hours the device is worn by a patient, and yi is remaining amount of hormone on the device. The model is simple linear regression, yi=β0+β1xi+εi, where εi are zero-mean, independent, and possibly not normal errors. efron1986bootstrap illustrated bootstrap in estimating standard errors for β1. Here, we use the residual randomization method to test H0:β1=0, and then invert the test to get a confidence interval. Thus, we set a=(0,1) and a0=0 in the null hypothesis of Equation (3).

We start by assuming only exchangeability of errors, and so we use the inferential primitive 𝒢𝗉, defined in Equation (12). In this simple setting, the randomization procedure is equivalent to regressing permutations of the constrained residuals, εio^=yi-y¯, with xi, and then comparing the estimated slopes of these regressions with the standard OLS estimate β^1. We calculate the p-values based on R=2000 iterations of the randomization procedure.

The null hypothesis, H0:β1=0, is strongly rejected (p-value 0) with the residual randomization test. To produce a confidence interval we follow the process outlined in Remark 2.3. Specifically, we consider a sequence of null hypotheses, H0:β1=β10, for various values of β10. Using the output of regular OLS, we may focus the range to β10[-0.1,-0.03]. The (two-sided) p-values for the entire sequence of tests are shown in Figure 1. From the plot, we can extract the values of β10 that lead to a p-value larger than 0.025, and determine the 95% confidence interval. In the figure, the interval is marked by the two vertical lines, and is roughly equal to [-0.0668,-0.0478].

Figure 1: Histogram of p-values for a sequence of tests, H0:β1=β10, for β10[-0.1,-0.03]. The horizontal dashed line marks the 0.025 threshold. The two vertical lines mark the range of values for β1 for which the null hypothesis cannot be rejected. This forms a confidence interval for β1.

We can extend this analysis to include heteroskedastic errors. For example, we can repeat the test inversion described above, this time using random signs as the inferential primitive; see Equation (13). Moreover, the data can be clustered according to the manufacturer of the device; in the data there are 3 manufacturers, each providing 9 data points. We can therefore do inference using within-cluster permutations or random signs, as described in Section 4. The results of this analysis are shown in Table 2, juxtaposed with results from regular OLS and bootstrap (efron1986bootstrap, Table 9.2). We see that the standard errors from all methods roughly agree. The exception is the results from cluster random signs, which cannot produce significance at the 5% level since there are only 3 clusters. This would be concerning if we were not willing to assume exchangeability of errors within clusters.

In summary, the numerical example of this section illustrates that the residual randomization method provides nonparametric estimation of the standard error in a straightforward manner like bootstrap. Additionally, the example shows that the randomization method can be useful for sensitivity analysis simply by doing inference under different invariance assumptions. In doing so, the randomization method is more versatile than bootstrap since one single procedure (presented in Section 2.1) underlies all of the aforementioned tests, whereas a bootstrap approach would require the use of an appropriate variant each time.

inference method midpoint estimate s.e. 95% interval
OLS -0.0574 .0045 (-0.0665,-0.0482).
bootstrap -0.0574 .0043 (-0.0660,-0.0488)
randomization
methods
permutations -0.0573 .0048 (-0.0668,-0.0477)
random signs -0.0595 .0045 (-0.0686,-0.0504)
permutations, within
-0.0609 .0043 (-0.0695,-0.0522)
signs, cluster
double
-0.0582 0.0050 (-0.0682,-0.0482)
Table 2: Residual randomization Inference on β1 in the hormone data. “permutations” assumes exchangeable errors; “random signs” assumes error symmetry around zero; “permutations, within” assumes exchangeable errors within clusters defined by the device manufacturer; “sign, across” assumes error symmetry around zero on the cluster level; “double” assumes both cluster invariances.

5.2 Clustered errors

Here, we illustrate the theory and methods of Section 4 on clustered errors. There are two main takeaways. First, using richer inferential primitives can yield substantially more powerful tests (Section 5.2.1). Second, exact tests in finite samples are possible by appropriately adapting to the data structure. We illustrate the idea in the famous Behrens–Fisher problem (Section 5.2.2).

5.2.1 Simulated study

Here, we use a familiar simulation setup (cameron2008bootstrap; bruce2018; canay2018wild) with one-way clustering, and errors of the form

εi=ϵc+ϵic,whereϵicN(0,1),i[c]. (21)

The random effects term, ϵc, can induce within-cluster correlation. We consider two cases, ϵc=0 and ϵcN(0,1). All errors ϵc,ϵic are independent. The data are simulated as yi=β0+xiβ+εi, with xi a scalar. Following cameron2008bootstrap, xi can also have a clustered structure:

xi=xc+xic,wherexicN(0,1),i[c]. (22)

For the cluster x-term we consider two cases: (1) xcN(0,1), and (2) xc.5LN(0,1), the log-normal distribution. The second choice is interesting because log-normal covariates produce a regression model with high leverage points, which makes the estimation problem harder. We consider c=1,,J clusters with J=10,15,20; each cluster has 30 datapoints, and thus there are i=1,,30J datapoints in total. When we want to induce heteroskedasticity, we scale the errors εi in Equation (21) by 3|xi|. Throughout all simulations, we set β1=0, and β0=0 for the homoskedastic case, and β0=1 for the heteroskedastic.

The results for testing the true null hypothesis, H0:β1=0, are shown in Table 3. The table shows rejection frequencies over 5000 replications at 5% significance level, and includes cluster robust error estimates (see Appendix D.3 for details). For each replication, every randomization test uses R=2000 replications internally.

First, we focus on Panel (A) with homoskedastic errors. As expected, standard OLS achieves the nominal level when there is no cluster correlation (ϵc=0), but fails considerably when the errors are clustered. The cluster robust errors are a definite improvement but are also far from the nominal level. This is likely due to the small sample, since rejection rates do improve as J grows. The residual randomization tests work well across the board: rejection rates are around 5% even when the covariates are log-normal, as shown in Columns marked by (2).

We now turn our attention to the heteroskedastic setting in Panel (B) of Table 3. Standard OLS now fails across all settings, as expected. The robust error method works roughly as well as in Panel (A). Notably, the log-normal covariates seem to affect the robust method even more in this setting. For example, the rejection rates can be around 14% with log-normal covariates, almost 3× the nominal level. On the other hand, the cluster sign randomization test shows remarkable robustness. It mostly achieves the nominal rate with normal covariates. And with log-normal covariates, its worst rate is around 8% when J=10, but otherwise it remains close to the nominal level of 5%. All other randomization tests don’t work well. This is expected because these tests include permutations within clusters, which is invalid when there is heteroskedasticity.

Next, we discuss power results when testing the false null hypothesis, H0:β1=0.1, shown in Table 4. We omit the results for the heteroskedastic panel in this case, since all tests except for the sign test don’t come close to the correct level. Overall, the randomization tests achieve good power compared to robust OLS. The double randomization test is more powerful than both individual tests in the left sub-panel of Table 4, where ϵc=0. In the right sub-panel, where the errors are homoskedastic and clustered, the double test remains more powerful than the sign test.

Notably, in that same sub-panel, the permutation test now performs much better than the rest, even compared to the double test. For example, when J=20 the permutation test is roughly 6× more powerful than the other tests. This may seem to contradict our argument that more invariances generally lead to more power. The result can be explained, however. By construction of the simulation, with xc and xic both zero-mean and independent normals, the randomized Gram matrix, XGX, for the permutation test is similar to the population XX. The similarity condition in Theorem 1 holds with good precision, suggesting an almost exact test. To check this theory, we can try to set xi=xc+xic, where xcN(c,1), and xicN(0,0.12), which produces heterogeneous clusters by emphasizing the differences between clusters (terms ϵc) and dampening their similarity (terms ϵic). With this setup, and all else equal, the rejection rates are now 0.137,0.053, and 0.141 over 5000 replications for the sign, permutation, and double test, respectively, compared to 0.119,0.392 and 0.146 of Table 4. This reversal cautions that testing power depends heavily on the problem structure.

(A) Homoskedastic
cluster error term (ϵc)
ϵc=0 (non-clustered errors) ϵcN(0,1)
number of clusters (J)
J=10 J=15 J=20 J=10 J=15 J=20
OLS (1) (2) (1) (2) (1) (2) (1) (2) (1) (2) (1) (2)
standard 0.057 0.051 0.054 0.051 0.050 0.053 0.490 0.382 0.480 0.394 0.493 0.421
HC2, robust 0.086 0.090 0.076 0.079 0.061 0.077 0.103 0.110 0.081 0.089 0.081 0.090
Randomizations
signs, across 0.059 0.047 0.054 0.049 0.047 0.054 0.053 0.055 0.056 0.048 0.055 0.050
perms.,within 0.053 0.050 0.046 0.050 0.052 0.051 0.051 0.048 0.048 0.051 0.054 0.056
double 0.061 0.054 0.056 0.052 0.051 0.056 0.055 0.052 0.054 0.046 0.051 0.050
(B) Heteroskedastic
ϵc=0 ϵcN(0,1)
J=10 J=15 J=20 J=10 J=15 J=20
OLS (1) (2) (1) (2) (1) (2) (1) (2) (1) (2) (1) (2)
standard 0.228 0.249 0.244 0.264 0.239 0.286 0.278 0.274 0.301 0.303 0.301 0.309
cluster robust 0.095 0.140 0.085 0.116 0.074 0.114 0.100 0.126 0.091 0.124 0.075 0.121
Randomizations
signs, across 0.055 0.084 0.055 0.072 0.052 0.072 0.049 0.065 0.059 0.071 0.056 0.072
perms., within 0.152 0.162 0.155 0.148 0.145 0.153 0.154 0.150 0.166 0.149 0.154 0.144
double 0.205 0.194 0.198 0.174 0.183 0.170 0.166 0.167 0.168 0.163 0.150 0.155
Table 3: Rejection rates for testing true H0:β1=0 in cluster simulation. Method “signs, across” randomly switches the signs of cluster residuals (similar to cluster wild bootstrap, but without replacement). Method “perms., within” permutes the residuals within clusters; and “double” performs both operations. See Equations (17), (16) and (18), respectively, for definitions. Column (1) corresponds to normal xc in Equation (22); column (2) corresponds to log-normal xc.
(A) Homoskedastic
cluster error term (ϵc)
ϵc=0 (non-clustered errors) ϵcN(0,1)
number of clusters (J)
J=10 J=15 J=20 J=10 J=15 J=20
OLS (1) (2) (1) (2) (1) (2) (1) (2) (1) (2) (1) (2)
standard 0.649 0.575 0.816 0.742 0.923 0.859 0.572 0.495 0.587 0.536 0.629 0.589
cluster robust 0.646 0.582 0.804 0.730 0.905 0.849 0.166 0.207 0.162 0.209 0.177 0.237
Randomizations
signs, across 0.532 0.450 0.744 0.638 0.875 0.791 0.119 0.128 0.129 0.147 0.152 0.177
perms., within 0.390 0.391 0.533 0.537 0.665 0.666 0.392 0.380 0.544 0.558 0.670 0.666
double 0.620 0.532 0.796 0.698 0.904 0.833 0.146 0.164 0.154 0.178 0.168 0.213
Table 4: Rejection rates for testing the false hypothesis H0:β1=0.1 in cluster simulation. Results for the Panel (B) on heteroskedastic errors have been omitted. As before, Column (1) corresponds to normal xc in Equation (22); column (2) corresponds to log-normal xc.

In summary, the results of this section reveal certain interesting trade-offs in randomization inference with clustered errors. When exchangeability is not plausible, the only option seems to be the cluster sign test. However, when exchangeability is plausible, then either the permutation test or the double test should be used. The choice depends on where most of the variation comes from, and could be aided by ANOVA. For example, if there is no substantial variation across clusters, then the permutation test should be used. If there is substantial variation both across and within clusters, then the double test is likely more appropriate.

5.2.2 Behrens–Fisher problem: an exact test

In this section, we illustrate the construction of an exact residual randomization test in the context of the famous Behrens–Fisher problem (scheffe1970practical). This problem has been used to illustrate the limitations of robust standard error methods in small samples (angrist2008mostly; imbens2016robust). We use a simulation setup similar to (angrist2008mostly, Section 8.2) and (imbens2016robust, Section II), where the data generating process is defined as follows:

yi=β0+β1di+εi.

Here, di{0,1}, and εi are independent with E(εi)=0 and Var(εi)=diσ12+(1-di)σ02, with both variances unknown.

Following the treatment effects terminology of angrist2008mostly, datapoint i denotes an experimental unit, which is treated when di=1 and is in control when di=0. There are n=30 units in total. Let n1=idi=3 and n0=n-n1=27 be the number of treated and control units, respectively. The OLS estimate for β1 satisfies β^1-β1=ε¯1-ε¯0, where ε¯1 and ε¯0 are the error averages for treated and control units, respectively. Inference for β1 is complicated because both σ12 and σ02 are unknown, and the standardized statistic based on β^1 is not exactly t-distributed. This is known as the Behrens–Fisher problem.

One popular approach due to welch1951comparison is to approximate the distribution of the statistic (β^1-β1)/σ^, where σ^ is some robust standard error estimate. The approximation follows by the adjustment of the robust estimate by a factor f so that fσ^/σ matches the first two moments of a chi-square with f degrees of freedom, where σ2=σ12/n1+σ02/n0 is the true variance. Because this adjustment depends on unknown quantities, imbens2016robust recommend an adjustment based on the bias reduction method of mccaffrey2002bias. Another approach altogether is to use the wild bootstrap, or more generally the sign tests of Section 3.2.

Remarkably, we can construct an exact randomization test using the approach of Section 2.3. The idea of this approach is to construct a clustering of the datapoints, such that XX within each cluster is similar to the population XX. In this simple example, we actually have a closed-form formula for the population quantity:

XX=n(1d¯d¯d¯)=30(1.1.1.1).

This suggests to split the units in clusters, so that in each cluster there is the same proportion of treated units, d¯. Thus, we can split the units in 3 clusters, with 1 treated unit and 9 control units per cluster. Let C0 denote such clustering. Then, for any cluster cC0,

XcXc=10(1.1.1.1)XX.

Next, consider the primitive 𝒢𝗌(C0) defined in Equation (17). This set forms an algebraic group that contains 8 elements, and describes a 3-cluster sign test. Then, the conditions of Theorem 2 hold, and the residual randomization test with 𝒢𝗌(C0) as the primitive is exact in finite samples.

In summary, the exact randomization test in this problem proceeds as follows. For a null hypothesis, H0:β1=β10, we take the following steps.

  1. 1.

    Calculate the residuals εo^=y-y¯𝟙-β10d, and the observed test statistic, Tn=n(β^1-β10).

  2. 2.

    Define tn(u)=n[du/n1-(1-d)u/n0]=n(u¯1-u¯0).

  3. 3.

    For every 𝚐𝒢𝗌(C0), calculate tn(𝚐εo^).

A p-value is obtained by comparing the test statistic values from Step 3 with the observed value. This p-value is exact since Theorem 2 implies that tn(𝚐εo^)=tn(𝚐ε), for every 𝚐𝒢𝗌(C0).

To illustrate numerically, we analyze three methods in a simulated study of this problem: the “BM” correction discussed by imbens2016robust, the sign test of Section 3.2 (similar to wild bootstrap), and the exact test described above. Following imbens2016robust, we fix σ1=1 and consider σ0=0.5,1,2,5. We always test the null hypothesis, H0:β1=0, and report three panels, corresponding to true values 0,1 and 2 for β1. We also consider three different error types: normal, t3, and a mixture. For the mixture, we sample εi.5N(-1,.252)+.5N(1,.252).

The results of the simulation are shown in Table 5. We see that all methods do well when σ0=0.5 or σ0=1, and the errors are normal (see first two columns across panels). The exception is perhaps the cluster sign test that over-rejects when σ0=.5. With normal errors, all tests become more conservative except for the exact test (see Columns 3 and 4) as σ0 grows. We get the same picture with t-errors, with only an apparent reduction in power for all tests. As σ0 increases all tests become increasingly conservative and powerless, except for the exact test. For instance, when β1=1.0 the BM test rejects only 0.8% of the time with normal errors and σ0=5, whereas the same number is 21.5% when σ0=.5. Not surprisingly, the exact test shows a remarkable robustness across panels. It is affected by increased σ0, or non-normal errors, but only slightly.

With mixture errors all tests except for the exact test badly over-reject. For example, in Panel (A) we see that the BM method rejects 25% of the time when σ0=.5. In contrast, the same number is 5% when the errors are normal. This indicates that normality may be crucial for the analytic tests. As σ0 increases these tests again become increasingly conservative and powerless. For instance, when β1=1.0 we see that BM rejects only 0.9% of the time. The exact test behaves fine here as well. It maintains the correct level, and does not lose power under various alternatives.

Panel (A). True β1=0.0
Error type, εi
normal t3 mixture
Control standard deviation, σ0
Method 0.5 1 2 5 0.5 1 2 5 0.5 1 2 5
BM 0.050 0.028 0.010 0.002 0.034 0.015 0.004 0.000 0.252 0.225 0.034 0.003
r-sign 0.095 0.012 0.000 0.000 0.067 0.012 0.001 0.000 0.213 0.010 0.001 0.000
r-exact 0.048 0.052 0.052 0.050 0.055 0.057 0.054 0.049 0.050 0.046 0.058 0.049
Panel (B). True β1=1.0
Method 0.5 1 2 5 0.5 1 2 5 0.5 1 2 5
BM 0.215 0.161 0.069 0.008 0.146 0.086 0.028 0.003 0.122 0.130 0.119 0.009
r-sign 0.448 0.149 0.007 0.000 0.270 0.065 0.003 0.000 0.214 0.122 0.004 0.000
r-exact 0.124 0.116 0.111 0.073 0.098 0.101 0.081 0.062 0.094 0.083 0.093 0.073
Panel (C). True β1=2.0
Method 0.5 1 2 5 0.5 1 2 5 0.5 1 2 5
BM 0.553 0.511 0.359 0.049 0.418 0.332 0.166 0.016 0.326 0.310 0.183 0.055
r-sign 0.899 0.632 0.090 0.000 0.655 0.290 0.032 0.000 0.978 0.673 0.070 0.001
r-exact 0.172 0.177 0.168 0.119 0.147 0.145 0.131 0.089 0.197 0.197 0.173 0.127
Table 5: Rejection rates of cluster sign test (r-sign), and exact randomization test (r-exact) for the Behrens–Fisher problem of Section 5.2.2.

In summary, this simulation points to the following conclusions. First, robust error estimates and the BM correction work well with normal errors, but they can be very conservative and lose power in small samples. This may be undesirable, especially in problems where the treatment effect is expected to be weak. Second, heavier-tailed errors do affect the tests, but not catastrophically so as long as the errors are smooth and normal-like. Additional simulations with Cauchy errors (not shown here), for instance, support this point. Third, the tests behave badly with stronger deviations from normality. Resampling methods like wild bootstrap, which are considered a good guard against that, don’t hold up. Here, such sign tests badly over-reject in small samples.

On a high level, it speaks to the flexibility and power of the residual randomization method that an exact test could be constructed here. The caveat is, of course, that such construction requires a careful study of the problem structure, and is not always possible.

5.3 Dyadic regression

To illustrate residual randomization inference with two-way clustering, we consider a simulation study on dyadic regression inspired by related work (cameron2011robust; aronow2015). The data generating model is defined as follows:

yi=β0+β1|xr(i)-xc(i)|+εi.

Here, i denotes a dyad of row and column elements (clusters); r(i){1,,m} is the row cluster of i, and c(i){1,,m} is the column cluster of i, with c(i)r(i). For n dyads it thus holds n=m(m-1)/2. We fix β0=1 and H0:β1=1, and test β1=1 for validity, and β1=1.2 for power. We consider a similar structure for the errors: εi=ϵr(i)+ϵc(i)+ui. The errors ui are independent standard normal, uiN(0,1). We also consider two choices for covariates x and errors ϵ. For the covariates, xjN(0,1) or xjLN(0,1), the standard log-normal distribution, where j=1,,m. For the errors, either ϵjN(0,1), or ϵj.5N(-1,.252)+.5N(1,.252), the mixture distribution introduced in Section 5.2.2. This leads to four different simulation settings. For every setting we consider m=10,25,50 clusters to see how the various methods perform with respect to sample size, which here increases quadratically—n=100,625,2500, respectively.

The results are shown in Table 6. We present results on three methods: a HC2 robust estimator as a strawman, the two-way cluster robust estimator (cameron2011robust; aronow2015), and the double permutation test of Section 4.5. We see that the standard HC2 estimator performs badly, which is expected since this estimator ignores the doubly clustered nature of the errors. The two-way cluster robust estimator is a clear improvement. It achieves the nominal level in many settings. However, it is also unstable, and is strongly affected by the error and covariate distributions. For example, in Panel (A) the estimator over-rejects at 11.5% when n=100, and only gets down to 6% when n=2500, when xi are log-normal and εi are normal.

In contrast, the double permutation test is notably robust and largely unaffected by the covariate distribution. In Panel (A) its rejection level is always around 5%, with the worst rate being around 6% when n=100. We see that the randomization test trades power for such robustness. This is shown in Panel (B), where the two-way test can be much more powerful than the randomization test. For instance, when both errors and covariates are normal and n=2500, the two-way test rejects 98% of the time, whereas the randomization test rejects 25%, almost 4× smaller. Under the true null, both tests are at the nominal level. The situation is noticeably improved when the covariates are log-normal. For instance, when the errors are normal and the covariates are log-normal, the aforementioned rates are 99.7% and 61%, respectively.

Panel (A). True β1=1.0
Error-covariate, (εi,xi)
(normal, normal) (normal, lognormal) (mixture, normal) (mixture, lognormal)
Sample size, n
Method 100 625 2500 100 625 2500 100 625 2500 100 625 2500
HC2 0.320 0.167 0.118 0.392 0.330 0.238 0.322 0.172 0.131 0.437 0.414 0.311
2-way robust 0.090 0.061 0.046 0.114 0.091 0.062 0.080 0.041 0.050 0.101 0.091 0.057
double perm. 0.060 0.057 0.052 0.047 0.055 0.045 0.053 0.037 0.057 0.053 0.057 0.050
Panel (B). True β1=1.2
100 625 2500 100 625 2500 100 625 2500 100 625 2500
HC2 0.363 0.279 0.359 0.488 0.616 0.734 0.360 0.267 0.279 0.470 0.543 0.675
2-way robust 0.150 0.537 0.981 0.301 0.788 0.997 0.144 0.524 0.983 0.286 0.775 0.997
double perm. 0.075 0.134 0.252 0.155 0.372 0.609 0.079 0.144 0.245 0.157 0.390 0.601
Table 6: Rejection rates for HC2 robust errors, two-way robust errors, and the double permutation test of Section 4.5 in the dyadic regression simulated study of Section 5.3. Null hypothesis is H0:β1=1.0.

In summary, the simulation results paint a mixed picture. If robustness is a concern, then the residual randomization test should be used. If power is more important, then we could use a two-way cluster robust estimator. In this case, we may improve the test through covariate transformations, e.g., through log-transforms. We caution, however, that the two-way cluster robust method faces the additional problem that the variance estimate is not always positive-definite, and it is unclear what the best remedy is in such cases. These bad examples were removed from the simulation.

Remark 5.1. In an influential paper, petersen2009estimating performs an empirical comparison of standard error estimates in financial panel data with firm-clustered and time-clustered errors. The double permutation test of Section 4.5 could also be used in this context. A better model is perhaps errors with autocorrelated structure, which we discuss in the following section.

6 Extensions

In this section, we consider regression with autocorrelated errors, and high-dimensional regression. We mainly propose ideas without further theoretical investigation, and point to supporting evidence from empirical evaluations. Our goal is to show the potential of residual randomization to address an even broader spectrum of problems than what has been presented so far.

6.1 High-dimensional regression

Here, we consider high-dimensional regression, where the number of covariates possibly exceeds the number of datapoints. The main problem we face here is that standard OLS does no longer work, and so we have to revise our main testing procedure with a penalized estimator.

Specifically, consider ridge regression with scalar penalty λ, namely β^ridge=(XX+λI)-1Xy. Since y=Xβ+ε, we can re-write the estimator as β^ridge=(I-λPλ-1)β+Pλ-1Xε, where Pλ=(XX+λI). Under the null hypothesis in Equation (3) it follows that

aβ^ridge-a0+λaPλ-1β=aPλ-1Xε.

We cannot use the randomization test immediately here because the term λaPλ-1β is unknown. To proceed, we could use a plug-in estimate such as β^ridge, or β^lasso, the LASSO estimate. Given the plug-in estimate, the residual randomization test could be adapted as follows:

  1. 1.

    Given y,X, calculate the LASSO and ridge regression estimates, β^ridge and β^lasso. Calculate the residuals, ε^=y-Xβ^lasso.

  2. 2.

    Let Tn=n(aβ^ridge-a0+λaPλ-1β^lasso) be the observed test statistic value.

  3. 3.

    Repeat: Sample GUnif(𝒢) and store tn(Gε^), where tn(u)=naPλ-1Xu.

  4. 4.

    Compare the value from Step 2 with the values from Step 3 to calculate the p-value.

The inferential primitive, 𝒢, could be defined as in earlier sections, e.g., as permutations or random signs on the residuals.

We now proceed to an empirical evaluation of this test, and conclude with some remarks at the end of this section. We compare with the hdi package of dezeure2015high, and particularly on a method that relies on a de-sparsified LASSO estimator (zhang2014confidence)—details are in Appendix D. We choose this method because in an extensive simulation study (dezeure2015high, Section 2.5) this method performed better on aggregate than a variety of other methods, including the nested covariance test of  lockhart2014significance and the de-biasing method of javanmard2014confidence. As all these tests, including residual randomization, are conditional, in future work we plan to compare also with unconditional LASSO-based tests (belloni2014inference; chernozhukov2018double).

For a precise comparison we run the randomization test described above using the same simulation study, but extended to include more error types. More specifically, there are p=500 covariates and n=100 data points, and XNp(0,Σ), where Σ is a Toeplitz matrix with ij-th element equal to 0.9|i-j|. Only s0 are the active parameters, with s0=3 or s0=15. There are 6 different ways to define the active parameters: randomly as U(-2,2),U(0,2),U(0,4), or fixed at 1, 2, or 10. So, in total there are 12 different simulation settings. The errors ε are i.i.d. N(0,1), and we also consider t3 and Cauchy. For every setting, we take 50 samples of X. For every setting and X sample, we draw ε and generate outcomes y=Xβ+ε to run the tests, and repeat this process 100 times. This gives us 12×50×100=6000 total test comparisons. Because both tests are computationally expensive we run the simulation on a cluster using 600 cores with x86, 64-bit architecture at 2.1Ghz. It takes about 15 minutes to complete the entire simulation, totaling to about a week of wall-clock time.

The results of this simulation study with normal errors are shown in Figure 2, split in 6 panels corresponding to the 6 different definitions of the active parameters. Here, we aggregate the data for s0=3 and s0=15, and present the individual plots in Appendix E.1. In the simulations below we test all individual hypotheses H0,j:βj=0, and for the randomization test we perform a simple Bonferroni correction; the hdi package makes a similar correction. The plots show the power of the two methods defined as the proportion of active parameters that a method is able to uncover (i.e., number of true rejections over s0). We will also report results on the family-wise error rate (FWER) defined as the proportion of false rejections over all comparisons.

We see that the randomization method is consistently more powerful than the desparsified LASSO method. The difference is even more pronounced when the signal is weak. For example, when the active parameters are equal to 1 (bottom, left panel) the power of the randomization method is around 95% compared to 55% for the LASSO method. As the signal becomes stronger this difference becomes smaller (see other bottom panels). Marginally, the Q1/Q2/Q3 quartiles for power are 0.76,0.95,1.00 for the randomization method, and 0.46,0.70,1.00 for the LASSO method. The quartiles for FWER are (.52,.97,3.5)×10-2 and (.80,1.6,5.8)×10-4 for randomization and LASSO, respectively. We see that both methods are conservative, but residual randomization reaches much closer to the nominal 5% level. Further improvements may be possible since the Bonferroni correction used for the randomization method is likely very conservative. In Appendix E.1 we present more results using t3 errors. In that setting, both tests lose some power due to the heavier-tailed errors, but the overall picture remains the same.

Figure 2: Power of de-sparsified LASSO method (dezeure2015high) and residual randomization in the high-dimensional simulation of Section 6.1. The different panels indicate how the active parameters are generated. The figure aggregates power data for both s0=3 and s0=15.

Remark 6.1. The residual randomization test of this section can easily be adapted to test for heteroskedastic errors, or even clustered errors, by choosing the appropriate primitive 𝒢. It would be interesting to compare the result randomization test to the approaches of cattaneo2018inference and d2018cluster in the context of high-dimensional inference with heteroskedastic and clustered errors, respectively.

6.2 Autocorrelated errors

Here, we consider a model with autocorrelated errors,

yt=xtβ+εt, (23)

where t=1,,n, is used as the index, and indicates time. Suppose the following error structure:

εt=ρtεt-1+ut, (24)

where ut are independent and symmetric around zero, and |ρt|1. This structure is similar to AR(1) but we do allow some form of non-stationarity since ρt may not converge. Our proposal here relies on the simple observation that εt is conditionally symmetric around zero given εt-1=0, that is, εt=d-εt{εt-1=0}. We can therefore group the datapoints in consecutive clusters, such that at the cluster endpoints the errors are zero. In practice, since we don’t have access to the true errors, we can cluster the restricted residuals, εo^, in a few clusters, such that the endpoints correspond to the lowest values of |εo^|, and then perform a cluster sign test as defined in Section 4.3. The full procedure can be described as follows.

  1. 1.

    Calculate the restricted residuals, εo^=y-Xβo^.

  2. 2.

    Order their absolute values, |ε^to|, and select the J+1 smallest values. Denote the corresponding timepoints as t0,,tJ.

  3. 3.

    Define the clustering, Cn={{t0,,t1},{t1+1,,t2},,{tJ-1+1,tJ}}.

  4. 4.

    Perform the cluster sign test based on 𝒢𝗌(Cn) as defined in Section 4.3.

We refer to this cluster sign test as the reflection test, since it rests upon the assumption that the error time series, εt, can be reflected around the time axis. Because it can happen that |Cn|<J we distinguish two variants of the reflection test. The unconditional variant rejects when |Cn|<J, while the conditional variant only decides when |Cn|J, but is otherwise undefined. In terms of validity, Theorem 6 already contains the main intuition. Under our model assumptions the cluster sizes, nc, are bounded random variables. It follows that cnc2/n2p0, which by Theorem 6 is sufficient for asymptotic validity. The full proof needs to address the fact that the clustering here is random and approximate, as it conditions on values ε^to0 instead of these values being exactly zero. We leave the full proof for future work.

To illustrate the reflection test, we show results from a simulated study. We use the model of Equation (23) with the error structure of Equation (24). Throughout this study, we fix ρt=ρ, with ρ{0,0.3,0.5,0.8}. We set β0=β1=0 and report the rejection rates under the true null, H0:β1=0 on n=100 datapoints. There are four models for the covariates: (i) xit=N(0,1), or (ii) xit=LN(0,1), or (iii) xit=ρxi,t-1+N(0,1), or (iv) xit=ρxi,t-1+LN(0,1), where xit is the i-th component of xt, and LN is the log-normal. Thus, in settings (i) and (ii) xit are independent, and in settings (iii) and (iv) xit are autocorrelated. The errors are either εit=ρεi,t-1+N(0,1), or εit=ρεi,t-1+mix, where mix=.5N(-1,.252)+.5N(1,.252) was introduced in Section 5.2.2.

The results of the simulation are shown in Table 7. We include results on standard OLS as a strawman, and heteroskedasticity and autocorrelation consistent errors (HAC) (white1980heteroskedasticity; mackinnon1985some; andrews1991heteroskedasticity), the latter as provided by the vcovHAC function in R’s sandwich package (zeileis2004econometric). For the reflection test we set J=6, and report results for both the conditional and unconditional reflection test. The rejection rate for the conditional reflection test is calculated only over the cases for which the test returns a definite decision.

First, we focus on Panel (A) where ρ=0.3. We see that standard OLS is actually performing well, achieving in many cases the nominal 5% level. HAC errors are worse due to the small sample, and are strongly affected by lognormal covariates—for instance, rejection rates jump from 6.6% to 11.2% from Column (i) to Column (ii). Using a mixture distribution for the errors also seems to affect HAC errors disproportionately. For example, the aforementioned numbers are now 6.6% and 14.5%, respectively, when the ut errors are from the mixture. Next, we look at Panel (B) where ρ=0.8. We now see that standard OLS is greatly affected by the increased error autocorrelation. For example, with normal ut and autocorrelated xit, the rejection rates are now around 33% compared to 7% in Panel (A). HAC errors are more robust but they still are affected by lognormal and autocorrelated covariates, and generally reject 2× to 3× higher than the nominal level.

Panel (A): ρ=0.3
Error εt=ρεt-1+utut=
normal mixture
Covariates xt
iid autocorrelated iid autocorrelated
Method (i) (ii) (iii) (iv) (i) (ii) (iii) (iv)
OLS 0.052 0.054 0.073 0.078 0.053 0.050 0.073 0.071
HAC 0.066 0.112 0.065 0.112 0.066 0.145 0.070 0.130
reflection test, uncond. 0.031 0.030 0.034 0.034 0.045 0.048 0.042 0.042
reflection test, cond. 0.051 0.048 0.054 0.055 0.053 0.057 0.050 0.049
Panel (B): ρ=0.8
(i) (ii) (iii) (iv) (i) (ii) (iii) (iv)
OLS 0.048 0.048 0.341 0.339 0.049 0.050 0.336 0.346
HAC 0.050 0.087 0.104 0.128 0.053 0.097 0.102 0.141
reflection test, uncond. 0.022 0.023 0.024 0.027 0.031 0.029 0.032 0.030
reflection test, cond. 0.049 0.052 0.055 0.061 0.053 0.050 0.052 0.051
Table 7: Rejection rates for standard OLS, HAC errors, and the reflection test for the simulated study of Section 6.2.

The reflection tests compare favorably. They maintain the nominal level in virtually all settings. We also see that the unconditional test is more conservative. This is expected since the test does not reject when |Cn|<J. The conditional test decides when |Cn|J, and maintains the nominal level almost exactly.

Remark 6.2.  In our simulations, in 60-80% of replications we have |Cn|6 when ρ=0.3; when ρ=0.8 this number drops to 40-55%. In the extreme case where the autocorrelation is very high (e.g., ρ=0.99) the same number is about 1%. In such cases the conditional reflection test rarely returns a definite decision. This is actually a feature, not a bug. When ρ is very high, the error process is almost a random walk, indicating that inference is not possible. The reflection test indirectly captures this limitation by producing empty Cn, and not returning a value.

Remark 6.3. The reflection test does not require estimation of the unknown error correlation ρ, and even works when ρ is time-dependent. Furthermore, it may be straightforward to extend to autocorrelation lags of order higher than one by defining reflection points on consecutive residuals, e.g., conditioning on both ε^to0 and ε^t+1o0. This may work well when autocorrelation is small enough or when the data size is big enough.

7 Discussion

While the residual randomization method has performed well in extensive empirical evaluations with complex error structures, there are nonetheless several open questions.

First, all our theoretical results are on asymptotic validity, while power can only be indirectly studied through our theory; e.g., by looking at the variance of tn(Gεo^)-tn(Gε), the difference between the feasible and the idealized test statistic. It would be interesting to know, for example, when additional invariances can yield higher powered tests. Such result could offer insights on the relative power benefits of the randomization approach over a more standard bootstrap approach. This analysis is complicated because it not only depends on the particular error distribution, or the problem structure, but also on the choice of the test statistic. Second, residual randomization showed notable performance in settings with autocorrelated errors and high-dimensional regression, which are not covered by our theory. Developing theory for such settings is challenging. For example, even though the reflection test described in Section 6.2 is a cluster sign test, the clustering we condition on is such that the desirable invariance assumption holds only approximately. It would thus be interesting to extend Theorem 1 to include approximate error invariances, which are only exact in the limit. This is reminiscent of randomization testing under approximate symmetry conditions, as studied by canay2017randomization.

Finally, extending residual randomization beyond linear models is important. The key technical challenge would likely be that the difference tn(Gεo^)-tn(Gε), which determines the asymptotic validity of the randomization method, is no longer easy to express due to non-linearity. Perhaps we could extend the idea in Section 6.2 of residual randomization with autocorrelated errors. For instance, suppose we can express β^ as β^=β+h(y,X;β)+fn(ε), so that aβ^-ah(y,X;β)-a0=afn(ε) under the null. For the test statistic, Tn, we could use a plug-in estimate for h, i.e., define Tn=aβ^-ah(y,X;β^)-a0, and then compare its observed value to the set {afn(𝚐εo^)}.

8 Conclusion

In this paper, we developed a residual randomization method for inference in regression models with complex error structure. We proved asymptotic validity for a variety of structures, particularly clustered errors with one-way and two-way clustering. We also illustrated construction of an exact randomization test in the Behrens–Fisher problem. Through extensive simulations, we showed that the proposed method is more robust in small samples than asymptotic methods, and is more flexible and data-adaptive than bootstrap. Extensions to high-dimensional regression and models with autocorrelated errors were also considered with notable empirical success.

References

Appendix A Proofs of Main Theorems

Theorem 1.

Suppose that Assumptions 1(a)-(c) hold for the regression model of Equation (2) with inferential primitive G. Let GUnif(G), and suppose that

(XX)-1(XGX)pbI,b0. (10)

Then, the residual randomization procedure of Section 2.1 is asymptotically valid; that is,

limsupnE(ϕ(y;X))α.
Proof.

Let Sn=(1/n)XX. By Assumption 1(b), SnS. Moreover,

tn(u)=na(XX)-1Xu=aSn-11nXu.

The convergence of constrained OLS, n(βo^-β)dω, for some random variable ω, is guaranteed by Assumptions 1(a)-(c), and the convergence of the unconstrained estimator, n(β^-β)=Sn-11nXεdS-1𝒵. Furthermore, aω=0, since aβo^=a0 by definition of the constrained least squares estimator.

In the randomization test we sample GrUnif(𝒢), r=1,,R, as i.i.d. samples uniformly from 𝒢, where R is fixed. By assumption of the theorem, for every r=1,,R,

Sn-11n(XGrX)pbI.

Thus, we obtain successively:

tn(Grεo^)-tn(Grε) =aSn-11nX(Grεo^-Grε)
=aSn-11nX[Gr(εo^-ε)]  [From linearity of Gr]
=-aSn-11nX{Gr(X(βo^-β))}  [since εo^-ε=X(β-βo^)]
=-aSn-1(1nXGrX)=bIn(βo^-β)da(bI)ω  [from Slutsky’s theorem]
=-baω=0.  [from constrained OLS] (11)

Since tn(Grεo^)-tn(Grε)d0, convergence in probability also follows, tn(Grεo^)=tn(Grε)+oP(1).

Now, we turn our attention to the variables Tn(r)tn(Grε), r=1,R. The test statistic is defined as Tn=n(aβ^-a0), and so Tn=H0tn(ε) under the null hypothesis. By Assumption 1(c) we have

Tn=H0tn(ε)=aSn-11nXεdaS-1𝒵T. (12)

Thus, under the null, Tn converges in distribution. By the invariance assumption, ε=dGrε, variable Tn(r) also has a limit law, such that

Tn(r)dTr,andTr=dT. (13)

Furthermore, define T^n(r)=tn(Grεo^), so that T^n(r)=Tn(r)+oP(1)dTr. Then, it follows

(Tn,{T^n(r):r=1,,R})d(T,{Tr:r=1,,R}),

and so the randomization test that compares Tn with {T^n(r):r=1,,R} is asymptotically valid, i.e.,

limsupnP{Tnq1-α({T^n(r):r=1,,R})}P{Tq1-α({Tr:r=1,,R})}α,

where q is the quantile function, and the first inequality follows from the Portmanteau theorem (lehmann2006testing, Theorem 11.2.1) and the continuous mapping theorem; the last inequality follows from Equation (13) and the randomization test property (lehmann2006testing, Theorem 15.2.2). ∎

Theorem 2.

For fixed n, consider a clustering Cn with |Cn|=J<. Suppose that for every cluster c there exists bcR such that XcXc=bcXX, where Xc=DcX are the cluster covariates. For some inferential primitive G, suppose that every member transform, gG, can be decomposed as g=c=1JscDc, where scR. Then, the residual randomization test with G as the primitive is exact in finite samples, i.e., E(ϕ(y,X))=α.

Proof.

By definition, Xc=𝔻cX and 𝔻c𝔻c=𝔻c since 𝔻c is binary-valued and diagonal. Thus, XcXc=X𝔻c𝔻cX=X𝔻cX. In Equation (A) of the proof of Theorem 1 we now have:

tn(Grεo^)-tn(Grε) =-a(XX)-1(1nXGrX)n(βo^-β)
=-a(XX)-1[1nX(cscr𝔻c)X]n(βo^-β)  [we use the representation Gr=cscr𝔻c]
=-cscra(XX)-1[1nX𝔻cX]n(βo^-β)
=-cscra(XX)-1(1/n)XcXcn(βo^-β)  [because X𝔻cX=XcXc]
=-c(1/n)scrbca(βo^-β)  [by cluster similarity assumption]
=-c(1/n)scrbc(a0-a0)=0.   [by constrained OLS, βo^ and the null hypothesis].

This result implies that the randomization values of the test statistic, tn(Grεo^), are exactly equal to the true values, tn(Grε). The test is therefore exact, i.e., valid in finite samples for any finite n. ∎

Appendix B Proofs of Validity Theorems

B.1 Preliminaries

For every column x of covariate matrix X, it follows from Assumption 1(a) that

x¯=1ni=1nxi=O(1),andx2¯=1ni=1nxi2=O(1). (14)

Boundedness for the first moment follows from Jensen’s inequality, e.g., x¯2x2¯=O(1). Using the notation of Section 2.3, for a set A{1,,n} the vector 𝟙A is one at i-th element only if iA, and is zero otherwise. Also, 𝟙i is shorthand for 𝟙{i}. The matrix 𝔻A=𝟙A𝟙A is the n×n diagonal matrix that is one at the (i,i)-th element only if iA, and is zero otherwise.

B.2 Main proofs

Theorem 3.

Suppose that Assumptions 1(a)-(c) hold for the regression model of Equation (2), and ε=dgε, for all gGp. Then, the residual randomization test using Gp as the primitive is asymptotically valid.

Proof.

The proof technique for the validity theorems of this section and the next is as follows. First, we consider the matrix XGX, which is a random p×p matrix with m-th element equal to xGz, where x denotes the -th column of X, and z the m-th column. Suprressing the dependence on and m, we focus on the quantity wn=1nxGz. Generalizing from an expression for wn can lead us to a result for XGX. Then, we can check whether the condition in Equation (10) of Theorem 1 holds.

In the permutation context, G=i=1n𝟙i𝟙π(i), where π is a random permutation from Πn. Then,

E(zπ(i))=1nj=1nzj=z¯,andE(zπ(i)2)=1nj=1nzj2=z2¯, (15)

where the expectation is with respect to the randomness of π. Furthermore, for fixed i,j, ij,

E(zπ(i)zπ(j)ij) =E{zπ(j)E(zπ(i)π(j),ij)}
=E(zπ(j)k=1,kπ(j)n1n-1zk)=1n-1E{zπ(j)(nz¯-zπ(j))}=1n-1(nz¯2-z2¯). (16)

Also,

xGz =x(i=1n𝟙i𝟙π(i))z=i=1n(x𝟙i)(𝟙π(i)z)=i=1nxizπ(i), (17)

where the last equality follows from the property 𝟙ix=xi, by definition of 𝟙i (see Section B.1). Then,

E(wn)=1nxE(G)z =1ni=1nE(xizπ(i)))=1ni=1nxiE(zπ(i))=1ni=1nxiz¯=x¯z¯. (18)

Here, the first equation follows from (17), and the second from (15).

It follows that E(1nXGX)=μμ, where μ=X𝟙/n are the column sample means of X. Next, we derive the variance of wn.

Var(wn) =E(wn2)-x¯2z¯2=1n2E{(i=1nxizπ(i))2}-x¯2z¯2  [from Equation (17)]
=1n2i=1nxi2E(zπ(i)2)+1n2ijxixjE(zπ(i)zπ(j))-x¯2z¯2
=1n2i=1nxi2z2¯+1n2(n-1)ijxixj(nz¯2-z2¯)-x¯2z¯2  [from Eqs. (15) and (B.2)]
=1nx2¯z2¯+(nz¯2-z2¯)n(n-1)(nx¯2-x2¯)-x¯2z¯2  [since ijxixj=ixijixj=n2x¯2-nx2¯.]
=1n-1(x2¯-x¯2)(z2¯-z¯2)1n-1sx2sz2, (19)

where s2 indicates sample variance. All variances are bounded by Assumption 1, and so Var(wn)p0. Therefore we obtain: 1nXGXpμμ. Note that X has ones as the first column for the intercept term. Then, the first column of (XX)/n is equal to μ, the column sample means. In vector form: (XX)𝟙1/n=μ, where 𝟙1 is the p×1 vector with only the first element being one. It follows that

[(XX)𝟙1/n]μ=μμ(XX)-1μμ=𝟙1μ/n.

This means that

Ann(XX)-11nXGXp𝟙1μ.

Let ωn=n(βo^-β), for which we know that ωndω from Theorem 1. Equation (A) in the proof of Theorem 1 can now be written as:

tn(Grεo^)-tn(Grε) =-aAnωn=-a𝟙1μωn+oP(1)
=-a1n(𝟙X(βo^-β)/n)+oP(1)  [since μ=X𝟙/n]
=-a1n(εo^¯-ε¯)+oP(1)p0,[from LLN] (20)

This result implies that tn(Gεo^)=tn(Gε)+oP(1). The rest of the proof in Theorem 1 goes through to prove validity even though the condition in Equation (10) of the theorem does not hold. ∎

Theorem 4.

Suppose that Assumptions 1(a)-(c) hold for the regression model of Equation (2), and ε=dgε, for all gGs. Then, the residual randomization test using Gs as the primitive is asymptotically valid.

Proof.

Here, GUnif(𝒢𝗌) is equivalent to G=i=1nSi𝟙i𝟙i, where S1,,Sn are independent random signs. As in the proof of Theorem 3, let x,z be any two columns of X, and define wn=1nxGz. Then,

1nxGz =1nx(i=1nSi𝟙i𝟙i)z=1ni=1nSi(x𝟙i)(𝟙iz)=1ni=1nSixizi. (21)

It follows that,

E(wn)=1nxE(G)z =1ni=1nE(Sixizi)  [from Equation (21)]
=1ni=1nxiziE(Si)
=0  [Since Si is random sign]. (22)

Next, we derive the variance of wn.

Var(wn)=E(wn2) =1n2E{(i=1nSixizi)2}  [from Equation (B.2)]
=1n2i=1nxi2zi2E(Si2)+1n2ijxixjzizjE(SiSj)
=1n2i=1nxi2zi2+1n2ijxixjzizjE(Si)=0E(Sj)=0  [since Si,Sj are independent when ij]
1n1ni=1nxi21ni=1nzi2  [from Cauchy-Schwartz inequality]
=1n(x2¯)1/2(z2¯)1/2=O(1/n)0,

where the last limit follows from Equation (14). Thus, 1nXGXp0, condition (10) of Theorem 1 holds, and the test is asymptotically valid. ∎

Theorem 5.

Suppose that Assumptions 1(a)-(c) hold for the regression model of Equation (2), and ε=dgε, for all gGp(Cn). Suppose also that Assumptions 1(a)-(b) hold within clusters. Then, the residual randomization test using Gp(Cn) as the primitive is asymptotically valid if the clusters are centered; or if the homogeneity condition of Equation (15) holds and mincnc.

Proof.

Here, GUnif{𝒢𝗉(Cn)} is equivalent to G=ci[c]𝟙i𝟙π(i), where π is a random within-cluster permutation from Π(Cn). Let x,z be any two columns of X, and define wn=1nxGz. For some i[c]:

E(zπ(i))=1nci[c]zizc¯,andE(zπ(i)2)=1nci[c]zi2zc2¯. (23)

For cluster c:

i[c],j[c],ijxixj=i[c]xij[c],jixj=i[c]xi(ncxc¯-xi)=nc(ncxc¯2-xc2¯). (24)

We follow similar steps to Equation (23) to obtain

E(zπ(i)zπ(j)i,j[c],ij) =E{zπ(j)E(zπ(i)π(j),i,j[c],ij)}=E{zπ(j)nczc¯-zπ(j)nc-1j[c]}
=1nc-1(nczc¯2-zc2¯). (25)

By definition,

xGz =x(c=1Jni[c]𝟙i𝟙π(i))z=ci[c](x𝟙i)(𝟙π(i)z)=ci[c]xizπ(i). (26)

Note that we keep the double summation to distinguish between clusters. We derive the expectation of wn:

E(wn)=1nxE(G)z =1nci[c]E(xizπ(i))  [from Equation (26)]
=1nci[c]xiE(zπ(i))
=1nci[c]xizc¯
=c(nc/n)xc¯zc¯mn. (27)

Next, we derive the variance.

Var(wn)=E(wn2)-mn2 =1n2E{(ci[c]xizπ(i))2}-mn2  [from Equation (26)]
=1n2ci[c]xi2E(zπ(i)2)+1n2ci[c],j[c],ijxixjE(zπ(i)zπ(j))
       +1n2c,c,cci[c],j[c]xixjE(zπ(i)zπ(j))-mn2
=1n2ci[c]xi2zc2¯+1n2ci[c],j[c],ijxixj1nc-1(nczc¯2-zc2¯)
       +1n2c,c,cci[c],j[c]xixjE(zπ(i)zπ(j))-mn2  [from Eqs. (23) and (B.2)]
=1ncncnxc2¯zc2¯(A)+c(nczc¯2-zc2¯)n2(nc-1)nc(ncxc¯2-xc2¯)(B)
       +1n2c,c,cci[c],j[c]xixjE(zπ(i)zπ(j))(C)-mn2  [from Equation (24)] (28)

We consider the above terms successively. To simplify notation, we define

Rcncnxc¯zc¯.

Notice that Rc=O(1) and mn=cRc.

Term (A). First, xc2¯=O(1) and zc2¯=O(1) from the within-cluster boundedness assumption. Thus, the term (A) is asymptotically negligible:

(A)=1ncncn=1O(1)=O(1/n).

Term (B).  The term be can be approximated as follows,

(B)=c(nczc¯2-zc2¯)n2(nc-1)nc(ncxc¯2-xc2¯)=cRc2[1+O(1nc)]+O(ncn2).

Term (C). Because permutations are independent across clusters, for i[c],j[c], with cc, we have E(zπ(i)zπ(j))=E(zπ(i))E(zπ(j))=zc¯zc¯. Thus,

(C)=c,c,ccncncn2xc¯zc¯xc¯zc¯.

By definition of Rc, we obtain (C)=c,c,ccRcRc.

We use these results on the individual terms to finish the calculation of the variance.

Var(wn) =(A)+(B)+(C)-(cRc)2
=O(1/n)+cRc2+cRc2O(1/nc)+cO(nc/n2)+c,c,ccRcRc-(cRc)2
=O(1/n)+cRc2O(1/nc)+cO(nc/n2)
=O(1/n)+cO(nc2/n2)O(1/nc)+cO(nc/n2)=O(1/n)+1nO(cnc/n)=O(1/n)0.

When the cluster covariates are centered, so that xc¯=0 for every column x, it follows that wnp 0 and so 1nXGXp0. This implies that the test is asymptotically valid by Theorem 1. Regarding the homogeneity condition, this is equivalent to:

1ncXcXcmcΣ.

When this condition holds and when mincnc, then we have essentially independent permutation tests within clusters, and Theorem 3 thus can be applied within each cluster. ∎

Theorem 6.

Suppose that Assumptions 1(a)-(c) hold for the regression model of Equation (2), and ε=dgε, for all gGs(Cn). Suppose also that Assumptions 1(a)-(b) hold within clusters. If Jn=J is fixed and mincnc, then the residual randomization test using Gs(Cn) as the primitive is asymptotically valid when the homogeneity condition of Equation (15) holds. If Jn, then the test is valid also when c=1Jnnc2/n20.

Proof.

Here, GUnif{𝒢𝗌(Cn)} is equivalent to G=c=1JnSci[c]𝟙i𝟙i, where Sc are independent random cluster signs. As in the preceding proofs, let x,z be any two columns of X, and let us focus on the quantity wn=1nxGz. By definition,

xGz =x(cSci[c]𝟙i𝟙i)z=cSci[c](x𝟙i)(𝟙iz)=cSci[c]xizirccScrc.

It holds that ScrcScrc for cc since Sc,Sc are independent. Thus,

E(wn)=1nxE(G)z =1nE(cScrc)
=1ncE(Sc)rc
=0  [Since Sc is random sign]. (29)

Next, we derive the variance.

Var(wn)=E(wn2) =1n2E{(cScrc)2}=1n2cE(Sc2rc2)+1n2ccE(ScrcScrc)
=1n2cE(Sc2)=1rc2+1n2ccE(Scrc)=0E(Scrc)=0  [since Sc,Sc are iid random signs.]
=1n2crc2=1n2ci[c]xi2zi2+1n2ci,j[c],ijxixjzizj
=1n2i=1nxi2zi2O(1/n),proof of Theorem 4+1n2ci[c]xizij[c],jixjzj
=O(1/n)+1n2ci[c]xizi(ncxzc¯-xizi)   [where xzc¯=(1/nc)i[c]xizi].
=O(1/n)+c(nc2/n2)(xzc¯)2-1nc(nc/n)xzc2¯   [where xzc2¯=(1/nc)i[c]xi2zi2].
=c(nc2/n2)(xzc¯)2+O(1/n). (30)

When cnc2/n20 then wnp0. This implies 1nXGXp0 and so the test is valid from Theorem 1.

When Jn=J is fixed and mincnc, and the homogeneity condition holds, then we can adapt the proof of Theorem 2; the only difference is that here tn(Grεo^)-tn(Grε)=oP(1) under the homogeneity condition, whereas the difference is exactly zero in Theorem 2. ∎

Theorem 7.

Suppose that Assumptions 1(a)-(c) hold for the regression model of Equation (2), and ε=dgε, for all gGp+s(Cn). Suppose also that Assumptions 1(a)-(b) hold within clusters. If Jn=J is fixed and mincnc, then the residual randomization test using Gp+s(Cn) as the primitive is asymptotically valid when the homogeneity condition of Equation (15) holds, or when the clusters are centered. If Jn, then the test is valid also when cnc2/n20.

Proof.

Here, GUnif{𝒢𝗉+𝗌(Cn)} is equivalent to G=c=1JnSci[c]𝟙i𝟙π(i), where Sc are independent random cluster signs, and π is a random permutation within the clusters defined by Cn. As in the preceding proofs, let x,z be any two columns of X, and let us focus on the quantity wn=1nxGz. By definition,

xGz =x(cSci[c]𝟙i𝟙π(i))z=cSci[c](x𝟙i)(𝟙π(i)z)=cSci[c]xizπ(i)rccScrc. (31)

We have ScrcScrc for cc since Sc,Sc are independent. Thus,

E(wn)=1nxE(G)z =1nE(cScrc)=1ncE(Sc)rc=0.  [Since Sc is random sign].

Next, we derive the variance.

Var(wn)=E(wn2) =1n2E{(cScrc)2}=1n2cE(Sc2rc2)+1n2ccE(ScRcScRc)
=1n2cE(Sc2)=1rc2+1n2ccE(Scrc)=0E(Scrc)=0  [since Sc,Sc are iid random signs.]
=1n2crc2=1n2cE(i[c]xi2zπ(i)2+i,j[c],jixixjzπ(i)zπ(j))
=1n2ci[c]xi2E(zπ(i)2)+1n2ci,j[c],jixixjE(zπ(i)zπ(j))
=1n2ci[c]xi2zc2¯+1n2ci,j[c],jixixj1nc-1(nczc¯2-zc2¯)  [from Equation (24)]
=1ncncnxc2¯zc2¯=O(1/n)+c(nczc¯2-zc2¯)n2(nc-1)nc(ncxc¯2-xc2¯)  [from Equation (B.2)]
=O(1/n)+c(nc2/n2)(xc¯)2(zc¯)2.  [see last steps of Theorem 5]

As before, when cnc2/n20 then wnp0. This implies 1nXGXp0 and so the test is valid from Theorem 1. The same holds when the clusters are centered, so that xc¯=0 for every column of X.

When Jn=J is fixed and mincnc, then conditional on the cluster signs we have a permutation test within each cluster. Hence, Theorem 3 can be applied to show that the test is valid conditionally on the signs, and thus it is valid unconditionally as well. ∎

Theorem 8.

Suppose that Assumptions 1(a)-(c) hold for the regression model of Equation (2), and ε=dgε, for all gGp(Rn,Cn). Suppose also that Assumptions 1(a)-(b) hold within clusters with respect to both Rn and Cn. When the number of clusters of both Rn,Cn grows indefinitely, the residual randomization test using Gp(Rn,Cn) as the primitive is asymptotically valid. When only the row (or only the column) cluster grows indefinitely, the same randomization test is valid if the covariates have been centered column-wise (or row-wise).

Proof.

Let G be a random two-way permutation matrix. We can write this as G=i𝟙i𝟙π(i), where the two-way structure is encoded in π according to Equation (18). As in the main text let r(i){1,,m} denote the row of data point i, and c(i){1,,m} the column of i.

Case 1: Row/column clusterings’ size . In this case, let m be the size of clusterings Rn,Cn, so that n=m2. We assume same size for the two clusterings to simplify the analysis. The proof still goes through even when the sizes are different, and the test only converges when both sizes grow indefinitely.

For some row r (or column c) we also define:

x¯r=1mi,r(i)=rxi,andx¯c=1mi,c(i)=cxi,
x2¯r=1mi,r(i)=rxi2,andx2¯c=1mi,c(i)=cxi2,

Despite the two-way structure, π maps any datapoint i to any other datapoint in {1,,n} with equal marginal probability. We obtain the properties:

E(xπ(i)) =1ni=1nxi=x¯.
E(xπ(i)2) =1ni=1nxi2=x2¯.
E(xπ(i)x¯r(π(i))) =E(x¯rE(xπ(i)r(π(i))=r))=E((x¯r)2)=1mr=1m(x¯r)2τx,R2.
E(xπ(i)x¯c(π(i))) =E(x¯cE(xπ(i)c(π(i))=c))=E((x¯c)2)=1mc=1m(x¯c)2τx,C2. (32)

The quantity E(xπ(j)π(i),ji) will also be useful, and is complicated by the structure of the two-way clustering. For example, if j and i are in the same row, so that r(j)=r(i), then we know that j will be moved by π to the same row as i; i.e., r(π(j))=r(π(i)). In this case, the expected value of xπ(j) is the average of row r(π(i)) except for where i is mapped to; i.e., the expected value is 1m-1(mx¯r(π(i))-xπ(i)). With similar calculations for the other two cases where j and i share the same column or share no row or column, we get:

E(xπ(j)π(i),ji) ={1m-1(mx¯r(π(i))-xπ(i)),ifr(j)=r(i);1m-1(mx¯c(π(i))-xπ(i)),ifc(j)=c(i);1n-2m+1(nx¯-mx¯r(π(i))-mx¯c(π(i))+xπ(i)),otherwise. (33)

From this, we obtain

E(xπ(i)xπ(j)ji)=E(xπ(i)E(xπ(j)π(i),ji))={1m-1(mτx,R2-x2¯),ifr(j)=r(i);1m-1(mτx,C2-x2¯),ifc(j)=c(i);1n-2m+1[n(x¯)2-mτx,R2-mτx,C2+x2¯],otherwise. (34)

Let x,z be any two columns of X, and define wn=1nxGz. We derive the mean of this variable:

E(wn) =E(1nxGz)=1nE(i=1nxizπ(i))=1ni=1nxiE(zπ(i))=1ni=1nxiz¯=x¯z¯, (35)

which follows from Equations (B.2). From this we conclude that

E(1nXGX)=μμ,μ=X𝟙/n.

This is the same result as in the non-clustered permutation test in Theorem 3.

Next, we derive the variance.

Var(wn)=E(wn2)-x¯2z¯2 =1n2E{(i=1nxizπ(i))2}-x¯2z¯2  [from Equation (35)]
=1n2i=1nxi2E(zπ(i)2)(A)+1n2ijxixjE(zπ(i)zπ(j))(B)-x¯2z¯2. (36)

We analyze the terms individually. For the first one,

(A)=1n2i=1nxi2z2¯=1nx2¯z2¯. (37)

For the second term,

(B) =1n2i=1nxij=1,jinxjE(zπ(i)zπ(j)ji)
=1n2i=1nxi[ji(r(j)=r(i)xjmτz,R2-z2¯m-1+c(j)=c(i)xjmτz,C2-z2¯m-1+otherxjnz¯2-mτz,R2-mτz,C2+z2¯n-2m+1)]
=1n2i=1nxi[(mτz,R2-z2¯)(mx¯r(i)-xi)m-1+(mτz,C2-z2¯)(mx¯c(i)-xi)m-1
verylongeqs.+[nz¯2-mτz,R2-mτz,C2+z2¯](nx¯-mx¯r(i)-mx¯c(i)+xi)n-2m+1].
=1n2(mτz,R2-z2¯)(m3τx,R2-nx2¯)m-1+1n2(mτz,C2-z2¯)(m3τx,C2-nx2¯)m-1
verylongeqs.+1n2[nz¯2-mτz,R2-mτz,C2+z2¯](n2x¯2-m3τx,R2-m3τx,C2+nx2¯)n-2m+1]. (38)

where we used ixix¯r(i)=m2τx,R2 and ixix¯c(i)=m2τx,C2. To simplify, for some column u define σu,R2=τu,R2-(1/m)u2¯ and σu2=u¯2-(1/m)τu,R2-(1/m)τu,C2+(1/m2)u2¯. Because n=m2 we obtain

(B) =1m-1(σx,R2σz,R2+σz,C2σz,C2)+nn-2m+1σx2σz2.

We gather the terms (A) and (B) to obtain:

Var(wn)=1nx2¯z2¯+1m-1(σx,R2σz,R2+σz,C2σz,C2)+nn-2m+1σx2σz2-x¯2z¯2=O(1/m),

following from the boundedness of first and second moments within rows and columns. If m, then wn-x¯z¯p0, as in Theorem 3. The test is thus valid asymptotically by Theorem 1.

Case 2: Size of only one clustering (either row or column) . Here, we consider the case where the size of only one clustering grows indefinitely, whereas the size of other is fixed. Without loss of generality suppose that the column clustering diverges. Define the respective sizes as |Rn|=mR< and |Cn|=mC, such that n=mRmC.

With the new definitions, we have:

x¯r=1mCi,r(i)=rxi,andx¯c=1mRi,c(i)=cxi,
x2¯r=1mCi,r(i)=rxi2,andx2¯c=1mRi,c(i)=cxi2,
E(xπ(i)x¯r(π(i)))=1mRr=1mR(x¯r)2τx,R2.
E(xπ(i)x¯c(π(i)))=1mCc=1mC(x¯c)2τx,C2. (39)
E(xπ(j)π(i),ji) ={1mC-1(mCx¯r(π(i))-xπ(i)),ifr(j)=r(i);1mR-1(mRx¯c(π(i))-xπ(i)),ifc(j)=c(i);1n-mR-mC+1(nx¯-mCx¯r(π(i))-mRx¯c(π(i))+xπ(i)),otherwise. (40)

From this, we obtain

E(xπ(i)xπ(j)ji)=E(xπ(i)E(xπ(j)π(i),ji))={1mC-1(mCτx,R2-x2¯),ifr(j)=r(i);1mR-1(mCτx,C2-x2¯),ifc(j)=c(i);1n-mR-mC+1[n(x¯)2-mCτx,R2-mRτx,C2+x2¯],otherwise. (41)

As before, E(wn)=x¯z¯. For the variance:

Var(wn)=1n2i=1nxi2E(zπ(i)2)(A)+1n2ijxixjE(zπ(i)zπ(j))(B)-x¯2z¯2.

We analyze the terms individually. For the first one,

(A)=1n2i=1nxi2z2¯=1nx2¯z2¯.

For the second term,

(B) =1n2i=1nxij=1,jinxjE(zπ(i)zπ(j)ji)
=1n2i=1nxi[ji(r(j)=r(i)xjmCτz,R2-z2¯mC-1+c(j)=c(i)xjmRτz,C2-z2¯mR-1+otherxjnz¯2-mCτz,R2-mRτz,C2+z2¯n-mR-mC+1)]
=1n2i=1nxi[(mCτz,R2-z2¯)(mCx¯r(i)-xi)mC-1+(mRτz,C2-z2¯)(mRx¯c(i)-xi)mR-1
verylongeqs.+[nz¯2-mCτz,R2-mRτz,C2+z2¯](nx¯-mCx¯r(i)-mRx¯c(i)+xi)n-mR-mC+1].
=1n2(mCτz,R2-z2¯)(nmCτx,R2-nx2¯)mC-1+1n2(mRτz,C2-z2¯)(nmRτx,C2-nx2¯)mR-1
verylongeqs.+1n2[nz¯2-mCτz,R2-mRτz,C2+z2¯](n2x¯2-nmCτx,R2-nmRτx,C2+nx2¯)n-mR-mC+1]. (42)

where we used ixix¯r(i)=mCmRτx,R2=nτx,R2; and, similarly, ixix¯c(i)=nτx,C2. To simplify, for some column u define σu,R2=τu,R2-(1/mC)u2¯ and σu2=u¯2-(mC/n)τu,R2-(mR/n)τu,C2+(1/n)u2¯.

For these definitions, it follows

(B) =mC2n(mC-1)σx,R2σz,R2+mR2n(mR-1)σz,C2σz,C2+nn-mR-mC+1σx2σz2.

We gather the terms (A) and (B). Since mR=O(1) and mC=n/mR, we finally get: