Abstract
We develop a randomizationbased method for inference in regression models.The basis of inference is an invariance assumption on the regression errors,such as invariance to permutations or random signs. To test significance, therandomization method repeatedly calculates a suitable test statistic overtransformations of the regression residuals according to the invariant.Inversion of the test can produce confidence intervals. We prove generalconditions for asymptotic validity of this residual randomization test andillustrate in many models, including clustered errors with oneway or twowayclustering structure. We also show that finitesample validity is possibleunder a suitable construction, and illustrate with an exact test for a case ofthe BehrensFisher problem. The proposed method offers four main advantagesover the bootstrap: (1) it addresses the inference problem in a unified way,while bootstrap typically needs to be adapted to the task; (2) it can be morepowerful by exploiting a richer and more flexible set of invariances thanexchangeability; (3) it does not rely on asymptotic normality; and (4) it canbe valid in finite samples. In extensive empirical evaluations, including highdimensional regression and autocorrelated errors, the proposed method performsfavorably against many alternatives, including bootstrap variants andasymptotic robust error methods.
Quick Read (beta)
Life After Bootstrap:
Residual Randomization Inference in Regression Models
Abstract
We develop a randomizationbased method for inference in regression models. The basis of inference is an invariance assumption on the regression errors, such as invariance to permutations or random signs. To test significance, the randomization method repeatedly calculates a suitable test statistic over transformations of the regression residuals according to the invariant. Inversion of the test can produce confidence intervals. We prove general conditions for asymptotic validity of this residual randomization test and illustrate in many models, including clustered errors with oneway or twoway clustering structure. We also show that finitesample validity is possible under a suitable construction, and illustrate with an exact test for a case of the Behrens–Fisher problem. The proposed method offers four main advantages over the bootstrap: (1) it addresses the inference problem in a unified way, while bootstrap typically needs to be adapted to the task; (2) it can be more powerful by exploiting a richer and more flexible set of invariances than exchangeability; (3) it does not rely on asymptotic normality; and (4) it can be valid in finite samples. In extensive empirical evaluations, including highdimensional regression and autocorrelated errors, the proposed method performs favorably against many alternatives, including bootstrap variants and asymptotic robust error methods.
Keywords: bootstrap, randomization, invariance, inferential primitive, exact inference.
Contents
 1 Introduction
 2 The residual randomization method
 3 Simple error structures
 4 Clustered error structures
 5 Applications and experiments
 6 Extensions
 7 Discussion
 8 Conclusion
 A Proofs of Main Theorems
 B Proofs of Validity Theorems
 C Concrete Algorithms
 D Details on simulations
 E Additional experiments
1 Introduction
The bootstrap (efron1979bootstrap; efron1981nonparametric) is a widely popular method for nonparametric estimation of standard errors. In a regression model,
$${y}_{i}={x}_{i}^{\prime}\beta +{\epsilon}_{i},i=1,\mathrm{\dots},n,$$  (1) 
the residual bootstrap (freedman1981bootstrapping; efron1986bootstrap) uses an estimate $\widehat{\beta}$ (e.g., from OLS) to bootstrap datasets $\{({y}_{i}^{\ast},{x}_{i})\}$, where ${y}_{i}^{\ast}={x}_{i}^{\prime}\widehat{\beta}+{\widehat{\epsilon}}_{i}^{\ast}$, and $({\widehat{\epsilon}}_{1}^{\ast},\mathrm{\dots},{\widehat{\epsilon}}_{n}^{\ast})$ is a sample with replacement from the residuals, ${\widehat{\epsilon}}_{i}={y}_{i}{x}_{i}\widehat{\beta}$, or the centered residuals, ${\widehat{\epsilon}}_{i}\overline{\widehat{\epsilon}}$. The residual bootstrap is more flexible and generally adapts better to the problem than other bootstrap alternatives^{2}^{2} 2 Another approach is the pairs bootstrap (freedman1981bootstrapping), which generates $\{({y}_{i}^{\ast},{x}_{i}^{\ast})\}$ by sampling with replacement from the original data. Because pairs bootstrap does not condition on $x$, it does not adapt to the problem geometry well, and generally performs poorly in small samples (flachaire2005bootstrapping; mackinnon2002bootstrap). because it conditions on covariates, while its asymptotics are well understood (freedman1981bootstrapping; peters1984some; bickel1983bootstrapping).
To be valid asymptotically, the bootstrap requires some form of exchangeability of the errors. When such exchangeability does not hold, the method may have to be altered substantially. For instance, in regression models with clustered errors, the “cluster wild bootstrap” method (cameron2008bootstrap) resamples the entire cluster residuals with replacement, and switches their signs at random. Moreover, although it is not necessary in theory, bootstrap methods in practice rely on asymptotic normality, which is generally not plausible with small samples. This requires additional complex alterations, such as acceleration or refinement (diciccio1988review; diciccio1996bootstrap). In short, a bootstrap approach needs to adapt the problem to the method, rather than the method to the problem.
As a robust alternative for inference we propose the residual randomization method. To test a linear hypothesis on parameters, ${H}_{0}:{a}^{\prime}\beta ={a}_{0}$, with $a\in {\mathbb{R}}^{p},{a}_{0}\in \mathbb{R}$ fixed, the proposed method relies on the following two key ideas. First, we make an invariance assumption on the errors, such as $\epsilon \stackrel{d}{=}\U0001d690\epsilon $, where $\U0001d690\in \mathcal{G}$ and $\mathcal{G}$ is an algebraic group of ${\mathbb{R}}^{n}\to {\mathbb{R}}^{n}$ linear maps. For example, if $\mathcal{G}$ is the set of all permutations, the assumption is equivalent to exchangeability of errors. Second, we choose a test statistic ${T}_{n}$, such that ${T}_{n}={t}_{n}(\epsilon )$ under the null hypothesis, and ${t}_{n}:{\mathbb{R}}^{n}\to \mathbb{R}$ has an asymptotic distribution law. For example, ${T}_{n}=\sqrt{n}{a}^{\prime}(\widehat{\beta}\beta )$ and ${t}_{n}(u)=\sqrt{n}{a}^{\prime}{({X}^{\top}X)}^{1}{X}^{\top}u$ satisfy this requirement, where $\widehat{\beta}$ is the standard OLS estimator. Then, inference on $\beta $ that is valid in finite samples is possible, in principle, by comparing the observed value of ${T}_{n}$ against the set of alternative values $\{{t}_{n}(\U0001d690\epsilon ):\U0001d690\in \mathcal{G}\}$.
In practice, however, this method is infeasible because the errors are generally unknown, except when we are testing a joint hypothesis on all $\beta $. The feasible method is then to compare ${T}_{n}$ with respect to the set $\{{t}_{n}(\U0001d690\widehat{{\epsilon}^{\mathrm{o}}}):\U0001d690\in \mathcal{G}\}$, where $\widehat{{\epsilon}^{\mathrm{o}}}$ are the residuals from OLS constrained under ${H}_{0}$. The feasible method is no longer valid in finite samples, in general, but is asymptotically valid under certain conditions depending on the choice of $\mathcal{G}$. An overarching sufficient condition for validity, which we prove in this paper (see Theorem 1), is that $(1/n){X}^{\top}GX$ is similar to $(1/n){X}^{\top}X$, in the sense that ${({X}^{\top}X)}^{1}{X}^{\top}GX\stackrel{p}{\to}bI$, where $G\in \mathrm{Unif}(\mathcal{G})$, $b$ is scalar and $I$ is the identity matrix. Intuitively, this ensures that the choice of $\mathcal{G}$ does not alter substantially the information structure of the problem, so that ${t}_{n}(G\widehat{{\epsilon}^{\mathrm{o}}})$ is sufficiently close to ${t}_{n}(G\epsilon )$. Moreover, a construction of $\mathcal{G}$ may be possible so that the aforementioned similarity condition holds exactly in the sample, leading to a randomization test of significance that achieves the nominal level for any finite $n$.
In related work, the idea of residual randomization was first developed in two seminal papers by freedman1983nonstochastic; freedman1983significance. Their key result was to show that the permutation distribution of a suitable test statistic converges to a standard distribution. freedman1983significance described their idea as the extension of Fisher’s randomization test (fisher1935design) to observational studies. Interestingly, Freedman did not carry out the extension to more complex $\mathcal{G}$, but instead produced an important series of papers on bootstrapbased variations of his permutation inference method, including the residual bootstrap (bickel1981some; freedman1981bootstrapping; peters1984some; freedman1984bootstrapping). This work stimulated an important sequence of papers on permutation tests in linear regression models, which highlighted the asymptotic properties and power of residual randomization (anderson2001permutation; kennedy1995randomization; manly2018randomization; ter1992permutation). All of these methods, including the original work of freedman1983nonstochastic, also rely on residual transformations, but only consider exchangeability as the main invariance assumption, i.e., $\mathcal{G}$ denotes only permutations. Furthermore, they do not analyze complex error structures, such clustered errors with oneway or twoway clustering, as we do in this paper.
Recently, randomization tests have received increased attention, especially in causal inference, due to their robustness and minimal assumptions (imbens2015causal; rosenbaum2002observational; lehmann2006testing; gerber2012field; athey2018exact; basse2019randomization; ding2016randomization; ding2017paradox; cattaneo2015randomization; canay2017randomization; canay2018wild; chernozhukov2017exact). In the context of clustered errors, canay2018wild have made the important connection between randomization testing and “wild bootstrap” (wu1986jackknife). Specifically, they proved asymptotic validity of the wild bootstrap when clusters grow in size but are finite in number, under the homogeneity condition that ${X}^{\top}X$ within each cluster is similar to the population matrix; that is, canay2018wild consider $\mathcal{G}$ as the set of diagonal matrices with clustered random signs on the diagonal. In Section 4 of this paper, we extend this result and show that validity is possible even without the homogeneity condition, provided that the errors have some suitable of symmetry, such as exchangeability within clusters.
Our paper is organized as follows. Section 2 presents the details of the residual randomization method, and the key technical result for validity. In Section 3, we discuss simple error structures, and in Section 4 we consider clustered structures, including twoway clustering. For each particular error structure we derive theoretical conditions of asymptotic validity. Section 5 presents extensive empirical evaluations that show the benefits of the proposed randomization method, including a construction of an exact test for an instance of the famous Behrens–Fisher problem in Section 5.2.2. Finally, in Section 6 we discuss residual randomization with highdimensional regression models and autocorrelated errors, showing notable empirical success in both cases.
2 The residual randomization method
We begin with the regression model of Equation (1), written compactly as
$$y=X\beta +\epsilon ,$$  (2) 
where $y,\epsilon $ are $n$dimensional column vectors, $X$ is the $n\times p$ matrix of covariates, and $\beta $ is the $p$dimensional vector of parameters; here, $p$ is fixed and $n>p$. We focus on testing linear hypotheses on the regression parameters, $\beta $, of the following form:
$${H}_{0}:{a}^{\prime}\beta ={a}_{0},$$  (3) 
where $a=({a}_{1},\mathrm{\dots},{a}_{p})\in {\mathbb{R}}^{p}$ and ${a}_{0}\in \mathbb{R}$ are fixed and known. These hypotheses include the individual significance tests, ${H}_{0,j}:{\beta}_{j}=0$, for $j=1,\mathrm{\dots},p$. The inversion of these tests can produce confidence intervals.
Our proposed method relies on two main components:

(a)
An inferential primitive, $\mathcal{G}$, as an algebraic group of ${\mathbb{R}}^{n}\to {\mathbb{R}}^{n}$ linear maps.

(b)
An invariant, ${t}_{n}:{\mathbb{R}}^{n}\to \mathbb{R}$, as a measurable function, such that
$${t}_{n}(\epsilon )\stackrel{d}{=}{t}_{n}(\U0001d690\epsilon ),\text{for all}\U0001d690\in \mathcal{G},\text{and any finite}n.$$ (4)
The idea is that if there exists a test statistic ${T}_{n}$, such that ${T}_{n}={t}_{n}(\epsilon )$ under ${H}_{0}$, then the randomization test that compares ${T}_{n}$ with the values $\{{t}_{n}(\U0001d690\epsilon ):\U0001d690\in \mathcal{G}\}$ is exact, i.e., valid for any finite $n$. In the context of linear regression, a natural approach is to define the test statistic as
$${T}_{n}=\sqrt{n}({a}^{\prime}\widehat{\beta}{a}_{0}),$$  (5) 
where $\widehat{\beta}$ is the standard OLS estimate of $\beta $, and the invariant as ${t}_{n}(u)=\sqrt{n}{a}^{\prime}{({X}^{\top}X)}^{1}{X}^{\top}u$. Then, by standard OLS theory, ${T}_{n}={t}_{n}(\epsilon )$ under ${H}_{0}$, as required. The corresponding pvalue,
$$\mathrm{pval}=P\{{t}_{n}(G\epsilon )\ge t\mid {T}_{n}=t\},G\sim \mathrm{Unif}(\mathcal{G}),$$ 
is exact (lehmann2006testing, Theorem 15.2.3).
In practice, this test is infeasible because the errors are unknown, and so we rely on estimated residuals. The approach is then to calculate the constrained OLS estimator $\widehat{{\beta}^{\mathrm{o}}}$, such that ${a}^{\prime}\widehat{{\beta}^{\mathrm{o}}}={a}_{0}$, and then obtain the restricted residuals, $\widehat{{\epsilon}^{\mathrm{o}}}=yX\widehat{{\beta}^{\mathrm{o}}}$. The approximate pvalue,
$$\widehat{\mathrm{pval}}=P\{{t}_{n}({G}_{r}\widehat{{\epsilon}^{\mathrm{o}}})\ge t\mid {T}_{n}=t\},$$  (6) 
where ${G}_{r}\sim \mathrm{Unif}(\mathcal{G})$, $r=1,\mathrm{\dots},R$, with $R$ fixed, defines a feasible randomization test that is also valid asymptotically under certain conditions, which we formalize in Theorem 1.
To simplify, we choose $\mathcal{G}$ as a group of realvalued $n\times n$ matrices. We also satisfy Equation (4), for any ${t}_{n}$, by the following stronger invariance assumption:
$$\epsilon \stackrel{d}{=}\U0001d690\epsilon ,\text{for all}\U0001d690\in \mathcal{G},\text{and any finite}n.$$  (7) 
These simplifications add useful conceptual clarity. For example, when $\mathcal{G}$ is the set of all $n\times n$ permutation matrices, Equation (7) is equivalent to exchangeability of errors; or when $\mathcal{G}$ denotes signed diagonal matrices, the assumption is equivalent to error symmetry around zero, also known as “orthant symmetry” (efron1969student). Moreover, finitesample invariance in Equation (7) may be necessary even in asymptotic settings, notably to show asymptotic validity of cluster sign tests when the number of clusters is finite but their sizes grow (see Section 4). Finally, under a suitable construction, finitesample invariance can produce tests that are feasible and valid even in finite samples; see Section 2.3 for details.
2.1 Concrete procedure
For the sake of concreteness, we define here the basic procedure for the residual randomization test. All subsequent tests in this paper are derivatives of this procedure for various choices of $\mathcal{G}$.

1.
Given $y,X,$ calculate the restricted OLS estimate, $\widehat{{\beta}^{\mathrm{o}}}$, such that ${a}^{\prime}\widehat{{\beta}^{\mathrm{o}}}={a}_{0}$, and then calculate the corresponding constrained residuals, $\widehat{{\epsilon}^{\mathrm{o}}}=yX\widehat{{\beta}^{\mathrm{o}}}$.

2.
Define the test statistic, ${T}_{n}=\sqrt{n}({a}^{\prime}\widehat{\beta}{a}_{0})$, and let ${T}_{n}=t$ be the observed value.

3.
Repeat $r=1,\mathrm{\dots},R$, with $R$ fixed:
Sample ${G}_{r}\sim \mathrm{Unif}(\mathcal{G})$, and store ${t}_{n}({G}_{r}\widehat{{\epsilon}^{\mathrm{o}}})$, where ${t}_{n}(u)=\sqrt{n}{a}^{\prime}{({X}^{\top}X)}^{1}{X}^{\top}u$. 
4.
Calculate the onesided pvalue as in Equation (6), or the twosided pvalue:
$$\widehat{{\mathrm{pval}}_{2}}=\mathrm{min}(P\{{t}_{n}({G}_{r}\widehat{{\epsilon}^{\mathrm{o}}})\ge t\mid {T}_{n}=t\},P\{{t}_{n}({G}_{r}\widehat{{\epsilon}^{\mathrm{o}}})\le t\mid {T}_{n}=t\}).$$ (8)
At target level $\alpha \in (0,1)$, the test decision may be described by the function
$$\varphi (y;X)=\mathbb{I}\{\widehat{{\mathrm{pval}}_{2}}\le \alpha /2\}.$$  (9) 
There are two main differences between the residual randomization procedure defined above and residual bootstrap. First, in the randomization procedure we can use any inferential primitive $\mathcal{G}$ beyond just permutations. This is especially useful with complex error structures, such as clustered errors, and can also be useful for sensitivity analysis (see Section 5.1). Second, the residual randomization procedure does not rely on any form of parametric assumptions to assess significance because it calculates the reference distribution empirically through random transformations of the residuals (Step 3). In contrast, bootstrap errors typically rely on asymptotic normality.
Remark 2.1. As an intuition for why this procedure is valid, consider the idealized residual randomization test, which essentially conditions on event $A=\{\epsilon \in R(\epsilon ;\mathcal{G})\}$ with $R(u;\mathcal{G})=\{\U0001d690u:\U0001d690\in \mathcal{G}\}$. Because $\mathcal{G}$ is an algebraic group, $R(\epsilon ;\mathcal{G})$ is the orbit of $\epsilon $ in the error sample space. Orbits are special grouptheoretic objects with the key property that they define a partition of their domain set. Thus, Equation (7) implies that ${t}_{n}(\epsilon )\mid A\stackrel{d}{=}{t}_{n}(G\epsilon )\mid \epsilon ,$ where $G\sim \mathrm{Unif}(\mathcal{G})$. This guarantees validity of the idealized randomization test (lehmann2006testing, Chapter 15), and suggests asymptotic validity of the feasible residual test. Such inference based on algebraic orbits is reminiscent of the “structure of inference” (fraser68).
Remark 2.2. A confidence interval for ${\beta}_{j}$ is possible through test inversion (rosenbaum2002observational, Section 2.6.2). Specifically, define $a$ so that ${a}_{j}=1$, and is zero otherwise. Then, for a sequence of ${a}_{0}$ values, calculate the pvalue of the test, say, $\mathrm{pval}({a}_{0})$. Given ${A}_{j}=\{{a}_{0}:\mathrm{pval}({a}_{0})\ge \alpha /2\}$, we report $[\mathrm{min}({A}_{j}),\mathrm{max}({A}_{j})]$ as the $100(1\alpha )\%$ confidence interval for ${\beta}_{j}$.
Remark 2.3. In some tests, the values of ${t}_{n}(G\widehat{{\epsilon}^{\mathrm{o}}})$ in Step 3 may often repeat. Then, a correction is needed for the test to achieve the target test level. See Appendix D.1 for details.
2.2 Asymptotic validity
Here, we develop theory on the asymptotic validity of the residual randomization test in Section 2.1. We start with a general theorem, and then specialize in the following sections.
Assumption 1.
For the linear regression model of Equation (2) suppose that:

(a)
$X$ is nonstochastic; ${X}^{\top}X/n$ is bounded and invertible;

(b)
${X}^{\top}X/n\to S$, where $S$ is positive definite; and

(c)
there exists random variable $\mathcal{Z}\in {\mathbb{R}}^{p}$, such that $\frac{1}{\sqrt{n}}{X}^{\top}\epsilon \stackrel{d}{\to}\mathcal{Z}.$
Assumptions 1(a) and (b) are common regularity conditions on $X$ to guarantee that the estimators, $\widehat{\beta}$ and $\widehat{{\beta}^{\mathrm{o}}}$, are wellbehaved. Assumption 1(c) requires sufficient regularity in $X$ so that the test statistic has an asymptotic distribution law. This is only an existence assumption, which is relatively weak because $\mathcal{Z}$ does not have to be normal necessarily, and may even be unknown.
Theorem 1.
Suppose that Assumptions 1(a)(c) hold for the regression model of Equation (2) with inferential primitive $\mathrm{G}$. Let $G\mathrm{\sim}\mathrm{Unif}\mathit{}\mathrm{(}\mathrm{G}\mathrm{)}$, and suppose that
$${({X}^{\top}X)}^{1}({X}^{\top}GX)\stackrel{p}{\to}bI,b\ge 0.$$  (10) 
Then, the residual randomization procedure of Section 2.1 is asymptotically valid; that is,
$$lim\underset{n\to \mathrm{\infty}}{sup}\mathrm{E}(\varphi (y;X))\le \alpha .$$ 
The proof of Theorem 1 is straightforward. The idea is to show that ${t}_{n}(G\widehat{{\epsilon}^{\mathrm{o}}})$ of the approximate test is asymptotically close to ${t}_{n}(G\epsilon )$ of the idealized test. The difference between these two values is exactly equal to $\sqrt{n}{a}^{\prime}{({X}^{\top}X)}^{1}{X}^{\top}GX(\widehat{{\beta}^{\mathrm{o}}}\beta )$. Under the similarity condition of Equation (10), this quantity is proportional to $\sqrt{n}{a}^{\prime}(\widehat{{\beta}^{\mathrm{o}}}\beta )$, which is zero under the null hypothesis.
Remark 2.4. The Gram matrix ${X}^{\top}X$ captures the information structure of the problem, also known as Fisher information. With that in mind, the condition in Equation (10) requires, intuitively, that the choice of inferential primitive, $\mathcal{G}$, does not alter substantially the information structure of the problem. We clarify here that $\mathcal{G}$ grows with $n$, but this dependence is left implicit to simplify notation; e.g., it is implicit in the limit of Equation (10).
Remark 2.5. Theorem 1 does not require finitesample invariance, as defined in Equation (7), but actually a weaker assumption of asymptotic invariance:
$$({t}_{n}(\U0001d690\epsilon ),{t}_{n}(\epsilon ))\stackrel{d}{\to}(T,T),\text{for all}\U0001d690\in \mathcal{G},\text{and some law}T.$$  (11) 
However, we use the finitesample invariance in Equation (7) as our main working assumption in this paper due to its conceptual clarity; see also the discussion around Equation (7) for more detailed arguments.
Remark 2.6. Theorem 1 describes only sufficient conditions for asymptotic validity. It is therefore possible that some conditions do not hold but the test is still valid asympotically—we discuss one case in Section 3.1. It can also happen that the test is valid but powerless. For example, the residual randomization test for the intercept model, ${y}_{i}={\beta}_{0}+{\epsilon}_{i}$, has zero power under only exchangeability of ${\epsilon}_{i}$, since here the OLS estimator (i.e., the sample mean, $\overline{y}$) is invariant to permutations.
2.3 Finite sample validity
Here, we discuss a construction of feasible and exact randomization tests as a complement to the asymptotic result of Theorem 1. The idea is to cluster some of the datapoints in a way such that the similarity condition in Equation (10) holds in the sample. This could mean an artificial clustering of the data done by the analyst, as opposed to some natural clustering, e.g., by state or school.
We need the following notation. A clustering, ${\mathrm{C}}^{n}$, denotes a partition of datapoints $\{1,\mathrm{\dots},n\}$, and is indexed by $c$. Let $[c]\subseteq \{1,\mathrm{\dots},n\}$ denote the set of datapoints in the $c$th cluster. Let us also define cluster membership indicators, namely, ${\mathrm{\U0001d7d9}}_{c}={(\mathbb{I}\{1\in [c]\},\mathrm{\dots},\mathbb{I}\{n\in [c]\})}^{\prime}$ and ${\mathbb{D}}_{c}={\mathrm{\U0001d7d9}}_{c}{\mathrm{\U0001d7d9}}_{c}^{\prime}$; thus, ${\mathrm{\U0001d7d9}}_{c}$ is the $n$dimensional column vector with elements that are one only for the members of $[c]$, and are zero otherwise; ${\mathbb{D}}_{c}$ is the corresponding diagonal $n\times n$ matrix.
Theorem 2.
For fixed $n$, consider a clustering ${\mathrm{C}}^{n}$ with $$. Suppose that for every cluster $c$ there exists ${b}_{c}\mathrm{\in}\mathrm{R}$ such that ${X}_{c}^{\mathrm{\top}}\mathit{}{X}_{c}\mathrm{=}{b}_{c}\mathit{}{X}^{\mathrm{\top}}\mathit{}X$, where ${X}_{c}\mathrm{=}{\mathrm{D}}_{c}\mathit{}X$ are the cluster covariates. For some inferential primitive $\mathrm{G}$, suppose that every member transform, $\mathrm{g}\mathrm{\in}\mathrm{G}$, can be decomposed as $\mathrm{g}\mathrm{=}{\mathrm{\sum}}_{c\mathrm{=}\mathrm{1}}^{J}{s}_{c}\mathit{}{\mathrm{D}}_{c}$, where ${s}_{c}\mathrm{\in}\mathrm{R}$. Then, the residual randomization test with $\mathrm{G}$ as the primitive is exact in finite samples, i.e., $\mathrm{E}\mathit{}\mathrm{(}\varphi \mathit{}\mathrm{(}y\mathrm{,}X\mathrm{)}\mathrm{)}\mathrm{=}\alpha $.
Intuitively, Theorem 2 suggests the following construction of exact residual randomization tests. First, split the data in $J$ clusters, such that each matrix ${X}^{\top}X$ within each cluster is similar to the population matrix. Second, define transforms jointly on the cluster level; e.g., flipping the signs of cluster residuals. If these transforms form an algebraic group and reflect the true error invariance, then the resulting test is exact. We illustrate this idea in an instance of the BehrensFisher problem in Section 5.2.2, where we split a binary covariate in clusters, and then apply a cluster sign test. The resulting test is exact even though there are only 30 data points.
3 Simple error structures
In this section, we begin with simple error structures. For each structure we specify an appropriate inferential primitive $\mathcal{G}$, and then derive conditions for asymptotic validity of residual randomization.
3.1 Exchangeable errors
The simplest error structure is when ${\epsilon}_{i}$ are exchangeable, so that $\epsilon \stackrel{d}{=}\U0001d690\epsilon $, for any $n\times n$ permutation matrix $\U0001d690$. We may write $\U0001d690={\sum}_{i=1}^{n}{\mathrm{\U0001d7d9}}_{i}{\mathrm{\U0001d7d9}}_{\pi (i)}^{\prime}$, where ${\mathrm{\U0001d7d9}}_{i}$ is the binary $n$dimensional column vector that is one only at the $i$th element, and $\pi $ is a permutation of $n$ elements, i.e., a bijection of $\{1,\mathrm{\dots},n\}$ on itself. Let ${\mathrm{\Pi}}_{n}$ be the space of all such permutations. Then, the inferential primitive is
$${\mathcal{G}}^{\U0001d5c9}=\{\sum _{i=1}^{n}{\mathrm{\U0001d7d9}}_{i}{\mathrm{\U0001d7d9}}_{\pi (i)}^{\prime}:\pi \in {\mathrm{\Pi}}_{n}\}.$$  (12) 
Operationally, the residual randomization test of Section 2.1 using ${\mathcal{G}}^{\U0001d5c9}$ as the primitive repeatedly calculates the test statistic, ${t}_{n}(\cdot )$, on permutations of the constrained residuals, $\widehat{{\epsilon}^{\mathrm{o}}}$.
Theorem 3.
Suppose that Assumptions 1(a)(c) hold for the regression model of Equation (2), and $\epsilon \stackrel{d}{\mathrm{=}}\mathrm{g}\mathit{}\epsilon $, for all $\mathrm{g}\mathrm{\in}{\mathrm{G}}^{\mathrm{p}}$. Then, the residual randomization test using ${\mathrm{G}}^{\mathrm{p}}$ as the primitive is asymptotically valid.
Notably, the condition of Theorem 1 does not hold here because there is bias between ${t}_{n}(G\widehat{{\epsilon}^{\mathrm{o}}})$ and ${t}_{n}(G\epsilon )$, the test statistics of the feasible and idealized test, respectively. Intuitively, this bias exists because a permutation matrix $\U0001d690$ does not completely decorrelate the columns of $X$. However, the theorem shows that this bias is negligible in the limit.
3.2 Heteroskedastic errors
We continue with independent and heteroskedastic errors, where $\mathrm{E}({\epsilon}_{i}^{2}\mid {x}_{i})={\sigma}_{i}^{2}$. In this setting, the seminal papers of white1980heteroskedasticity; eicker1963asymptotic have established asymptotically correct inference for $\beta $ through heteroskedasticityconsistent variance estimation. However, in small samples these methods can be severely biased (chesher1987bias), especially when the design is highly leveraged; mackinnon1985some; davidson2008wild; godfrey2001significance provide additional theory and simulations illustrating the problem.
For residual randomization, we can assume that the errors are symmetric around zero, and thus their distribution to be invariant to random sign flips. Formally, $\epsilon \stackrel{d}{=}\U0001d690\epsilon $, where $\U0001d690={\sum}_{i=1}^{n}{s}_{i}{\mathrm{\U0001d7d9}}_{i}{\mathrm{\U0001d7d9}}_{i}^{\prime}$ and $s=({s}_{i})\in {\{1,+1\}}^{n}$ is a vector of signs. The inferential primitive is defined as
$${\mathcal{G}}^{\U0001d5cc}=\{\sum _{i=1}^{n}{s}_{i}{\mathrm{\U0001d7d9}}_{i}{\mathrm{\U0001d7d9}}_{i}^{\prime}:s\in {\{1,+1\}}^{n}\}.$$  (13) 
Operationally, the residual randomization test using ${G}^{\U0001d5cc}$ as the primitive repeatedly calculates the test statistic, ${t}_{n}(\cdot )$, on constrained residuals with their signs randomly flipped.
Theorem 4.
Suppose that Assumptions 1(a)(c) hold for the regression model of Equation (2), and $\epsilon \stackrel{d}{\mathrm{=}}\mathrm{g}\mathit{}\epsilon $, for all $\mathrm{g}\mathrm{\in}{\mathrm{G}}^{\mathrm{s}}$. Then, the residual randomization test using ${\mathrm{G}}^{\mathrm{s}}$ as the primitive is asymptotically valid.
The residual randomization test with ${G}^{\U0001d5cc}$ as the inferential primitive is very similar to the “wild bootstrap” (wu1986jackknife). This bootstrap variant resamples data sets as ${y}_{i}^{\ast}={x}_{i}^{\prime}\widehat{\beta}+{v}_{i}f({\widehat{\epsilon}}_{i})$, where ${v}_{i}$ is a zeromean random variable, and $f$ is a transform of the residuals, usually the identity $f({\widehat{\epsilon}}_{i})={\widehat{\epsilon}}_{i}$. mammen1993bootstrap proposed sampling ${v}_{i}$ as $(\sqrt{5}1)/2$ or $(\sqrt{5}+1)/2$, such that $E({v}_{i})=0,\mathrm{E}({v}_{i}^{2})=1$, and $\mathrm{E}({v}_{i}^{3})=1$; these moment conditions lead to fast convergence of the bootstrap according to Edgeworth expansions; see also (liu1988bootstrap).
Remark 3.1. Recent work has pointed out that the aforementioned analytical approximations of the wild bootstrap are untrustworthy, and suggest using random signs for ${v}_{i}$, also known as the Rademacher distribution (davidson2008wild; flachaire2005bootstrapping; horowitz2001bootstrap). The residual randomization approach sides with the latter choice. Random signs express symmetry around zero, which is more natural than the symmetries expressed by other choices.
Remark 3.2. The proof of Theorem 4 does not require that the errors are symmetric around zero in finite samples, but only that the test statistic has such symmetry asymptotically in the sense of Equation (11). Theorem 4 can therefore be more general. We choose to present the narrower formulation here because the finitesample condition of sign symmetry is conceptually clearer than the asymptotic condition; see also Section 2 and Remark 2.6 for additional discussion.
4 Clustered error structures
Here, we consider problems with a natural clustering of the data, for example, a clustering by city or state, school or classroom. moulton1986random and arellano1987practitioners provide classical results on clusterrobust variance estimation, showing that regular OLS may be significantly biased when the cluster structure is ignored. Asymptotically correct inference is possible when the number of clusters grows (white1984asymptotic; hansen2007asymptotic; carter2017asymptotic), or when the number is fixed (ibragimov2010t; ibragimov2016inference; bester2011inference; conley2011inference). However, all these parametric methods rely on asymptotics that can be biased in practice, especially when the number of clusters is small (donald2007inference; imbens2016robust; cameron2008bootstrap; cameron2015practitioner); see also angrist2008mostly and (abadie2017should) on practical discussions about the need for cluster error adjustments.
Such small sample issues can be alleviated by “cluster wild bootstrap” (cameron2008bootstrap). However, the standard bootstrap asymptotics don’t work with a finite number of clusters. In current work, canay2018wild analyze the cluster wild bootstrap in a randomization framework, and prove its asymptotic validity when the number of clusters is finite but the individual cluster sizes grow. Their result requires that ${X}^{\top}X$ is homogeneous across clusters, and similar to the population ${X}^{\top}X$. In this section, we extend this result in various ways, and show, for instance, that such homogeneity is not necessary when errors are exchangeable within clusters. We summarize all our results in the following section, and then proceed to individual cases.
4.1 Overview of results
To proceed we need some additional notation. Let ${\mathrm{C}}^{1},\mathrm{\dots},{\mathrm{C}}^{n}$ denote a sequence of clusterings of $\{1\},\mathrm{\dots},\{1,\mathrm{\dots},n\}$, respectively. Let ${J}_{n}={\mathrm{C}}^{n}$ and let $i\in [c]$ denote that datapoint $i$ is in $c$th cluster. The clusterings may grow infinitely but they exhibit a nested structure:
$$i\in [c]\text{under}{\mathrm{C}}^{n}\Rightarrow i\in [c]\text{under all}{\mathrm{C}}^{m},m\ge n,$$  (14) 
so that a datapoint stays within its assigned cluster. This assumption is without practical loss of generality, while it allows us to posit withincluster limits (e.g., means) as $n$ grows.
In this paper, we consider three types of cluster invariances: (1) exchangeability within clusters; (2) sign symmetry across clusters; and (3) double invariance, where both symmetries hold. In all cases we assume, as usual, that the errors are independent across clusters. Our technical results on validity of residual randomization for all such invariances are summarized in Table 1.
Cluster size, ${J}_{n}={\mathrm{C}}^{n}$  

inferential primitive  $$  ${J}_{n}\to \mathrm{\infty}$ 
exchangeability within clusters (${\mathcal{G}}^{\U0001d5c9}$)  ${X}^{\top}\mathrm{\U0001d7d9}/n=0$, or $\mathrm{HA}$  ${X}^{\top}\mathrm{\U0001d7d9}/n=0$, or $\mathrm{HA}$ 
sign symmetry across clusters (${\mathcal{G}}^{\U0001d5cc}$)  $\mathrm{HA}$  $\mathrm{HA}$, or ${\sum}_{c=1}^{{J}_{n}}{n}_{c}^{2}/{n}^{2}\to 0.$ 
double invariance (${\mathcal{G}}^{\U0001d5c9+\U0001d5cc}$)  ${X}^{\top}\mathrm{\U0001d7d9}/n=0$, or $\mathrm{HA}$  ${X}^{\top}\mathrm{\U0001d7d9}/n=0$, or $\mathrm{HA}$, or ${\sum}_{c=1}^{{J}_{n}}{n}_{c}^{2}/{n}^{2}\to 0.$ 
The assumption of cluster homogeneity, denoted by “HA” in Table 1, stands out. For example, when the number of clusters is fixed $$, the homogeneity assumption posits that, for every cluster $c=1,\mathrm{\dots},J$,
$${({X}^{\top}X)}^{1}{X}^{\top}{\mathbb{D}}_{c}X\to {b}_{c}I,{b}_{c}\ge 0,$$  (15) 
where ${\mathbb{D}}_{c}$ is the diagonal matrix of cluster membership. As mentioned before, Equation (15) implies a homogeneous covariate correlation structure across clusters in the limit, and was first identified by canay2018wild in the context of cluster wild bootstrap. Interestingly, this condition is the asymptotic equivalent of the finitesample homogeneity condition in Theorem 2, which can yield exact tests in finite samples. However, homogeneity is not necessary for validity when the errors are exchangeable within clusters (see first row of Table 1). In this case, permuting the residuals, after cluster covariates have been centered, also leads to valid inference, except for the intercept ${\beta}_{0}$ that is differenced out. On the other hand, when ${J}_{n}\to \mathrm{\infty}$ the additional condition ${\sum}_{c}{n}_{c}^{2}/{n}^{2}\to 0$ can also lead to valid inference, beyond homogeneity or centered clusters (see second column of Table 1). This requires, intuitively, that no cluster dominates the others in size.
4.2 Exchangeability within clusters
We start with the cluster variant of exchangeability in Section 3.1. Formally, let $\pi (\cdot ;{\mathrm{C}}^{n})$ be a permutation of $n$ elements within clusters of ${\mathrm{C}}^{n}$; i.e., if $i\in [c]$, for some $[c]\in {\mathrm{C}}^{n}$, then $\pi (i;{\mathrm{C}}^{n})\in [c]$. Let $\mathrm{\Pi}({\mathrm{C}}^{n})$ denote the space of all such permutations. Then, finitesample invariance implies that $\epsilon \stackrel{d}{=}\U0001d690\epsilon $, where $\U0001d690={\sum}_{i=1}^{n}{\mathrm{\U0001d7d9}}_{i}{\mathrm{\U0001d7d9}}_{\pi (i)}^{\prime}$, for some unique $\pi \in \mathrm{\Pi}({\mathrm{C}}^{n})$. The inferential primitive is
$${\mathcal{G}}^{\U0001d5c9}({\mathrm{C}}^{n})=\{\sum _{i=1}^{n}{\mathrm{\U0001d7d9}}_{i}{\mathrm{\U0001d7d9}}_{\pi (i)}^{\prime}:\pi \in \mathrm{\Pi}({\mathrm{C}}^{n})\}.$$  (16) 
Operationally, the residual randomization test using ${\mathcal{G}}^{\U0001d5c9}({\mathrm{C}}^{n})$ as the primitive repeatedly calculates the test statistic, ${t}_{n}(\cdot )$, on random withincluster permutations of the residuals, $\widehat{{\epsilon}^{\mathrm{o}}}$.
Theorem 5.
Suppose that Assumptions 1(a)(c) hold for the regression model of Equation (2), and $\epsilon \stackrel{d}{\mathrm{=}}\mathrm{g}\mathit{}\epsilon $, for all $\mathrm{g}\mathrm{\in}{\mathrm{G}}^{\mathrm{p}}\mathit{}\mathrm{(}{\mathrm{C}}^{n}\mathrm{)}$. Suppose also that Assumptions 1(a)(b) hold within clusters. Then, the residual randomization test using ${\mathrm{G}}^{\mathrm{p}}\mathit{}\mathrm{(}{\mathrm{C}}^{n}\mathrm{)}$ as the primitive is asymptotically valid if the clusters are centered; or if the homogeneity condition of Equation (15) holds and ${\mathrm{min}}_{c}\mathit{}{n}_{c}\mathrm{\to}\mathrm{\infty}$.
Theorem 5 shows two conditions for asymptotic validity of a residual randomization test that permutes the residuals within clusters. First, when the covariates are centered the test is valid even when the number of clusters is fixed; however, this precludes inference on ${\beta}_{0}$. Second, when the homogeneity condition holds and all clusters grow in size, the test can be analyzed as a permutation within each cluster separately, and then validity follows from Theorem 3.
4.3 Sign symmetry across clusters
When the errors are not exchangeable within clusters, we may assume that the entire cluster errors are independent and signsymmetric, similar to Section 3.2. Then, finitesample invariance implies that $\epsilon \stackrel{d}{=}\U0001d690\epsilon $, where $\U0001d690={\sum}_{c=1}^{{J}_{n}}{s}_{c}{\sum}_{i\in [c]}{\mathrm{\U0001d7d9}}_{i}{\mathrm{\U0001d7d9}}_{i}^{\prime}$, and ${s}_{c}=\pm 1$. The inferential primitive is
$${\mathcal{G}}^{\U0001d5cc}({\mathrm{C}}^{n})=\{\sum _{c=1}^{{J}_{n}}{s}_{c}\sum _{i\in [c]}{\mathrm{\U0001d7d9}}_{i}{\mathrm{\U0001d7d9}}_{i}^{\prime}:s\in {\{1,+1\}}^{{J}_{n}}\}.$$  (17) 
Operationally, the residual randomization test using ${\mathcal{G}}^{\U0001d5cc}({\mathrm{C}}^{n})$ as the primitive repeatedly calculates the test statistic, ${t}_{n}(\cdot )$, on cluster residuals with their signs randomly flipped.
Theorem 6.
Suppose that Assumptions 1(a)(c) hold for the regression model of Equation (2), and $\epsilon \stackrel{d}{\mathrm{=}}\mathrm{g}\mathit{}\epsilon $, for all $\mathrm{g}\mathrm{\in}{\mathrm{G}}^{\mathrm{s}}\mathit{}\mathrm{(}{\mathrm{C}}^{n}\mathrm{)}$. Suppose also that Assumptions 1(a)(b) hold within clusters. If ${J}_{n}\mathrm{=}J$ is fixed and ${\mathrm{min}}_{c}\mathit{}{n}_{c}\mathrm{\to}\mathrm{\infty}$, then the residual randomization test using ${\mathrm{G}}^{\mathrm{s}}\mathit{}\mathrm{(}{\mathrm{C}}^{n}\mathrm{)}$ as the primitive is asymptotically valid when the homogeneity condition of Equation (15) holds. If ${J}_{n}\mathrm{\to}\mathrm{\infty}$, then the test is valid also when ${\mathrm{\sum}}_{c\mathrm{=}\mathrm{1}}^{{J}_{n}}{n}_{c}^{\mathrm{2}}\mathrm{/}{n}^{\mathrm{2}}\mathrm{\to}\mathrm{0}$.
Theorem 6 shows two conditions for validity of a residual randomization test that randomly flips the signs of cluster residuals. The homogeneity condition is required when the number of clusters is finite because ${t}_{n}(G\widehat{{\epsilon}^{\mathrm{o}}}){t}_{n}(G\epsilon )$ does not converge in this setting. However, when the number of clusters grows, then the condition ${\sum}_{c}{n}_{c}^{2}/{n}^{2}\to 0$ also guarantees validity of the test. This condition roughly requires that no cluster dominates the others in the limit.
Remark 4.1. The residual randomization test here is similar to the “cluster wild bootstrap” method of cameron2015practitioner. The main difference is that cluster wild bootstrap resamples the cluster residuals with replacement. canay2018wild proved the asymptotic validity of cluster wild bootstrap with finite clusters under the cluster homogeneity condition of Equation (15). In the following section, we show that homogeneity is not required when errors are also exchangeable within clusters. When exchangeability does not hold, however, the proof of Theorem 6 shows that validity is still possible if the covariates are uncorrelated. This suggests running the randomization test on decorrelated covariates (e.g., through PCA).
Remark 4.2. canay2018wild also pointed out that asymptotic validity with a fixed number of clusters requires a finitesample invariance assumption, i.e., the symmetry of errors around zero in finite samples. This is not surprising since we cannot estimate cluster variance consistently if the number of clusters is fixed. See also Remarks 2.6 and 3.2 regarding our discussion thread on such contrast between finitesample and asymptotic invariance.
4.4 Double invariance
Here, we make both invariance assumptions from the two previous sections, namely exchangeability within clusters and sign symmetry across clusters. Formally, $\epsilon \stackrel{d}{=}\U0001d690\epsilon $, where $\U0001d690={\sum}_{c=1}^{{J}_{n}}{s}_{c}{\sum}_{i\in [c]}{\mathrm{\U0001d7d9}}_{i}{\mathrm{\U0001d7d9}}_{\pi (i)}$, where ${s}_{c}$ are cluster signs and $\pi \in \mathrm{\Pi}({\mathrm{C}}^{n})$. The inferential primitive is defined as follows:
$${\mathcal{G}}^{\U0001d5c9+\U0001d5cc}({\mathrm{C}}^{n})=\{\sum _{c=1}^{{J}_{n}}{s}_{c}\sum _{i\in [c]}{\mathrm{\U0001d7d9}}_{i}{\mathrm{\U0001d7d9}}_{\pi (i)}^{\prime}:s\in {\{1,+1\}}^{{J}_{n}},\pi \in \mathrm{\Pi}({\mathrm{C}}^{n})\}.$$  (18) 
Operationally, the residual randomization test of Section 2.1 using ${\mathcal{G}}^{\U0001d5c9+\U0001d5cc}$ as the primitive repeatedly calculates the test statistic, ${t}_{n}(\cdot )$, on permuted cluster residuals with randomly flipped signs.
Theorem 7.
Suppose that Assumptions 1(a)(c) hold for the regression model of Equation (2), and $\epsilon \stackrel{d}{\mathrm{=}}\mathrm{g}\mathit{}\epsilon $, for all $\mathrm{g}\mathrm{\in}{\mathrm{G}}^{\mathrm{p}\mathrm{+}\mathrm{s}}\mathit{}\mathrm{(}{\mathrm{C}}^{n}\mathrm{)}$. Suppose also that Assumptions 1(a)(b) hold within clusters. If ${J}_{n}\mathrm{=}J$ is fixed and ${\mathrm{min}}_{c}\mathit{}{n}_{c}\mathrm{\to}\mathrm{\infty}$, then the residual randomization test using ${\mathrm{G}}^{\mathrm{p}\mathrm{+}\mathrm{s}}\mathit{}\mathrm{(}{\mathrm{C}}^{n}\mathrm{)}$ as the primitive is asymptotically valid when the homogeneity condition of Equation (15) holds, or when the clusters are centered. If ${J}_{n}\mathrm{\to}\mathrm{\infty}$, then the test is valid also when ${\mathrm{\sum}}_{c}{n}_{c}^{\mathrm{2}}\mathrm{/}{n}^{\mathrm{2}}\mathrm{\to}\mathrm{0}$.
The idea behind the double invariant, ${\mathcal{G}}^{\U0001d5cc+\U0001d5c9}({\mathrm{C}}^{n})$, is to combine the strengths of both cluster permutations and random signs. This procedure should be used when there is no concern of heteroskedasticity within clusters, because it will generally lead to more powerful tests and sharper inference compared to the the simple cluster sign test (or the wild bootstrap). We illustrate this point in the experiments of Section 5.2.
4.5 Twoway clustering
In many problems, the data could be clustered in two ways. For example, with stateschool data one cluster could be the state and another could be the school; or in panel data, the clusters could be state and time. Here, we discuss a residual randomization test for such error structure, and prove conditions for validity. These technical conditions are more subtle than in oneway clustering because the data size may grow in either one or both cluster dimensions.
In related work, cameron2011robust proposed estimation based on the inclusionexclusion principle, where standard robust error estimates are used for every cluster individually and in combinations. aronow2015; fafchamps2007formation; conley1999gmm study sandwich estimators. However, all these estimators have the typical smallsample problems, including a reliance on normal asymptotics. Additionally, they may be illdefined (e.g., yielding nonpositive definite estimates), and can be computationally cumbersome.
Our approach starts by visualizing an arrangement of the errors in a twodimensional array. Data fall in cells that correspond to a “row clustering” and “column clustering”, namely ${\mathrm{R}}^{n}$ and ${\mathrm{C}}^{n}$, indexed by $r$ and $c$, respectively. Without loss of generality we assume that the clusters have the same size, so that $r,c=1,\mathrm{\dots},{J}_{n}$. For datapoint $i$, the value $r(i)\in \{1,\mathrm{\dots},{J}_{n}\}$ denotes its assigned row cluster, and $c(i)\in \{1,\mathrm{\dots},{J}_{n}\}$ its column cluster. The error structure can be represented without loss of generality as follows. Let $i$ be the $k$th observation in row cluster $r$ and column cluster $c$, then we can write
$${\epsilon}_{i}={\u03f5}_{r}+{\u03f5}_{c}+{\u03f5}_{rc(k)},\text{where}r(i)=r,c(i)=c,k\text{th replication in cell.}$$  (19) 
The common row and column terms induce the twoway correlation in the errors. We will assume, as usual, that the error ${\u03f5}_{rc(k)}$ is pure noise, so that it obeys any desirable invariance. Inference therefore depends on the assumptions we put on ${\u03f5}_{r},{\u03f5}_{c}$.
Suppose that ${\u03f5}_{r},{\u03f5}_{c}$ are individually exchangeable and there is only one replication per cell. Then, the errors have the simple form ${\epsilon}_{i}={\u03f5}_{r}+{\u03f5}_{c}+{\u03f5}_{rc}$, and are thus invariant to entire row and column permutations. Specifically, define $\mathrm{\Pi}({\mathrm{R}}^{n},{\mathrm{C}}^{n})$ as the following group of permutations:
$$\mathrm{\Pi}({\mathrm{R}}^{n},{\mathrm{C}}^{n})=\{\pi \in {\mathrm{\Pi}}_{n}:r(i)=r(j)\Rightarrow r(\pi (i))=r(\pi (j)),\text{and}c(i)=c(j)\Rightarrow c(\pi (i))=c(\pi (j)).\}$$ 
That is, $\mathrm{\Pi}({\mathrm{R}}^{n},{\mathrm{C}}^{n})$ is the set of permutations for which if $i$ and $j$ are in the same row (or column), then both $i$ and $j$ are mapped to the same row (or column) under any $\pi \in \mathrm{\Pi}({\mathrm{R}}^{n},{\mathrm{C}}^{n})$. Moreover, $\mathrm{\Pi}({\mathrm{R}}^{n},{\mathrm{C}}^{n})$ is a subgroup of ${\mathrm{\Pi}}_{n}$. The inferential primitive can thus be defined as follows:
$${\mathcal{G}}^{\U0001d5c9}({\mathrm{R}}^{n},{\mathrm{C}}^{n})=\{\sum _{i=1}^{n}{\mathrm{\U0001d7d9}}_{i}{\mathrm{\U0001d7d9}}_{\pi (i)}^{\prime}:\pi \in \mathrm{\Pi}({\mathrm{R}}^{n},{\mathrm{C}}^{n})\}.$$  (20) 
Operationally, the residual randomization test of Section 2.1 using ${\mathcal{G}}^{\U0001d5c9}({\mathrm{R}}^{n},{\mathrm{C}}^{n})$ as the primitive essentially permutes the residuals row and columnwise.
Theorem 8.
Suppose that Assumptions 1(a)(c) hold for the regression model of Equation (2), and $\epsilon \stackrel{d}{\mathrm{=}}\mathrm{g}\mathit{}\epsilon $, for all $\mathrm{g}\mathrm{\in}{\mathrm{G}}^{\mathrm{p}}\mathit{}\mathrm{(}{\mathrm{R}}^{n}\mathrm{,}{\mathrm{C}}^{n}\mathrm{)}$. Suppose also that Assumptions 1(a)(b) hold within clusters with respect to both ${\mathrm{R}}^{n}$ and ${\mathrm{C}}^{n}$. When the number of clusters of both ${\mathrm{R}}^{n}\mathrm{,}{\mathrm{C}}^{n}$ grows indefinitely, the residual randomization test using ${\mathrm{G}}^{\mathrm{p}}\mathit{}\mathrm{(}{\mathrm{R}}^{n}\mathrm{,}{\mathrm{C}}^{n}\mathrm{)}$ as the primitive is asymptotically valid. When only the row (or only the column) cluster grows indefinitely, the same randomization test is valid if the covariates have been centered columnwise (or rowwise).
Notably, the conditions for validity in Theorem 8 are similar to Theorem 3 on unclustered exchangeability, and are actually weaker than Theorem 5 on oneway clustered exchangeability. Intuitively, in oneway clustering the randomization bias depends on the cluster covariate means, which requires additional assumptions, such as cluster homogeneity. In twoway clustering the randomization bias depends on the global means, and so validity follows exactly as in Theorem 3.
Remark 4.3. The permutation test described above can be extended to multiway clustered errors. As in the twoway clustering case, this would require an assumption of additivity of cluster errors and exchangeability within clusters.
Remark 4.4. A related test known as “quadratic assignment procedure”(QAP) (hubert1986assignment; hubert1976quadratic; krackhardt1988predicting) is used in network analysis to test independence between network data (christakis2013social). In a dyadic regression model, for instance, QAP tests permute the rows and vectors of the dependent variable, $y$, and refit the model. As such, a QAP test acts as an Ftest, but sometimes is misused to assess significance of individual parameters. More appropriate for significance is the residual randomization test based on ${\mathcal{G}}^{\U0001d5c9}({\mathrm{R}}^{n},{\mathrm{C}}^{n})$, which is essentially the residual version of the QAP test. dekker2007sensitivity have made a similar point based on their adaptation of the residual permutation method of freedman1983nonstochastic.
Remark 4.5. If there are replications within cells, and exchangeability still holds, then we could simply average data $y,X$ within cells, and then perform the randomization test as above. Such test operates on the errors ${\epsilon}_{i}^{\prime}={\u03f5}_{r}+{\u03f5}_{c}+{\overline{\u03f5}}_{rc}$, which are still invariant to row and column permutations. The averaging may preclude inference for parameters that are differenced out.
Remark 4.6. Inference with heteroskedasticity in twoway clustered errors is more challenging. It may be tempting to switch the signs of the entire row or column residuals, but this would be wrong because changing one row affects all columns, and changing one column affects all rows. One approach is to follow a decomposition similar to ANOVA, where we difference out row and column averages. Such test operates on the transformed errors ${\epsilon}_{i}^{\prime}={\epsilon}_{i}{\overline{\epsilon}}_{r(i)}{\overline{\epsilon}}_{c(i)}+\overline{\epsilon}={\u03f5}_{rc}{\overline{\u03f5}}_{r(i)}{\overline{\u03f5}}_{c(i)}+\overline{\u03f5}$. If the number of rows (and columns) grows, then the averages converge to zero. The errors ${\epsilon}_{i}^{\prime}$ could then be treated as approximately satisfying sign symmetry. Theorem 1 could be extended to include such approximate symmetries, but we leave this for future work.
5 Applications and experiments
In this section, we perform an extensive evaluation of the residual randomization tests introduced earlier. Our comparisons include both the bootstrap and its variants, and also standard asymptotic methods from the robust standard error literature. Exact algorithmic details on the methods used here can be found in Appendix D.
5.1 Hormone data
In the hormone data of efron1986bootstrap, there are 27 medical devices, each producing a data point $({y}_{i},{x}_{i})$; here, ${x}_{i}$ is number of hours the device is worn by a patient, and ${y}_{i}$ is remaining amount of hormone on the device. The model is simple linear regression, ${y}_{i}={\beta}_{0}+{\beta}_{1}{x}_{i}+{\epsilon}_{i}$, where ${\epsilon}_{i}$ are zeromean, independent, and possibly not normal errors. efron1986bootstrap illustrated bootstrap in estimating standard errors for ${\beta}_{1}$. Here, we use the residual randomization method to test ${H}_{0}:{\beta}_{1}=0$, and then invert the test to get a confidence interval. Thus, we set $a={(0,1)}^{\prime}$ and ${a}_{0}=0$ in the null hypothesis of Equation (3).
We start by assuming only exchangeability of errors, and so we use the inferential primitive ${\mathcal{G}}^{\U0001d5c9}$, defined in Equation (12). In this simple setting, the randomization procedure is equivalent to regressing permutations of the constrained residuals, $\widehat{{\epsilon}_{i}^{\mathrm{o}}}={y}_{i}\overline{y}$, with ${x}_{i}$, and then comparing the estimated slopes of these regressions with the standard OLS estimate ${\widehat{\beta}}_{1}$. We calculate the pvalues based on $R=2000$ iterations of the randomization procedure.
The null hypothesis, ${H}_{0}:{\beta}_{1}=0$, is strongly rejected (pvalue $\approx 0$) with the residual randomization test. To produce a confidence interval we follow the process outlined in Remark 2.3. Specifically, we consider a sequence of null hypotheses, ${H}_{0}:{\beta}_{1}={\beta}_{1}^{0}$, for various values of ${\beta}_{1}^{0}$. Using the output of regular OLS, we may focus the range to ${\beta}_{1}^{0}\in [0.1,0.03]$. The (twosided) pvalues for the entire sequence of tests are shown in Figure 1. From the plot, we can extract the values of ${\beta}_{1}^{0}$ that lead to a pvalue larger than $0.025$, and determine the 95% confidence interval. In the figure, the interval is marked by the two vertical lines, and is roughly equal to $[0.0668,0.0478]$.
We can extend this analysis to include heteroskedastic errors. For example, we can repeat the test inversion described above, this time using random signs as the inferential primitive; see Equation (13). Moreover, the data can be clustered according to the manufacturer of the device; in the data there are 3 manufacturers, each providing 9 data points. We can therefore do inference using withincluster permutations or random signs, as described in Section 4. The results of this analysis are shown in Table 2, juxtaposed with results from regular OLS and bootstrap (efron1986bootstrap, Table 9.2). We see that the standard errors from all methods roughly agree. The exception is the results from cluster random signs, which cannot produce significance at the 5% level since there are only 3 clusters. This would be concerning if we were not willing to assume exchangeability of errors within clusters.
In summary, the numerical example of this section illustrates that the residual randomization method provides nonparametric estimation of the standard error in a straightforward manner like bootstrap. Additionally, the example shows that the randomization method can be useful for sensitivity analysis simply by doing inference under different invariance assumptions. In doing so, the randomization method is more versatile than bootstrap since one single procedure (presented in Section 2.1) underlies all of the aforementioned tests, whereas a bootstrap approach would require the use of an appropriate variant each time.
inference method  midpoint estimate  s.e.  95% interval  

OLS  $0.0574$  $.0045$  $(0.0665,0.0482)$.  
bootstrap  $0.0574$  $.0043$  $(0.0660,0.0488)$  

permutations  $0.0573$  $.0048$  $(0.0668,0.0477)$  
random signs  $0.0595$  $.0045$  $(0.0686,0.0504)$  

$0.0609$  $.0043$  $(0.0695,0.0522)$  

$\ast $  $\ast $  $\ast $  

$0.0582$  $0.0050$  $(0.0682,0.0482)$ 
5.2 Clustered errors
Here, we illustrate the theory and methods of Section 4 on clustered errors. There are two main takeaways. First, using richer inferential primitives can yield substantially more powerful tests (Section 5.2.1). Second, exact tests in finite samples are possible by appropriately adapting to the data structure. We illustrate the idea in the famous Behrens–Fisher problem (Section 5.2.2).
5.2.1 Simulated study
Here, we use a familiar simulation setup (cameron2008bootstrap; bruce2018; canay2018wild) with oneway clustering, and errors of the form
$${\epsilon}_{i}={\u03f5}_{c}+{\u03f5}_{ic},\text{where}{\u03f5}_{ic}\sim N(0,1),i\in [c].$$  (21) 
The random effects term, ${\u03f5}_{c}$, can induce withincluster correlation. We consider two cases, ${\u03f5}_{c}=0$ and ${\u03f5}_{c}\sim N(0,1)$. All errors ${\u03f5}_{c},{\u03f5}_{ic}$ are independent. The data are simulated as ${y}_{i}={\beta}_{0}+{x}_{i}\beta +{\epsilon}_{i}$, with ${x}_{i}$ a scalar. Following cameron2008bootstrap, ${x}_{i}$ can also have a clustered structure:
$${x}_{i}={\mathrm{x}}_{c}+{\mathrm{x}}_{ic},\text{where}{\mathrm{x}}_{ic}\sim N(0,1),i\in [c].$$  (22) 
For the cluster $x$term we consider two cases: (1) ${\mathrm{x}}_{c}\sim N(0,1)$, and (2) ${\mathrm{x}}_{c}\sim .5LN(0,1)$, the lognormal distribution. The second choice is interesting because lognormal covariates produce a regression model with high leverage points, which makes the estimation problem harder. We consider $c=1,\mathrm{\dots},J$ clusters with $J=10,15,20$; each cluster has 30 datapoints, and thus there are $i=1,\mathrm{\dots},30J$ datapoints in total. When we want to induce heteroskedasticity, we scale the errors ${\epsilon}_{i}$ in Equation (21) by $3{x}_{i}$. Throughout all simulations, we set ${\beta}_{1}=0$, and ${\beta}_{0}=0$ for the homoskedastic case, and ${\beta}_{0}=1$ for the heteroskedastic.
The results for testing the true null hypothesis, ${H}_{0}:{\beta}_{1}=0$, are shown in Table 3. The table shows rejection frequencies over 5000 replications at 5% significance level, and includes cluster robust error estimates (see Appendix D.3 for details). For each replication, every randomization test uses $R=2000$ replications internally.
First, we focus on Panel (A) with homoskedastic errors. As expected, standard OLS achieves the nominal level when there is no cluster correlation $({\u03f5}_{c}=0)$, but fails considerably when the errors are clustered. The cluster robust errors are a definite improvement but are also far from the nominal level. This is likely due to the small sample, since rejection rates do improve as $J$ grows. The residual randomization tests work well across the board: rejection rates are around 5% even when the covariates are lognormal, as shown in Columns marked by (2).
We now turn our attention to the heteroskedastic setting in Panel (B) of Table 3. Standard OLS now fails across all settings, as expected. The robust error method works roughly as well as in Panel (A). Notably, the lognormal covariates seem to affect the robust method even more in this setting. For example, the rejection rates can be around 14% with lognormal covariates, almost $3\times $ the nominal level. On the other hand, the cluster sign randomization test shows remarkable robustness. It mostly achieves the nominal rate with normal covariates. And with lognormal covariates, its worst rate is around 8% when $J=10$, but otherwise it remains close to the nominal level of 5%. All other randomization tests don’t work well. This is expected because these tests include permutations within clusters, which is invalid when there is heteroskedasticity.
Next, we discuss power results when testing the false null hypothesis, ${H}_{0}:{\beta}_{1}=0.1$, shown in Table 4. We omit the results for the heteroskedastic panel in this case, since all tests except for the sign test don’t come close to the correct level. Overall, the randomization tests achieve good power compared to robust OLS. The double randomization test is more powerful than both individual tests in the left subpanel of Table 4, where ${\u03f5}_{c}=0$. In the right subpanel, where the errors are homoskedastic and clustered, the double test remains more powerful than the sign test.
Notably, in that same subpanel, the permutation test now performs much better than the rest, even compared to the double test. For example, when $J=20$ the permutation test is roughly 6$\times $ more powerful than the other tests. This may seem to contradict our argument that more invariances generally lead to more power. The result can be explained, however. By construction of the simulation, with ${\mathrm{x}}_{c}$ and ${\mathrm{x}}_{ic}$ both zeromean and independent normals, the randomized Gram matrix, ${X}^{\top}GX$, for the permutation test is similar to the population ${X}^{\top}X$. The similarity condition in Theorem 1 holds with good precision, suggesting an almost exact test. To check this theory, we can try to set ${x}_{i}={\mathrm{x}}_{c}+{\mathrm{x}}_{ic}$, where ${\mathrm{x}}_{c}\sim N(c,1)$, and ${\mathrm{x}}_{ic}\sim N(0,{0.1}^{2})$, which produces heterogeneous clusters by emphasizing the differences between clusters (terms ${\u03f5}_{c}$) and dampening their similarity (terms ${\u03f5}_{ic}$). With this setup, and all else equal, the rejection rates are now $0.137,0.053$, and $0.141$ over 5000 replications for the sign, permutation, and double test, respectively, compared to $0.119,0.392$ and $0.146$ of Table 4. This reversal cautions that testing power depends heavily on the problem structure.
(A) Homoskedastic  

cluster error term (${\u03f5}_{c}$)  
${\u03f5}_{c}=0$ (nonclustered errors)  ${\u03f5}_{c}\sim N(0,1)$  
number of clusters ($J$)  
$J=10$  $J=15$  $J=20$  $J=10$  $J=15$  $J=20$  
OLS  (1)  (2)  (1)  (2)  (1)  (2)  (1)  (2)  (1)  (2)  (1)  (2) 
standard  0.057  0.051  0.054  0.051  0.050  0.053  0.490  0.382  0.480  0.394  0.493  0.421 
HC2, robust  0.086  0.090  0.076  0.079  0.061  0.077  0.103  0.110  0.081  0.089  0.081  0.090 
Randomizations  
signs, across  0.059  0.047  0.054  0.049  0.047  0.054  0.053  0.055  0.056  0.048  0.055  0.050 
perms.,within  0.053  0.050  0.046  0.050  0.052  0.051  0.051  0.048  0.048  0.051  0.054  0.056 
double  0.061  0.054  0.056  0.052  0.051  0.056  0.055  0.052  0.054  0.046  0.051  0.050 
(B) Heteroskedastic  
${\u03f5}_{c}=0$  ${\u03f5}_{c}\sim N(0,1)$  
$J=10$  $J=15$  $J=20$  $J=10$  $J=15$  $J=20$  
OLS  (1)  (2)  (1)  (2)  (1)  (2)  (1)  (2)  (1)  (2)  (1)  (2) 
standard  0.228  0.249  0.244  0.264  0.239  0.286  0.278  0.274  0.301  0.303  0.301  0.309 
cluster robust  0.095  0.140  0.085  0.116  0.074  0.114  0.100  0.126  0.091  0.124  0.075  0.121 
Randomizations  
signs, across  0.055  0.084  0.055  0.072  0.052  0.072  0.049  0.065  0.059  0.071  0.056  0.072 
perms., within  0.152  0.162  0.155  0.148  0.145  0.153  0.154  0.150  0.166  0.149  0.154  0.144 
double  0.205  0.194  0.198  0.174  0.183  0.170  0.166  0.167  0.168  0.163  0.150  0.155 
(A) Homoskedastic  

cluster error term (${\u03f5}_{c}$)  
${\u03f5}_{c}=0$ (nonclustered errors)  ${\u03f5}_{c}\sim N(0,1)$  
number of clusters ($J$)  
$J=10$  $J=15$  $J=20$  $J=10$  $J=15$  $J=20$  
OLS  (1)  (2)  (1)  (2)  (1)  (2)  (1)  (2)  (1)  (2)  (1)  (2) 
standard  0.649  0.575  0.816  0.742  0.923  0.859  0.572  0.495  0.587  0.536  0.629  0.589 
cluster robust  0.646  0.582  0.804  0.730  0.905  0.849  0.166  0.207  0.162  0.209  0.177  0.237 
Randomizations  
signs, across  0.532  0.450  0.744  0.638  0.875  0.791  0.119  0.128  0.129  0.147  0.152  0.177 
perms., within  0.390  0.391  0.533  0.537  0.665  0.666  0.392  0.380  0.544  0.558  0.670  0.666 
double  0.620  0.532  0.796  0.698  0.904  0.833  0.146  0.164  0.154  0.178  0.168  0.213 
In summary, the results of this section reveal certain interesting tradeoffs in randomization inference with clustered errors. When exchangeability is not plausible, the only option seems to be the cluster sign test. However, when exchangeability is plausible, then either the permutation test or the double test should be used. The choice depends on where most of the variation comes from, and could be aided by ANOVA. For example, if there is no substantial variation across clusters, then the permutation test should be used. If there is substantial variation both across and within clusters, then the double test is likely more appropriate.
5.2.2 Behrens–Fisher problem: an exact test
In this section, we illustrate the construction of an exact residual randomization test in the context of the famous Behrens–Fisher problem (scheffe1970practical). This problem has been used to illustrate the limitations of robust standard error methods in small samples (angrist2008mostly; imbens2016robust). We use a simulation setup similar to (angrist2008mostly, Section 8.2) and (imbens2016robust, Section II), where the data generating process is defined as follows:
$${y}_{i}={\beta}_{0}+{\beta}_{1}{d}_{i}+{\epsilon}_{i}.$$ 
Here, ${d}_{i}\in \{0,1\}$, and ${\epsilon}_{i}$ are independent with $\mathrm{E}({\epsilon}_{i})=0$ and $\mathrm{Var}({\epsilon}_{i})={d}_{i}{\sigma}_{1}^{2}+(1{d}_{i}){\sigma}_{0}^{2}$, with both variances unknown.
Following the treatment effects terminology of angrist2008mostly, datapoint $i$ denotes an experimental unit, which is treated when ${d}_{i}=1$ and is in control when ${d}_{i}=0$. There are $n=30$ units in total. Let ${n}_{1}={\sum}_{i}{d}_{i}=3$ and ${n}_{0}=n{n}_{1}=27$ be the number of treated and control units, respectively. The OLS estimate for ${\beta}_{1}$ satisfies ${\widehat{\beta}}_{1}{\beta}_{1}={\overline{\epsilon}}_{1}{\overline{\epsilon}}_{0}$, where ${\overline{\epsilon}}_{1}$ and ${\overline{\epsilon}}_{0}$ are the error averages for treated and control units, respectively. Inference for ${\beta}_{1}$ is complicated because both ${\sigma}_{1}^{2}$ and ${\sigma}_{0}^{2}$ are unknown, and the standardized statistic based on ${\widehat{\beta}}_{1}$ is not exactly $t$distributed. This is known as the Behrens–Fisher problem.
One popular approach due to welch1951comparison is to approximate the distribution of the statistic $({\widehat{\beta}}_{1}{\beta}_{1})/\widehat{\sigma}$, where $\widehat{\sigma}$ is some robust standard error estimate. The approximation follows by the adjustment of the robust estimate by a factor $f$ so that $f\widehat{\sigma}/\sigma $ matches the first two moments of a chisquare with $f$ degrees of freedom, where ${\sigma}^{2}={\sigma}_{1}^{2}/{n}_{1}+{\sigma}_{0}^{2}/{n}_{0}$ is the true variance. Because this adjustment depends on unknown quantities, imbens2016robust recommend an adjustment based on the bias reduction method of mccaffrey2002bias. Another approach altogether is to use the wild bootstrap, or more generally the sign tests of Section 3.2.
Remarkably, we can construct an exact randomization test using the approach of Section 2.3. The idea of this approach is to construct a clustering of the datapoints, such that ${X}^{\top}X$ within each cluster is similar to the population ${X}^{\top}X$. In this simple example, we actually have a closedform formula for the population quantity:
$${X}^{\top}X=n\left(\begin{array}{cc}\hfill 1\hfill & \hfill \overline{d}\hfill \\ \hfill \overline{d}\hfill & \hfill \overline{d}\hfill \end{array}\right)=30\left(\begin{array}{cc}\hfill 1\hfill & \hfill .1\hfill \\ \hfill .1\hfill & \hfill .1\hfill \end{array}\right).$$ 
This suggests to split the units in clusters, so that in each cluster there is the same proportion of treated units, $\overline{d}$. Thus, we can split the units in 3 clusters, with 1 treated unit and 9 control units per cluster. Let ${\mathrm{C}}_{0}$ denote such clustering. Then, for any cluster $c\in {\mathrm{C}}_{0}$,
$${X}_{c}^{\top}{X}_{c}=10\left(\begin{array}{cc}\hfill 1\hfill & \hfill .1\hfill \\ \hfill .1\hfill & \hfill .1\hfill \end{array}\right)\propto {X}^{\top}X.$$ 
Next, consider the primitive ${\mathcal{G}}^{\U0001d5cc}({\mathrm{C}}_{0})$ defined in Equation (17). This set forms an algebraic group that contains 8 elements, and describes a 3cluster sign test. Then, the conditions of Theorem 2 hold, and the residual randomization test with ${\mathcal{G}}^{\U0001d5cc}({\mathrm{C}}_{0})$ as the primitive is exact in finite samples.
In summary, the exact randomization test in this problem proceeds as follows. For a null hypothesis, ${H}_{0}:{\beta}_{1}={\beta}_{1}^{0}$, we take the following steps.

1.
Calculate the residuals $\widehat{{\epsilon}^{\mathrm{o}}}=y\overline{y}\mathrm{\U0001d7d9}{\beta}_{1}^{0}d$, and the observed test statistic, ${T}_{n}=\sqrt{n}({\widehat{\beta}}_{1}{\beta}_{1}^{0})$.

2.
Define ${t}_{n}(u)=\sqrt{n}[{d}^{\prime}u/{n}_{1}{(1d)}^{\prime}u/{n}_{0}]=\sqrt{n}({\overline{u}}_{1}{\overline{u}}_{0})$.

3.
For every $\U0001d690\in {\mathcal{G}}^{\U0001d5cc}({\mathrm{C}}_{0})$, calculate ${t}_{n}(\U0001d690\widehat{{\epsilon}^{\mathrm{o}}})$.
A pvalue is obtained by comparing the test statistic values from Step 3 with the observed value. This pvalue is exact since Theorem 2 implies that ${t}_{n}(\U0001d690\widehat{{\epsilon}^{\mathrm{o}}})={t}_{n}(\U0001d690\epsilon )$, for every $\U0001d690\in {\mathcal{G}}^{\U0001d5cc}({\mathrm{C}}_{0})$.
To illustrate numerically, we analyze three methods in a simulated study of this problem: the “BM” correction discussed by imbens2016robust, the sign test of Section 3.2 (similar to wild bootstrap), and the exact test described above. Following imbens2016robust, we fix ${\sigma}_{1}=1$ and consider ${\sigma}_{0}=0.5,1,2,5$. We always test the null hypothesis, ${H}_{0}:{\beta}_{1}=0$, and report three panels, corresponding to true values $0,1$ and $2$ for ${\beta}_{1}$. We also consider three different error types: normal, ${t}_{3}$, and a mixture. For the mixture, we sample ${\epsilon}_{i}\sim .5N(1,{.25}^{2})+.5N(1,{.25}^{2})$.
The results of the simulation are shown in Table 5. We see that all methods do well when ${\sigma}_{0}=0.5$ or ${\sigma}_{0}=1$, and the errors are normal (see first two columns across panels). The exception is perhaps the cluster sign test that overrejects when ${\sigma}_{0}=.5$. With normal errors, all tests become more conservative except for the exact test (see Columns 3 and 4) as ${\sigma}_{0}$ grows. We get the same picture with $t$errors, with only an apparent reduction in power for all tests. As ${\sigma}_{0}$ increases all tests become increasingly conservative and powerless, except for the exact test. For instance, when ${\beta}_{1}=1.0$ the BM test rejects only 0.8% of the time with normal errors and ${\sigma}_{0}=5$, whereas the same number is 21.5% when ${\sigma}_{0}=.5$. Not surprisingly, the exact test shows a remarkable robustness across panels. It is affected by increased ${\sigma}_{0}$, or nonnormal errors, but only slightly.
With mixture errors all tests except for the exact test badly overreject. For example, in Panel (A) we see that the BM method rejects 25% of the time when ${\sigma}_{0}=.5$. In contrast, the same number is 5% when the errors are normal. This indicates that normality may be crucial for the analytic tests. As ${\sigma}_{0}$ increases these tests again become increasingly conservative and powerless. For instance, when ${\beta}_{1}=1.0$ we see that BM rejects only 0.9% of the time. The exact test behaves fine here as well. It maintains the correct level, and does not lose power under various alternatives.
Panel (A). True ${\beta}_{1}=0.0$  

Error type, ${\epsilon}_{i}$  
normal  ${t}_{3}$  mixture  
Control standard deviation, ${\sigma}_{0}$  
Method  0.5  1  2  5  0.5  1  2  5  0.5  1  2  5 
BM  0.050  0.028  0.010  0.002  0.034  0.015  0.004  0.000  0.252  0.225  0.034  0.003 
rsign  0.095  0.012  0.000  0.000  0.067  0.012  0.001  0.000  0.213  0.010  0.001  0.000 
rexact  0.048  0.052  0.052  0.050  0.055  0.057  0.054  0.049  0.050  0.046  0.058  0.049 
Panel (B). True ${\beta}_{1}=1.0$  
Method  0.5  1  2  5  0.5  1  2  5  0.5  1  2  5 
BM  0.215  0.161  0.069  0.008  0.146  0.086  0.028  0.003  0.122  0.130  0.119  0.009 
rsign  0.448  0.149  0.007  0.000  0.270  0.065  0.003  0.000  0.214  0.122  0.004  0.000 
rexact  0.124  0.116  0.111  0.073  0.098  0.101  0.081  0.062  0.094  0.083  0.093  0.073 
Panel (C). True ${\beta}_{1}=2.0$  
Method  0.5  1  2  5  0.5  1  2  5  0.5  1  2  5 
BM  0.553  0.511  0.359  0.049  0.418  0.332  0.166  0.016  0.326  0.310  0.183  0.055 
rsign  0.899  0.632  0.090  0.000  0.655  0.290  0.032  0.000  0.978  0.673  0.070  0.001 
rexact  0.172  0.177  0.168  0.119  0.147  0.145  0.131  0.089  0.197  0.197  0.173  0.127 
In summary, this simulation points to the following conclusions. First, robust error estimates and the BM correction work well with normal errors, but they can be very conservative and lose power in small samples. This may be undesirable, especially in problems where the treatment effect is expected to be weak. Second, heaviertailed errors do affect the tests, but not catastrophically so as long as the errors are smooth and normallike. Additional simulations with Cauchy errors (not shown here), for instance, support this point. Third, the tests behave badly with stronger deviations from normality. Resampling methods like wild bootstrap, which are considered a good guard against that, don’t hold up. Here, such sign tests badly overreject in small samples.
On a high level, it speaks to the flexibility and power of the residual randomization method that an exact test could be constructed here. The caveat is, of course, that such construction requires a careful study of the problem structure, and is not always possible.
5.3 Dyadic regression
To illustrate residual randomization inference with twoway clustering, we consider a simulation study on dyadic regression inspired by related work (cameron2011robust; aronow2015). The data generating model is defined as follows:
$${y}_{i}={\beta}_{0}+{\beta}_{1}{x}_{r(i)}{x}_{c(i)}+{\epsilon}_{i}.$$ 
Here, $i$ denotes a dyad of row and column elements (clusters); $r(i)\in \{1,\mathrm{\dots},m\}$ is the row cluster of $i$, and $c(i)\in \{1,\mathrm{\dots},m\}$ is the column cluster of $i$, with $c(i)\le r(i)$. For $n$ dyads it thus holds $n=m(m1)/2$. We fix ${\beta}_{0}=1$ and ${H}_{0}:{\beta}_{1}=1$, and test ${\beta}_{1}=1$ for validity, and ${\beta}_{1}=1.2$ for power. We consider a similar structure for the errors: ${\epsilon}_{i}={\u03f5}_{r(i)}+{\u03f5}_{c(i)}+{u}_{i}$. The errors ${u}_{i}$ are independent standard normal, ${u}_{i}\sim N(0,1)$. We also consider two choices for covariates $x$ and errors $\u03f5$. For the covariates, ${x}_{j}\sim N(0,1)$ or ${x}_{j}\sim LN(0,1)$, the standard lognormal distribution, where $j=1,\mathrm{\dots},m$. For the errors, either ${\u03f5}_{j}\sim N(0,1)$, or ${\u03f5}_{j}\sim .5N(1,{.25}^{2})+.5N(1,{.25}^{2})$, the mixture distribution introduced in Section 5.2.2. This leads to four different simulation settings. For every setting we consider $m=10,25,50$ clusters to see how the various methods perform with respect to sample size, which here increases quadratically—$n=100,625,2500$, respectively.
The results are shown in Table 6. We present results on three methods: a HC2 robust estimator as a strawman, the twoway cluster robust estimator (cameron2011robust; aronow2015), and the double permutation test of Section 4.5. We see that the standard HC2 estimator performs badly, which is expected since this estimator ignores the doubly clustered nature of the errors. The twoway cluster robust estimator is a clear improvement. It achieves the nominal level in many settings. However, it is also unstable, and is strongly affected by the error and covariate distributions. For example, in Panel (A) the estimator overrejects at 11.5% when $n=100$, and only gets down to 6% when $n=2500$, when ${x}_{i}$ are lognormal and ${\epsilon}_{i}$ are normal.
In contrast, the double permutation test is notably robust and largely unaffected by the covariate distribution. In Panel (A) its rejection level is always around 5%, with the worst rate being around $6\%$ when $n=100$. We see that the randomization test trades power for such robustness. This is shown in Panel (B), where the twoway test can be much more powerful than the randomization test. For instance, when both errors and covariates are normal and $n=2500$, the twoway test rejects 98% of the time, whereas the randomization test rejects 25%, almost $4\times $ smaller. Under the true null, both tests are at the nominal level. The situation is noticeably improved when the covariates are lognormal. For instance, when the errors are normal and the covariates are lognormal, the aforementioned rates are $99.7\%$ and $61\%$, respectively.
Panel (A). True ${\beta}_{1}=1.0$  

Errorcovariate, $({\epsilon}_{i},{x}_{i})$  
(normal, normal)  (normal, lognormal)  (mixture, normal)  (mixture, lognormal)  
Sample size, $n$  
Method  100  625  2500  100  625  2500  100  625  2500  100  625  2500 
HC2  0.320  0.167  0.118  0.392  0.330  0.238  0.322  0.172  0.131  0.437  0.414  0.311 
2way robust  0.090  0.061  0.046  0.114  0.091  0.062  0.080  0.041  0.050  0.101  0.091  0.057 
double perm.  0.060  0.057  0.052  0.047  0.055  0.045  0.053  0.037  0.057  0.053  0.057  0.050 
Panel (B). True ${\beta}_{1}=1.2$  
100  625  2500  100  625  2500  100  625  2500  100  625  2500  
HC2  0.363  0.279  0.359  0.488  0.616  0.734  0.360  0.267  0.279  0.470  0.543  0.675 
2way robust  0.150  0.537  0.981  0.301  0.788  0.997  0.144  0.524  0.983  0.286  0.775  0.997 
double perm.  0.075  0.134  0.252  0.155  0.372  0.609  0.079  0.144  0.245  0.157  0.390  0.601 
In summary, the simulation results paint a mixed picture. If robustness is a concern, then the residual randomization test should be used. If power is more important, then we could use a twoway cluster robust estimator. In this case, we may improve the test through covariate transformations, e.g., through logtransforms. We caution, however, that the twoway cluster robust method faces the additional problem that the variance estimate is not always positivedefinite, and it is unclear what the best remedy is in such cases. These bad examples were removed from the simulation.
Remark 5.1. In an influential paper, petersen2009estimating performs an empirical comparison of standard error estimates in financial panel data with firmclustered and timeclustered errors. The double permutation test of Section 4.5 could also be used in this context. A better model is perhaps errors with autocorrelated structure, which we discuss in the following section.
6 Extensions
In this section, we consider regression with autocorrelated errors, and highdimensional regression. We mainly propose ideas without further theoretical investigation, and point to supporting evidence from empirical evaluations. Our goal is to show the potential of residual randomization to address an even broader spectrum of problems than what has been presented so far.
6.1 Highdimensional regression
Here, we consider highdimensional regression, where the number of covariates possibly exceeds the number of datapoints. The main problem we face here is that standard OLS does no longer work, and so we have to revise our main testing procedure with a penalized estimator.
Specifically, consider ridge regression with scalar penalty $\lambda $, namely ${\widehat{\beta}}^{\mathrm{ridge}}={({X}^{\top}X+\lambda I)}^{1}{X}^{\top}y.$ Since $y=X\beta +\epsilon $, we can rewrite the estimator as ${\widehat{\beta}}^{\mathrm{ridge}}=(I\lambda {P}_{\lambda}^{1})\beta +{P}_{\lambda}^{1}{X}^{\top}\epsilon $, where ${P}_{\lambda}=({X}^{\top}X+\lambda I)$. Under the null hypothesis in Equation (3) it follows that
$${a}^{\prime}{\widehat{\beta}}^{\mathrm{ridge}}{a}_{0}+\lambda {a}^{\prime}{P}_{\lambda}^{1}\beta ={a}^{\prime}{P}_{\lambda}^{1}{X}^{\top}\epsilon .$$ 
We cannot use the randomization test immediately here because the term $\lambda {a}^{\prime}{P}_{\lambda}^{1}\beta $ is unknown. To proceed, we could use a plugin estimate such as ${\widehat{\beta}}^{\mathrm{ridge}}$, or ${\widehat{\beta}}^{\mathrm{lasso}}$, the LASSO estimate. Given the plugin estimate, the residual randomization test could be adapted as follows:

1.
Given $y,X,$ calculate the LASSO and ridge regression estimates, ${\widehat{\beta}}^{\mathrm{ridge}}$ and ${\widehat{\beta}}^{\mathrm{lasso}}$. Calculate the residuals, $\widehat{\epsilon}=yX{\widehat{\beta}}^{\mathrm{lasso}}$.

2.
Let ${T}_{n}=\sqrt{n}({a}^{\prime}{\widehat{\beta}}^{\mathrm{ridge}}{a}_{0}+\lambda {a}^{\prime}{P}_{\lambda}^{1}{\widehat{\beta}}^{\mathrm{lasso}})$ be the observed test statistic value.

3.
Repeat: Sample $G\sim \mathrm{Unif}(\mathcal{G})$ and store ${t}_{n}(G\widehat{\epsilon})$, where ${t}_{n}(u)=\sqrt{n}{a}^{\prime}{P}_{\lambda}^{1}{X}^{\top}u$.

4.
Compare the value from Step 2 with the values from Step 3 to calculate the pvalue.
The inferential primitive, $\mathcal{G}$, could be defined as in earlier sections, e.g., as permutations or random signs on the residuals.
We now proceed to an empirical evaluation of this test, and conclude with some remarks at the end of this section. We compare with the hdi package of dezeure2015high, and particularly on a method that relies on a desparsified LASSO estimator (zhang2014confidence)—details are in Appendix D. We choose this method because in an extensive simulation study (dezeure2015high, Section 2.5) this method performed better on aggregate than a variety of other methods, including the nested covariance test of lockhart2014significance and the debiasing method of javanmard2014confidence. As all these tests, including residual randomization, are conditional, in future work we plan to compare also with unconditional LASSObased tests (belloni2014inference; chernozhukov2018double).
For a precise comparison we run the randomization test described above using the same simulation study, but extended to include more error types. More specifically, there are $p=500$ covariates and $n=100$ data points, and $X\sim {N}_{p}(0,\mathrm{\Sigma})$, where $\mathrm{\Sigma}$ is a Toeplitz matrix with $ij$th element equal to ${0.9}^{ij}$. Only ${s}_{0}$ are the active parameters, with ${s}_{0}=3$ or ${s}_{0}=15$. There are 6 different ways to define the active parameters: randomly as $U(2,2),U(0,2),U(0,4)$, or fixed at 1, 2, or 10. So, in total there are 12 different simulation settings. The errors $\epsilon $ are i.i.d. $N(0,1)$, and we also consider ${t}_{3}$ and Cauchy. For every setting, we take 50 samples of $X$. For every setting and $X$ sample, we draw $\epsilon $ and generate outcomes $y=X\beta +\epsilon $ to run the tests, and repeat this process 100 times. This gives us $12\times 50\times 100=6000$ total test comparisons. Because both tests are computationally expensive we run the simulation on a cluster using 600 cores with x86, 64bit architecture at 2.1Ghz. It takes about 15 minutes to complete the entire simulation, totaling to about a week of wallclock time.
The results of this simulation study with normal errors are shown in Figure 2, split in 6 panels corresponding to the 6 different definitions of the active parameters. Here, we aggregate the data for ${s}_{0}=3$ and ${s}_{0}=15$, and present the individual plots in Appendix E.1. In the simulations below we test all individual hypotheses ${H}_{0,j}:{\beta}_{j}=0$, and for the randomization test we perform a simple Bonferroni correction; the hdi package makes a similar correction. The plots show the power of the two methods defined as the proportion of active parameters that a method is able to uncover (i.e., number of true rejections over ${s}_{0}$). We will also report results on the familywise error rate (FWER) defined as the proportion of false rejections over all comparisons.
We see that the randomization method is consistently more powerful than the desparsified LASSO method. The difference is even more pronounced when the signal is weak. For example, when the active parameters are equal to 1 (bottom, left panel) the power of the randomization method is around 95% compared to 55% for the LASSO method. As the signal becomes stronger this difference becomes smaller (see other bottom panels). Marginally, the Q1/Q2/Q3 quartiles for power are $0.76,0.95,1.00$ for the randomization method, and $0.46,0.70,1.00$ for the LASSO method. The quartiles for FWER are $(.52,.97,3.5)\times {10}^{2}$ and $(.80,1.6,5.8)\times {10}^{4}$ for randomization and LASSO, respectively. We see that both methods are conservative, but residual randomization reaches much closer to the nominal 5% level. Further improvements may be possible since the Bonferroni correction used for the randomization method is likely very conservative. In Appendix E.1 we present more results using ${t}_{3}$ errors. In that setting, both tests lose some power due to the heaviertailed errors, but the overall picture remains the same.
Remark 6.1. The residual randomization test of this section can easily be adapted to test for heteroskedastic errors, or even clustered errors, by choosing the appropriate primitive $\mathcal{G}$. It would be interesting to compare the result randomization test to the approaches of cattaneo2018inference and d2018cluster in the context of highdimensional inference with heteroskedastic and clustered errors, respectively.
6.2 Autocorrelated errors
Here, we consider a model with autocorrelated errors,
$${y}_{t}={x}_{t}^{\prime}\beta +{\epsilon}_{t},$$  (23) 
where $t=1,\mathrm{\dots},n$, is used as the index, and indicates time. Suppose the following error structure:
$${\epsilon}_{t}={\rho}_{t}{\epsilon}_{t1}+{u}_{t},$$  (24) 
where ${u}_{t}$ are independent and symmetric around zero, and ${\rho}_{t}\le 1$. This structure is similar to AR(1) but we do allow some form of nonstationarity since ${\rho}_{t}$ may not converge. Our proposal here relies on the simple observation that ${\epsilon}_{t}$ is conditionally symmetric around zero given ${\epsilon}_{t1}=0$, that is, ${\epsilon}_{t}\stackrel{d}{=}{\epsilon}_{t}\mid \{{\epsilon}_{t1}=0\}$. We can therefore group the datapoints in consecutive clusters, such that at the cluster endpoints the errors are zero. In practice, since we don’t have access to the true errors, we can cluster the restricted residuals, $\widehat{{\epsilon}^{\mathrm{o}}}$, in a few clusters, such that the endpoints correspond to the lowest values of $\widehat{{\epsilon}^{\mathrm{o}}}$, and then perform a cluster sign test as defined in Section 4.3. The full procedure can be described as follows.

1.
Calculate the restricted residuals, $\widehat{{\epsilon}^{\mathrm{o}}}=yX\widehat{{\beta}^{\mathrm{o}}}$.

2.
Order their absolute values, ${\widehat{\epsilon}}_{t}^{\mathrm{o}}$, and select the $J+1$ smallest values. Denote the corresponding timepoints as ${t}_{0},\mathrm{\dots},{t}_{J}$.

3.
Define the clustering, ${\mathrm{C}}^{n}=\{\{{t}_{0},\mathrm{\dots},{t}_{1}\},\{{t}_{1}+1,\mathrm{\dots},{t}_{2}\},\mathrm{\dots},\{{t}_{J1}+1,{t}_{J}\}\}$.

4.
Perform the cluster sign test based on ${\mathcal{G}}^{\U0001d5cc}({\mathrm{C}}^{n})$ as defined in Section 4.3.
We refer to this cluster sign test as the reflection test, since it rests upon the assumption that the error time series, ${\epsilon}_{t}$, can be reflected around the time axis. Because it can happen that $$ we distinguish two variants of the reflection test. The unconditional variant rejects when $$, while the conditional variant only decides when ${\mathrm{C}}^{n}\ge J$, but is otherwise undefined. In terms of validity, Theorem 6 already contains the main intuition. Under our model assumptions the cluster sizes, ${n}_{c}$, are bounded random variables. It follows that ${\sum}_{c}{n}_{c}^{2}/{n}^{2}\stackrel{p}{\to}0$, which by Theorem 6 is sufficient for asymptotic validity. The full proof needs to address the fact that the clustering here is random and approximate, as it conditions on values ${\widehat{\epsilon}}_{t}^{\mathrm{o}}\approx 0$ instead of these values being exactly zero. We leave the full proof for future work.
To illustrate the reflection test, we show results from a simulated study. We use the model of Equation (23) with the error structure of Equation (24). Throughout this study, we fix ${\rho}_{t}=\rho $, with $\rho \in \{0,0.3,0.5,0.8\}$. We set ${\beta}_{0}={\beta}_{1}=0$ and report the rejection rates under the true null, ${H}_{0}:{\beta}_{1}=0$ on $n=100$ datapoints. There are four models for the covariates: (i) ${x}_{it}=N(0,1)$, or (ii) ${x}_{it}=LN(0,1)$, or (iii) ${x}_{it}=\rho {x}_{i,t1}+N(0,1)$, or (iv) ${x}_{it}=\rho {x}_{i,t1}+LN(0,1)$, where ${x}_{it}$ is the $i$th component of ${x}_{t}$, and $LN$ is the lognormal. Thus, in settings (i) and (ii) ${x}_{it}$ are independent, and in settings (iii) and (iv) ${x}_{it}$ are autocorrelated. The errors are either ${\epsilon}_{it}=\rho {\epsilon}_{i,t1}+N(0,1)$, or ${\epsilon}_{it}=\rho {\epsilon}_{i,t1}+\mathrm{mix}$, where $\mathrm{mix}=.5N(1,{.25}^{2})+.5N(1,{.25}^{2})$ was introduced in Section 5.2.2.
The results of the simulation are shown in Table 7. We include results on standard OLS as a strawman, and heteroskedasticity and autocorrelation consistent errors (HAC) (white1980heteroskedasticity; mackinnon1985some; andrews1991heteroskedasticity), the latter as provided by the vcovHAC function in R’s sandwich package (zeileis2004econometric). For the reflection test we set $J=6$, and report results for both the conditional and unconditional reflection test. The rejection rate for the conditional reflection test is calculated only over the cases for which the test returns a definite decision.
First, we focus on Panel (A) where $\rho =0.3$. We see that standard OLS is actually performing well, achieving in many cases the nominal 5% level. HAC errors are worse due to the small sample, and are strongly affected by lognormal covariates—for instance, rejection rates jump from 6.6% to 11.2% from Column (i) to Column (ii). Using a mixture distribution for the errors also seems to affect HAC errors disproportionately. For example, the aforementioned numbers are now 6.6% and 14.5%, respectively, when the ${u}_{t}$ errors are from the mixture. Next, we look at Panel (B) where $\rho =0.8$. We now see that standard OLS is greatly affected by the increased error autocorrelation. For example, with normal ${u}_{t}$ and autocorrelated ${x}_{it}$, the rejection rates are now around 33% compared to 7% in Panel (A). HAC errors are more robust but they still are affected by lognormal and autocorrelated covariates, and generally reject $2\times $ to $3\times $ higher than the nominal level.
Panel (A): $\rho =0.3$  

Error ${\epsilon}_{t}=\rho {\epsilon}_{t1}+{u}_{t}$, ${u}_{t}=\mathrm{\dots}$  
normal  mixture  
Covariates ${x}_{t}$  
iid  autocorrelated  iid  autocorrelated  
Method  (i)  (ii)  (iii)  (iv)  (i)  (ii)  (iii)  (iv) 
OLS  0.052  0.054  0.073  0.078  0.053  0.050  0.073  0.071 
HAC  0.066  0.112  0.065  0.112  0.066  0.145  0.070  0.130 
reflection test, uncond.  0.031  0.030  0.034  0.034  0.045  0.048  0.042  0.042 
reflection test, cond.  0.051  0.048  0.054  0.055  0.053  0.057  0.050  0.049 
Panel (B): $\rho =0.8$  
(i)  (ii)  (iii)  (iv)  (i)  (ii)  (iii)  (iv)  
OLS  0.048  0.048  0.341  0.339  0.049  0.050  0.336  0.346 
HAC  0.050  0.087  0.104  0.128  0.053  0.097  0.102  0.141 
reflection test, uncond.  0.022  0.023  0.024  0.027  0.031  0.029  0.032  0.030 
reflection test, cond.  0.049  0.052  0.055  0.061  0.053  0.050  0.052  0.051 
The reflection tests compare favorably. They maintain the nominal level in virtually all settings. We also see that the unconditional test is more conservative. This is expected since the test does not reject when $$. The conditional test decides when ${\mathrm{C}}^{n}\ge J$, and maintains the nominal level almost exactly.
Remark 6.2. In our simulations, in 6080% of replications we have ${\mathrm{C}}^{n}\ge 6$ when $\rho =0.3$; when $\rho =0.8$ this number drops to 4055%. In the extreme case where the autocorrelation is very high (e.g., $\rho =0.99$) the same number is about 1%. In such cases the conditional reflection test rarely returns a definite decision. This is actually a feature, not a bug. When $\rho $ is very high, the error process is almost a random walk, indicating that inference is not possible. The reflection test indirectly captures this limitation by producing empty ${\mathrm{C}}^{n}$, and not returning a value.
Remark 6.3. The reflection test does not require estimation of the unknown error correlation $\rho $, and even works when $\rho $ is timedependent. Furthermore, it may be straightforward to extend to autocorrelation lags of order higher than one by defining reflection points on consecutive residuals, e.g., conditioning on both ${\widehat{\epsilon}}_{t}^{\mathrm{o}}\approx 0$ and ${\widehat{\epsilon}}_{t+1}^{\mathrm{o}}\approx 0$. This may work well when autocorrelation is small enough or when the data size is big enough.
7 Discussion
While the residual randomization method has performed well in extensive empirical evaluations with complex error structures, there are nonetheless several open questions.
First, all our theoretical results are on asymptotic validity, while power can only be indirectly studied through our theory; e.g., by looking at the variance of ${t}_{n}(G\widehat{{\epsilon}^{\mathrm{o}}}){t}_{n}(G\epsilon )$, the difference between the feasible and the idealized test statistic. It would be interesting to know, for example, when additional invariances can yield higher powered tests. Such result could offer insights on the relative power benefits of the randomization approach over a more standard bootstrap approach. This analysis is complicated because it not only depends on the particular error distribution, or the problem structure, but also on the choice of the test statistic. Second, residual randomization showed notable performance in settings with autocorrelated errors and highdimensional regression, which are not covered by our theory. Developing theory for such settings is challenging. For example, even though the reflection test described in Section 6.2 is a cluster sign test, the clustering we condition on is such that the desirable invariance assumption holds only approximately. It would thus be interesting to extend Theorem 1 to include approximate error invariances, which are only exact in the limit. This is reminiscent of randomization testing under approximate symmetry conditions, as studied by canay2017randomization.
Finally, extending residual randomization beyond linear models is important. The key technical challenge would likely be that the difference ${t}_{n}(G\widehat{{\epsilon}^{\mathrm{o}}}){t}_{n}(G\epsilon )$, which determines the asymptotic validity of the randomization method, is no longer easy to express due to nonlinearity. Perhaps we could extend the idea in Section 6.2 of residual randomization with autocorrelated errors. For instance, suppose we can express $\widehat{\beta}$ as $\widehat{\beta}=\beta +h(y,X;\beta )+{f}_{n}(\epsilon )$, so that ${a}^{\prime}\widehat{\beta}{a}^{\prime}h(y,X;\beta ){a}_{0}={a}^{\prime}{f}_{n}(\epsilon )$ under the null. For the test statistic, ${T}_{n}$, we could use a plugin estimate for $h$, i.e., define ${T}_{n}={a}^{\prime}\widehat{\beta}{a}^{\prime}h(y,X;\widehat{\beta}){a}_{0}$, and then compare its observed value to the set $\{{a}^{\prime}{f}_{n}(\U0001d690\widehat{{\epsilon}^{\mathrm{o}}})\}$.
8 Conclusion
In this paper, we developed a residual randomization method for inference in regression models with complex error structure. We proved asymptotic validity for a variety of structures, particularly clustered errors with oneway and twoway clustering. We also illustrated construction of an exact randomization test in the Behrens–Fisher problem. Through extensive simulations, we showed that the proposed method is more robust in small samples than asymptotic methods, and is more flexible and dataadaptive than bootstrap. Extensions to highdimensional regression and models with autocorrelated errors were also considered with notable empirical success.
References
Appendix A Proofs of Main Theorems
Theorem 1.
Suppose that Assumptions 1(a)(c) hold for the regression model of Equation (2) with inferential primitive $\mathrm{G}$. Let $G\mathrm{\sim}\mathrm{Unif}\mathit{}\mathrm{(}\mathrm{G}\mathrm{)}$, and suppose that
$${\left({X}^{\top}X\right)}^{1}\left({X}^{\top}GX\right)\stackrel{p}{\to}bI,b\ge 0.$$  (10) 
Then, the residual randomization procedure of Section 2.1 is asymptotically valid; that is,
$$lim\underset{n\to \mathrm{\infty}}{sup}\mathrm{E}\left(\varphi (y;X)\right)\le \alpha .$$ 
Proof.
Let ${S}_{n}=\left(1/n\right){X}^{\top}X$. By Assumption 1(b), ${S}_{n}\to S$. Moreover,
$${t}_{n}\left(u\right)=\sqrt{n}{a}^{\prime}{\left({X}^{\top}X\right)}^{1}{X}^{\top}u={a}^{\prime}{S}_{n}^{1}\frac{1}{\sqrt{n}}{X}^{\top}u.$$ 
The convergence of constrained OLS, $\sqrt{n}\left(\widehat{{\beta}^{\mathrm{o}}}\beta \right)\stackrel{d}{\to}\omega ,$ for some random variable $\omega $, is guaranteed by Assumptions 1(a)(c), and the convergence of the unconstrained estimator, $\sqrt{n}\left(\widehat{\beta}\beta \right)={S}_{n}^{1}\frac{1}{\sqrt{n}}{X}^{\top}\epsilon \stackrel{d}{\to}{S}^{1}\mathcal{Z}$. Furthermore, ${a}^{\prime}\omega =0$, since ${a}^{\prime}\widehat{{\beta}^{\mathrm{o}}}={a}_{0}$ by definition of the constrained least squares estimator.
In the randomization test we sample ${G}_{r}\sim \mathrm{Unif}\left(\mathcal{G}\right)$, $r=1,\mathrm{\dots},R$, as i.i.d. samples uniformly from $\mathcal{G}$, where $R$ is fixed. By assumption of the theorem, for every $r=1,\mathrm{\dots},R$,
$${S}_{n}^{1}\frac{1}{n}\left({X}^{\top}{G}_{r}X\right)\stackrel{p}{\to}bI.$$ 
Thus, we obtain successively:
${t}_{n}\left({G}_{r}\widehat{{\epsilon}^{\mathrm{o}}}\right){t}_{n}\left({G}_{r}\epsilon \right)$  $={a}^{\prime}{S}_{n}^{1}{\displaystyle \frac{1}{\sqrt{n}}}{X}^{\top}\left({G}_{r}\widehat{{\epsilon}^{\mathrm{o}}}{G}_{r}\epsilon \right)$  
$={a}^{\prime}{S}_{n}^{1}{\displaystyle \frac{1}{\sqrt{n}}}{X}^{\top}\left[{G}_{r}\left(\widehat{{\epsilon}^{\mathrm{o}}}\epsilon \right)\right]\mathit{\hspace{1em}\hspace{1em}}\left[\mathit{\text{From linearity of}}{G}_{r}\right]$  
$={a}^{\prime}{S}_{n}^{1}{\displaystyle \frac{1}{\sqrt{n}}}{X}^{\top}\left\{{G}_{r}\left(X\left(\widehat{{\beta}^{\mathrm{o}}}\beta \right)\right)\right\}\mathit{\hspace{1em}\hspace{1em}}\left[\mathit{\text{since}}\widehat{{\epsilon}^{\mathrm{o}}}\epsilon =X\left(\beta \widehat{{\beta}^{\mathrm{o}}}\right)\right]$  
$={a}^{\prime}\underset{=bI}{\underset{\u23df}{{S}_{n}^{1}\left({\displaystyle \frac{1}{n}}{X}^{\top}{G}_{r}X\right)}}\sqrt{n}\left(\widehat{{\beta}^{\mathrm{o}}}\beta \right)\stackrel{d}{\to}{a}^{\prime}\left(bI\right)\omega \mathit{\hspace{1em}\hspace{1em}}\left[\mathit{\text{from Slutsky\u2019s theorem}}\right]$  
$=b{a}^{\prime}\omega =0.\mathit{\hspace{1em}\hspace{1em}}\left[\mathit{\text{from constrained OLS}}\right]$  (11) 
Since ${t}_{n}\left({G}_{r}\widehat{{\epsilon}^{\mathrm{o}}}\right){t}_{n}\left({G}_{r}\epsilon \right)\stackrel{d}{\to}0$, convergence in probability also follows, ${t}_{n}\left({G}_{r}\widehat{{\epsilon}^{\mathrm{o}}}\right)={t}_{n}\left({G}_{r}\epsilon \right)+{o}_{P}\left(1\right)$.
Now, we turn our attention to the variables ${T}_{n}^{\left(r\right)}\triangleq {t}_{n}\left({G}_{r}\epsilon \right)$, $r=1,\mathrm{\dots}R$. The test statistic is defined as ${T}_{n}=\sqrt{n}\left({a}^{\prime}\widehat{\beta}{a}_{0}\right)$, and so ${T}_{n}\stackrel{{H}_{0}}{=}{t}_{n}\left(\epsilon \right)$ under the null hypothesis. By Assumption 1(c) we have
$${T}_{n}\stackrel{{H}_{0}}{=}{t}_{n}\left(\epsilon \right)={a}^{\prime}{S}_{n}^{1}\frac{1}{\sqrt{n}}{X}^{\top}\epsilon \stackrel{d}{\to}{a}^{\prime}{S}^{1}\mathcal{Z}\triangleq T.$$  (12) 
Thus, under the null, ${T}_{n}$ converges in distribution. By the invariance assumption, $\epsilon \stackrel{d}{=}{G}_{r}\epsilon $, variable ${T}_{n}^{\left(r\right)}$ also has a limit law, such that
$${T}_{n}^{\left(r\right)}\stackrel{d}{\to}{T}_{r},\text{and}{T}_{r}\stackrel{d}{=}T.$$  (13) 
Furthermore, define ${\widehat{T}}_{n}^{\left(r\right)}={t}_{n}\left({G}_{r}\widehat{{\epsilon}^{\mathrm{o}}}\right)$, so that ${\widehat{T}}_{n}^{\left(r\right)}={T}_{n}^{\left(r\right)}+{o}_{P}\left(1\right)\stackrel{d}{\to}{T}_{r}$. Then, it follows
$$({T}_{n},\{{\widehat{T}}_{n}^{\left(r\right)}:r=1,\mathrm{\dots},R\})\stackrel{d}{\to}(T,\{{T}_{r}:r=1,\mathrm{\dots},R\}),$$ 
and so the randomization test that compares ${T}_{n}$ with $\{{\widehat{T}}_{n}^{\left(r\right)}:r=1,\mathrm{\dots},R\}$ is asymptotically valid, i.e.,
$lim\underset{n\to \mathrm{\infty}}{sup}P\{{T}_{n}\ge {q}_{1\alpha}\left(\{{\widehat{T}}_{n}^{\left(r\right)}:r=1,\mathrm{\dots},R\}\right)\}\le P\{T\ge {q}_{1\alpha}\left(\{{T}_{r}:r=1,\mathrm{\dots},R\}\right)\}\le \alpha ,$ 
where $q$ is the quantile function, and the first inequality follows from the Portmanteau theorem (lehmann2006testing, Theorem 11.2.1) and the continuous mapping theorem; the last inequality follows from Equation (13) and the randomization test property (lehmann2006testing, Theorem 15.2.2). ∎
Theorem 2.
For fixed $n$, consider a clustering ${\mathrm{C}}^{n}$ with $$. Suppose that for every cluster $c$ there exists ${b}_{c}\mathrm{\in}\mathrm{R}$ such that ${X}_{c}^{\mathrm{\top}}\mathit{}{X}_{c}\mathrm{=}{b}_{c}\mathit{}{X}^{\mathrm{\top}}\mathit{}X$, where ${X}_{c}\mathrm{=}{\mathrm{D}}_{c}\mathit{}X$ are the cluster covariates. For some inferential primitive $\mathrm{G}$, suppose that every member transform, $\mathrm{g}\mathrm{\in}\mathrm{G}$, can be decomposed as $\mathrm{g}\mathrm{=}{\mathrm{\sum}}_{c\mathrm{=}\mathrm{1}}^{J}{s}_{c}\mathit{}{\mathrm{D}}_{c}$, where ${s}_{c}\mathrm{\in}\mathrm{R}$. Then, the residual randomization test with $\mathrm{G}$ as the primitive is exact in finite samples, i.e., $\mathrm{E}\mathit{}\mathrm{(}\varphi \mathit{}\mathrm{(}y\mathrm{,}X\mathrm{)}\mathrm{)}\mathrm{=}\alpha $.
Proof.
By definition, ${X}_{c}={\mathbb{D}}_{c}X$ and ${\mathbb{D}}_{c}{\mathbb{D}}_{c}={\mathbb{D}}_{c}$ since ${\mathbb{D}}_{c}$ is binaryvalued and diagonal. Thus, ${X}_{c}^{\top}{X}_{c}={X}^{\top}{\mathbb{D}}_{c}{\mathbb{D}}_{c}X={X}^{\top}{\mathbb{D}}_{c}X$. In Equation (A) of the proof of Theorem 1 we now have:
${t}_{n}\left({G}_{r}\widehat{{\epsilon}^{\mathrm{o}}}\right){t}_{n}\left({G}_{r}\epsilon \right)$  $={a}^{\prime}{\left({X}^{\top}X\right)}^{1}\left({\displaystyle \frac{1}{n}}{X}^{\top}{G}_{r}X\right)\sqrt{n}\left(\widehat{{\beta}^{\mathrm{o}}}\beta \right)$  
$={a}^{\prime}{\left({X}^{\top}X\right)}^{1}\left[{\displaystyle \frac{1}{n}}{X}^{\top}\left({\displaystyle \sum _{c}}{s}_{cr}{\mathbb{D}}_{c}\right)X\right]\sqrt{n}\left(\widehat{{\beta}^{\mathrm{o}}}\beta \right)\mathit{\hspace{1em}\hspace{1em}}\left[\mathit{\text{we use the representation}}{G}_{r}={\sum}_{c}{s}_{cr}{\mathbb{D}}_{c}\right]$  
$={\displaystyle \sum _{c}}{s}_{cr}{a}^{\prime}{\left({X}^{\top}X\right)}^{1}\left[{\displaystyle \frac{1}{n}}{X}^{\top}{\mathbb{D}}_{c}X\right]\sqrt{n}\left(\widehat{{\beta}^{\mathrm{o}}}\beta \right)$  
$={\displaystyle \sum _{c}}{s}_{cr}{a}^{\prime}{\left({X}^{\top}X\right)}^{1}\left(1/n\right){X}_{c}^{\top}{X}_{c}\sqrt{n}\left(\widehat{{\beta}^{\mathrm{o}}}\beta \right)\mathit{\hspace{1em}\hspace{1em}}\left[\mathit{\text{because}}{X}^{\top}{\mathbb{D}}_{c}X={X}_{c}^{\top}{X}_{c}\right]$  
$={\displaystyle \sum _{c}}\left(1/\sqrt{n}\right){s}_{cr}{b}_{c}{a}^{\prime}\left(\widehat{{\beta}^{\mathrm{o}}}\beta \right)\mathit{\hspace{1em}\hspace{1em}}\left[\mathit{\text{by cluster similarity assumption}}\right]$  
$={\displaystyle \sum _{c}}(1/\sqrt{n}){s}_{cr}{b}_{c}({a}_{0}{a}_{0})=0.\mathit{\hspace{1em}\hspace{1em}\hspace{0.25em}}\left[\mathit{\text{by constrained OLS,}}\widehat{{\beta}^{\mathrm{o}}}\mathit{\text{and the null hypothesis}}\right].$ 
This result implies that the randomization values of the test statistic, ${t}_{n}\left({G}_{r}\widehat{{\epsilon}^{\mathrm{o}}}\right)$, are exactly equal to the true values, ${t}_{n}\left({G}_{r}\epsilon \right)$. The test is therefore exact, i.e., valid in finite samples for any finite $n$. ∎
Appendix B Proofs of Validity Theorems
B.1 Preliminaries
For every column $x$ of covariate matrix $X$, it follows from Assumption 1(a) that
$$\overline{x}=\frac{1}{n}\sum _{i=1}^{n}{x}_{i}=O\left(1\right),\text{and}\overline{{x}^{2}}=\frac{1}{n}\sum _{i=1}^{n}{x}_{i}^{2}=O\left(1\right).$$  (14) 
Boundedness for the first moment follows from Jensen’s inequality, e.g., ${\overline{x}}^{2}\le \overline{{x}^{2}}=O\left(1\right)$. Using the notation of Section 2.3, for a set $A\subseteq \{1,\mathrm{\dots},n\}$ the vector ${\mathrm{\U0001d7d9}}_{A}$ is one at $i$th element only if $i\in A$, and is zero otherwise. Also, ${\mathrm{\U0001d7d9}}_{i}$ is shorthand for ${\mathrm{\U0001d7d9}}_{\left\{i\right\}}$. The matrix ${\mathbb{D}}_{A}={\mathrm{\U0001d7d9}}_{A}{\mathrm{\U0001d7d9}}_{A}^{\prime}$ is the $n\times n$ diagonal matrix that is one at the $(i,i)$th element only if $i\in A$, and is zero otherwise.
B.2 Main proofs
Theorem 3.
Suppose that Assumptions 1(a)(c) hold for the regression model of Equation (2), and $\epsilon \stackrel{d}{\mathrm{=}}\mathrm{g}\mathit{}\epsilon $, for all $\mathrm{g}\mathrm{\in}{\mathrm{G}}^{\mathrm{p}}$. Then, the residual randomization test using ${\mathrm{G}}^{\mathrm{p}}$ as the primitive is asymptotically valid.
Proof.
The proof technique for the validity theorems of this section and the next is as follows. First, we consider the matrix ${X}^{\top}GX$, which is a random $p\times p$ matrix with $\mathrm{\ell}m$th element equal to ${x}^{\prime}Gz$, where $x$ denotes the $\mathrm{\ell}$th column of $X$, and $z$ the $m$th column. Suprressing the dependence on $\mathrm{\ell}$ and $m$, we focus on the quantity ${w}_{n}=\frac{1}{n}{x}^{\prime}Gz\in \mathbb{R}$. Generalizing from an expression for ${w}_{n}$ can lead us to a result for ${X}^{\top}GX$. Then, we can check whether the condition in Equation (10) of Theorem 1 holds.
In the permutation context, $G={\sum}_{i=1}^{n}{\mathrm{\U0001d7d9}}_{i}{\mathrm{\U0001d7d9}}_{\pi \left(i\right)}^{\prime}$, where $\pi $ is a random permutation from ${\mathrm{\Pi}}_{n}$. Then,
$$\mathrm{E}\left({z}_{\pi \left(i\right)}\right)=\frac{1}{n}\sum _{j=1}^{n}{z}_{j}=\overline{z},\text{and}\mathrm{E}\left({z}_{\pi \left(i\right)}^{2}\right)=\frac{1}{n}\sum _{j=1}^{n}{z}_{j}^{2}=\overline{{z}^{2}},$$  (15) 
where the expectation is with respect to the randomness of $\pi $. Furthermore, for fixed $i,j$, $i\ne j$,
$\mathrm{E}({z}_{\pi \left(i\right)}{z}_{\pi \left(j\right)}\mid i\ne j)$  $=\mathrm{E}\left\{{z}_{\pi \left(j\right)}\mathrm{E}({z}_{\pi \left(i\right)}\mid \pi \left(j\right),i\ne j)\right\}$  
$=\mathrm{E}\left({z}_{\pi \left(j\right)}{\displaystyle \sum _{k=1,k\ne \pi \left(j\right)}^{n}}{\displaystyle \frac{1}{n1}}{z}_{k}\right)={\displaystyle \frac{1}{n1}}\mathrm{E}\left\{{z}_{\pi \left(j\right)}\left(n\overline{z}{z}_{\pi \left(j\right)}\right)\right\}={\displaystyle \frac{1}{n1}}\left(n{\overline{z}}^{2}\overline{{z}^{2}}\right).$  (16) 
Also,
${x}^{\prime}Gz$  $={x}^{\prime}\left({\displaystyle \sum _{i=1}^{n}}{\mathrm{\U0001d7d9}}_{i}{\mathrm{\U0001d7d9}}_{\pi \left(i\right)}^{\prime}\right)z={\displaystyle \sum _{i=1}^{n}}\left({x}^{\prime}{\mathrm{\U0001d7d9}}_{i}\right)\left({\mathrm{\U0001d7d9}}_{\pi \left(i\right)}^{\prime}z\right)={\displaystyle \sum _{i=1}^{n}}{x}_{i}{z}_{\pi \left(i\right)},$  (17) 
where the last equality follows from the property ${\mathrm{\U0001d7d9}}_{i}^{\prime}x={x}_{i}$, by definition of ${\mathrm{\U0001d7d9}}_{i}$ (see Section B.1). Then,
$\mathrm{E}\left({w}_{n}\right)={\displaystyle \frac{1}{n}}{x}^{\prime}\mathrm{E}\left(G\right)z$  $={\displaystyle \frac{1}{n}}{\displaystyle \sum _{i=1}^{n}}\mathrm{E}\left({x}_{i}{z}_{\pi \left(i\right)}\right))={\displaystyle \frac{1}{n}}{\displaystyle \sum}{}_{i=1}{}^{n}x{}_{i}\mathrm{E}({z}_{\pi \left(i\right)})={\displaystyle \frac{1}{n}}{\displaystyle \sum}{}_{i=1}{}^{n}x{}_{i}\overline{z}=\overline{x}\overline{z}.$  (18) 
Here, the first equation follows from (17), and the second from (15).
It follows that $\mathrm{E}\left(\frac{1}{n}{X}^{\top}GX\right)=\mu {\mu}^{\prime}$, where $\mu ={X}^{\top}\mathrm{\U0001d7d9}/n$ are the column sample means of $X$. Next, we derive the variance of ${w}_{n}$.
$\mathrm{Var}\left({w}_{n}\right)$  $=\mathrm{E}\left({w}_{n}^{2}\right){\overline{x}}^{2}{\overline{z}}^{2}={\displaystyle \frac{1}{{n}^{2}}}\mathrm{E}\left\{{\left({\displaystyle \sum _{i=1}^{n}}{x}_{i}{z}_{\pi \left(i\right)}\right)}^{2}\right\}{\overline{x}}^{2}{\overline{z}}^{2}\mathit{\hspace{1em}\hspace{1em}}\left[\mathit{\text{from Equation(}}\mathit{\text{17}}\mathit{\text{)}}\right]$  
$={\displaystyle \frac{1}{{n}^{2}}}{\displaystyle \sum _{i=1}^{n}}{x}_{i}^{2}\mathrm{E}\left({z}_{\pi \left(i\right)}^{2}\right)+{\displaystyle \frac{1}{{n}^{2}}}{\displaystyle \sum _{i\ne j}}{x}_{i}{x}_{j}\mathrm{E}\left({z}_{\pi \left(i\right)}{z}_{\pi \left(j\right)}\right){\overline{x}}^{2}{\overline{z}}^{2}$  
$={\displaystyle \frac{1}{{n}^{2}}}{\displaystyle \sum _{i=1}^{n}}{x}_{i}^{2}\overline{{z}^{2}}+{\displaystyle \frac{1}{{n}^{2}\left(n1\right)}}{\displaystyle \sum _{i\ne j}}{x}_{i}{x}_{j}\left(n{\overline{z}}^{2}\overline{{z}^{2}}\right){\overline{x}}^{2}{\overline{z}}^{2}\mathit{\hspace{1em}\hspace{1em}}\left[\mathit{\text{from Eqs.(}}\mathit{\text{15}}\mathit{\text{) and(}}\mathit{\text{B.2}}\mathit{\text{)}}\right]$  
$={\displaystyle \frac{1}{n}}\overline{{x}^{2}}\cdot \overline{{z}^{2}}+{\displaystyle \frac{\left(n{\overline{z}}^{2}\overline{{z}^{2}}\right)}{n\left(n1\right)}}\left(n{\overline{x}}^{2}\overline{{x}^{2}}\right){\overline{x}}^{2}{\overline{z}}^{2}\mathit{\hspace{1em}\hspace{1em}}\left[\mathit{\text{since}}{\sum}_{i\ne j}{x}_{i}{x}_{j}={\sum}_{i}{x}_{i}{\sum}_{j\ne i}{x}_{j}={n}^{2}{\overline{x}}^{2}n\overline{{x}^{2}}\mathit{\text{.}}\right]$  
$={\displaystyle \frac{1}{n1}}\left(\overline{{x}^{2}}{\overline{x}}^{2}\right)\left(\overline{{z}^{2}}{\overline{z}}^{2}\right)\triangleq {\displaystyle \frac{1}{n1}}{s}_{x}^{2}{s}_{z}^{2},$  (19) 
where ${s}^{2}$ indicates sample variance. All variances are bounded by Assumption 1, and so $\mathrm{Var}\left({w}_{n}\right)\stackrel{p}{\to}0$. Therefore we obtain: $\frac{1}{n}{X}^{\top}GX\stackrel{p}{\to}\mu {\mu}^{\prime}.$ Note that $X$ has ones as the first column for the intercept term. Then, the first column of $\left({X}^{\top}X\right)/n$ is equal to $\mu $, the column sample means. In vector form: $\left({X}^{\top}X\right){\mathrm{\U0001d7d9}}_{1}/n=\mu $, where ${\mathrm{\U0001d7d9}}_{1}$ is the $p\times 1$ vector with only the first element being one. It follows that
$$\left[\left({X}^{\top}X\right){\mathrm{\U0001d7d9}}_{1}/n\right]{\mu}^{\prime}=\mu {\mu}^{\prime}\Rightarrow {\left({X}^{\top}X\right)}^{1}\mu {\mu}^{\prime}={\mathrm{\U0001d7d9}}_{1}{\mu}^{\prime}/n.$$ 
This means that
$${A}_{n}\triangleq n{\left({X}^{\top}X\right)}^{1}\frac{1}{n}{X}^{\top}GX\stackrel{p}{\to}{\mathrm{\U0001d7d9}}_{1}{\mu}^{\prime}.$$ 
Let ${\omega}_{n}=\sqrt{n}\left(\widehat{{\beta}^{\mathrm{o}}}\beta \right)$, for which we know that ${\omega}_{n}\stackrel{d}{\to}\omega $ from Theorem 1. Equation (A) in the proof of Theorem 1 can now be written as:
${t}_{n}\left({G}_{r}\widehat{{\epsilon}^{\mathrm{o}}}\right){t}_{n}\left({G}_{r}\epsilon \right)$  $={a}^{\prime}{A}_{n}{\omega}_{n}={a}^{\prime}{\mathrm{\U0001d7d9}}_{1}{\mu}^{\prime}{\omega}_{n}+{o}_{P}\left(1\right)$  
$={a}_{1}\sqrt{n}\left({\mathrm{\U0001d7d9}}^{\top}X\left(\widehat{{\beta}^{\mathrm{o}}}\beta \right)/n\right)+{o}_{P}\left(1\right)\mathit{\hspace{1em}\hspace{1em}}\left[\mathit{\text{since}}\mu ={X}^{\top}\mathrm{\U0001d7d9}/n\right]$  
$={a}_{1}\sqrt{n}\left(\overline{\widehat{{\epsilon}^{\mathrm{o}}}}\overline{\epsilon}\right)+{o}_{P}\left(1\right)\stackrel{p}{\to}0,\left[\mathit{\text{from LLN}}\right]$  (20) 
This result implies that ${t}_{n}\left(G\widehat{{\epsilon}^{\mathrm{o}}}\right)={t}_{n}\left(G\epsilon \right)+{o}_{P}\left(1\right)$. The rest of the proof in Theorem 1 goes through to prove validity even though the condition in Equation (10) of the theorem does not hold. ∎
Theorem 4.
Suppose that Assumptions 1(a)(c) hold for the regression model of Equation (2), and $\epsilon \stackrel{d}{\mathrm{=}}\mathrm{g}\mathit{}\epsilon $, for all $\mathrm{g}\mathrm{\in}{\mathrm{G}}^{\mathrm{s}}$. Then, the residual randomization test using ${\mathrm{G}}^{\mathrm{s}}$ as the primitive is asymptotically valid.
Proof.
Here, $G\sim \mathrm{Unif}\left({\mathcal{G}}^{\U0001d5cc}\right)$ is equivalent to $G={\sum}_{i=1}^{n}{S}_{i}{\mathrm{\U0001d7d9}}_{i}{\mathrm{\U0001d7d9}}_{i}^{\prime}$, where ${S}_{1},\mathrm{\dots},{S}_{n}$ are independent random signs. As in the proof of Theorem 3, let $x,z$ be any two columns of $X$, and define ${w}_{n}=\frac{1}{n}{x}^{\prime}Gz\in \mathbb{R}$. Then,
$\frac{1}{n}}{x}^{\prime}Gz$  $={\displaystyle \frac{1}{n}}{x}^{\prime}\left({\displaystyle \sum _{i=1}^{n}}{S}_{i}{\mathrm{\U0001d7d9}}_{i}{\mathrm{\U0001d7d9}}_{i}^{\prime}\right)z={\displaystyle \frac{1}{n}}{\displaystyle \sum _{i=1}^{n}}{S}_{i}\left({x}^{\prime}{\mathrm{\U0001d7d9}}_{i}\right)\left({\mathrm{\U0001d7d9}}_{i}^{\prime}z\right)={\displaystyle \frac{1}{n}}{\displaystyle \sum _{i=1}^{n}}{S}_{i}{x}_{i}{z}_{i}.$  (21) 
It follows that,
$\mathrm{E}\left({w}_{n}\right)={\displaystyle \frac{1}{n}}{x}^{\prime}\mathrm{E}\left(G\right)z$  $={\displaystyle \frac{1}{n}}{\displaystyle \sum _{i=1}^{n}}\mathrm{E}\left({S}_{i}{x}_{i}{z}_{i}\right)\mathit{\hspace{1em}\hspace{1em}}\left[\mathit{\text{from Equation(}}\mathit{\text{21}}\mathit{\text{)}}\right]$  
$={\displaystyle \frac{1}{n}}{\displaystyle \sum _{i=1}^{n}}{x}_{i}{z}_{i}\mathrm{E}\left({S}_{i}\right)$  
$=0\mathit{\hspace{1em}\hspace{1em}}\left[\mathit{\text{Since}}{S}_{i}\mathit{\text{is random sign}}\right].$  (22) 
Next, we derive the variance of ${w}_{n}$.
$\mathrm{Var}\left({w}_{n}\right)=\mathrm{E}\left({w}_{n}^{2}\right)$  $={\displaystyle \frac{1}{{n}^{2}}}\mathrm{E}\left\{{\left({\displaystyle \sum _{i=1}^{n}}{S}_{i}{x}_{i}{z}_{i}\right)}^{2}\right\}\mathit{\hspace{1em}\hspace{1em}}\left[\mathit{\text{from Equation(}}\mathit{\text{B.2}}\mathit{\text{)}}\right]$  
$={\displaystyle \frac{1}{{n}^{2}}}{\displaystyle \sum _{i=1}^{n}}{x}_{i}^{2}{z}_{i}^{2}\mathrm{E}\left({S}_{i}^{2}\right)+{\displaystyle \frac{1}{{n}^{2}}}{\displaystyle \sum _{i\ne j}}{x}_{i}{x}_{j}{z}_{i}{z}_{j}\mathrm{E}\left({S}_{i}{S}_{j}\right)$  
$={\displaystyle \frac{1}{{n}^{2}}}{\displaystyle \sum _{i=1}^{n}}{x}_{i}^{2}{z}_{i}^{2}+{\displaystyle \frac{1}{{n}^{2}}}{\displaystyle \sum _{i\ne j}}{x}_{i}{x}_{j}{z}_{i}{z}_{j}\underset{=0}{\underset{\u23df}{\mathrm{E}\left({S}_{i}\right)}}\underset{=0}{\underset{\u23df}{\mathrm{E}\left({S}_{j}\right)}}\mathit{\hspace{1em}\hspace{1em}}\left[\mathit{\text{since}}{S}_{i},{S}_{j}\mathit{\text{are independent when}}i\ne j\right]$  
$\le {\displaystyle \frac{1}{n}}\sqrt{{\displaystyle \frac{1}{n}}{\displaystyle \sum _{i=1}^{n}}{x}_{i}^{2}}\sqrt{{\displaystyle \frac{1}{n}}{\displaystyle \sum _{i=1}^{n}}{z}_{i}^{2}}\mathit{\hspace{1em}\hspace{1em}}\left[\mathit{\text{from CauchySchwartz inequality}}\right]$  
$={\displaystyle \frac{1}{n}}{\left(\overline{{x}^{2}}\right)}^{1/2}\cdot {\left(\overline{{z}^{2}}\right)}^{1/2}=O\left(1/n\right)\to 0,$ 
where the last limit follows from Equation (14). Thus, $\frac{1}{n}{X}^{\top}GX\stackrel{p}{\to}0$, condition (10) of Theorem 1 holds, and the test is asymptotically valid. ∎
Theorem 5.
Suppose that Assumptions 1(a)(c) hold for the regression model of Equation (2), and $\epsilon \stackrel{d}{\mathrm{=}}\mathrm{g}\mathit{}\epsilon $, for all $\mathrm{g}\mathrm{\in}{\mathrm{G}}^{\mathrm{p}}\mathit{}\mathrm{(}{\mathrm{C}}^{n}\mathrm{)}$. Suppose also that Assumptions 1(a)(b) hold within clusters. Then, the residual randomization test using ${\mathrm{G}}^{\mathrm{p}}\mathit{}\mathrm{(}{\mathrm{C}}^{n}\mathrm{)}$ as the primitive is asymptotically valid if the clusters are centered; or if the homogeneity condition of Equation (15) holds and ${\mathrm{min}}_{c}\mathit{}{n}_{c}\mathrm{\to}\mathrm{\infty}$.
Proof.
Here, $G\sim \mathrm{Unif}\left\{{\mathcal{G}}^{\U0001d5c9}\left({\mathrm{C}}^{n}\right)\right\}$ is equivalent to $G={\sum}_{c}{\sum}_{i\in \left[c\right]}{\mathrm{\U0001d7d9}}_{i}{\mathrm{\U0001d7d9}}_{\pi \left(i\right)}^{\prime}$, where $\pi $ is a random withincluster permutation from $\mathrm{\Pi}\left({\mathrm{C}}^{n}\right)$. Let $x,z$ be any two columns of $X$, and define ${w}_{n}=\frac{1}{n}{x}^{\prime}Gz\in \mathbb{R}$. For some $i\in \left[c\right]$:
$$\mathrm{E}\left({z}_{\pi \left(i\right)}\right)=\frac{1}{{n}_{c}}\sum _{{i}^{\prime}\in \left[c\right]}{z}_{{i}^{\prime}}\triangleq \overline{{z}_{c}},\text{and}\mathrm{E}\left({z}_{\pi \left(i\right)}^{2}\right)=\frac{1}{{n}_{c}}\sum _{{i}^{\prime}\in \left[c\right]}{z}_{{i}^{\prime}}^{2}\triangleq \overline{{z}_{c}^{2}}.$$  (23) 
For cluster $c$:
$$\sum _{i\in \left[c\right],j\in \left[c\right],i\ne j}{x}_{i}{x}_{j}=\sum _{i\in \left[c\right]}{x}_{i}\sum _{j\in \left[c\right],j\ne i}{x}_{j}=\sum _{i\in \left[c\right]}{x}_{i}\left({n}_{c}\overline{{x}_{c}}{x}_{i}\right)={n}_{c}\left({n}_{c}{\overline{{x}_{c}}}^{2}\overline{{x}_{c}^{2}}\right).$$  (24) 
We follow similar steps to Equation (23) to obtain
$\mathrm{E}({z}_{\pi \left(i\right)}{z}_{\pi \left(j\right)}\mid i,j\in \left[c\right],i\ne j)$  $=\mathrm{E}\left\{{z}_{\pi \left(j\right)}\mathrm{E}({z}_{\pi \left(i\right)}\mid \pi \left(j\right),i,j\in \left[c\right],i\ne j)\right\}=\mathrm{E}\{{z}_{\pi \left(j\right)}{\displaystyle \frac{{n}_{c}\overline{{z}_{c}}{z}_{\pi \left(j\right)}}{{n}_{c}1}}\mid j\in \left[c\right]\}$  
$={\displaystyle \frac{1}{{n}_{c}1}}\left({n}_{c}{\overline{{z}_{c}}}^{2}\overline{{z}_{c}^{2}}\right).$  (25) 
By definition,
${x}^{\prime}Gz$  $={x}^{\prime}\left({\displaystyle \sum _{c=1}^{{J}_{n}}}{\displaystyle \sum _{i\in \left[c\right]}}{\mathrm{\U0001d7d9}}_{i}{\mathrm{\U0001d7d9}}_{\pi \left(i\right)}^{\prime}\right)z={\displaystyle \sum _{c}}{\displaystyle \sum _{i\in \left[c\right]}}\left({x}^{\prime}{\mathrm{\U0001d7d9}}_{i}\right)\left({\mathrm{\U0001d7d9}}_{\pi \left(i\right)}^{\prime}z\right)={\displaystyle \sum _{c}}{\displaystyle \sum _{i\in \left[c\right]}}{x}_{i}{z}_{\pi \left(i\right)}.$  (26) 
Note that we keep the double summation to distinguish between clusters. We derive the expectation of ${w}_{n}$:
$\mathrm{E}\left({w}_{n}\right)={\displaystyle \frac{1}{n}}{x}^{\prime}\mathrm{E}\left(G\right)z$  $={\displaystyle \frac{1}{n}}{\displaystyle \sum _{c}}{\displaystyle \sum _{i\in \left[c\right]}}\mathrm{E}\left({x}_{i}{z}_{\pi \left(i\right)}\right)\mathit{\hspace{1em}\hspace{1em}}\left[\mathit{\text{from Equation(}}\mathit{\text{26}}\mathit{\text{)}}\right]$  
$={\displaystyle \frac{1}{n}}{\displaystyle \sum _{c}}{\displaystyle \sum _{i\in \left[c\right]}}{x}_{i}\mathrm{E}\left({z}_{\pi \left(i\right)}\right)$  
$={\displaystyle \frac{1}{n}}{\displaystyle \sum _{c}}{\displaystyle \sum _{i\in \left[c\right]}}{x}_{i}\overline{{z}_{c}}$  
$={\displaystyle \sum _{c}}\left({n}_{c}/n\right)\overline{{x}_{c}}\cdot \overline{{z}_{c}}\triangleq {m}_{n}.$  (27) 
Next, we derive the variance.
$\mathrm{Var}\left({w}_{n}\right)=\mathrm{E}\left({w}_{n}^{2}\right){m}_{n}^{2}$  $={\displaystyle \frac{1}{{n}^{2}}}\mathrm{E}\left\{{\left({\displaystyle \sum _{c}}{\displaystyle \sum _{i\in \left[c\right]}}{x}_{i}{z}_{\pi \left(i\right)}\right)}^{2}\right\}{m}_{n}^{2}\mathit{\hspace{1em}\hspace{1em}}\left[\mathit{\text{from Equation(}}\mathit{\text{26}}\mathit{\text{)}}\right]$  
$={\displaystyle \frac{1}{{n}^{2}}}{\displaystyle \sum _{c}}{\displaystyle \sum _{i\in \left[c\right]}}{x}_{i}^{2}\mathrm{E}\left({z}_{\pi \left(i\right)}^{2}\right)+{\displaystyle \frac{1}{{n}^{2}}}{\displaystyle \sum _{c}}{\displaystyle \sum _{i\in \left[c\right],j\in \left[c\right],i\ne j}}{x}_{i}{x}_{j}\mathrm{E}\left({z}_{\pi \left(i\right)}{z}_{\pi \left(j\right)}\right)$  
$+{\displaystyle \frac{1}{{n}^{2}}}{\displaystyle \sum _{c,{c}^{\prime},c\ne {c}^{\prime}}}{\displaystyle \sum _{i\in \left[c\right],j\in {\left[c\right]}^{\prime}}}{x}_{i}{x}_{j}\mathrm{E}\left({z}_{\pi \left(i\right)}{z}_{\pi \left(j\right)}\right){m}_{n}^{2}$  
$={\displaystyle \frac{1}{{n}^{2}}}{\displaystyle \sum _{c}}{\displaystyle \sum _{i\in \left[c\right]}}{x}_{i}^{2}\overline{{z}_{c}^{2}}+{\displaystyle \frac{1}{{n}^{2}}}{\displaystyle \sum _{c}}{\displaystyle \sum _{i\in \left[c\right],j\in \left[c\right],i\ne j}}{x}_{i}{x}_{j}{\displaystyle \frac{1}{{n}_{c}1}}\left({n}_{c}{\overline{{z}_{c}}}^{2}\overline{{z}_{c}^{2}}\right)$  
$+{\displaystyle \frac{1}{{n}^{2}}}{\displaystyle \sum _{c,{c}^{\prime},c\ne {c}^{\prime}}}{\displaystyle \sum _{i\in \left[c\right],j\in {\left[c\right]}^{\prime}}}{x}_{i}{x}_{j}\mathrm{E}\left({z}_{\pi \left(i\right)}{z}_{\pi \left(j\right)}\right){m}_{n}^{2}\mathit{\hspace{1em}\hspace{1em}}\left[\mathit{\text{from Eqs.(}}\mathit{\text{23}}\mathit{\text{) and(}}\mathit{\text{B.2}}\mathit{\text{)}}\right]$  
$=\underset{\left(A\right)}{\underset{\u23df}{{\displaystyle \frac{1}{n}}{\displaystyle \sum _{c}}{\displaystyle \frac{{n}_{c}}{n}}\overline{{x}_{c}^{2}}\cdot \overline{{z}_{c}^{2}}}}+\underset{\left(B\right)}{\underset{\u23df}{{\displaystyle \sum _{c}}{\displaystyle \frac{\left({n}_{c}{\overline{{z}_{c}}}^{2}\overline{{z}_{c}^{2}}\right)}{{n}^{2}\left({n}_{c}1\right)}}{n}_{c}\left({n}_{c}{\overline{{x}_{c}}}^{2}\overline{{x}_{c}^{2}}\right)}}$  
$+\underset{\left(C\right)}{\underset{\u23df}{{\displaystyle \frac{1}{{n}^{2}}}{\displaystyle \sum _{c,{c}^{\prime},c\ne {c}^{\prime}}}{\displaystyle \sum _{i\in \left[c\right],j\in {\left[c\right]}^{\prime}}}{x}_{i}{x}_{j}\mathrm{E}\left({z}_{\pi \left(i\right)}{z}_{\pi \left(j\right)}\right)}}{m}_{n}^{2}\mathit{\hspace{1em}\hspace{1em}}\left[\mathit{\text{from Equation(}}\mathit{\text{24}}\mathit{\text{)}}\right]$  (28) 
We consider the above terms successively. To simplify notation, we define
$${R}_{c}\triangleq \frac{{n}_{c}}{n}\overline{{x}_{c}}\cdot \overline{{z}_{c}}.$$ 
Notice that ${R}_{c}=O\left(1\right)$ and ${m}_{n}={\sum}_{c}{R}_{c}$.
Term $(A)$. First, $\overline{{x}_{c}^{2}}=O\left(1\right)$ and $\overline{{z}_{c}^{2}}=O\left(1\right)$ from the withincluster boundedness assumption. Thus, the term $\left(A\right)$ is asymptotically negligible:
$$\left(A\right)=\frac{1}{n}\underset{=1}{\underset{\u23df}{\sum _{c}\frac{{n}_{c}}{n}}}O\left(1\right)=O\left(1/n\right).$$ 
Term $(B)$. The term be can be approximated as follows,
$$\left(B\right)=\sum _{c}\frac{\left({n}_{c}{\overline{{z}_{c}}}^{2}\overline{{z}_{c}^{2}}\right)}{{n}^{2}\left({n}_{c}1\right)}{n}_{c}\left({n}_{c}{\overline{{x}_{c}}}^{2}\overline{{x}_{c}^{2}}\right)=\sum _{c}{R}_{c}^{2}\left[1+O\left(\frac{1}{{n}_{c}}\right)\right]+O\left(\frac{{n}_{c}}{{n}^{2}}\right).$$ 
Term $(C)$. Because permutations are independent across clusters, for $i\in \left[c\right],j\in {\left[c\right]}^{\prime}$, with $c\ne {c}^{\prime}$, we have $\mathrm{E}\left({z}_{\pi \left(i\right)}{z}_{\pi \left(j\right)}\right)=\mathrm{E}\left({z}_{\pi \left(i\right)}\right)\mathrm{E}\left({z}_{\pi \left(j\right)}\right)=\overline{{z}_{c}}\cdot \overline{{z}_{{c}^{\prime}}}$. Thus,
$$\left(C\right)=\sum _{c,{c}^{\prime},c\ne {c}^{\prime}}\frac{{n}_{c}{n}_{{c}^{\prime}}}{{n}^{2}}\overline{{x}_{c}}\cdot \overline{{z}_{c}}\cdot \overline{{x}_{{c}^{\prime}}}\cdot \overline{{z}_{{c}^{\prime}}}.$$ 
By definition of ${R}_{c}$, we obtain $\left(C\right)={\sum}_{c,{c}^{\prime},c\ne {c}^{\prime}}{R}_{c}{R}_{{c}^{\prime}}$.
We use these results on the individual terms to finish the calculation of the variance.
$\mathrm{Var}\left({w}_{n}\right)$  $=\left(A\right)+\left(B\right)+\left(C\right){\left({\displaystyle \sum _{c}}{R}_{c}\right)}^{2}$  
$=O\left(1/n\right)+{\displaystyle \sum _{c}}{R}_{c}^{2}+{\displaystyle \sum _{c}}{R}_{c}^{2}O\left(1/{n}_{c}\right)+{\displaystyle \sum _{c}}O\left({n}_{c}/{n}^{2}\right)+{\displaystyle \sum _{c,{c}^{\prime},c\ne {c}^{\prime}}}{R}_{c}{R}_{{c}^{\prime}}{\left({\displaystyle \sum _{c}}{R}_{c}\right)}^{2}$  
$=O\left(1/n\right)+{\displaystyle \sum _{c}}{R}_{c}^{2}O\left(1/{n}_{c}\right)+{\displaystyle \sum _{c}}O\left({n}_{c}/{n}^{2}\right)$  
$=O\left(1/n\right)+{\displaystyle \sum _{c}}O\left({n}_{c}^{2}/{n}^{2}\right)O\left(1/{n}_{c}\right)+{\displaystyle \sum _{c}}O\left({n}_{c}/{n}^{2}\right)=O\left(1/n\right)+{\displaystyle \frac{1}{n}}O\left({\displaystyle \sum _{c}}{n}_{c}/n\right)=O\left(1/n\right)\to 0.$ 
When the cluster covariates are centered, so that $\overline{{x}_{c}}=0$ for every column $x$, it follows that ${w}_{n}\stackrel{p}{\to}\mathrm{\hspace{0.25em}0}$ and so $\frac{1}{n}{X}^{\top}GX\stackrel{p}{\to}0$. This implies that the test is asymptotically valid by Theorem 1. Regarding the homogeneity condition, this is equivalent to:
$$\frac{1}{{n}_{c}}{X}_{c}^{\top}{X}_{c}\to {m}_{c}\mathrm{\Sigma}.$$ 
When this condition holds and when ${\mathrm{min}}_{c}{n}_{c}\to \mathrm{\infty}$, then we have essentially independent permutation tests within clusters, and Theorem 3 thus can be applied within each cluster. ∎
Theorem 6.
Suppose that Assumptions 1(a)(c) hold for the regression model of Equation (2), and $\epsilon \stackrel{d}{\mathrm{=}}\mathrm{g}\mathit{}\epsilon $, for all $\mathrm{g}\mathrm{\in}{\mathrm{G}}^{\mathrm{s}}\mathit{}\mathrm{(}{\mathrm{C}}^{n}\mathrm{)}$. Suppose also that Assumptions 1(a)(b) hold within clusters. If ${J}_{n}\mathrm{=}J$ is fixed and ${\mathrm{min}}_{c}\mathit{}{n}_{c}\mathrm{\to}\mathrm{\infty}$, then the residual randomization test using ${\mathrm{G}}^{\mathrm{s}}\mathit{}\mathrm{(}{\mathrm{C}}^{n}\mathrm{)}$ as the primitive is asymptotically valid when the homogeneity condition of Equation (15) holds. If ${J}_{n}\mathrm{\to}\mathrm{\infty}$, then the test is valid also when ${\mathrm{\sum}}_{c\mathrm{=}\mathrm{1}}^{{J}_{n}}{n}_{c}^{\mathrm{2}}\mathrm{/}{n}^{\mathrm{2}}\mathrm{\to}\mathrm{0}$.
Proof.
Here, $G\sim \mathrm{Unif}\left\{{\mathcal{G}}^{\U0001d5cc}\left({\mathrm{C}}^{n}\right)\right\}$ is equivalent to $G={\sum}_{c=1}^{{J}_{n}}{S}_{c}{\sum}_{i\in \left[c\right]}{\mathrm{\U0001d7d9}}_{i}{\mathrm{\U0001d7d9}}_{i}^{\prime}$, where ${S}_{c}$ are independent random cluster signs. As in the preceding proofs, let $x,z$ be any two columns of $X$, and let us focus on the quantity ${w}_{n}=\frac{1}{n}{x}^{\prime}Gz$. By definition,
${x}^{\prime}Gz$  $={x}^{\prime}\left({\displaystyle \sum _{c}}{S}_{c}{\displaystyle \sum _{i\in \left[c\right]}}{\mathrm{\U0001d7d9}}_{i}{\mathrm{\U0001d7d9}}_{i}^{\prime}\right)z={\displaystyle \sum _{c}}{S}_{c}{\displaystyle \sum _{i\in \left[c\right]}}\left({x}^{\prime}{\mathrm{\U0001d7d9}}_{i}\right)\left({\mathrm{\U0001d7d9}}_{i}^{\prime}z\right)={\displaystyle \sum _{c}}{S}_{c}\underset{{r}_{c}}{\underset{\u23df}{{\displaystyle \sum _{i\in \left[c\right]}}{x}_{i}{z}_{i}}}\triangleq {\displaystyle \sum _{c}}{S}_{c}{r}_{c}.$ 
It holds that ${S}_{c}{r}_{c}\text{\u22a7}{S}_{{c}^{\prime}}{r}_{{c}^{\prime}}$ for $c\ne {c}^{\prime}$ since ${S}_{c},{S}_{{c}^{\prime}}$ are independent. Thus,
$\mathrm{E}\left({w}_{n}\right)={\displaystyle \frac{1}{n}}{x}^{\prime}\mathrm{E}\left(G\right)z$  $={\displaystyle \frac{1}{n}}\mathrm{E}\left({\displaystyle \sum _{c}}{S}_{c}{r}_{c}\right)$  
$={\displaystyle \frac{1}{n}}{\displaystyle \sum _{c}}\mathrm{E}\left({S}_{c}\right){r}_{c}$  
$=0\mathit{\hspace{1em}\hspace{1em}}\left[\mathit{\text{Since}}{S}_{c}\mathit{\text{is random sign}}\right].$  (29) 
Next, we derive the variance.
$\mathrm{Var}\left({w}_{n}\right)=\mathrm{E}\left({w}_{n}^{2}\right)$  $={\displaystyle \frac{1}{{n}^{2}}}\mathrm{E}\left\{{\left({\displaystyle \sum _{c}}{S}_{c}{r}_{c}\right)}^{2}\right\}={\displaystyle \frac{1}{{n}^{2}}}{\displaystyle \sum _{c}}\mathrm{E}\left({S}_{c}^{2}{r}_{c}^{2}\right)+{\displaystyle \frac{1}{{n}^{2}}}{\displaystyle \sum _{c\ne {c}^{\prime}}}\mathrm{E}\left({S}_{c}{r}_{c}{S}_{{c}^{\prime}}{r}_{{c}^{\prime}}\right)$  
$={\displaystyle \frac{1}{{n}^{2}}}{\displaystyle \sum _{c}}\underset{=1}{\underset{\u23df}{\mathrm{E}\left({S}_{c}^{2}\right)}}{r}_{c}^{2}+{\displaystyle \frac{1}{{n}^{2}}}{\displaystyle \sum _{c\ne {c}^{\prime}}}\underset{=0}{\underset{\u23df}{\mathrm{E}\left({S}_{c}{r}_{c}\right)}}\underset{=0}{\underset{\u23df}{\mathrm{E}\left({S}_{{c}^{\prime}}{r}_{{c}^{\prime}}\right)}}\mathit{\hspace{1em}\hspace{1em}}\left[\mathit{\text{since}}{S}_{c},{S}_{{c}^{\prime}}\mathit{\text{are iid random signs.}}\right]$  
$={\displaystyle \frac{1}{{n}^{2}}}{\displaystyle \sum _{c}}{r}_{c}^{2}={\displaystyle \frac{1}{{n}^{2}}}{\displaystyle \sum _{c}}{\displaystyle \sum _{i\in \left[c\right]}}{x}_{i}^{2}{z}_{i}^{2}+{\displaystyle \frac{1}{{n}^{2}}}{\displaystyle \sum _{c}}{\displaystyle \sum _{i,j\in \left[c\right],i\ne j}}{x}_{i}{x}_{j}{z}_{i}{z}_{j}$  
$=\underset{O\left(1/n\right),\text{proof of Theorem}\text{4}}{\underset{\u23df}{{\displaystyle \frac{1}{{n}^{2}}}{\displaystyle \sum _{i=1}^{n}}{x}_{i}^{2}{z}_{i}^{2}}}+{\displaystyle \frac{1}{{n}^{2}}}{\displaystyle \sum _{c}}{\displaystyle \sum _{i\in \left[c\right]}}{x}_{i}{z}_{i}{\displaystyle \sum _{j\in \left[c\right],j\ne i}}{x}_{j}{z}_{j}$  
$=O\left(1/n\right)+{\displaystyle \frac{1}{{n}^{2}}}{\displaystyle \sum _{c}}{\displaystyle \sum _{i\in \left[c\right]}}{x}_{i}{z}_{i}\left({n}_{c}\overline{{\mathrm{xz}}_{c}}{x}_{i}{z}_{i}\right)\mathit{\hspace{1em}\hspace{1em}\hspace{0.25em}}\left[\mathit{\text{where}}\overline{{\mathrm{xz}}_{c}}=\left(1/{n}_{c}\right){\sum}_{i\in \left[c\right]}{x}_{i}{z}_{i}\right].$  
$=O\left(1/n\right)+{\displaystyle \sum _{c}}\left({n}_{c}^{2}/{n}^{2}\right){\left(\overline{{\mathrm{xz}}_{\mathrm{c}}}\right)}^{2}{\displaystyle \frac{1}{n}}{\displaystyle \sum _{c}}\left({n}_{c}/n\right)\overline{{\mathrm{xz}}_{c}^{2}}\mathit{\hspace{1em}\hspace{1em}\hspace{0.25em}}\left[\mathit{\text{where}}\overline{{\mathrm{xz}}_{c}^{2}}=\left(1/{n}_{c}\right){\sum}_{i\in \left[c\right]}{x}_{i}^{2}{z}_{i}^{2}\right].$  
$={\displaystyle \sum _{c}}\left({n}_{c}^{2}/{n}^{2}\right){\left(\overline{{\mathrm{xz}}_{\mathrm{c}}}\right)}^{2}+O\left(1/n\right).$  (30) 
When ${\sum}_{c}{n}_{c}^{2}/{n}^{2}\to 0$ then ${w}_{n}\stackrel{p}{\to}0$. This implies $\frac{1}{n}{X}^{\top}GX\stackrel{p}{\to}0$ and so the test is valid from Theorem 1.
When ${J}_{n}=J$ is fixed and ${\mathrm{min}}_{c}{n}_{c}\to \mathrm{\infty}$, and the homogeneity condition holds, then we can adapt the proof of Theorem 2; the only difference is that here ${t}_{n}\left({G}_{r}\widehat{{\epsilon}^{\mathrm{o}}}\right){t}_{n}\left({G}_{r}\epsilon \right)={o}_{P}\left(1\right)$ under the homogeneity condition, whereas the difference is exactly zero in Theorem 2. ∎
Theorem 7.
Suppose that Assumptions 1(a)(c) hold for the regression model of Equation (2), and $\epsilon \stackrel{d}{\mathrm{=}}\mathrm{g}\mathit{}\epsilon $, for all $\mathrm{g}\mathrm{\in}{\mathrm{G}}^{\mathrm{p}\mathrm{+}\mathrm{s}}\mathit{}\mathrm{(}{\mathrm{C}}^{n}\mathrm{)}$. Suppose also that Assumptions 1(a)(b) hold within clusters. If ${J}_{n}\mathrm{=}J$ is fixed and ${\mathrm{min}}_{c}\mathit{}{n}_{c}\mathrm{\to}\mathrm{\infty}$, then the residual randomization test using ${\mathrm{G}}^{\mathrm{p}\mathrm{+}\mathrm{s}}\mathit{}\mathrm{(}{\mathrm{C}}^{n}\mathrm{)}$ as the primitive is asymptotically valid when the homogeneity condition of Equation (15) holds, or when the clusters are centered. If ${J}_{n}\mathrm{\to}\mathrm{\infty}$, then the test is valid also when ${\mathrm{\sum}}_{c}{n}_{c}^{\mathrm{2}}\mathrm{/}{n}^{\mathrm{2}}\mathrm{\to}\mathrm{0}$.
Proof.
Here, $G\sim \mathrm{Unif}\left\{{\mathcal{G}}^{\U0001d5c9+\U0001d5cc}\left({\mathrm{C}}^{n}\right)\right\}$ is equivalent to $G={\sum}_{c=1}^{{J}_{n}}{S}_{c}{\sum}_{i\in \left[c\right]}{\mathrm{\U0001d7d9}}_{i}{\mathrm{\U0001d7d9}}_{\pi}{\left(i\right)}^{\prime}$, where ${S}_{c}$ are independent random cluster signs, and $\pi $ is a random permutation within the clusters defined by ${\mathrm{C}}^{n}$. As in the preceding proofs, let $x,z$ be any two columns of $X$, and let us focus on the quantity ${w}_{n}=\frac{1}{n}{x}^{\prime}Gz\in \mathbb{R}$. By definition,
${x}^{\prime}Gz$  $={x}^{\prime}\left({\displaystyle \sum _{c}}{S}_{c}{\displaystyle \sum _{i\in \left[c\right]}}{\mathrm{\U0001d7d9}}_{i}{\mathrm{\U0001d7d9}}_{\pi \left(i\right)}^{\prime}\right)z={\displaystyle \sum _{c}}{S}_{c}{\displaystyle \sum _{i\in \left[c\right]}}\left({x}^{\prime}{\mathrm{\U0001d7d9}}_{i}\right)\left({\mathrm{\U0001d7d9}}_{\pi \left(i\right)}^{\top}z\right)={\displaystyle \sum _{c}}{S}_{c}\underset{{r}_{c}}{\underset{\u23df}{{\displaystyle \sum _{i\in \left[c\right]}}{x}_{i}{z}_{\pi \left(i\right)}}}\triangleq {\displaystyle \sum _{c}}{S}_{c}{r}_{c}.$  (31) 
We have ${S}_{c}{r}_{c}\text{\u22a7}{S}_{{c}^{\prime}}{r}_{{c}^{\prime}}$ for $c\ne {c}^{\prime}$ since ${S}_{c},{S}_{{c}^{\prime}}$ are independent. Thus,
$\mathrm{E}\left({w}_{n}\right)={\displaystyle \frac{1}{n}}{x}^{\prime}\mathrm{E}\left(G\right)z$  $={\displaystyle \frac{1}{n}}\mathrm{E}\left({\displaystyle \sum _{c}}{S}_{c}{r}_{c}\right)={\displaystyle \frac{1}{n}}{\displaystyle \sum _{c}}\mathrm{E}\left({S}_{c}\right){r}_{c}=0.\mathit{\hspace{1em}\hspace{1em}}\left[\mathit{\text{Since}}{S}_{c}\mathit{\text{is random sign}}\right].$ 
Next, we derive the variance.
$\mathrm{Var}\left({w}_{n}\right)=\mathrm{E}\left({w}_{n}^{2}\right)$  $={\displaystyle \frac{1}{{n}^{2}}}\mathrm{E}\left\{{\left({\displaystyle \sum _{c}}{S}_{c}{r}_{c}\right)}^{2}\right\}={\displaystyle \frac{1}{{n}^{2}}}{\displaystyle \sum _{c}}\mathrm{E}\left({S}_{c}^{2}{r}_{c}^{2}\right)+{\displaystyle \frac{1}{{n}^{2}}}{\displaystyle \sum _{c\ne {c}^{\prime}}}\mathrm{E}\left({S}_{c}{R}_{c}{S}_{{c}^{\prime}}{R}_{{c}^{\prime}}\right)$  
$={\displaystyle \frac{1}{{n}^{2}}}{\displaystyle \sum _{c}}\underset{=1}{\underset{\u23df}{\mathrm{E}\left({S}_{c}^{2}\right)}}{r}_{c}^{2}+{\displaystyle \frac{1}{{n}^{2}}}{\displaystyle \sum _{c\ne {c}^{\prime}}}\underset{=0}{\underset{\u23df}{\mathrm{E}\left({S}_{c}{r}_{c}\right)}}\underset{=0}{\underset{\u23df}{\mathrm{E}\left({S}_{{c}^{\prime}}{r}_{{c}^{\prime}}\right)}}\mathit{\hspace{1em}\hspace{1em}}\left[\mathit{\text{since}}{S}_{c},{S}_{{c}^{\prime}}\mathit{\text{are iid random signs.}}\right]$  
$={\displaystyle \frac{1}{{n}^{2}}}{\displaystyle \sum _{c}}{r}_{c}^{2}={\displaystyle \frac{1}{{n}^{2}}}{\displaystyle \sum _{c}}\mathrm{E}\left({\displaystyle \sum _{i\in \left[c\right]}}{x}_{i}^{2}{z}_{\pi \left(i\right)}^{2}+{\displaystyle \sum _{i,j\in \left[c\right],j\ne i}}{x}_{i}{x}_{j}{z}_{\pi \left(i\right)}{z}_{\pi \left(j\right)}\right)$  
$={\displaystyle \frac{1}{{n}^{2}}}{\displaystyle \sum _{c}}{\displaystyle \sum _{i\in \left[c\right]}}{x}_{i}^{2}\mathrm{E}\left({z}_{\pi \left(i\right)}^{2}\right)+{\displaystyle \frac{1}{{n}^{2}}}{\displaystyle \sum _{c}}{\displaystyle \sum _{i,j\in \left[c\right],j\ne i}}{x}_{i}{x}_{j}\mathrm{E}\left({z}_{\pi \left(i\right)}{z}_{\pi \left(j\right)}\right)$  
$={\displaystyle \frac{1}{{n}^{2}}}{\displaystyle \sum _{c}}{\displaystyle \sum _{i\in \left[c\right]}}{x}_{i}^{2}\overline{{z}_{c}^{2}}+{\displaystyle \frac{1}{{n}^{2}}}{\displaystyle \sum _{c}}{\displaystyle \sum _{i,j\in \left[c\right],j\ne i}}{x}_{i}{x}_{j}{\displaystyle \frac{1}{{n}_{c}1}}\left({n}_{c}{\overline{{z}_{c}}}^{2}\overline{{z}_{c}^{2}}\right)\mathit{\hspace{1em}\hspace{1em}}\left[\mathit{\text{from Equation(}}\mathit{\text{24}}\mathit{\text{)}}\right]$  
$=\underset{=O\left(1/n\right)}{\underset{\u23df}{{\displaystyle \frac{1}{n}}{\displaystyle \sum _{c}}{\displaystyle \frac{{n}_{c}}{n}}\overline{{x}_{c}^{2}}\cdot \overline{{z}_{c}^{2}}}}+{\displaystyle \sum _{c}}{\displaystyle \frac{\left({n}_{c}{\overline{{z}_{c}}}^{2}\overline{{z}_{c}^{2}}\right)}{{n}^{2}\left({n}_{c}1\right)}}{n}_{c}\left({n}_{c}{\overline{{x}_{c}}}^{2}\overline{{x}_{c}^{2}}\right)\mathit{\hspace{1em}\hspace{1em}}\left[\mathit{\text{from Equation(}}\mathit{\text{B.2}}\mathit{\text{)}}\right]$  
$=O(1/n)+{\displaystyle \sum _{c}}({n}_{c}^{2}/{n}^{2}){\left(\overline{{x}_{c}}\right)}^{2}{\left(\overline{{z}_{c}}\right)}^{2}.\mathit{\hspace{1em}\hspace{1em}}\left[\mathit{\text{see last steps of Theorem}}\mathit{\text{5}}\right]$ 
As before, when ${\sum}_{c}{n}_{c}^{2}/{n}^{2}\to 0$ then ${w}_{n}\stackrel{p}{\to}0$. This implies $\frac{1}{n}{X}^{\top}GX\stackrel{p}{\to}0$ and so the test is valid from Theorem 1. The same holds when the clusters are centered, so that $\overline{{x}_{c}}=0$ for every column of $X$.
When ${J}_{n}=J$ is fixed and ${\mathrm{min}}_{c}{n}_{c}\to \mathrm{\infty}$, then conditional on the cluster signs we have a permutation test within each cluster. Hence, Theorem 3 can be applied to show that the test is valid conditionally on the signs, and thus it is valid unconditionally as well. ∎
Theorem 8.
Suppose that Assumptions 1(a)(c) hold for the regression model of Equation (2), and $\epsilon \stackrel{d}{\mathrm{=}}\mathrm{g}\mathit{}\epsilon $, for all $\mathrm{g}\mathrm{\in}{\mathrm{G}}^{\mathrm{p}}\mathit{}\mathrm{(}{\mathrm{R}}^{n}\mathrm{,}{\mathrm{C}}^{n}\mathrm{)}$. Suppose also that Assumptions 1(a)(b) hold within clusters with respect to both ${\mathrm{R}}^{n}$ and ${\mathrm{C}}^{n}$. When the number of clusters of both ${\mathrm{R}}^{n}\mathrm{,}{\mathrm{C}}^{n}$ grows indefinitely, the residual randomization test using ${\mathrm{G}}^{\mathrm{p}}\mathit{}\mathrm{(}{\mathrm{R}}^{n}\mathrm{,}{\mathrm{C}}^{n}\mathrm{)}$ as the primitive is asymptotically valid. When only the row (or only the column) cluster grows indefinitely, the same randomization test is valid if the covariates have been centered columnwise (or rowwise).
Proof.
Let $G$ be a random twoway permutation matrix.
We can write this as
$G={\sum}_{i}{\mathrm{\U0001d7d9}}_{i}{\mathrm{\U0001d7d9}}_{\pi \left(i\right)}^{\prime}$, where the twoway structure is encoded in $\pi $ according to
Equation (18).
As in the main text let $r\left(i\right)\in \{1,\mathrm{\dots},m\}$ denote the row of data point $i$, and
$c\left(i\right)\in \{1,\mathrm{\dots},m\}$ the column of $i$.
Case 1: Row/column clusterings’ size $\to \mathrm{\infty}$. In this case, let $m$ be the size of clusterings ${\mathrm{R}}^{n},{\mathrm{C}}^{n}$, so that $n={m}^{2}$. We assume same size for the two clusterings to simplify the analysis. The proof still goes through even when the sizes are different, and the test only converges when both sizes grow indefinitely.
For some row $r$ (or column $c$) we also define:
${\overline{x}}_{r}={\displaystyle \frac{1}{m}}{\displaystyle \sum _{i,r\left(i\right)=r}}{x}_{i},\text{and}{\overline{x}}_{c}={\displaystyle \frac{1}{m}}{\displaystyle \sum _{i,c\left(i\right)=c}}{x}_{i},$  
${\overline{{x}^{2}}}_{r}={\displaystyle \frac{1}{m}}{\displaystyle \sum _{i,r\left(i\right)=r}}{x}_{i}^{2},\text{and}{\overline{{x}^{2}}}_{c}={\displaystyle \frac{1}{m}}{\displaystyle \sum _{i,c\left(i\right)=c}}{x}_{i}^{2},$ 
Despite the twoway structure, $\pi $ maps any datapoint $i$ to any other datapoint in $\{1,\mathrm{\dots},n\}$ with equal marginal probability. We obtain the properties:
$\mathrm{E}\left({x}_{\pi \left(i\right)}\right)$  $={\displaystyle \frac{1}{n}}{\displaystyle \sum _{i=1}^{n}}{x}_{i}=\overline{x}.$  
$\mathrm{E}\left({x}_{\pi \left(i\right)}^{2}\right)$  $={\displaystyle \frac{1}{n}}{\displaystyle \sum _{i=1}^{n}}{x}_{i}^{2}=\overline{{x}^{2}}.$  
$\mathrm{E}\left({x}_{\pi \left(i\right)}{\overline{x}}_{r\left(\pi \left(i\right)\right)}\right)$  $=\mathrm{E}\left({\overline{x}}_{r}\mathrm{E}({x}_{\pi \left(i\right)}\mid r\left(\pi \left(i\right)\right)=r)\right)=\mathrm{E}\left({\left({\overline{x}}_{r}\right)}^{2}\right)={\displaystyle \frac{1}{m}}{\displaystyle \sum _{r=1}^{m}}{\left({\overline{x}}_{r}\right)}^{2}\triangleq {\tau}_{x,R}^{2}.$  
$\mathrm{E}\left({x}_{\pi \left(i\right)}{\overline{x}}_{c\left(\pi \left(i\right)\right)}\right)$  $=\mathrm{E}\left({\overline{x}}_{c}\mathrm{E}({x}_{\pi \left(i\right)}\mid c\left(\pi \left(i\right)\right)=c)\right)=\mathrm{E}\left({\left({\overline{x}}_{c}\right)}^{2}\right)={\displaystyle \frac{1}{m}}{\displaystyle \sum _{c=1}^{m}}{\left({\overline{x}}_{c}\right)}^{2}\triangleq {\tau}_{x,C}^{2}.$  (32) 
The quantity $\mathrm{E}({x}_{\pi \left(j\right)}\mid \pi \left(i\right),j\ne i)$ will also be useful, and is complicated by the structure of the twoway clustering. For example, if $j$ and $i$ are in the same row, so that $r\left(j\right)=r\left(i\right)$, then we know that $j$ will be moved by $\pi $ to the same row as $i$; i.e., $r\left(\pi \left(j\right)\right)=r\left(\pi \left(i\right)\right)$. In this case, the expected value of ${x}_{\pi \left(j\right)}$ is the average of row $r\left(\pi \left(i\right)\right)$ except for where $i$ is mapped to; i.e., the expected value is $\frac{1}{m1}\left(m{\overline{x}}_{r\left(\pi \left(i\right)\right)}{x}_{\pi \left(i\right)}\right)$. With similar calculations for the other two cases where $j$ and $i$ share the same column or share no row or column, we get:
$\mathrm{E}({x}_{\pi \left(j\right)}\mid \pi \left(i\right),j\ne i)$  $=\{\begin{array}{cc}\frac{1}{m1}\left(m{\overline{x}}_{r\left(\pi \left(i\right)\right)}{x}_{\pi \left(i\right)}\right),\text{if}r\left(j\right)=r\left(i\right);\hfill & \\ \frac{1}{m1}\left(m{\overline{x}}_{c\left(\pi \left(i\right)\right)}{x}_{\pi \left(i\right)}\right),\text{if}c\left(j\right)=c\left(i\right);\hfill & \\ \frac{1}{n2m+1}\left(n\overline{x}m{\overline{x}}_{r\left(\pi \left(i\right)\right)}m{\overline{x}}_{c\left(\pi \left(i\right)\right)}+{x}_{\pi \left(i\right)}\right),\text{otherwise}.\hfill & \end{array}$  (33) 
From this, we obtain
$\mathrm{E}({x}_{\pi \left(i\right)}{x}_{\pi \left(j\right)}\mid j\ne i)=\mathrm{E}\left({x}_{\pi \left(i\right)}\mathrm{E}({x}_{\pi \left(j\right)}\mid \pi \left(i\right),j\ne i)\right)=\{\begin{array}{cc}\frac{1}{m1}\left(m{\tau}_{x,R}^{2}\overline{{x}^{2}}\right),\text{if}r\left(j\right)=r\left(i\right);\hfill & \\ \frac{1}{m1}\left(m{\tau}_{x,C}^{2}\overline{{x}^{2}}\right),\text{if}c\left(j\right)=c\left(i\right);\hfill & \\ \frac{1}{n2m+1}\left[n{\left(\overline{x}\right)}^{2}m{\tau}_{x,R}^{2}m{\tau}_{x,C}^{2}+\overline{{x}^{2}}\right],\text{otherwise}.\hfill & \end{array}$  (34) 
Let $x,z$ be any two columns of $X$, and define ${w}_{n}=\frac{1}{n}{x}^{\prime}Gz\in \mathbb{R}$. We derive the mean of this variable:
$\mathrm{E}\left({w}_{n}\right)$  $=\mathrm{E}\left({\displaystyle \frac{1}{n}}{x}^{\prime}Gz\right)={\displaystyle \frac{1}{n}}\mathrm{E}\left({\displaystyle \sum _{i=1}^{n}}{x}_{i}{z}_{\pi \left(i\right)}\right)={\displaystyle \frac{1}{n}}{\displaystyle \sum _{i=1}^{n}}{x}_{i}\mathrm{E}\left({z}_{\pi \left(i\right)}\right)={\displaystyle \frac{1}{n}}{\displaystyle \sum _{i=1}^{n}}{x}_{i}\overline{z}=\overline{x}\cdot \overline{z},$  (35) 
which follows from Equations (B.2). From this we conclude that
$$\mathrm{E}\left(\frac{1}{n}{X}^{\top}GX\right)=\mu {\mu}^{\prime},\mu ={X}^{\top}\mathrm{\U0001d7d9}/n.$$ 
This is the same result as in the nonclustered permutation test in Theorem 3.
Next, we derive the variance.
$\mathrm{Var}\left({w}_{n}\right)=\mathrm{E}\left({w}_{n}^{2}\right){\overline{x}}^{2}{\overline{z}}^{2}$  $={\displaystyle \frac{1}{{n}^{2}}}\mathrm{E}\left\{{\left({\displaystyle \sum _{i=1}^{n}}{x}_{i}{z}_{\pi \left(i\right)}\right)}^{2}\right\}{\overline{x}}^{2}{\overline{z}}^{2}\mathit{\hspace{1em}\hspace{1em}}\left[\mathit{\text{from Equation(}}\mathit{\text{35}}\mathit{\text{)}}\right]$  
$=\underset{\left(A\right)}{\underset{\u23df}{{\displaystyle \frac{1}{{n}^{2}}}{\displaystyle \sum _{i=1}^{n}}{x}_{i}^{2}\mathrm{E}\left({z}_{\pi \left(i\right)}^{2}\right)}}+\underset{\left(B\right)}{\underset{\u23df}{{\displaystyle \frac{1}{{n}^{2}}}{\displaystyle \sum _{i\ne j}}{x}_{i}{x}_{j}\mathrm{E}\left({z}_{\pi \left(i\right)}{z}_{\pi \left(j\right)}\right)}}{\overline{x}}^{2}{\overline{z}}^{2}.$  (36) 
We analyze the terms individually. For the first one,
$\left(A\right)={\displaystyle \frac{1}{{n}^{2}}}{\displaystyle \sum _{i=1}^{n}}{x}_{i}^{2}\overline{{z}^{2}}={\displaystyle \frac{1}{n}}\overline{{x}^{2}}\overline{{z}^{2}}.$  (37) 
For the second term,
$\left(B\right)$  $={\displaystyle \frac{1}{{n}^{2}}}{\displaystyle \sum _{i=1}^{n}}{x}_{i}{\displaystyle \sum _{j=1,j\ne i}^{n}}{x}_{j}\mathrm{E}({z}_{\pi \left(i\right)}{z}_{\pi \left(j\right)}\mid j\ne i)$  
$={\displaystyle \frac{1}{{n}^{2}}}{\displaystyle \sum _{i=1}^{n}}{x}_{i}\left[{\displaystyle \sum _{j\ne i}}\left({\displaystyle \sum _{r\left(j\right)=r\left(i\right)}}{x}_{j}{\displaystyle \frac{m{\tau}_{z,R}^{2}\overline{{z}^{2}}}{m1}}+{\displaystyle \sum _{c\left(j\right)=c\left(i\right)}}{x}_{j}{\displaystyle \frac{m{\tau}_{z,C}^{2}\overline{{z}^{2}}}{m1}}+{\displaystyle \sum _{\text{other}}}{x}_{j}{\displaystyle \frac{n{\overline{z}}^{2}m{\tau}_{z,R}^{2}m{\tau}_{z,C}^{2}+\overline{{z}^{2}}}{n2m+1}}\right)\right]$  
$={\displaystyle \frac{1}{{n}^{2}}}{\displaystyle \sum _{i=1}^{n}}{x}_{i}[{\displaystyle \frac{\left(m{\tau}_{z,R}^{2}\overline{{z}^{2}}\right)\left(m{\overline{x}}_{r\left(i\right)}{x}_{i}\right)}{m1}}+{\displaystyle \frac{\left(m{\tau}_{z,C}^{2}\overline{{z}^{2}}\right)\left(m{\overline{x}}_{c\left(i\right)}{x}_{i}\right)}{m1}}$  
${v}{e}{r}{y}{l}{o}{n}{g}{e}{q}{s}{.}+{\displaystyle \frac{\left[n{\overline{z}}^{2}m{\tau}_{z,R}^{2}m{\tau}_{z,C}^{2}+\overline{{z}^{2}}\right]\left(n\overline{x}m{\overline{x}}_{r\left(i\right)}m{\overline{x}}_{c\left(i\right)}+{x}_{i}\right)}{n2m+1}}].$  
$={\displaystyle \frac{1}{{n}^{2}}}{\displaystyle \frac{\left(m{\tau}_{z,R}^{2}\overline{{z}^{2}}\right)\left({m}^{3}{\tau}_{x,R}^{2}n\overline{{x}^{2}}\right)}{m1}}+{\displaystyle \frac{1}{{n}^{2}}}{\displaystyle \frac{\left(m{\tau}_{z,C}^{2}\overline{{z}^{2}}\right)\left({m}^{3}{\tau}_{x,C}^{2}n\overline{{x}^{2}}\right)}{m1}}$  
${v}{e}{r}{y}{l}{o}{n}{g}{e}{q}{s}{.}+{\displaystyle \frac{1}{{n}^{2}}}{\displaystyle \frac{\left[n{\overline{z}}^{2}m{\tau}_{z,R}^{2}m{\tau}_{z,C}^{2}+\overline{{z}^{2}}\right]\left({n}^{2}{\overline{x}}^{2}{m}^{3}{\tau}_{x,R}^{2}{m}^{3}{\tau}_{x,C}^{2}+n\overline{{x}^{2}}\right)}{n2m+1}}].$  (38) 
where we used ${\sum}_{i}{x}_{i}{\overline{x}}_{r\left(i\right)}={m}^{2}{\tau}_{x,R}^{2}$ and ${\sum}_{i}{x}_{i}{\overline{x}}_{c\left(i\right)}={m}^{2}{\tau}_{x,C}^{2}$. To simplify, for some column $u$ define ${\sigma}_{u,R}^{2}={\tau}_{u,R}^{2}\left(1/m\right)\overline{{u}^{2}}$ and ${\sigma}_{u}^{2}={\overline{u}}^{2}\left(1/m\right){\tau}_{u,R}^{2}\left(1/m\right){\tau}_{u,C}^{2}+\left(1/{m}^{2}\right)\overline{{u}^{2}}$. Because $n={m}^{2}$ we obtain
$\left(B\right)$  $={\displaystyle \frac{1}{m1}}\left({\sigma}_{x,R}^{2}{\sigma}_{z,R}^{2}+{\sigma}_{z,C}^{2}{\sigma}_{z,C}^{2}\right)+{\displaystyle \frac{n}{n2m+1}}{\sigma}_{x}^{2}{\sigma}_{z}^{2}.$ 
We gather the terms (A) and (B) to obtain:
$$\mathrm{Var}\left({w}_{n}\right)=\frac{1}{n}\overline{{x}^{2}}\overline{{z}^{2}}+\frac{1}{m1}\left({\sigma}_{x,R}^{2}{\sigma}_{z,R}^{2}+{\sigma}_{z,C}^{2}{\sigma}_{z,C}^{2}\right)+\frac{n}{n2m+1}{\sigma}_{x}^{2}{\sigma}_{z}^{2}{\overline{x}}^{2}{\overline{z}}^{2}=O\left(1/m\right),$$ 
following from the boundedness of first and second moments within rows and columns.
If $m\to \mathrm{\infty}$, then ${w}_{n}\overline{x}\cdot \overline{z}\stackrel{p}{\to}0$, as in Theorem 3.
The test is thus valid asymptotically by Theorem 1.
Case 2: Size of only one clustering (either row or column) $\to \mathrm{\infty}$. Here, we consider the case where the size of only one clustering grows indefinitely, whereas the size of other is fixed. Without loss of generality suppose that the column clustering diverges. Define the respective sizes as $$ and $\left{\mathrm{C}}^{n}\right={m}_{C}\to \mathrm{\infty}$, such that $n={m}_{R}{m}_{C}$.
With the new definitions, we have:
${\overline{x}}_{r}={\displaystyle \frac{1}{{m}_{C}}}{\displaystyle \sum _{i,r\left(i\right)=r}}{x}_{i},\text{and}{\overline{x}}_{c}={\displaystyle \frac{1}{{m}_{R}}}{\displaystyle \sum _{i,c\left(i\right)=c}}{x}_{i},$  
${\overline{{x}^{2}}}_{r}={\displaystyle \frac{1}{{m}_{C}}}{\displaystyle \sum _{i,r\left(i\right)=r}}{x}_{i}^{2},\text{and}{\overline{{x}^{2}}}_{c}={\displaystyle \frac{1}{{m}_{R}}}{\displaystyle \sum _{i,c\left(i\right)=c}}{x}_{i}^{2},$  
$\mathrm{E}\left({x}_{\pi \left(i\right)}{\overline{x}}_{r\left(\pi \left(i\right)\right)}\right)={\displaystyle \frac{1}{{m}_{R}}}{\displaystyle \sum _{r=1}^{{m}_{R}}}{\left({\overline{x}}_{r}\right)}^{2}\triangleq {\tau}_{x,R}^{2}.$  
$\mathrm{E}\left({x}_{\pi \left(i\right)}{\overline{x}}_{c\left(\pi \left(i\right)\right)}\right)={\displaystyle \frac{1}{{m}_{C}}}{\displaystyle \sum _{c=1}^{{m}_{C}}}{\left({\overline{x}}_{c}\right)}^{2}\triangleq {\tau}_{x,C}^{2}.$  (39) 
$\mathrm{E}({x}_{\pi \left(j\right)}\mid \pi \left(i\right),j\ne i)$  $=\{\begin{array}{cc}\frac{1}{{m}_{C}1}\left({m}_{C}{\overline{x}}_{r\left(\pi \left(i\right)\right)}{x}_{\pi \left(i\right)}\right),\text{if}r\left(j\right)=r\left(i\right);\hfill & \\ \frac{1}{{m}_{R}1}\left({m}_{R}{\overline{x}}_{c\left(\pi \left(i\right)\right)}{x}_{\pi \left(i\right)}\right),\text{if}c\left(j\right)=c\left(i\right);\hfill & \\ \frac{1}{n{m}_{R}{m}_{C}+1}\left(n\overline{x}{m}_{C}{\overline{x}}_{r\left(\pi \left(i\right)\right)}{m}_{R}{\overline{x}}_{c\left(\pi \left(i\right)\right)}+{x}_{\pi \left(i\right)}\right),\text{otherwise}.\hfill & \end{array}$  (40) 
From this, we obtain
$\mathrm{E}({x}_{\pi \left(i\right)}{x}_{\pi \left(j\right)}\mid j\ne i)=\mathrm{E}\left({x}_{\pi \left(i\right)}\mathrm{E}({x}_{\pi \left(j\right)}\mid \pi \left(i\right),j\ne i)\right)=\{\begin{array}{cc}\frac{1}{{m}_{C}1}\left({m}_{C}{\tau}_{x,R}^{2}\overline{{x}^{2}}\right),\text{if}r\left(j\right)=r\left(i\right);\hfill & \\ \frac{1}{{m}_{R}1}\left({m}_{C}{\tau}_{x,C}^{2}\overline{{x}^{2}}\right),\text{if}c\left(j\right)=c\left(i\right);\hfill & \\ \frac{1}{n{m}_{R}{m}_{C}+1}\left[n{\left(\overline{x}\right)}^{2}{m}_{C}{\tau}_{x,R}^{2}{m}_{R}{\tau}_{x,C}^{2}+\overline{{x}^{2}}\right],\text{otherwise}.\hfill & \end{array}$  (41) 
As before, $\mathrm{E}\left({w}_{n}\right)=\overline{x}\overline{z}.$ For the variance:
$$\mathrm{Var}\left({w}_{n}\right)=\underset{\left(A\right)}{\underset{\u23df}{\frac{1}{{n}^{2}}\sum _{i=1}^{n}{x}_{i}^{2}\mathrm{E}\left({z}_{\pi \left(i\right)}^{2}\right)}}+\underset{\left(B\right)}{\underset{\u23df}{\frac{1}{{n}^{2}}\sum _{i\ne j}{x}_{i}{x}_{j}\mathrm{E}\left({z}_{\pi \left(i\right)}{z}_{\pi \left(j\right)}\right)}}{\overline{x}}^{2}{\overline{z}}^{2}.$$ 
We analyze the terms individually. For the first one,
$$\left(A\right)=\frac{1}{{n}^{2}}\sum _{i=1}^{n}{x}_{i}^{2}\overline{{z}^{2}}=\frac{1}{n}\overline{{x}^{2}}\overline{{z}^{2}}.$$ 
For the second term,
$\left(B\right)$  $={\displaystyle \frac{1}{{n}^{2}}}{\displaystyle \sum _{i=1}^{n}}{x}_{i}{\displaystyle \sum _{j=1,j\ne i}^{n}}{x}_{j}\mathrm{E}({z}_{\pi \left(i\right)}{z}_{\pi \left(j\right)}\mid j\ne i)$  
$={\displaystyle \frac{1}{{n}^{2}}}{\displaystyle \sum _{i=1}^{n}}{x}_{i}\left[{\displaystyle \sum _{j\ne i}}\left({\displaystyle \sum _{r\left(j\right)=r\left(i\right)}}{x}_{j}{\displaystyle \frac{{m}_{C}{\tau}_{z,R}^{2}\overline{{z}^{2}}}{{m}_{C}1}}+{\displaystyle \sum _{c\left(j\right)=c\left(i\right)}}{x}_{j}{\displaystyle \frac{{m}_{R}{\tau}_{z,C}^{2}\overline{{z}^{2}}}{{m}_{R}1}}+{\displaystyle \sum _{\text{other}}}{x}_{j}{\displaystyle \frac{n{\overline{z}}^{2}{m}_{C}{\tau}_{z,R}^{2}{m}_{R}{\tau}_{z,C}^{2}+\overline{{z}^{2}}}{n{m}_{R}{m}_{C}+1}}\right)\right]$  
$={\displaystyle \frac{1}{{n}^{2}}}{\displaystyle \sum _{i=1}^{n}}{x}_{i}[{\displaystyle \frac{\left({m}_{C}{\tau}_{z,R}^{2}\overline{{z}^{2}}\right)\left({m}_{C}{\overline{x}}_{r\left(i\right)}{x}_{i}\right)}{{m}_{C}1}}+{\displaystyle \frac{\left({m}_{R}{\tau}_{z,C}^{2}\overline{{z}^{2}}\right)\left({m}_{R}{\overline{x}}_{c\left(i\right)}{x}_{i}\right)}{{m}_{R}1}}$  
${v}{e}{r}{y}{l}{o}{n}{g}{e}{q}{s}{.}+{\displaystyle \frac{\left[n{\overline{z}}^{2}{m}_{C}{\tau}_{z,R}^{2}{m}_{R}{\tau}_{z,C}^{2}+\overline{{z}^{2}}\right]\left(n\overline{x}{m}_{C}{\overline{x}}_{r\left(i\right)}{m}_{R}{\overline{x}}_{c\left(i\right)}+{x}_{i}\right)}{n{m}_{R}{m}_{C}+1}}].$  
$={\displaystyle \frac{1}{{n}^{2}}}{\displaystyle \frac{\left({m}_{C}{\tau}_{z,R}^{2}\overline{{z}^{2}}\right)\left(n{m}_{C}{\tau}_{x,R}^{2}n\overline{{x}^{2}}\right)}{{m}_{C}1}}+{\displaystyle \frac{1}{{n}^{2}}}{\displaystyle \frac{\left({m}_{R}{\tau}_{z,C}^{2}\overline{{z}^{2}}\right)\left(n{m}_{R}{\tau}_{x,C}^{2}n\overline{{x}^{2}}\right)}{{m}_{R}1}}$  
${v}{e}{r}{y}{l}{o}{n}{g}{e}{q}{s}{.}+{\displaystyle \frac{1}{{n}^{2}}}{\displaystyle \frac{\left[n{\overline{z}}^{2}{m}_{C}{\tau}_{z,R}^{2}{m}_{R}{\tau}_{z,C}^{2}+\overline{{z}^{2}}\right]\left({n}^{2}{\overline{x}}^{2}n{m}_{C}{\tau}_{x,R}^{2}n{m}_{R}{\tau}_{x,C}^{2}+n\overline{{x}^{2}}\right)}{n{m}_{R}{m}_{C}+1}}].$  (42) 
where we used ${\sum}_{i}{x}_{i}{\overline{x}}_{r\left(i\right)}={m}_{C}{m}_{R}{\tau}_{x,R}^{2}=n{\tau}_{x,R}^{2}$; and, similarly, ${\sum}_{i}{x}_{i}{\overline{x}}_{c\left(i\right)}=n{\tau}_{x,C}^{2}$. To simplify, for some column $u$ define ${\sigma}_{u,R}^{2}={\tau}_{u,R}^{2}\left(1/{m}_{C}\right)\overline{{u}^{2}}$ and ${\sigma}_{u}^{2}={\overline{u}}^{2}\left({m}_{C}/n\right){\tau}_{u,R}^{2}\left({m}_{R}/n\right){\tau}_{u,C}^{2}+\left(1/n\right)\overline{{u}^{2}}$.
For these definitions, it follows
$\left(B\right)$  $={\displaystyle \frac{{m}_{C}^{2}}{n\left({m}_{C}1\right)}}{\sigma}_{x,R}^{2}{\sigma}_{z,R}^{2}+{\displaystyle \frac{{m}_{R}^{2}}{n\left({m}_{R}1\right)}}{\sigma}_{z,C}^{2}{\sigma}_{z,C}^{2}+{\displaystyle \frac{n}{n{m}_{R}{m}_{C}+1}}{\sigma}_{x}^{2}{\sigma}_{z}^{2}.$ 
We gather the terms (A) and (B). Since ${m}_{R}=O\left(1\right)$ and ${m}_{C}=n/{m}_{R}\to \mathrm{\infty}$, we finally get: