Data fission: splitting a single data point

Abstract

Suppose we observe a random vector $X$ from some distribution $P$ in a knownfamily with unknown parameters. We ask the following question: when is itpossible to split $X$ into two parts $f(X)$ and $g(X)$ such that neither partis sufficient to reconstruct $X$ by itself, but both together can recover $X$fully, and the joint distribution of $(f(X),g(X))$ is tractable? As oneexample, if $X=(X_1,\dots,X_n)$ and $P$ is a product distribution, then for any$m<n$, we can split the sample to define $f(X)=(X_1,\dots,X_m)$ and$g(X)=(X_{m+1},\dots,X_n)$. Rasines and Young (2022) offers an alternativeroute of accomplishing this task through randomization of $X$ with additiveGaussian noise which enables post-selection inference in finite samples forGaussian distributed data and asymptotically for non-Gaussian additive models.In this paper, we offer a more general methodology for achieving such a splitin finite samples by borrowing ideas from Bayesian inference to yield a(frequentist) solution that can be viewed as a continuous analog of datasplitting. We call our method data fission, as an alternative to datasplitting, data carving and p-value masking. We exemplify the method on a fewprototypical applications, such as post-selection inference for trend filteringand other regression problems.

Quick Read (beta)

loading the full paper ...