Data thinning for convolution-closed distributions

Abstract

We propose data thinning, a new approach for splitting an observation intotwo or more independent parts that sum to the original observation, and thatfollow the same distribution as the original observation, up to a (known)scaling of a parameter. This proposal is very general, and can be applied toany observation drawn from a "convolution closed" distribution, a class thatincludes the Gaussian, Poisson, negative binomial, Gamma, and binomialdistributions, among others. It is similar in spirit to -- but distinct from,and more easily applicable than -- a recent proposal known as data fission.Data thinning has a number of applications to model selection, evaluation, andinference. For instance, cross-validation via data thinning provides anattractive alternative to the "usual" approach of cross-validation via samplesplitting, especially in unsupervised settings in which the latter is notapplicable. In simulations and in an application to single-cell RNA-sequencingdata, we show that data thinning can be used to validate the results ofunsupervised learning approaches, such as k-means clustering and principalcomponents analysis.

Quick Read (beta)

loading the full paper ...