Generative Distribution Embeddings

Abstract

Many real-world problems require reasoning across multiple scales, demandingmodels which operate not on single data points, but on entire distributions. Weintroduce generative distribution embeddings (GDE), a framework that liftsautoencoders to the space of distributions. In GDEs, an encoder acts on sets ofsamples, and the decoder is replaced by a generator which aims to match theinput distribution. This framework enables learning representations ofdistributions by coupling conditional generative models with encoder networkswhich satisfy a criterion we call distributional invariance. We show that GDEslearn predictive sufficient statistics embedded in the Wasserstein space, suchthat latent GDE distances approximately recover the $W_2$ distance, and latentinterpolation approximately recovers optimal transport trajectories forGaussian and Gaussian mixture distributions. We systematically benchmark GDEsagainst existing approaches on synthetic datasets, demonstrating consistentlystronger performance. We then apply GDEs to six key problems in computationalbiology: learning representations of cell populations from lineage-tracing data(150K cells), predicting perturbation effects on single-cell transcriptomes (1Mcells), predicting perturbation effects on cellular phenotypes (20M single-cellimages), modeling tissue-specific DNA methylation patterns (253M sequences),designing synthetic yeast promoters (34M sequences), and spatiotemporalmodeling of viral protein sequences (1M sequences).

Quick Read (beta)

loading the full paper ...