Abstract
Determinantal point processes (DPPs) are specific probability distributionsover clouds of points that are used as models and computational tools acrossphysics, probability, statistics, and more recently machine learning. Samplingfrom DPPs is a challenge and therefore we present DPPy, a Python toolbox thatgathers known exact and approximate sampling algorithms for both finite andcontinuous DPPs. The project is hosted on GitHub and equipped with an extensivedocumentation.
Quick Read (beta)
DPPy: Sampling DPPs with Python
Abstract
Determinantal point processes (DPPs) are specific probability distributions over clouds of points that are used as models and computational tools across physics, probability, statistics, and more recently machine learning. Sampling from DPPs is a challenge and therefore we present DPPy, a Python toolbox that gathers known exact and approximate sampling algorithms for both finite and continuous DPPs. The project is hosted on GitHub^{ \faGithub }^{ \faGithub } \faGithub github.com/guilgautier/DPPy and equipped with an extensive documentation.^{ \faBook }^{ \faBook } \faBook dppy.readthedocs.io
in A,…,Z
Guillermo Polito${}^{\mathrm{\u2020}\mathrm{*}}$ [email protected]
Rémi Bardenet${}^{\mathrm{\u2020}}$ [email protected]
Michal Valko${}^{\mathrm{*}\mathrm{\u2021}}$ [email protected]
${}^{\mathrm{\u2020}}$Univ. Lille, CNRS, Centrale Lille, UMR 9189 – CRIStAL, 59651 Villeneuve d’Ascq, France
${}^{\mathrm{*}}$Inria LilleNord Europe, 40 avenue Halley 59650 Villeneuve d’Ascq, France
${}^{\mathrm{\u2021}}$DeepMind Paris, 14 Rue de Londres, 75009 Paris, France
Keywords: determinantal point processes, sampling, MCMC, random matrices, Python
1 Introduction
Determinantal point processes (DPPs) are distributions over configurations of points that encode diversity through a kernel function $K$. They were introduced by Macchi (1975) as models for beams of fermions, and they have since found applications in fields as diverse as probability (Soshnikov, 2000; König, 2004; Hough et al., 2006), statistical physics (Pathria & Beale, 2011), Monte Carlo methods (Bardenet & Hardy, 2016), spatial statistics (Lavancier et al., 2012), and machine learning (ML, Kulesza & Taskar, 2012).
In ML, DPPs mainly serve to model diverse sets of items, as in recommendation (Kathuria et al., 2016; Gartrell et al., 2016) or text summarization (Dupuy & Bach, 2018). Consequently, MLers use mostly finite DPPs, which are distributions over subsets of a finite ground set of cardinality $M$, parametrized by an $M\times M$ kernel matrix $\text{bfK}$. Routine inference tasks such as normalization, marginalization, or sampling have complexity $\text{calO}({M}^{3})$ (Gillenwater, 2014). Like other kernel methods, when $M$ is large, $\text{calO}({M}^{3})$ is a bottleneck.
In terms of software, the R library spatstat (Baddeley & Turner, 2005), a generalpurpose toolbox on spatial point processes, includes sampling and learning of continuous DPPs with stationary kernels, as described by Lavancier et al. (2012). Complementarily, we propose DPPy, a turnkey Python implementation of known general algorithms to sample finite DPPs. We also include algorithms for nonstationary continuous DPPs, e.g., related to random covariance matrices or Monte Carlo methods that are also of interest for MLers.
The DPPy project, hosted on GitHub,\@footnotemarkis already being used by the crossdisciplinary DPP community (Burt et al., 2019; Kammoun, 2018; Poulson, 2019; Dereziński et al., 2019; Gautier et al., 2019). We use Travis^{ \faGears }^{ \faGears } \faGears travisci.com/guilgautier/DPPy for continuous integration and Coveralls^{ \faIndent }^{ \faIndent } \faIndent coveralls.io/github/guilgautier/DPPy for test coverage. Through ReadTheDocs\@footnotemarkwe provide an extensive documentation, which covers the essential mathematical background and showcases the key properties of DPPs through DPPy objects and associated methods. DPPy thus also serves as a tutorial.
2 Definitions
A point process is a random subset of points $\text{calX}=\{{X}_{1},\mathrm{\dots},{X}_{N}\}\subset \text{bbX}$, where the number of points $N$ is itself random. We further add to the definition that $N$ should be almost surely finite and that all points in a sample are distinct. Given a reference measure $\mu $ on $\text{bbX}$, a point process is usually characterized by its $k$correlation function ${\rho}_{k}$ for all $k$, where
$$\text{bbP}\left[\begin{array}{c}\hfill \exists \text{one point of the process in}\hfill \\ \hfill \text{each ball}B(x{}_{i},\phantom{\rule{1.66666666666667pt}{0ex}}\mathrm{d}x{}_{i}),\forall i=1,\mathrm{\dots},k\hfill \end{array}\right]={\rho}_{k}({x}_{1},\mathrm{\dots},{x}_{k})\prod _{i=1}^{k}\mu (\mathrm{d}{x}_{i}),$$ 
see Møller & Waagepetersen (2004, Section 4). The functions ${\rho}_{k}$ describe the interaction among points in $\text{calX}$ by quantifying cooccurrence of points at a set of locations.
A point process $\text{calX}$ on $(\text{bbX},\mu )$ parametrized by a kernel $K:\text{bbX}\times \text{bbX}\to \text{bbC}$ is said to be determinantal, denoted as $\text{calX}\sim \mathrm{DPP}(K)$, if its $k$correlation functions satisfy
$${\rho}_{k}({x}_{1},\mathrm{\dots},{x}_{k})=det{\left[K({x}_{i},{x}_{j})\right]}_{i,j=1}^{k},\forall k\ge 1.$$ 
In ML, most DPPs are in the finite setting where $\text{bbX}=\{1,\mathrm{\dots},M\}$ and $\mu ={\sum}_{i=1}^{M}{\delta}_{i}$. In this context, the kernel function becomes an $M\times M$ matrix $\text{bfK}$, and the correlation functions refer to inclusion probabilities. DPPs are thus often defined as $\text{calX}\sim \mathrm{DPP}(\text{bfK})$ if
$$\text{bbP}[S\subset \text{calX}]=det{\text{bfK}}_{S},\forall S\subset \text{bbX},$$  (1) 
where ${\text{bfK}}_{S}$ denotes the submatrix of $\text{bfK}$ formed by the rows and columns indexed by $S$. The kernel matrix $\text{bfK}$ is commonly assumed to be realsymmetric, in which case the existence and uniqueness of the DPP in Equation 1 is equivalent to the condition that the eigenvalues of $\text{bfK}$ lie in $[0,1]$. The result also holds for general Hermitian kernel functions $K$ with additional assumptions (Soshnikov, 2000, Theorem 3). We note that there are also DPPs with nonsymmetric kernels (Borodin et al., 2010; Gartrell et al., 2019).
Oftentimes, ML practitioners favor a more flexible definition of a DPP in terms of a likelihood kernel $\text{bfL}$, which only requires $\text{bfL}\u2ab00$ so that
$$\text{bbP}[\text{calX}=S]=\frac{det{\text{bfL}}_{S}}{det\left[I+\text{bfL}\right]}\text{,}$$ 
rather than a correlation kernel $0\u2aaf\text{bfK}\u2aafI$. Yet, the $\text{bfL}$ parametrization makes Equation 1 less interpretable and does not cover important cases such as fixed size DPPs which are achievable using projection $\text{bfK}$ kernels. Kulesza & Taskar (2012, Section 5) countered that with $k$DPPs, which can be understood as DPPs parametrized by a likelihood kernel, conditioned to have exactly $k$ elements. However, in general, $k$DPPs are not DPPs.
The main interest in DPPs in ML is that they model diversity while being tractable. Compared to independent sampling with the same marginals, Equation 1 entails
$$\text{bbP}[\{i,j\}\subset \text{calX}]={\text{bfK}}_{ii}{\text{bfK}}_{jj}{\text{bfK}}_{ij}{\text{bfK}}_{ji}=\text{bbP}[\{i\}\subset \text{calX}]\times \text{bbP}[\{j\}\subset \text{calX}]{\text{bfK}}_{ij}{}^{2},$$ 
so that, the larger $\left{\text{bfK}}_{ij}\right$ less likely items $i$ and $j$ cooccur. If ${\text{bfK}}_{ij}$ models the similarity between items $i$ and $j$, DPPs are thus random diverse sets of elements.
Most point processes that encode diversity are not tractable, in the sense that efficient algorithms to sample, marginalize, or compute normalization constants are not available. However, DPPs are amenable to these tasks with polynomial complexity (Gillenwater, 2014). Next, we present the challenging task of sampling, which is the core of DPPy.
3 Sampling determinantal point processes
We assume henceforth that $K$ is realsymmetric and satisfies suitable conditions Soshnikov (2000, Theorem 3) so that its spectral decomposition is available
$$K(x,y)\triangleq \sum _{i=1}^{\mathrm{\infty}}{\lambda}_{i}{\varphi}_{i}(x){\varphi}_{i}(y),\text{with}{\int}_{\text{bbX}}{\varphi}_{i}(x){\varphi}_{j}(x)\mu (\mathrm{d}x)={\delta}_{ij}.$$ 
Note that, in the finite case, the spectral theorem is enough to eigendecompose $\text{bfK}$. Hough et al. (2006, Theorem 7) proved that sampling $\mathrm{DPP}(K)$ can be done in two steps:

1.
draw ${B}_{i}\sim \text{calB}\mathrm{er}({\lambda}_{i})$ independently and denote $\{{i}_{1},\mathrm{\dots},{i}_{N}\}=\{i:{B}_{i}=1\}$,

2.
sample from the DPP with kernel $\stackrel{~}{K}(x,y)={\sum}_{n=1}^{N}{\varphi}_{{i}_{n}}(x){\varphi}_{{i}_{n}}(y)$.
In other words, all DPPs are mixtures of projection DPPs, that are parametrized by an orthogonal projection kernel. In a nutshell, Step 1 selects a component of the mixture and Step 2 generates a sample of the projection $\mathrm{DPP}(\stackrel{~}{K})$. Hough et al. (2006, Algorithm 18) provide a generic projection DPP sampler that we briefly describe. First, the projection DPP with kernel $\stackrel{~}{K}$ has exactly $N=\mathrm{rank}\stackrel{~}{K}$ points, $\mu $almost surely. Then, the sequential aspect of the chain rule applied to sample $({X}_{1},\mathrm{\dots},{X}_{N})$ with probability distribution
$$\frac{det{\left[\stackrel{~}{K}({x}_{p},{x}_{n})\right]}_{p,n=1}^{N}}{N!}\prod _{n=1}^{N}\mu \left(\mathrm{d}{x}_{n}\right)=\frac{{\parallel \mathrm{\Phi}\left({x}_{1}\right)\parallel}^{2}}{N}\mu \left(\mathrm{d}{x}_{1}\right)\prod _{n=2}^{N}\frac{{\mathrm{distance}}^{2}(\mathrm{\Phi}\left({x}_{n}\right),\mathrm{span}{\left\{\mathrm{\Phi}\left({x}_{p}\right)\right\}}_{p=1}^{n1})}{N\left(n1\right)}\mu \left(\mathrm{d}{x}_{n}\right),$$  (2) 
can be discarded to get a valid sample $\{{X}_{1},\mathrm{\dots},{X}_{N}\}\sim \mathrm{DPP}(\stackrel{~}{K})$. To each $x\in \text{bbX}$ we associate a feature vector $\mathrm{\Phi}(x)\triangleq ({\varphi}_{{i}_{1}}(x),\mathrm{\dots},{\varphi}_{{i}_{N}}(x))$, so that $\stackrel{~}{K}(x,y)=\mathrm{\Phi}{(x)}^{\U0001d5b3}\mathrm{\Phi}(y)$.
A few remarks are in order. First, the LHS of Equation 2 defines an exchangeable probability distribution. Second, the successive ratios that appear in the RHS are the normalized conditional densities (w.r.t. $\mu $) that drive the chain rule. The associated normalizing constants are independent of the previous points. The numerators can be written as the ratio of two determinants and further expanded with Woodbury’s formula. They can be identified as the incremental posterior variances in Gaussian process regression with kernel $\stackrel{~}{K}$ (Rasmussen & Williams, 2006, Equation 2.26). Third, the chain rule expressed in Equation 2 has a strong GramSchmidt flavor since it actually comes from a recursive application of the base$\times $height formula. In the end, DPPs favor configuration of points whose feature vectors $\mathrm{\Phi}({x}_{1}),\mathrm{\dots},\mathrm{\Phi}({x}_{N})$ span a large volume, which is another way of understanding repulsiveness. The previous sampling scheme is exact and generic but, except for projection kernels, it requires the eigendecomposition of the underlying kernel.
In the finite setting, this corresponds to an initial $\text{calO}({M}^{3})$ cost, then the complexity of drawing exact samples is of order $\text{calO}(M\text{bbE}{\left[\text{calX}\right]}^{2})$ (see, e.g., Gillenwater, 2014; Tremblay et al., 2018). Besides, there exist some alternative exact samplers. Poulson (2019) and Launay et al. (2018) use a $\text{calO}({M}^{3})$ Choleskybased chain rule on sets; each item in turn is decided to be excluded or included in the sample. Dereziński et al. (2019) first sample an intermediate distribution and correct the bias by thinning the intermediate sample (with size smaller than $M$) using a carefully designed DPP. In certain regimes, this procedure may be more practical with an overall $\text{calO}(M\mathrm{poly}(\text{bbE}\left[\text{calX}\right])\mathrm{polylog}(M))$ cost. In the continuous case, sampling exactly each conditional that appear in the right hand side of Equation 2 can by done by a rejection sampling mechanism with a tailored proposal.
In applications where the costs related to exact sampling are a bottleneck, users rely on approximate sampling. Research has focused mainly on kernel approximation (Affandi et al., 2013) and MCMC samplers (Anari et al., 2016; Li et al., 2016; Gautier et al., 2017).
However, specific DPPs admit efficient exact samplers that do not rely on Equation 2, e.g., uniform spanning trees (UST, Propp & Wilson, 1998, Figure 1(c)) or eigenvalues of random matrices. For instance, a $\beta $ensemble is a set of $N$ points of $\text{bbR}$ with joint distribution
$$ 
For some choices of the weight function $\omega $, the $\beta $ensemble can be sampled by computing the eigenvalues of simple tridiagonal (Dumitriu & Edelman, 2002) or quindiagonal random matrices (Killip & Nenciu, 2004). In particular, ($\beta =2$)ensembles correspond to projection DPPs (König, 2004). They are therefore examples of continuous DPPs that can be sampled exactly in $\text{calO}({N}^{2})$ time, without rejection. Some of these ensembles are of direct interest to MLers. The Laguerre ensemble, for instance, has $\omega $ be a Gamma pdf, and corresponds to the eigenvalues of the empirical covariance matrix of i.i.d. Gaussian vectors, see Figure1(b). Finally, we mention that DPPy also features an exact sampler of the multivariate extension of the Jacobi ensemble which has been central in recent results on fasterthanMonte Carlo numerical integration (Bardenet & Hardy, 2016; Gautier et al., 2019; Mazoyer et al., 2019).
4 The DPPy toolbox
DPPy handles Python objects that fit the natural definition of the corresponding DPPs; see also the documentation\@footnotemarkand the corresponding Jupyter notebooks, which showcase DPPy objects. For example, FiniteDPP(kernel_type="correlation", **{"K": K}) instantiates a finite $\mathrm{DPP}(\text{bfK})$. Its two main methods, .sample_exact() and .sample_mcmc() implement the different exact samplers and current stateoftheart MCMC samplers. To sample $k$$\mathrm{DPP}$s, the additional .sample_exact_k_dpp() and .sample_mcmc_k_dpp() methods are available.
A Laguerre $\beta $ensemble is instantiated as LaguerreEnsemble(beta=2). It can be sampled using either the full matrix model (eigenvalues of random covariance matrix) when $\beta \in \{1,2,4\}$ with .sample_full_model() or the tridiagonal one with .sample_banded_model(), for $\beta >0$. Samples can be displayed via .plot() or .hist() to construct the empirical distribution that converges to the MarčenkoPastur distribution, see Figure 1(b).
DPPy can readily serve as research and teaching support. DPPy is also ready for other contributors to add content and enlarge its scope, e.g., with procedures for learning kernels.
Acknowledgments
We acknowledge funding by European CHISTERA project DELTA, the French Ministry of Higher Education and Research, the NordPasdeCalais Regional Council, Inria and OttovonGuerickeUniversität Magdeburg associatedteam northEuropean project Allocate, and French National Research Agency project BoB (n.ANR16CE230003).
References
 Affandi et al. (2013) Affandi, R. H., Kulesza, A., Fox, E. B., and Taskar, B. Nyström Approximation for LargeScale Determinantal Processes. In International Conference on Artificial Intelligence and Statistics (AISTATS), volume 31, pp. 85–98, 2013.
 Anari et al. (2016) Anari, N., Gharan, S. O., and Rezaei, A. Monte Carlo Markov Chain Algorithms for Sampling Strongly Rayleigh Distributions and Determinantal Point Processes. In Conference on Learning Theory (COLT), pp. 103–115, New York, USA, 2016. PMLR. arXiv:1602.05242.
 Baddeley & Turner (2005) Baddeley, A. and Turner, R. spatstat : An R Package for Analyzing Spatial Point Patterns. Journal of Statistical Software, 12(6):1–42, jan 2005.
 Bardenet & Hardy (2016) Bardenet, R. and Hardy, A. Monte Carlo with Determinantal Point Processes. ArXiv eprints, 2016. arXiv:1605.00361.
 Borodin et al. (2010) Borodin, A., Diaconis, P., and Fulman, J. On adding a list of numbers (and other onedependent determinantal processes). Bulletin of the American Mathematical Society, 47(4):639–670, 2010. arXiv:0904.3740.
 Burt et al. (2019) Burt, D., Rasmussen, C. E., and Wilk, M. V. D. Rates of Convergence for Sparse Variational Gaussian Process Regression. In International Conference on Machine Learning (ICML), pp. 862–871, may 2019. arXiv:1903.03571.
 Dereziński et al. (2019) Dereziński, M., Calandriello, D., and Valko, M. Exact sampling of determinantal point processes with sublinear time preprocessing. In Workshop on Negative Dependence in Machine Learning, International Conference on Machine Learning (ICML), 2019. arXiv:1905.13476.
 Dumitriu & Edelman (2002) Dumitriu, I. and Edelman, A. Matrix Models for Beta Ensembles. Journal of Mathematical Physics, 43(11):5830–5847, 2002. arXiv:mathph/0206043.
 Dupuy & Bach (2018) Dupuy, C. and Bach, F. Learning Determinantal Point Processes in Sublinear Time. In International Conference on Artificial Intelligence and Statistics (AISTATS), volume 84, pp. 244–257, Lanzarote, Spain, 2018. PMLR. arXiv:1610.05925.
 Gartrell et al. (2016) Gartrell, M., Paquet, U., and Koenigstein, N. LowRank Factorization of Determinantal Point Processes for Recommendation. In AAAI Conference on Artificial Intelligence, pp. 1912–1918, 2016. arXiv:1602.05436.
 Gartrell et al. (2019) Gartrell, M., Brunel, V.E., Dohmatob, E., and Krichene, S. Learning Nonsymmetric Determinantal Point Processes. may 2019. arXiv:1905.12962.
 Gautier et al. (2017) Gautier, G., Bardenet, R., and Valko, M. Zonotope hitandrun for efficient sampling from projection DPPs. International Conference on Machine Learning (ICML), pp. 1223–1232, may 2017. arXiv:1705.10498.
 Gautier et al. (2019) Gautier, G., Bardenet, R., and Valko, M. On two ways to use determinantal point processes for Monte Carlo integration. In Workshop on Negative Dependence in Machine Learning, International Conference in Machine Learning (ICML), 2019.
 Gillenwater (2014) Gillenwater, J. Approximate inference for determinantal point processes. PhD thesis, University of Pennsylvania, 2014.
 Hough et al. (2006) Hough, J. B., Krishnapur, M., Peres, Y., and Virág, B. Determinantal Processes and Independence. In Probability Surveys, volume 3, pp. 206–229. The Institute of Mathematical Statistics and the Bernoulli Society, 2006. arXiv:math/0503110.
 Kammoun (2018) Kammoun, M. S. Monotonous subsequences and the descent process of invariant random permutations. Electronic Journal of Probability, 23(0), 2018. arXiv:1805.05253.
 Kathuria et al. (2016) Kathuria, T., Deshpande, A., and Kohli, P. Batched Gaussian Process Bandit Optimization via Determinantal Point Processes. In Neural Information Processing Systems (NIPS), pp. 4206–4214, 2016. arXiv:1611.04088.
 Killip & Nenciu (2004) Killip, R. and Nenciu, I. Matrix models for circular ensembles. International Mathematics Research Notices, 2004(50):2665, 2004. arXiv:math/0410034.
 König (2004) König, W. Orthogonal polynomial ensembles in probability theory. Probab. Surveys, 2:385–447, 2004. arXiv:math/0403090.
 Kulesza & Taskar (2012) Kulesza, A. and Taskar, B. Determinantal Point Processes for Machine Learning. Foundations and Trends in Machine Learning, 5(23):123–286, 2012. arXiv:1207.6083.
 Launay et al. (2018) Launay, C., Galerne, B., and Desolneux, A. Exact Sampling of Determinantal Point Processes without Eigendecomposition. ArXiv eprints, feb 2018. arXiv:1802.08429.
 Lavancier et al. (2012) Lavancier, F., Møller, J., and Rubak, E. Determinantal point process models and statistical inference : Extended version. Journal of the Royal Statistical Society. Series B: Statistical Methodology, 77(4):853–877, may 2012. arXiv:1205.4818.
 Li et al. (2016) Li, C., Jegelka, S., and Sra, S. Fast Mixing Markov Chains for Strongly Rayleigh Measures, DPPs, and Constrained Sampling. In Neural Information Processing Systems (NIPS), pp. 4188–4196, Barcelona, Spain, 2016. arXiv:1608.01008.
 Macchi (1975) Macchi, O. The coincidence approach to stochastic point processes. Advances in Applied Probability, 7(01):83–122, 1975.
 Mazoyer et al. (2019) Mazoyer, A., Coeurjolly, J.F., and Amblard, P.O. Projections of determinantal point processes. ArXiv eprints, 2019. arXiv:1901.02099v3.
 Møller & Waagepetersen (2004) Møller, J. and Waagepetersen, R. P. Statistical inference and simulation for spatial point processes, volume 23. Chapman & Hall/CRC. 2004. ISBN 1584882654.
 Pathria & Beale (2011) Pathria, R. K. and Beale, P. D. Statistical Mechanics. Academic Press. 2011. ISBN 0123821894.
 Poulson (2019) Poulson, J. Highperformance sampling of generic Determinantal Point Processes. ArXiv eprints, apr 2019. arXiv:1905.00165.
 Propp & Wilson (1998) Propp, J. G. and Wilson, D. B. How to Get a Perfectly Random Sample from a Generic Markov Chain and Generate a Random Spanning Tree of a Directed Graph. Journal of Algorithms, 27(2):170–217, may 1998.
 Rasmussen & Williams (2006) Rasmussen, C. E. and Williams, C. K. I. Gaussian processes for machine learning. MIT Press. 2006. ISBN 026218253X.
 Soshnikov (2000) Soshnikov, A. Determinantal random point fields. Russian Mathematical Surveys, 55(5):923–975, feb 2000. arXiv:math/0002099.
 Tremblay et al. (2018) Tremblay, N., Barthelme, S., and Amblard, P.O. Optimized Algorithms to Sample Determinantal Point Processes. ArXiv eprints, feb 2018. arXiv:1802.08471.