DPPy: Sampling DPPs with Python

  • 2019-08-12 16:58:41
  • Guillaume Gautier, Guillermo Polito, Rémi Bardenet, Michal Valko
  • 0

Abstract

Determinantal point processes (DPPs) are specific probability distributionsover clouds of points that are used as models and computational tools acrossphysics, probability, statistics, and more recently machine learning. Samplingfrom DPPs is a challenge and therefore we present DPPy, a Python toolbox thatgathers known exact and approximate sampling algorithms for both finite andcontinuous DPPs. The project is hosted on GitHub and equipped with an extensivedocumentation.

 

Quick Read (beta)

DPPy: Sampling DPPs with Python

\nameGuillaume Gautier* \email[email protected]
\nameGuillermo Polito* \email[email protected]
\nameRémi Bardenet \email[email protected]
\nameMichal Valko* \email[email protected]
\addrUniv. Lille, CNRS, Centrale Lille, UMR 9189 – CRIStAL, 59651 Villeneuve d’Ascq, France
\addr*Inria Lille-Nord Europe, 40 avenue Halley 59650 Villeneuve d’Ascq, France
\addrDeepMind Paris, 14 Rue de Londres, 75009 Paris, France
Abstract

Determinantal point processes (DPPs) are specific probability distributions over clouds of points that are used as models and computational tools across physics, probability, statistics, and more recently machine learning. Sampling from DPPs is a challenge and therefore we present DPPy, a Python toolbox that gathers known exact and approximate sampling algorithms for both finite and continuous DPPs. The project is hosted on GitHub \faGithub \faGithub \faGithub github.com/guilgautier/DPPy and equipped with an extensive documentation. \faBook \faBook \faBook dppy.readthedocs.io

\DFNalwaysdouble\foreach\x

in A,…,Z

DPPy: Sampling DPPs with Python Guillaume Gautier* [email protected]
Guillermo Polito* [email protected]
Rémi Bardenet [email protected]
Michal Valko* [email protected]
Univ. Lille, CNRS, Centrale Lille, UMR 9189 – CRIStAL, 59651 Villeneuve d’Ascq, France
*Inria Lille-Nord Europe, 40 avenue Halley 59650 Villeneuve d’Ascq, France
DeepMind Paris, 14 Rue de Londres, 75009 Paris, France

Keywords: determinantal point processes, sampling, MCMC, random matrices, Python

1 Introduction

Determinantal point processes (DPPs) are distributions over configurations of points that encode diversity through a kernel function K. They were introduced by Macchi (1975) as models for beams of fermions, and they have since found applications in fields as diverse as probability (Soshnikov, 2000; König, 2004; Hough et al., 2006), statistical physics (Pathria & Beale, 2011), Monte Carlo methods (Bardenet & Hardy, 2016), spatial statistics (Lavancier et al., 2012), and machine learning (ML, Kulesza & Taskar, 2012).

In ML, DPPs mainly serve to model diverse sets of items, as in recommendation (Kathuria et al., 2016; Gartrell et al., 2016) or text summarization (Dupuy & Bach, 2018). Consequently, MLers use mostly finite DPPs, which are distributions over subsets of a finite ground set of cardinality M, parametrized by an M×M kernel matrix \bfK. Routine inference tasks such as normalization, marginalization, or sampling have complexity \calO(M3) (Gillenwater, 2014). Like other kernel methods, when M is large, \calO(M3) is a bottleneck.

In terms of software, the R library spatstat (Baddeley & Turner, 2005), a general-purpose toolbox on spatial point processes, includes sampling and learning of continuous DPPs with stationary kernels, as described by Lavancier et al.(2012). Complementarily, we propose DPPy, a turnkey Python implementation of known general algorithms to sample finite DPPs. We also include algorithms for non-stationary continuous DPPs, e.g., related to random covariance matrices or Monte Carlo methods that are also of interest for MLers.

The DPPy project, hosted on GitHub,\@footnotemarkis already being used by the cross-disciplinary DPP community (Burt et al., 2019; Kammoun, 2018; Poulson, 2019; Dereziński et al., 2019; Gautier et al., 2019). We use Travis \faGears \faGears \faGears travis-ci.com/guilgautier/DPPy for continuous integration and Coveralls \faIndent \faIndent \faIndent coveralls.io/github/guilgautier/DPPy for test coverage. Through ReadTheDocs\@footnotemarkwe provide an extensive documentation, which covers the essential mathematical background and showcases the key properties of DPPs through DPPy objects and associated methods. DPPy thus also serves as a tutorial.

2 Definitions

A point process is a random subset of points \calX={X1,,XN}\bbX, where the number of points N is itself random. We further add to the definition that N should be almost surely finite and that all points in a sample are distinct. Given a reference measure μ on \bbX, a point process is usually characterized by its k-correlation function ρk for all k, where

\bbP[ one point of the process ineach ball B(xi,dxi),i=1,,k]=ρk(x1,,xk)i=1kμ(dxi),

see Møller & Waagepetersen (2004, Section 4). The functions ρk describe the interaction among points in \calX by quantifying co-occurrence of points at a set of locations.

A point process \calX on (\bbX,μ) parametrized by a kernel K:\bbX×\bbX\bbC is said to be determinantal, denoted as \calXDPP(K), if its k-correlation functions satisfy

ρk(x1,,xk)=det[K(xi,xj)]i,j=1k,k1.

In ML, most DPPs are in the finite setting where \bbX={1,,M} and μ=i=1Mδi. In this context, the kernel function becomes an M×M matrix \bfK, and the correlation functions refer to inclusion probabilities. DPPs are thus often defined as \calXDPP(\bfK) if

\bbP[S\calX]=det\bfKS,S\bbX, (1)

where \bfKS denotes the submatrix of \bfK formed by the rows and columns indexed by S. The kernel matrix \bfK is commonly assumed to be real-symmetric, in which case the existence and uniqueness of the DPP in Equation 1 is equivalent to the condition that the eigenvalues of \bfK lie in [0,1]. The result also holds for general Hermitian kernel functions K with additional assumptions (Soshnikov, 2000, Theorem 3). We note that there are also DPPs with nonsymmetric kernels (Borodin et al., 2010; Gartrell et al., 2019).

Oftentimes, ML practitioners favor a more flexible definition of a DPP in terms of a likelihood kernel \bfL, which only requires \bfL0 so that

\bbP[\calX=S]=det\bfLSdet[I+\bfL],

rather than a correlation kernel 0\bfKI. Yet, the \bfL parametrization makes Equation 1 less interpretable and does not cover important cases such as fixed size DPPs which are achievable using projection \bfK kernels. Kulesza & Taskar (2012, Section 5) countered that with k-DPPs, which can be understood as DPPs parametrized by a likelihood kernel, conditioned to have exactly k elements. However, in general, k-DPPs are not DPPs.

The main interest in DPPs in ML is that they model diversity while being tractable. Compared to independent sampling with the same marginals, Equation 1 entails

\bbP[{i,j}\calX]=\bfKii\bfKjj-\bfKij\bfKji=\bbP[{i}\calX]×\bbP[{j}\calX]-|\bfKij|2,

so that, the larger |\bfKij| less likely items i and j co-occur. If \bfKij models the similarity between items i and j, DPPs are thus random diverse sets of elements.

Most point processes that encode diversity are not tractable, in the sense that efficient algorithms to sample, marginalize, or compute normalization constants are not available. However, DPPs are amenable to these tasks with polynomial complexity (Gillenwater, 2014). Next, we present the challenging task of sampling, which is the core of DPPy.

3 Sampling determinantal point processes

We assume henceforth that K is real-symmetric and satisfies suitable conditions Soshnikov (2000, Theorem 3) so that its spectral decomposition is available

K(x,y)i=1λiϕi(x)ϕi(y),with \bbXϕi(x)ϕj(x)μ(dx)=δij.

Note that, in the finite case, the spectral theorem is enough to eigendecompose \bfK. Hough et al.(2006, Theorem 7) proved that sampling DPP(K) can be done in two steps:

  1. 1.

    draw Bi\calBer(λi) independently and denote {i1,,iN}={i:Bi=1},

  2. 2.

    sample from the DPP with kernel K~(x,y)=n=1Nϕin(x)ϕin(y).

In other words, all DPPs are mixtures of projection DPPs, that are parametrized by an orthogonal projection kernel. In a nutshell, Step 1 selects a component of the mixture and Step 2 generates a sample of the projection DPP(K~). Hough et al.(2006, Algorithm 18) provide a generic projection DPP sampler that we briefly describe. First, the projection DPP with kernel K~ has exactly N=rankK~ points, μ-almost surely. Then, the sequential aspect of the chain rule applied to sample (X1,,XN) with probability distribution

det[K~(xp,xn)]p,n=1NN!n=1Nμ(dxn)=Φ(x1)2Nμ(dx1)n=2Ndistance2(Φ(xn),span{Φ(xp)}p=1n-1)N-(n-1)μ(dxn), (2)

can be discarded to get a valid sample {X1,,XN}DPP(K~). To each x\bbX we associate a feature vector Φ(x)(ϕi1(x),,ϕiN(x)), so that K~(x,y)=Φ(x)𝖳Φ(y).

A few remarks are in order. First, the LHS of Equation 2 defines an exchangeable probability distribution. Second, the successive ratios that appear in the RHS are the normalized conditional densities (w.r.t. μ) that drive the chain rule. The associated normalizing constants are independent of the previous points. The numerators can be written as the ratio of two determinants and further expanded with Woodbury’s formula. They can be identified as the incremental posterior variances in Gaussian process regression with kernel K~ (Rasmussen & Williams, 2006, Equation 2.26). Third, the chain rule expressed in Equation 2 has a strong Gram-Schmidt flavor since it actually comes from a recursive application of the base×height formula. In the end, DPPs favor configuration of points whose feature vectors Φ(x1),,Φ(xN) span a large volume, which is another way of understanding repulsiveness. The previous sampling scheme is exact and generic but, except for projection kernels, it requires the eigendecomposition of the underlying kernel.

In the finite setting, this corresponds to an initial \calO(M3) cost, then the complexity of drawing exact samples is of order \calO(M\bbE[|\calX|]2) (see, e.g., Gillenwater, 2014; Tremblay et al., 2018). Besides, there exist some alternative exact samplers. Poulson (2019) and Launay et al.(2018) use a \calO(M3) Cholesky-based chain rule on sets; each item in turn is decided to be excluded or included in the sample. Dereziński et al.(2019) first sample an intermediate distribution and correct the bias by thinning the intermediate sample (with size smaller than M) using a carefully designed DPP. In certain regimes, this procedure may be more practical with an overall \calO(Mpoly(\bbE[|\calX|])polylog(M)) cost. In the continuous case, sampling exactly each conditional that appear in the right hand side of Equation 2 can by done by a rejection sampling mechanism with a tailored proposal.

In applications where the costs related to exact sampling are a bottleneck, users rely on approximate sampling. Research has focused mainly on kernel approximation (Affandi et al., 2013) and MCMC samplers (Anari et al., 2016; Li et al., 2016; Gautier et al., 2017).

However, specific DPPs admit efficient exact samplers that do not rely on Equation 2, e.g., uniform spanning trees (UST, Propp & Wilson, 1998, Figure 1(c)) or eigenvalues of random matrices. For instance, a β-ensemble is a set of N points of \bbR with joint distribution

1ZN,βp<n|xp-xn|βn=1Nω(xn)dxn,where β>0.

For some choices of the weight function ω, the β-ensemble can be sampled by computing the eigenvalues of simple tridiagonal (Dumitriu & Edelman, 2002) or quindiagonal random matrices (Killip & Nenciu, 2004). In particular, (β=2)-ensembles correspond to projection DPPs (König, 2004). They are therefore examples of continuous DPPs that can be sampled exactly in \calO(N2) time, without rejection. Some of these ensembles are of direct interest to MLers. The Laguerre ensemble, for instance, has ω be a Gamma pdf, and corresponds to the eigenvalues of the empirical covariance matrix of i.i.d. Gaussian vectors, see Figure1(b). Finally, we mention that DPPy also features an exact sampler of the multivariate extension of the Jacobi ensemble which has been central in recent results on faster-than-Monte Carlo numerical integration (Bardenet & Hardy, 2016; Gautier et al., 2019; Mazoyer et al., 2019).

4 The DPPy toolbox

DPPy handles Python objects that fit the natural definition of the corresponding DPPs; see also the documentation\@footnotemarkand the corresponding Jupyter notebooks, which showcase DPPy objects. For example, FiniteDPP(kernel_type="correlation", **{"K": K}) instantiates a finite DPP(\bfK). Its two main methods, .sample_exact() and .sample_mcmc() implement the different exact samplers and current state-of-the-art MCMC samplers. To sample k-DPPs, the additional .sample_exact_k_dpp() and .sample_mcmc_k_dpp() methods are available.

A Laguerre β-ensemble is instantiated as LaguerreEnsemble(beta=2). It can be sampled using either the full matrix model (eigenvalues of random covariance matrix) when β{1,2,4} with .sample_full_model() or the tridiagonal one with .sample_banded_model(), for β>0. Samples can be displayed via .plot() or .hist() to construct the empirical distribution that converges to the Marčenko-Pastur distribution, see Figure 1(b).

DPPy can readily serve as research and teaching support. DPPy is also ready for other contributors to add content and enlarge its scope, e.g., with procedures for learning kernels.

(a) 2D Jacobi ensemble
(b) β=2-Laguerre ensemble
(c) \bfK kernel of UST
Figure 1: Some displays available in DPPy

Acknowledgments

We acknowledge funding by European CHIST-ERA project DELTA, the French Ministry of Higher Education and Research, the Nord-Pas-de-Calais Regional Council, Inria and Otto-von-Guericke-Universität Magdeburg associated-team north-European project Allocate, and French National Research Agency project BoB (n.ANR-16-CE23-0003).

References