Finite mixture models are typically inconsistent for the number of components

Abstract

Scientists and engineers are often interested in learning the number ofsubpopulations (or components) present in a data set. Practitioners commonlyuse a Dirichlet process mixture model (DPMM) for this purpose; in particular,they count the number of clusters---i.e. components containing at least onedata point---in the DPMM posterior. But Miller and Harrison (2013) warn thatthe DPMM cluster-count posterior is severely inconsistent for the number oflatent components when the data are truly generated from a finite mixture; thatis, the cluster-count posterior probability on the true generating number ofcomponents goes to zero in the limit of infinite data. A potential alternativeis to use a finite mixture model (FMM) with a prior on the number ofcomponents. Past work has shown the resulting FMM component-count posterior isconsistent. But existing results crucially depend on the assumption that thecomponent likelihoods are perfectly specified. In practice, this assumption isunrealistic, and empirical evidence (Miller and Dunson, 2019) suggests that theFMM posterior on the number of components is sensitive to the likelihoodchoice. In this paper, we add rigor to data-analysis folk wisdom by provingthat under even the slightest model misspecification, the FMM posterior on thenumber of components is ultraseverely inconsistent: for any finite $k \in\mathbb{N}$, the posterior probability that the number of components is $k$converges to 0 in the limit of infinite data. We illustrate practicalconsequences of our theory on simulated and real data sets.

Quick Read (beta)

loading the full paper ...