Finding Outliers in Gaussian Model-Based Clustering

Abstract

Clustering, or unsupervised classification, is a task often plagued byoutliers. Yet there is a paucity of work on handling outliers in clustering.Outlier identification algorithms tend to fall into three broad categories:outlier inclusion, outlier trimming, and \textit{post hoc} outlieridentification methods, with the former two often requiring pre-specificationof the number of outliers. The fact that sample Mahalanobis distance isbeta-distributed is used to derive an approximate distribution for thelog-likelihoods of subset finite Gaussian mixture models. An algorithm is thenproposed that removes the least plausible points according to the subsetlog-likelihoods, which are deemed outliers, until the subset log-likelihoodsadhere to the reference distribution. This results in a trimming method, calledOCLUST, that inherently estimates the number of outliers.

Quick Read (beta)

loading the full paper ...