Finding Outliers in Gaussian Model-Based Clustering

  • 2024-04-05 16:01:31
  • Katharine M. Clark, Paul D. McNicholas
  • 0

Abstract

Clustering, or unsupervised classification, is a task often plagued byoutliers. Yet there is a paucity of work on handling outliers in clustering.Outlier identification algorithms tend to fall into three broad categories:outlier inclusion, outlier trimming, and \textit{post hoc} outlieridentification methods, with the former two often requiring pre-specificationof the number of outliers. The fact that sample Mahalanobis distance isbeta-distributed is used to derive an approximate distribution for thelog-likelihoods of subset finite Gaussian mixture models. An algorithm is thenproposed that removes the least plausible points according to the subsetlog-likelihoods, which are deemed outliers, until the subset log-likelihoodsadhere to the reference distribution. This results in a trimming method, calledOCLUST, that inherently estimates the number of outliers.

 

Quick Read (beta)

loading the full paper ...