Forest-Guided Clustering -- Shedding Light into the Random Forest Black Box

Abstract

As machine learning models are increasingly deployed in sensitive applicationareas, the demand for interpretable and trustworthy decision-making hasincreased. Random Forests (RF), despite their widespread use and strongperformance on tabular data, remain difficult to interpret due to theirensemble nature. We present Forest-Guided Clustering (FGC), a model-specificexplainability method that reveals both local and global structure in RFs bygrouping instances according to shared decision paths. FGC produceshuman-interpretable clusters aligned with the model's internal logic andcomputes cluster-specific and global feature importance scores to derivedecision rules underlying RF predictions. FGC accurately recovered latentsubclass structure on a benchmark dataset and outperformed classical clusteringand post-hoc explanation methods. Applied to an AML transcriptomic dataset, FGCuncovered biologically coherent subpopulations, disentangled disease-relevantsignals from confounders, and recovered known and novel gene expressionpatterns. FGC bridges the gap between performance and interpretability byproviding structure-aware insights that go beyond feature-level attribution.

Quick Read (beta)

loading the full paper ...