Nonparametric IPSS: Fast, flexible feature selection with false discovery control

  • 2025-07-16 19:29:37
  • Omar Melikechi, David B. Dunson, Jeffrey W. Miller
  • 0

Abstract

Feature selection is a critical task in machine learning and statistics.However, existing feature selection methods either (i) rely on parametricmethods such as linear or generalized linear models, (ii) lack theoreticalfalse discovery control, or (iii) identify few true positives. Here, weintroduce a general feature selection method with finite-sample false discoverycontrol based on applying integrated path stability selection (IPSS) toarbitrary feature importance scores. The method is nonparametric whenever theimportance scores are nonparametric, and it estimates q-values, which arebetter suited to high-dimensional data than p-values. We focus on two specialcases using importance scores from gradient boosting (IPSSGB) and randomforests (IPSSRF). Extensive nonlinear simulations with RNA sequencing data showthat both methods accurately control the false discovery rate and detect moretrue positives than existing methods. Both methods are also efficient, runningin under 20 seconds when there are 500 samples and 5000 features. We applyIPSSGB and IPSSRF to detect microRNAs and genes related to cancer, finding thatthey yield better predictions with fewer features than existing approaches.

 

Quick Read (beta)

loading the full paper ...