A Closer Look at Deep Learning Methods on Tabular Datasets

Abstract

Tabular data is prevalent across diverse domains in machine learning. Whileclassical methods like tree-based models have long been effective, Deep NeuralNetwork (DNN)-based methods have recently demonstrated promising performance.However, the diverse characteristics of methods and the inherent heterogeneityof tabular datasets make understanding and interpreting tabular methods bothchallenging and prone to unstable observations. In this paper, we conductin-depth evaluations and comprehensive analyses of tabular methods, with aparticular focus on DNN-based models, using a benchmark of over 300 tabulardatasets spanning a wide range of task types, sizes, and domains. First, weperform an extensive comparison of 32 state-of-the-art deep and tree-basedmethods, evaluating their average performance across multiple criteria.Although method ranks vary across datasets, we empirically find thattop-performing methods tend to concentrate within a small subset of tabularmodels, regardless of the criteria used. Next, we investigate whether thetraining dynamics of deep tabular models can be predicted based on datasetproperties. This approach not only offers insights into the behavior of deeptabular methods but also identifies a core set of "meta-features" that reflectdataset heterogeneity. The other subset includes datasets where method ranksare consistent with the overall benchmark, acting as a reliable probe forfurther tabular analysis.

Quick Read (beta)

loading the full paper ...