Abstract
With the growing popularity of deep learning and foundation models fortabular data, the need for standardized and reliable benchmarks is higher thanever. However, current benchmarks are static. Their design is not updated evenif flaws are discovered, model versions are updated, or new models arereleased. To address this, we introduce TabArena, the first continuouslymaintained living tabular benchmarking system. To launch TabArena, we manuallycurate a representative collection of datasets and well-implemented models,conduct a large-scale benchmarking study to initialize a public leaderboard,and assemble a team of experienced maintainers. Our results highlight theinfluence of validation method and ensembling of hyperparameter configurationsto benchmark models at their full potential. While gradient-boosted trees arestill strong contenders on practical tabular datasets, we observe that deeplearning methods have caught up under larger time budgets with ensembling. Atthe same time, foundation models excel on smaller datasets. Finally, we showthat ensembles across models advance the state-of-the-art in tabular machinelearning. We observe that some deep learning models are overrepresented incross-model ensembles due to validation set overfitting, and we encourage modeldevelopers to address this issue. We launch TabArena with a public leaderboard,reproducible code, and maintenance protocols to create a living benchmarkavailable at https://tabarena.ai.