On Learning Representations for Tabular Data Distillation

Abstract

Dataset distillation generates a small set of information-rich instances froma large dataset, resulting in reduced storage requirements, privacy orcopyright risks, and computational costs for downstream modeling, though muchof the research has focused on the image data modality. We study tabular datadistillation, which brings in novel challenges such as the inherent featureheterogeneity and the common use of non-differentiable learning models (such asdecision tree ensembles and nearest-neighbor predictors). To mitigate thesechallenges, we present $\texttt{TDColER}$, a tabular data distillationframework via column embeddings-based representation learning. To evaluate thisframework, we also present a tabular data distillation benchmark, ${{\sf \smallTDBench}}$. Based on an elaborate evaluation on ${{\sf \small TDBench}}$,resulting in 226,890 distilled datasets and 548,880 models trained on them, wedemonstrate that $\texttt{TDColER}$ is able to boost the distilled data qualityof off-the-shelf distillation schemes by 0.5-143% across 7 different tabularlearning models.

Quick Read (beta)

loading the full paper ...