Abstract
Synthesizing high-quality tabular data is an important topic in many datascience tasks, ranging from dataset augmentation to privacy protection.However, developing expressive generative models for tabular data ischallenging due to its inherent heterogeneous data types, complexinter-correlations, and intricate column-wise distributions. In this paper, weintroduce TabDiff, a joint diffusion framework that models all multi-modaldistributions of tabular data in one model. Our key innovation is thedevelopment of a joint continuous-time diffusion process for numerical andcategorical data, where we propose feature-wise learnable diffusion processesto counter the high disparity of different feature distributions. TabDiff isparameterized by a transformer handling different input types, and the entireframework can be efficiently optimized in an end-to-end fashion. We furtherintroduce a multi-modal stochastic sampler to automatically correct theaccumulated decoding error during sampling, and propose classifier-freeguidance for conditional missing column value imputation. Comprehensiveexperiments on seven datasets demonstrate that TabDiff achieves superioraverage performance over existing competitive baselines across all eightmetrics, with up to $22.5\%$ improvement over the state-of-the-art model onpair-wise column correlation estimations. Code is available athttps://github.com/MinkaiXu/TabDiff.