Masked Clinical Modelling: A Framework for Synthetic and Augmented Survival Data Generation

Abstract

Access to real clinical data is often restricted due to privacy obligations,creating significant barriers for healthcare research. Synthetic datasetsprovide a promising solution, enabling secure data sharing and modeldevelopment. However, most existing approaches focus on data realism ratherthan utility -- ensuring that models trained on synthetic data yield clinicallymeaningful insights comparable to those trained on real data. In this paper, wepresent Masked Clinical Modelling (MCM), a framework inspired by maskedlanguage modelling, designed for both data synthesis and conditional dataaugmentation. We evaluate this prototype on the WHAS500 dataset using CoxProportional Hazards models, focusing on the preservation of hazard ratios askey clinical metrics. Our results show that data generated using the MCMframework improves both discrimination and calibration in survival analysis,outperforming existing methods. MCM demonstrates strong potential to supportsurvival data analysis and broader healthcare applications.

Quick Read (beta)

loading the full paper ...