Abstract
Rapid advances in Natural Language Processing (NLP) have revolutionized manyfields, including healthcare. However, these advances raise significant privacyconcerns, especially when pre-trained models fine-tuned and specialized onsensitive data can memorize and then expose and regurgitate personalinformation. This paper presents a privacy-preserving language modelingapproach to address the problem of language models anonymization, and thuspromote their sharing. Specifically, we propose both a Masking LanguageModeling (MLM) methodology to specialize a BERT-like language model, and aCausal Language Modeling (CLM) methodology to specialize a GPT-like model thatavoids the model from memorizing direct and indirect identifying informationpresent in the training data. We have comprehensively evaluated our approachesusing a medical dataset and compared them against different baselines. Ourresults indicate that by avoiding memorizing both direct and indirectidentifiers during model specialization, our masking and causal languagemodeling schemes offer a good tradeoff for maintaining high privacy whileretaining high utility.