Towards the Anonymization of the Language Modeling

  • 2025-05-04 22:12:15
  • Antoine Boutet, Lucas Magnana, Juliette Sénéchal, Helain Zimmermann
  • 0

Abstract

Rapid advances in Natural Language Processing (NLP) have revolutionized manyfields, including healthcare. However, these advances raise significant privacyconcerns, especially when pre-trained models fine-tuned and specialized onsensitive data can memorize and then expose and regurgitate personalinformation. This paper presents a privacy-preserving language modelingapproach to address the problem of language models anonymization, and thuspromote their sharing. Specifically, we propose both a Masking LanguageModeling (MLM) methodology to specialize a BERT-like language model, and aCausal Language Modeling (CLM) methodology to specialize a GPT-like model thatavoids the model from memorizing direct and indirect identifying informationpresent in the training data. We have comprehensively evaluated our approachesusing a medical dataset and compared them against different baselines. Ourresults indicate that by avoiding memorizing both direct and indirectidentifiers during model specialization, our masking and causal languagemodeling schemes offer a good tradeoff for maintaining high privacy whileretaining high utility.

 

Quick Read (beta)

loading the full paper ...