Anonymization by Design of Language Modeling

Abstract

Rapid advances in Natural Language Processing (NLP) have revolutionized manyfields, including healthcare. However, these advances raise significant privacyconcerns, especially when models specialized on sensitive data can memorize andthen expose and regurgitate confidential information. This paper presents aprivacy-by-design language modeling approach to address the problem of languagemodels anonymization, and thus promote their sharing. Specifically, we proposeboth a Masking Language Modeling (MLM) methodology to specialize a BERT-likelanguage model, and a Causal Language Modeling (CLM) methodology to specializea GPT-like model that avoids the model from memorizing direct and indirectidentifying information present in the training data. We have comprehensivelyevaluated our approaches using medical datasets and compared them againstdifferent baselines. Our results indicate that by avoiding memorizing bothdirect and indirect identifiers during model specialization, our masking andcausal language modeling schemes offer the best tradeoff for maintaining highprivacy while retaining high utility.

Quick Read (beta)

loading the full paper ...