KART: Privacy Leakage Framework of Language Models Pre-trained with Clinical Records

  • 2020-12-31 19:06:18
  • Yuta Nakamura, Shouhei Hanaoka, Yukihiro Nomura, Naoto Hayashi, Osamu Abe, Shuntaro Yada, Shoko Wakamiya, Eiji Aramaki
  • 0

Abstract

Nowadays, mainstream natural language pro-cessing (NLP) is empowered bypre-trained language models. In the biomedical domain, only models pre-trainedwith anonymized data have been published. This policy is acceptable, but thereare two questions: Can the privacy policy of language models be different fromthat of data? What happens if private language models are accidentally madepublic? We empirically evaluated the privacy risk of language models, usingseveral BERT models pre-trained with MIMIC-III corpus in different dataanonymity and corpus sizes. We simulated model inversion attacks to obtain theclinical information of target individuals, whose full names are already knownto attackers. The BERT models were probably low-risk because the Top-100accuracy of each attack was far below expected by chance. Moreover, mostprivacy leakage situations have several common primary factors; therefore, weformalized various privacy leakage scenarios under a universal novel frameworknamed Knowledge, Anonymization, Resource, and Target (KART) framework. The KARTframework helps parameterize complex privacy leakage scenarios and simplifiesthe comprehensive evaluation. Since the concept of the KART framework is domainagnostic, it can contribute to the establishment of privacy guidelines oflanguage models beyond the biomedical domain.

 

Quick Read (beta)

loading the full paper ...