Is Your Model Sensitive? SPeDaC: A New Benchmark for Detecting and Classifying Sensitive Personal Data

Abstract

In recent years we have seen the exponential growth of applications,including dialogue systems, that handle sensitive personal information. Thishas brought to light the extremely important issue regarding personal dataprotection in virtual environments. Firstly, a performing model should be ableto distinguish sentences with sensitive content from neutral sentences.Secondly, it should be able to identify the type of personal data categorycontained in them. In this way, a different privacy treatment could beconsidered for each category. In literature, if there are works on automaticsensitive data identification, these are often conducted on different domainsor languages without a common benchmark. To fill this gap, in this work weintroduce SPeDaC, a new annotated benchmark for the identification of sensitivepersonal data categories. Furthermore, we provide an extensive evaluation ofour dataset, conducted using different baselines and a classifier based onRoBERTa, a neural architecture that achieves strong performances on thedetection of sensitive sentences and on the personal data categoriesclassification.

Quick Read (beta)

loading the full paper ...