Utility-driven assessment of anonymized data via clustering
Ferrão, M. E. Ferrão
Sousa, Paula Prata
Scientific data Vol. 9, Nº 456, pp. 1 - 11, July, 2022.
ISSN (print): 2052-4463
Scimago Journal Ranking: 2,41 (in 2022)
Digital Object Identifier: 10.1038/s41597-022-01561-6
In this study, clustering is conceived as an auxiliary tool to identify groups of special interest. This
approach was applied to a real dataset concerning an entire Portuguese cohort of higher education Law
students. Several anonymized clustering scenarios were compared against the original cluster solution.
The clustering techniques were explored as data utility models in the context of data anonymization,
using k-anonymity and (ε, δ)-differential as privacy models. The purpose was to assess anonymized
data utility by standard metrics, by the characteristics of the groups obtained, and the relative risk (a
relevant metric in social sciences research). For a matter of self-containment, we present an overview
of anonymization and clustering methods. We used a partitional clustering algorithm and analyzed
several clustering validity indices to understand to what extent the data structure is preserved, or not,
after data anonymization. The results suggest that for low dimensionality/cardinality datasets the
anonymization procedure easily jeopardizes the clustering endeavor. In addition, there is evidence that
relevant field-of-study estimates obtained from anonymized data are biased.