The sharing of open data is fundamental for the advance of science and technology, and it is also essential for performing research studies and developing new applications that benefit our society. However, the sharing of these datasets needs to be done taking into account the appropriate measures, especially when they contain personal data.
Data anonymization: privacy protection
Data anonymization consists on applying a set of techniques that protect the privacy of personal data, and it is particularly useful when the data is going to be shared with other parties. The target is to reduce the risk of reidentifying the people that appears on the data. In this context, the identification has to be understood in a broad sense, as it not only refers to discovering the name of a person or the number of his/her ID card, but also to deducing someone’s identity due to the fact that this person has certain unique characteristics (for example, the combination of his/her birth date and postal code).
In order to perform an adequate anonymization of the data, there are several techniques that can be applied. These techniques modify the data contents, typically by changing some values (e.g., by adding some noise that distorts a numerical value) or deleting them. In general, the more we change data the higher our privacy guarantees will be, but if we change it in excess it could lose completely the value for the previous mentioned studies or applications. This problem is typically known as the privacy and utility trade-off, and finding its optimal point is not a trivial issue.
On the other hand, it is important to highlight that the anonymization process should be irreversible, and the simple fact of deleting and identifier (as an ID card number) or changing it for another value (which is known as pseudonymization) will not be enough to guarantee privacy. In order to anonymize data correctly, it is necessary to manage the datasets as a whole and understand the risks that they might be subject to. If, ideally, we were able to transform a dataset that contains personal data to such an extent that it was completely anonymized, the contained data would not be considered as personal data anymore as, theoretically, it would neither be possible to undo the anonymization process and perform any kind of reidentification.
Technology advances on data anonymization
How we can ensure that our data is truly anonymous and there are no risks to privacy? The answer to this question is simple: there are no absolute guarantees. Even though if we were able perform a perfect anonymization of our data, there are some aspects that will always be out of our control. For example, in a few years another dataset might be released and, if it were combined with our anonymized dataset, it could derive in the reidentification of a person that was theoretically anonymous. Therefore, the best option is to take a risk based approach, adapting the anonymization process to each particular case and applying state-of-the-art techniques that allow us to guarantee the best compromise between privacy and utility up until now, and measure periodically the risk to which our data is subject.
Gradiant provides advanced anonymization solutions that allow to automatize these complex tasks, and provides metrics to measure the risk of the data and their utility after the anonymization process. Gradiant is currently participating in the H2020 project INFINITECH, researching on new advanced techniques that allow to anonymize personal data, including geolocated data.
Author: Lilian Adkinson, head of Security & Privacy Analytics at Gradiant