Re-Identification Risk Quantification
The Challenge
More and more, organisations are collecting data about their users and customers. This data is then fed into sophisticated analytics, including machine learning algorithms, to unlock insightful information leading to higher value services and products.
The question is how organisations can then provide safe access to this data internally, or even share the data externally for societal or commercial benefit. This is extended by considering the benefit of different organisations safely sharing data between them, and there is a strong incentive to do so.
Most data custodians recognise the privacy and confidentiality risks in using and sharing their data both within and outside their organisations. However, there is no consistent and repeatable methodology or related tool for data custodians to confidently measure and understand the level of such risks in their data for the purpose of sharing or releasing it.
Our Response
We have designed quantitative and qualitative privacy and confidentiality risk methodology, with appropriate assessment metrics and frameworks, to understand the risks with sharing or releasing data, or even just providing access to a wider internal audience. These tools leverage scientific knowledge from information theory and stochastic models to provide an accurate estimation of the residual risks associated with the sharing of sensitive data.
For example, one of our metrics allows the measurement of re-identification risks for an individual, an event, or a transaction based on factors such as uniqueness, uniformity and/or linkability. Another one of our metrics quantifies the risk of deducing a non-reported value in an aggregated data report.
We have also developed software, such as our Re-identifier Risk Ready Reckoner (R4), to implement these metrics and methodologies. R4 generates quantifiable risk assessments that display on a working dashboard – and provides data treatment options such as binning and perturbation to help data custodians mitigate these risks – before re-assessing the risk in the treated data.
The Results
Our work is improving awareness of privacy and confidentiality risk in data and helping in the management of that risk across the data ecosystem.
Our privacy and confidentiality risk frameworks and R4 software have been used extensively in several commercial engagements, identifying and measuring re-identification risks in so-called de-identified data pending release (or in some cases already released), as well as inference risks of not-reported data in confidential financial reports.
Demonstrating the impact of our work through these engagements, we have observed cases where data custodians have adjusted their approach to making data available due to better appreciation of the risk it carries. In other cases, guided by our framework, data custodians have applied targeted transformation to the data to reduce the residual risks – while still maintaining an acceptable level of utility – before releasing it
Watch a video demonstration of R4 at: https://youtu.be/jcckoxpMCqo?start=873
Applied in the Real World
Some real world examples where R4 or its underlying technology have been used in government and industry engagements to help assess risks in real situations:
- the Office of the Australian Information Commissioner (OAIC) reports on the MBS/PBS privacy breach
- the Victoria Office of the Information Commissioner (OVIC) report on the MyKi privacy breach
- the CRC-P round 7 project: Privacy-Preserving Analytics for the Education Technology Industry
- the Queensland Office of the Information Commissioner tabled report on released QLD datasets
R4 in Details
The Re-identification Risk Ready Reckoner (R4) is tool that:
- helps data custodians to understand the re-identification risk (RIR) of a dataset,
- provides data treatment options, such as binning and perturbation, to users on how to mitigate that risk, and
- generates quantifiable RIR assessments that display on a working dashboard
R4 can assess the risk of re-identification in both unit-record and event-based datasets. Through analysis of the data, an R4 user can come to understand those parts of the dataset that have greatest risk of re-identification by examining the potential of different combinations of background knowledge to be used for a re-identification.
R4 also provides a REST API to allow graceful integration to other systems
People
News
- Paul Tyler invited to speak at a webinar organised by the Office of the Information Commissioner, Queensland
- OIC Queensland report based on D61 analysis and Report
- OIC Victoria report on the release of myki data based on D61 risk assessment
- D61 contributing to the OAIC investigation report on MBS/PBS data publication
Related Publications
- D. Vatsalan, T. Rakotoarivelo, R. Bhaskar, P. Tyler, D. Ladjal, “Privacy risk quantification in education data using Markov model”, British Journal of Educational Technology 53 (4), 804-821, 2022.
- I. Muhammad, S. Shehroz, E. DeCristofaro, M.A. Kaafar, G. Jourjon, Z. Shafiq., “Measuring, Characterising and Detecting Facebook Like Farms”, ACM Transactions on Privacy and Security (TOPS) 2017.