Privacy-Preserving Access to User Data
Organisations are increasingly collecting and using customer data for different purposes. Enabling access to this data to other organisations can potentially result in a mutually beneficial scenario, not only for the organisations involved but also for the customers who can benefit from improved services. However, enabling access to user data without any privacy mechanism may result in breach of user privacy, including (but not limited to) re-identification of the users in the dataset.
This project aims at assessing privacy risks (such as re-identification) if datasets are released to the public even when identifying information is completely removed, and developing provable privacy-preserving algorithms for access to user data (either online or offline).
How can an organisation tell if a certain dataset has the risk of re-identification without knowing a priori what relevant background information is already public? We model a re-identification algorithm as an algorithm that can link individuals in two different snapshots of the dataset (taken at different times). We then show that any non-trivial re-identification algorithm can only get a few of these links correct. We then aggregate the dataset to different granularities to check if the re-identification algorithm has an advantage close to a trivial algorithm or not. Based on this analysis an organisation can determine at least a lower bound on the privacy risks associated with a purported release of a dataset. This exercise also sheds light on the limitation of syntactic anonymity algorithms such as k-anonymity.
Differentially Private Data Access
Differential privacy is a formal notion of privacy. An algorithm is said to be differentially private if the likelihood ratio of observing a certain result when given as input a dataset without a user’s data and a dataset with the user’s data is bounded (by a small number). This implies that the user does not incur any privacy risk by taking part in the dataset. Many differentially private algorithms have been proposed in the literature. Quite often they also come with provable utility guarantees, which means that the result of any (specific) analysis will remain similar to the original dataset.
Based on the size of the dataset, the type of analysis (class of queries), and the size of the data universe, many of the proposed differentially private algorithms can become prohibitive. Furthermore, depending on the required attributes of the dataset to be released, applying a generic differentially private algorithm may turn out to be an overkill (unnecessarily consuming the privacy budget). The aim of this project is to develop custom differentially private algorithms for release of and/or access to user data. Our partners include Australian Bureau of Statistics (ABS), Department of Social Services (DSS) and Transport for New South Wales (TfNSW).
Our main sub-projects are
- Development of a differentially private algorithm that allows online access to data while minimising the loss in privacy budget as a function of the number of queries asked
- Development of a differentially private algorithm for data release that is computationally efficient as a function of the size of the data universe
- Hassan Jameel Asghar, Data61-CSIRO.
- David Smith, Data61-CSIRO.
- Sirine Mrabet, Data61-CSIRO.
- Ming Ding, Data61-CSIRO.
- Paul Tyler, Data61-CSIRO.
- Kanchana Thilakarathna, Data61-CSIRO.
- Thierry Rakotoarivelo, Data61-CSIRO.
- Dali Kaafar, Data61-CSIRO.
- David Smith et al. 2017, The Application of Mixture Distributions in Query release: Improving Usefulness while Maintaining Differential Privacy. Technical Report (under submission).