Privacy-preserving Data Encoding and Matching (PriDEMatch)


The Challenge

Encoding and matching data is an essential component in any privacy preserving data analytics applications. For example, linking records from datasets within different organizations requires encoded data to preserve the privacy of individuals represented by such records while utilizing the value of data for analytics applications. Thus, more robust, provable protection of privacy in the encoding and matching techniques applied to data is needed to legally and ethically allow using individuals’ information for many data-driven applications.

There have been two categories of techniques used for data encoding and matching: cryptographic techniques and probabilistic techniques. Cryptographic techniques are highly accurate, however they are computationally expensive for large-scale and low latency data. Probabilistic techniques, on the other hand, are highly efficient for processing, storing, and performing any computation, at the cost of tunable utility loss.

We have developed a tool, named PriDEMatch (Privacy-preserving Data Encoding and Matching), that implements the probabilistic encoding methods for the different applications. It provides a variety of probabilistic techniques to address the challenges of privacy preserving data applications for Big data. Specifically, we combine probabilistic data structures, such as Bloom filters, with differential privacy techniques to develop a suite of efficient and provable privacy preserving data encoding and matching methods for different types of data. The following figure illustrates an overview of the developed PriDEMatch tool.









The Research

We have been actively researching the concept of differential privacy combined with probabilistic data structures in the encoding process when matching records from different datasets. Specifically, we explored the following three research themes, designed novel mechanisms in these contexts and implemented them into the PriDEMatch tool to enhance its performance and capabilities.

  • Implementation of probabilistic encoding methods:  Probabilistic data structures have received much attention in both research and practical privacy preserving applications. PriDEMatch uses Bloom filters, one of the popular probabilistic method, to encode different types of data. However, other probabilistic data structures, such as Cuckoo filters, are becoming popular due to their highly efficient for storing, processing, and computing. This research theme focuses on developing novel probabilistic encoding schemes that can be used with differential privacy to encode different data types. 
  • The use of deep learning techniques in the data matching process: The use of deep learning (DL) models in the privacy-preserving context is not trivial due to several challenges. (1) Data owners need to ensure the training of the DL models does not allow an attacker to infer information. (2) All data owners need to have access to training data with similar characteristics. And finally, (3) DL model complexities need to be optimized. This research theme aims to investigate how DL could be used to link databases while preserving the privacy of the individuals represented by the encoded records.
  • Privacy-Fairness Trade-off: Fairness determines how much a linkage classifier distorts from producing linkage decisions with equal probability for individuals across different groups (for example gender with groups ‘male’ and ‘female’). As privacy and fairness are not independent from each other, the data matching techniques have to balance the trade-off between privacy and fairness. This research theme aims to investigate how fairness constraints can be applied to privacy-preserving data matching techniques.


A video demonstration of PriDEMatch is available here.

Related Publications

  • Sirintra Vaiwsri, Thilina Ranbaduge, Peter Christen, “Encryption-based sub-string matching for privacy-preserving record linkage”, Journal of Information Security and Applications (2024),
  • Thilina Ranbaduge, Dinusha Vatsalan, Ming Ding. “Privacy-preserving Deep Learning based Record Linkage.”, IEEE Transactions on Knowledge and Data Engineering (2024).
  • Vidanage, Anushka, Peter Christen, Thilina Ranbaduge, and Rainer Schnell. “A Vulnerability Assessment Framework for Privacy-Preserving Record Linkage.” ACM Transactions on Privacy and Security (2023).
  • Vaiwsri, Sirintra, Thilina Ranbaduge, and Peter Christen. “Accurate and efficient privacy-preserving string matching.” International Journal of Data Science and Analytics (2022).
  • Vidanage, Anushka, Thilina Ranbaduge, Peter Christen, and Rainer Schnell. “A Taxonomy of Attacks on Privacy-Preserving Record Linkage”. Journal of Privacy and Confidentiality (2022).
  • Wu, Nan, Dinusha Vatsalan, Sunny Verma, and Mohamed Ali Kaafar. “Fairness and cost constrained privacy-aware record linkage.” IEEE Transactions on Information Forensics and Security (2022).
  • Vatsalan, Dinusha, Raghav Bhaskar, and Mohamed Ali Kaafar. “Local differentially private fuzzy counting in stream data using probabilistic data structures.” IEEE Transactions on Knowledge and Data Engineering (2022).
  • Vatsalan, Dinusha, Raghav Bhaskar, Aris Gkoulalas-Divanis, and Dimitrios Karapiperis. “Privacy preserving text data encoding and topic modelling.” In 2021 IEEE International Conference on Big Data (Big Data). IEEE, 2021.
  • Yu, Joyce, Jakub Nabaglo, Dinusha Vatsalan, Wilko Henecka, and Brian Thorne. “Hyper-parameter optimization for privacy-preserving record linkage.” In ECML PKDD 2020 Workshops: Workshops of the European Conference on Machine Learning and Knowledge Discovery in Databases (ECML PKDD 2020).
  • Vatsalan, Dinusha, Joyce Yu, Wilko Henecka, and Brian Thorne. “Fairness-aware privacy-preserving record linkage.” In Data Privacy Management, Cryptocurrencies and Blockchain Technology: ESORICS 2020 International Workshops, 2020.