With the increasing demand for government-held data, organizations need effective processes and procedures for removing personal information. A vital tool in this regard is data de-identification that involves the removal of personal information from a data set or record. It protects the privacy of people because, after de-identification, a data set is considered to no longer include personal information.
In this article, we are going to discuss:
What is Data De-identification?
Data de-identification is the process of eliminating Personally Identifiable Data (PII) from any document or other media, including an individual’s Protected Health Information (PHI).
De-identification of data is the quickest and easiest way to ensure compliance and identification security on communication methods that could be accessed by outsiders or the public.
Data de-identification allows information to be utilized by others without the likelihood of individuals being recognized. It may be used to:
- Safeguard the privacy of people and organizations, such as companies
- Build community trust in how agencies store and handle data
- Guarantee that the spatial location of archaeological or mineral findings or endangered species isn’t publicly accessible
- Reduce risk and minimize the damage caused to people from a data breach
When & Why is Data De-identification Important?
The main goal of data de-identification is to safeguard the confidentiality of people. If a record includes any type or amount of personal data, it can’t be considered de-identified.
At the same time, one of the key reasons for releasing de-identified data is to give others a chance to study the characteristics and values of the raw data for research purposes.
For instance, a state-owned educational organization may employ an agency to study the outcomes or influence of educational policy like a recent expansion of state-subsidized kindergarten programs. The investigators would then request access to data needed to conduct their study (such as records showing the number of students enrolled in kindergarten programs over 10 years).
However, before providing access to the records, the educational organization would de-identify the data to avoid individual identities from being exposed in the data provided to the external agency.
Therefore, data de-identification techniques should also focus on preserving as much value in the info as possible, while safeguarding the privacy of people. This twofold purpose of de-identification makes it a significant tool to be used in several contexts, including:
- Responding to access to information requests in a privacy-protective manner.
- Supporting improved marketing based upon customer activity data without disclosing info about the individual customers from whom the information was gathered.
- Open data initiatives that seek to promote research, innovation, and the development of new applications and services.
- Allowing for groundbreaking energy research with data on energy consumption that won’t disclose the corresponding users.
- Data sharing within and among organizations to break down silos.
- Supporting leading-edge healthcare research with patient information without violating patient privacy.
- Allowing libraries to maintain the privacy of their visitors concerning their reading and viewing activities while maintaining trends and statistics about which items they access and read.
Data De-identification vs Data Tokenization
Data tokenization is a process of substituting personal data with a random token. Often, a link is maintained between the original information and the token (such as for payment processing on sites). Tokens can be completely random numbers or generated by one-way functions (such as salted hashes).
Unlike encrypted data, tokenized data can’t be deciphered or reverse engineered. That’s because there’s no mathematical relationship between the token and its original number. Simply put, tokens can’t be returned to their initial form.
Data De-identification vs Data Masking
Data Masking is a technique that removes or hides information, replacing it with realistic replacement data or fake information. The objective is to create a version that can’t be decoded or reverse engineered. There are a number of ways to change the data, including encryption, character shuffling, and word or character replacement.
Masking is usually applied to things that directly identify an individual like their name and phone number. However, the information must remain usable, appear real, and look consistent. For example, in a call center, masking may be used so that operators can’t view credit card numbers in billing systems.
Data De-identification vs Anonymization
Data Anonymization is a kind of data sanitization process that intends to protect the privacy of individuals. It is the process of removing PII from data sets to maintain the anonymity of individuals whom the data describe. It is often the preferred method for making structured medical datasets secure for sharing.
For instance, you can run personal information such as names, addresses, and social security numbers through a data anonymization process that preserves the data but keeps the source anonymous.
Here’s a table that gives you a snapshot of how de-identification, anonymization, tokenization, and masking compare with one another.
Risk of Re-identification
Example Use Case
Sharing customer service or HR data with third parties
Payment processing on websites
Creating medical records that can be shared across organizations
User training, sales demos, or software testing
Approaches to Data De-identification
When discussing de-identification techniques, it’s important to understand the two kinds of identifiers: direct identifiers and quasi-identifiers (also called indirect identifiers).
- Direct identifiers are variables that can uniquely identify a person, such as names, email addresses, and social security numbers.
- Quasi-identifiers are variables that can identify a person but are also beneficial for data analysis. For instance, dates, demographic info (like race and origin), and socioeconomic variables.
Understanding this difference is significant as the approaches used to secure the identifiers will depend on how they are categorized.
Now let’s take a look at some of the commonly used data de-identification techniques:
- Redacting information, including via pixelation in digital recording and video footage
- Omission (omitting data in the data set such as full names)
- Differential privacy (describing or analyzing the patterns of groups within the data set while concealing data about individuals)
- Aggregating data
- Suppression (removing values from the data set or substituting some values with ‘missing’)
- Data swapping (for instance swap salaries for individuals within the same area, so the aggregate is still valid)
- Coding or pseudonymization (substituting identifiers with unique, temporary IDs or codes)
- Hashing (one-way encryption of identifiers)
- Micro-aggregation (forming groups with a certain number of observations and replacing the individual values with the group mean. For example, group in threes, so ages 21, 22, and 23 each become 22)
- Removing some variables
- Generalization (such as by substituting the exact birth date with a month and a year)
- k-anonymization (defining attributes that indirectly point to the person’s identity as quasi-identifiers (QI) and handling data by making at least k individuals have the same combination of QI values)
- Adding noise (creating white noise by generating and adding a new variable to the original variable with mean zero and positive variance)
Data De-identification Tools
De-identification can be an intricate and technically challenging process. However, there are several automated data de-identification tools or software that can facilitate the process. Some of these tools include:
ARX Data Anonymization Tool
ARX is a comprehensive open-source tool that anonymizes sensitive personal information. It supports an extensive range of privacy and risk models, techniques for data transformation, and techniques to analyze the utility of output data. This tool is used in a variety of contexts including research projects, commercial big data analytics platforms, clinical trial information sharing, and for training purposes.
deid software package
The deid software package includes code and dictionaries that automatically locate and remove PHI in free text from medical records. It was developed and tested using a gold standard corpus of over 2,400 nursing notes that were methodically de-identified by a multi-pass process including various automated methods as well as scrupulous reviews by multiple experts working autonomously.
Google Differential Privacy (DP) Library
Google’s DP library offers a set of building blocks that enable developers to build differentially private applications in Go, Java, and C++, Java, and Go. It allows developers to easily access metrics related to how successfully their apps are engaging their consumers, such as Daily Active Users and Revenue per Active user, in a way that helps ensure individual users can’t be identified or re-identified.
Examples of Data De-identification
Let’s take a look at some examples of de-identification.
(Generalized to age bracket)
Previous Country of Residence
(Omitted street names)
1 Acorn St,
3 Beale St,
2 Napier Lane,
In the table above, we have:
- Replaced individual names with unique codes so that they become unidentifiable.
- Ages are replaced with age brackets through the process of generalization.
- Street names are omitted to hide some part of their current location.