Guide: Data Masking

Data De-Identification

With the increasing demand for government-held data, organizations need effective processes and procedures for removing personal information. A vital tool in this regard is data de-identification that involves the removal of personal information from a data set or record. It protects the privacy of people because, after de-identification, a data set is considered to no longer include personal information.

In this article, we are going to discuss:

What is Data De-identification?

Data de-identification is the process of eliminating Personally Identifiable Data (PII) from any document or other media, including an individual’s Protected Health Information (PHI). 

De-identification of data is the quickest and easiest way to ensure compliance and identification security on communication methods that could be accessed by outsiders or the public.

Data de-identification allows information to be utilized by others without the likelihood of individuals being recognized. It may be used to:

  • Safeguard the privacy of people and organizations, such as companies
  • Build community trust in how agencies store and handle data
  • Guarantee that the spatial location of archaeological or mineral findings or endangered species isn’t publicly accessible
  • Reduce risk and minimize the damage caused to people from a data breach
Data De-Identification
De-identifying data

When & Why is Data De-identification Important?

The main goal of data de-identification is to safeguard the confidentiality of people. If a record includes any type or amount of personal data, it can’t be considered de-identified. 

At the same time, one of the key reasons for releasing de-identified data is to give others a chance to study the characteristics and values of the raw data for research purposes. 

For instance, a state-owned educational organization may employ an agency to study the outcomes or influence of educational policy like a recent expansion of state-subsidized kindergarten programs. The investigators would then request access to data needed to conduct their study (such as records showing the number of students enrolled in kindergarten programs over 10 years). 

However, before providing access to the records, the educational organization would de-identify the data to avoid individual identities from being exposed in the data provided to the external agency. 

Therefore, data de-identification techniques should also focus on preserving as much value in the info as possible, while safeguarding the privacy of people. This twofold purpose of de-identification makes it a significant tool to be used in several contexts, including:

  • Responding to access to information requests in a privacy-protective manner. 
  • Supporting improved marketing based upon customer activity data without disclosing info about the individual customers from whom the information was gathered.
  • Open data initiatives that seek to promote research, innovation, and the development of new applications and services. 
  • Allowing for groundbreaking energy research with data on energy consumption that won’t disclose the corresponding users.
  • Data sharing within and among organizations to break down silos.
  • Supporting leading-edge healthcare research with patient information without violating patient privacy.
  • Allowing libraries to maintain the privacy of their visitors concerning their reading and viewing activities while maintaining trends and statistics about which items they access and read.

Data De-identification vs Data Tokenization

Data tokenization is a process of substituting personal data with a random token. Often, a link is maintained between the original information and the token (such as for payment processing on sites). Tokens can be completely random numbers or generated by one-way functions (such as salted hashes). 

Unlike encrypted data, tokenized data can’t be deciphered or reverse engineered. That’s because there’s no mathematical relationship between the token and its original number. Simply put, tokens can’t be returned to their initial form.

Data De-identification vs Data Masking

Data Masking is a technique that removes or hides information, replacing it with realistic replacement data or fake information. The objective is to create a version that can’t be decoded or reverse engineered. There are a number of ways to change the data, including encryption, character shuffling, and word or character replacement. 

Masking is usually applied to things that directly identify an individual like their name and phone number. However, the information must remain usable, appear real, and look consistent. For example, in a call center, masking may be used so that operators can’t view credit card numbers in billing systems.

Data De-identification vs Anonymization

Data Anonymization is a kind of data sanitization process that intends to protect the privacy of individuals. It is the process of removing PII from data sets to maintain the anonymity of individuals whom the data describe. It is often the preferred method for making structured medical datasets secure for sharing. 

For instance, you can run personal information such as names, addresses, and social security numbers through a data anonymization process that preserves the data but keeps the source anonymous.

Here’s a table that gives you a snapshot of how de-identification, anonymization, tokenization, and masking compare with one another.

 

Risk of Re-identification

Naturalness

Example Use Case

De-identification

Very low

Low

Sharing customer service or HR data with third parties

Tokenization

Low

Very low

Payment processing on websites

Anonymization

Very low

Low

Creating medical records that can be shared across organizations

Masking

Very low

Very low

User training, sales demos, or software testing

 

Approaches to Data De-identification

When discussing de-identification techniques, it’s important to understand the two kinds of identifiers: direct identifiers and quasi-identifiers (also called indirect identifiers). 

  • Direct identifiers are variables that can uniquely identify a person, such as names, email addresses, and social security numbers. 
  • Quasi-identifiers are variables that can identify a person but are also beneficial for data analysis. For instance, dates, demographic info (like race and origin), and socioeconomic variables. 

Understanding this difference is significant as the approaches used to secure the identifiers will depend on how they are categorized.

Now let’s take a look at some of the commonly used data de-identification techniques:

  • Redacting information, including via pixelation in digital recording and video footage
  • Omission (omitting data in the data set such as full names)
  • Differential privacy (describing or analyzing the patterns of groups within the data set while concealing data about individuals)
  • Aggregating data
  • Suppression (removing values from the data set or substituting some values with ‘missing’)
  • Data swapping (for instance swap salaries for individuals within the same area, so the aggregate is still valid)
  • Coding or pseudonymization (substituting identifiers with unique, temporary IDs or codes)
  • Hashing (one-way encryption of identifiers)
  • Micro-aggregation (forming groups with a certain number of observations and replacing the individual values with the group mean. For example, group in threes, so ages 21, 22, and 23 each become 22)
  • Removing some variables
  • Generalization (such as by substituting the exact birth date with a month and a year)
  • k-anonymization (defining attributes that indirectly point to the person’s identity as quasi-identifiers (QI) and handling data by making at least k individuals have the same combination of QI values)
  • Adding noise (creating white noise by generating and adding a new variable to the original variable with mean zero and positive variance)

Data De-identification Tools

De-identification can be an intricate and technically challenging process. However, there are several automated data de-identification tools or software that can facilitate the process. Some of these tools include:

ARX Data Anonymization Tool

ARX is a comprehensive open-source tool that anonymizes sensitive personal information. It supports an extensive range of privacy and risk models, techniques for data transformation, and techniques to analyze the utility of output data. This tool is used in a variety of contexts including research projects, commercial big data analytics platforms, clinical trial information sharing, and for training purposes.

deid software package

The deid software package includes code and dictionaries that automatically locate and remove PHI in free text from medical records. It was developed and tested using a gold standard corpus of over 2,400 nursing notes that were methodically de-identified by a multi-pass process including various automated methods as well as scrupulous reviews by multiple experts working autonomously.

Google Differential Privacy (DP) Library

Google’s DP library offers a set of building blocks that enable developers to build differentially private applications in Go, Java, and C++, Java, and Go. It allows developers to easily access metrics related to how successfully their apps are engaging their consumers, such as Daily Active Users and Revenue per Active user, in a way that helps ensure individual users can’t be identified or re-identified.

Examples of Data De-identification

Let’s take a look at some examples of de-identification.

Name

(Coding)

Age

(Generalized to age bracket)

Previous Country of Residence

(Unchanged)

Current Location

(Omitted street names)

Jane Black

#12345

21

20-25

China

1 Acorn St,

Boston

Tony Richards

#23456

34

30-35

Scotland

3 Beale St,

Memphis

John Doe

#45676

45

45-50

Germany

2 Napier Lane,

San Francisco

In the table above, we have: 

  • Replaced individual names with unique codes so that they become unidentifiable.
  • Ages are replaced with age brackets through the process of generalization.
  • Street names are omitted to hide some part of their current location.

Data De-Identification With Satori

To learn more about how Satori helps you solve privacy, security and governance challenges, visit our product page.

Satori logo2 white