Guide: Data Masking

How K Anonymity Preserves Data Privacy

Customers today entrust organizations with their personal information, which is used to provide them with better services while enhancing the company’s decision-making. However, a lot of the value in this information still goes unused.

This information could be instrumental to help third party analysts and investigators answer queries ranging from urban planning to curing deadly diseases. Therefore, often companies want to share this information with other parties without compromising the confidentiality of their customers. At the same time, they also strive to maintain the utility of the information to guarantee precise analytical outcomes.

So, how do you publicly release a database without compromising individual privacy? Many data owners just omit any unique identifiers such as SSN and name, hoping that it works. However, it’s not the right approach.

According to Prof. Sweeney, the combination of date of birth, gender, and zip code is enough to uniquely identify at least 87% of the US population in publicly accessible databases. To ensure a real privacy guarantee, it must be proved and established mathematically, and this is where K Anonymity helps.

In this article, we’re going to talk about:

What is K Anonymity?

According to the K Anonymity definition, it is a privacy model usually applied to safeguard the subject’s confidentiality in information sharing situations by anonymizing data. In this model, attributes are suppressed or generalized until every row is identical with at least (K-1) other rows. At this point, the database is said to become K Anonymous.

K6 Example (Making subjects indistinguishable)

K Anonymization Examples

Let’s take a look at an example of how K Anonymization works.

Table 1 shows fictitious data of 12 patients admitted to a healthcare facility.

Table 1: Original Dataset

ID
Age
Zip Code
Diagnosis
1
22
13053
Cancer
2
27
13053
Heart Disease
3
29
13076
Cancer
4
26
13068
Viral Infection
5
50
14850
Heart Disease
6
55
14865
Cancer
7
49
14854
Cancer
8
52
14867
Viral Infection
9
32
13067
Heart Disease
10
38
13477
Heart Disease
11
37
13255
Cancer
12
35
13555
Viral Infection

Table 2 shows the anonymized database. In this table, we’ve applied 4-Anonymization, which means that the dataset contains at least 4 entries for the given set of quasi-identifiers (QI).

Table 2: 4-Anonymous Dataset

ID
Age
Zip Code
Diagnosis
1
[20-30]
130**
Cancer
2
[20-30]
130**
Heart Disease
3
[20-30]
130**
Cancer
4
[20-30]
130**
Viral Infection
5
[40-60]
148**
Heart Disease
6
[40-60]
148**
Cancer
7
[40-60]
148**
Cancer
8
[40-60]
148**
Viral Infection
9
[30-40]
13***
Heart Disease
10
[30-40]
13***
Heart Disease
11
[30-40]
13***
Cancer
12
[30-40]
13***
Viral Infection

This data has 4-anonymity with respect to the attributes ‘Zip Code’ and ‘Age’. That’s because there are always at least 4 rows with exact attributes for any combination of these attributes found in any row of the table.

K Anonymity For Protecting Privacy

In many privacy-preserving systems, the ultimate objective is the anonymity of the individuals. On the surface, anonymity just means to be nameless. However, when you look at it closely, you’ll quickly understand that only eliminating names from a dataset isn’t enough to attain true anonymization.

 

It’s possible to re-identify anonymized data by connecting it with another dataset. The data may include information pieces that aren’t unique identifiers themselves, but can be identified when linked with other datasets.

 

K Anonymity prevents definite database linkages. It defines attributes that indirectly point to the person’s identity as quasi-identifiers and handles data by making at least K individuals have the same combination of QI values. As a result, at worst, the data released narrows down an individual entry to a group of K individuals.

K Anonymity Implementation

The most common implementations of K Anonymity use transformation techniques such as generalization, suppression, and global recoding.

1. Generalization

Generalization is the practice of replacing a specific value with a more generic one. For instance, zip codes in a dataset can be generalized into counties or municipalities (i.e. changing 13053 to 130**). Ages may be generalized into an age bracket (i.e. grouping 22 into [20-30]).

 

This technique removes recognizing info that can be gleaned from the dataset by reducing an attribute’s specificity. You may think of it as sufficiently ‘widening the net.’

2. Suppression

Suppression is the process of eliminating an attribute’s value completely from a dataset. In the above example of age information, suppression would mean eliminating age data from every cohort completely.

 

Bear in mind that suppression should only be used for data points that aren’t relevant to the purpose of the data collection. For instance, if you’re gathering data to determine at which age individuals have the most chances of developing a particular illness or condition, suppressing the age data would make the data itself useless.

 

Suppression should be applied to mostly irrelevant data points, particularly on a case-by-case basis, instead of using a set of overarching rules that apply universally.

3. Global Recoding

In this method, continuous or discrete numerical variables can be grouped into a predefined class. It means that a given specific value is replaced with a more generic value that can be chosen from anywhere in the whole dataset.

 

For instance, if global recoding is performed on a dataset, the zip code will be generalized regardless of gender or any other descriptive variable. The recoding process can also be single-dimensional or multi-dimensional.

 

  • In single-dimensional recoding, each attribute is individually mapped (such as zip code).
  • In multi-dimensional recoding, the mapping can be performed on a function of numerous attributes together, like in quasi-identifiers (such as zip code, gender, and date of birth).

K Anonymity vs L-Diversity

L-diversity is a form of group-based anonymization used to maintain privacy in datasets by decreasing the granularity of a data representation model via methods including generalization and suppression.

 

The L-diversity model is often used as a yardstick to measure whether K Anonymization efforts have gone far enough to avoid re-identification. A dataset is said to satisfy L-diversity if there are at least L well-represented values for every sensitive attribute in every group of records that share key attributes.

 

In other words, any attribute that is considered sensitive, such as what medical conditions a person has, or whether a learner passed or failed an exam, takes on at least L distinct values within each subset K.

 

This model protects privacy even when the data holder or publisher doesn’t know what information a malicious party may already have about the subjects in a dataset. It helps achieve true anonymity as the values of sensitive attributes are well-represented in every group.

K Anonymity vs Differential Privacy

Differential privacy is a system that allows you to publicly share info about a dataset by defining the patterns of groups within the dataset while suppressing info about people in that dataset.

 

The main idea is that if the consequence of making one arbitrary replacement in the database is sufficiently small, the query result can’t be used to deduce much about any single subject, and thus ensures confidentiality.

 

In other words, differential privacy is a limitation on the algorithms used to share aggregate info about a statistical database that constrains the disclosure of sensitive info in the database.

 

For instance:

 

  • Some government agencies use differentially private algorithms to share demographic info or other statistical aggregates while guaranteeing the discretion of survey responses.
  • Many businesses also use differentially private algorithms to gather information about customer behavior while regulating what can be accessed even by in-house analysts.

 

To safeguard the privacy of subjects, differential privacy adds noise in the data to disguise the real value, and therefore, makes it private. By doing so, it conceals the subject’s identity with little to no influence on the information utility. This means that the statistical results from the dataset shouldn’t be affected by a subject’s contribution as the information represents the characteristics of the whole population.

Conclusion

Data privacy has garnered a lot of attention in recent years. As leakage of customer data is a constant issue we experience nowadays, businesses use different approaches to protect their user data.

 

Many businesses collect customer information for in-house usage, and they sometimes make this information publicly accessible through datasets. To safeguard the customer’s identity, data engineers use K anonymization, differential privacy methods, and other approaches to protect the customers’ private info.

 

K Anonymity is a robust tool when applied correctly and with the right protections implemented, such as access control. It contributes significantly to privacy-improving technologies, together with alternative methods like differential privacy algorithms. With big data becoming the norm, we see growing data dimensionality along with more and more public datasets that can be used to support re-identification efforts.

Satori logo2 white