Guide: Data Masking

Data Generalization: The Specifics of Generalizing Data

Data mining is not a new concept that emerged with the digital revolution. The idea has been around for about a century, although it became more popular in the 1930s. In 1936, Alan Turing proposed a universal machine that could perform computations comparable to your current computers, one of the first forms of data mining.

Data Mining, also called Knowledge Discovery in Data (KDD), is a technique for extracting patterns and other useful information from huge data sets. Because of the advancements in data warehousing technologies and the rise of big data, the use of data mining techniques has exploded in recent decades, supporting businesses in turning raw data into valuable knowledge.

In this article, you will receive an in-depth view of a concept closely knitted to data mining — data generalization. Specifically:

What is Data Generalization?

When faced with the question of generalization in data mining, one can simply answer that data generalization is the process of broadening the classification of data in a database. This helps a user expand out from the data to provide a broader picture of trends or insights.

Below is a generalization in data mining, with an example.

If you have a data set with a collection of people’s ages, for example, the data generalization process would look like this:

 

ORIGINAL DATA

GENERALIZED DATA

AGES

16

18

21

23 

27

32

32

36

38

39

44

47

47

48

49

10-19 (2)

20-29 (3) 

30-39 (5) 

40-49 (5)

Data generalization in data mining substitutes a precise value with a less accurate value, which may appear counterintuitive. Still, it is a widely practical and used technique in data mining, analysis, and secure storage.

Two Forms of Data Generalization in Data Mining

There are two main forms of data generalization in data mining: Automated and Declarative.

Automated Generalization distorts values until a given value of k gets reached. Because you can utilize an algorithm to apply the least amount of distortion required to obtain the stated value of k, this method may offer the optimal balance between privacy and accuracy. You can select which deals are of most importance for your use case, and those values can be blurred using one of the various approaches to achieve any value of k.

Declarative Generalization, on the other hand, allows you to set the bin sizes upfront, such as always rounding to entire months. Outliers sometimes get discarded from this procedure, which might skew the data and add bias. Although, you must remember a declarative generalization does not always lead to k-anonymity.

Although declarative generalization may not help you reach k-anonymity, it is a good idea to use it as a default. Therefore, the recipient of the de-identified material only sees the level of detail they need.

Identifiers used in Data Generalization in Data Mining

Identifiers are data points about a subject that can determine their identity and link to other personal information. There are two main types of identifiers: direct identifiers and quasi-identifiers.

Direct identifiers are data points that can identify an individual while allowing other data to link to that person. Even if multiples of the same data point exist in the data, a data point can be a direct identifier. For example, even if two people are named “Mary,” the name is still a direct identifier.

Quasi Identifiers, on the other hand, do not allow you to identify a person on their own. Still, you can use them in conjunction with additional information to do so. Quasi Identifiers can be unique within a data collection. Still, they are also expected to appear in different data sets shortly or are currently present in other unique data sets.

Suppose you have a data set that includes a person’s gender and zip code. There will be enough people of that gender who live in that zip code that this person cannot get identified only based on those two data factors. However, suppose that person also appears in another data collection, including their gender, zip code, and other personal information. In that case, someone may connect the two data sets and identify the individual.

When is Data Generalization Important?

Data generalization in data mining allows you to abstract personal data by removing identifying characteristics.

This generalization allows you to examine the data you have collected without jeopardizing the people’s privacy in your dataset. It is crucial to remember that there are several methods for generalizing data, and you should choose the one that makes the most sense for your case. In some circumstances, masking direct identifiers is the best course of action, while in others, you want to keep the signal in data analytics.

Remember that there is no one-size-fits-all solution for retaining privacy. Due to this fact, you should learn about different approaches like tokenization, redaction, and pseudonymization. Once you understand those concepts, you can apply them as needed to get the most out of your data without jeopardizing privacy.

Data Generalization vs. Data Mining

Treading the line between data generalization vs. data mining need not be difficult.

Data generalization is the process of summarizing data by replacing relatively low-level numbers with higher-level concepts. In contrast, data mining involves investigating and analyzing vast data blocks to uncover relevant patterns and trends. Data generalization is a type of descriptive data mining, to put it simply.

Data Generalization vs. Data Aggregation

Data aggregation is a notion linked to, and frequently confused with, data generalization in data mining.

When treading the line between data generalization vs. data aggregation, the primary distinction is that accumulation creates a general class from many classes. In contrast, generalization is the process of constructing a specific general class from numerous classes.

Put simply:

DATA GENERALIZATION

DATA AGGREGATION

A technique for grouping objects of similar types into a single general class.

An association between two objects which describes the “has a” relationship

Indicates “is a” Relationship

Indicates “has a” Relationship

Approaches to Data Generalization

There are two basic approaches to Data Generalization in Data Mining:

Data Cube Approach

In most cases, a data cube makes data easier to understand. It is very helpful when displaying data with dimensions as specific gauges of business needs. Every cube dimension reflects a different aspect of the database, such as daily, monthly, or yearly sales.

A data cube’s data allows for analyzing nearly all figures for virtually any or all customers, sales agents, products, among other things. As a result, a data cube can assist in identifying trends and analyzing performance.

In a nutshell:

  • It is also known as the OLAP approach or Online Analytical Processing.
  • It is a practical strategy because it aids in the creation of a previous selling graph.
  • The Data cube gets used to holding the computation and results in this method.
  • On a data cube, roll-up and drill-down procedures get employed.
  • Aggregate functions like count(), sum(), average(), and max() are commonly used in these procedures.
  • These materialized, you can then use perspectives for decision-making, information discovery, and various other uses.

Attribute Oriented Induction

Attribute Oriented Induction is a database mining technique that compresses the original data collection into a generalized relation, resulting in concise and comprehensive information about the huge datasets.

Moreover, attribute generalization in data mining allows for the transition of similar data collections, originally stated at a low (primitive) level in a database, into more abstract conceptual representations.

In a nutshell:

  • Attribute generalization in data mining is a query-oriented, generalization-based technique to online data analysis.
  • Generalizations get made using this method based on varying values of each attribute within the relevant data set. Then, to do aggregation, the same tuple is merged, and their corresponding counts get accumulated.
  • Before an OLAP or data mining query gets submitted for processing, it performs offline aggregation.
  • It does not get restricted to specific metrics or categorical data.
  • Attribute-Oriented Induction uses two methods:
    • Attribute removal
    • Attribute generalization

Examples of Data Generalization

Market Basket Analysis is one of the most well-known examples of data generalization in data mining. Market Basket Analysis is a method for analyzing the purchases made by a customer in a supermarket.

The idea is to use the concept to identify the things that a customer buys together. What are the chances that if a person buys bread, they will also buy butter? This analysis aids in the promotion of company offers and discounts. Data mining is used to do the same thing.

Moreover, business reporting for sales or marketing, management reporting, business process management (BPM), budgeting and forecasting, financial reporting, and similar sectors commonly use Market Basket Analysis. However, other sectors such as Agriculture are beginning to find new ways of using this analysis as well.

Conclusion

With the realization of the significance of data, businesses are continuously finding ways to use and leverage data to their advantage. As a result, data scientists have become increasingly important to companies worldwide as they strive to achieve greater heights with data science than ever before. However, with this, comes the need to protect the privacy of individuals and follow compliance, which brings the need for data generalization, as well as other data anonymization strategies.

Satori logo2 white