What is Data Anonymization?
Data anonymization is a method of information sanitization, which involves removing or encrypting personally identifiable data in a dataset. The goal is to ensure the privacy of the subject’s information. Data anonymization minimizes the risk of information leaks when data is moving across boundaries. It also maintains the structure of the data, enabling analytics post-anonymization.
The European Union’s General Data Protection Regulation (GDPR) demands the pseudonymization or anonymization of stored information of individuals living in the EU. Anonymized data sets are not classified as personal data, and so are not subject to the rules of GDPR. This permits organizations to use the information for broader purposes while remaining compliant and protecting the rights of the data subjects.
Data anonymization is also a core component of HIPAA requirements. HIPAA is a US regulation governing the use of Private Health Information (PHI) in the healthcare industry and its partners.
This is part of our series of articles about data masking.
In this article:
- Data Anonymization Use Cases
- What Data Should Be Anonymized?
- 6 Data Anonymization Techniques
The information provided in this article and elsewhere on this website is meant purely for educational discussion and contains only general information about legal, commercial and other matters. It is not legal advice and should not be treated as such. Information on this website may not constitute the most up-to-date legal or other information.
The information in this article is provided “as is” without any representations or warranties, express or implied. We make no representations or warranties in relation to the information in this article and all liability with respect to actions taken or not taken based on the contents of this article are hereby expressly disclaimed.
You must not rely on the information in this article as an alternative to legal advice from your attorney or other professional legal services provider. If you have any specific questions about any legal matter you should consult your attorney or other professional legal services provider.
This article may contain links to other third-party websites. Such links are only for the convenience of the reader, user or browser; we do not recommend or endorse the contents of any third-party sites.
Data Anonymization Use Cases
Typical cases of data anonymization include:
- Medical research—researchers and healthcare professionals examining data related to the prevalence of a disease among a certain population would use data anonymization. This way they protect the patient’s privacy and adhere to HIPAA standards.
- Marketing enhancements—online retailers often seek to improve when and how they reach their customers, via digital advertisement, social media, emails, and their website. Digital agencies use insights gained from consumer information to meet the increasing need for personalized user experience and to refine their services. Anonymization allows these marketers to leverage data in marketing while remaining compliant.
- Software and product development—developers need to use real data to develop tools that can deal with real-life challenges, perform testing, and improve the effectiveness of existing software. This information should be anonymized because development environments are not as secure as production environments, and if they are breached, sensitive personal data is not compromised.
Business performance—large organizations often collect employee-related information to increase productivity, optimize performance, and enhance employee safety. By using data anonymization and aggregation, such organizations can access valuable information without causing employees to feel monitored, exploited, or judged.
What Data Should Be Anonymized?
Not all datasets need to undergo anonymization. Every database administrator should identify which datasets need to be made anonymous and which data can safely remain in their original form.
Choosing which datasets to anonymize may seem straightforward. However, “sensitive data” is a subjective idea that changes according to the individual and the sector. For example, contact information could be seen as impersonal to a marketing agency’s manager, however, it may be viewed as highly sensitive by security personnel.
Most compliance standards and organizational policies agree that Personally Identifiable Information (PII) should be treated as sensitive data and stored safely. Thus, such information is a perfect candidate for anonymization. This still leaves some room for interpretation, because PII might mean different things in different industries, and there is also debate around the legal definition of PII in different territories.
There is a broad consensus that certain data is deemed as PIIs—irrespective of legal or industry influence. This includes:
- Name—no matter what context this arises, the name is the most significant key identifier in a data set. A data set reduces a data source’s list of variables. If this information is obtained by the cybercriminal they can readily trace the source of a data set—even encoded data sets. Thus, names must be anonymized
- Credit card details—this field deals with credit card numbers, other details like expiration date and CVV, and credit card tokens. They are regarded as highly personal, are unique to the individual, and can have financial implications for the individual if compromised. They must always be protected.
- Mobile numbers—if a cybercriminal gains access to a mobile number they could also gain access to additional, more sensitive data about the individual. Thus, personal phone numbers should always be anonymized.
- Photograph—photographs are the perfect means of identification. Often, photographs are collected to verify identity and to ensure security. A dataset containing photos of individuals must be safeguarded, and thus it is a strong candidate for anonymization.
- Passwords—a cybercriminal could easily impersonate someone and gain access to private data by compromising their password. In any backend structure created to store passwords, you should encrypt and/or anonymize the data.
Security questions—such data sets are also key identifiers. Many software services and web applications use these questions as a step towards granting user access. Given this, it is important to encrypt them.
6 Data Anonymization Techniques
The following are common techniques you can use to anonymize sensitive data.
Data masking involves allowing access to a modified version of sensitive data. This can be achieved by modifying data in real-time, as it is accessed (dynamic data masking), or by creating a mirror version of the database with anonymized data (static data masking). Anonymization can be performed via a range of techniques, including encryption, term or character shuffling, or dictionary substitution.
Pseudonymisation is a method of data de-identification. It replaces private identifiers with pseudonyms or false identifiers, for example, the name “David Bloomberg” might be switched with “John Smith”. This ensures data confidentiality and statistical precision.
Related content: Read our guide to pseudonymisation
Generalization requires excluding certain data to make it less identifiable. Data could be changed into a range of values with logical boundaries. For instance, the house number at a specific address could be omitted, or replaced with a range within 200 house numbers of the original value. The idea is to remove certain identifiers without compromising the data’s accuracy.
Related content: Read our guide to data generalization
Data swapping, also called shuffling or data permutation, rearranges dataset attribute values so that they don’t match the initial information. Switching columns (attributes) that feature recognizable values, including date of birth, can greatly influence anonymization.
Data perturbation changes the initial dataset slightly by using rounding methods and random noise. The values used must be proportional to the disturbance employed. It is important to carefully select the base used to modify the original values—if the base is too small, the data will not be sufficiently anonymized, and if it’s too large, the data may not be recognizable or usable.
Synthetic data is algorithmically produced data with no connection to any real case. The data is used to create artificial datasets rather than utilizing or modifying the original dataset and compromising protection and privacy.
This data method uses mathematical systems based on patterns or features in the original dataset. Linear regressions, standard deviations, medians, and other statistical methods may be employed to create synthetic outcomes.
Automated Data Classification with Satori
Satori is the first DataSecOps platform that does automated & continuous data classification and sensitive data discovery. This is done without adding any database objects and helps discover new sensitive data immediately, instead of on a scheduled scan.