What Is Data Masking?
Data masking is a technique used to create a version of data that looks structurally similar to the original but hides (masks) sensitive information. The version with the masked information can then be used for various purposes, such as user training or software testing. The main objective of masking data is to create a functional substitute that does not reveal the real data.
The majority of organizations have stringent security controls that protect production data when it rests in storage and when it is in business use. However, sometimes data is used for less secure operations like testing or training, or by third parties outside the organization. This can put the data at risk, and might result in compliance violations.
Data masking offers an alternative that can allow access to information, while protecting sensitive data. Data masking processes use the same data format to emulate the original data, while changing the values of sensitive information.
There is a wide range of ways that can be used to alter data, including character shuffling, word or character substitution, and encryption. Each method has its unique advantages. However, when masking data the values must always be changed in some manner that makes reverse engineering impossible.
Here are several examples of data masking:
- Replacing personally-identifying details and names with other symbols and characters
- Moving details around or randomizing sensitive data like names or account numbers
- Scrambling the data, substituting parts of it for other parts from the same dataset
- Deleting or “nulling out” sensitive values within data records
- Encrypting the data to make it infeasible for unauthorized users to access it without a decryption key
In this article we will discuss:
Which Data Requires Data Masking?
Here are the most common data types that require data masking:
- Personally identifiable information (PII)—data that can be used to identify certain individuals. This includes information like full name, passport number, driver’s license number, and social security number.
- Protected health information (PHI)—data collected by healthcare service providers for the purpose of identifying appropriate care. This includes insurance information, demographic information, test and laboratory results, medical histories, and health conditions.
- Payment card information—the Payment Card Industry Data Security Standard (PCI DSS) requires merchants that handle credit and debit cards transactions to appropriately secure cardholder data.
- Intellectual property (IP)—data related to creations of the mind, including inventions, business plans, designs, and specifications, have high value for an organization and must be protected from unauthorized access and theft.
Types of Data Masking
Here are three common types of data masking:
- Static data masking—involves creating a duplicated version of a dataset, containing fully or partially masked data. The dummy database is maintained separately from the production database.
- Dynamic data masking—alters information in real time, as it is accessed by users. This technique is applied directly to production datasets. It ensures that the original data is seen only by authorized users, and any non-privileged user sees masked data.
Read more about dynamic data masking in our complete dynamic masking guide.
- On the fly data masking—modifies sensitive information as it is transferred between environments, ensuring that sensitive information is masked before it reaches the target environment. This technique is ideal for organizations migrating data between systems, or maintaining continuous integration or synchronization of disparate data sets.
8 Data Masking Techniques
Here are a few common data masking techniques you can use to protect sensitive data within your datasets.
1. Data Pseudonymization
Lets you switch an original data set, such as a name or an e-mail, with a pseudonym or an alias. This process is reversible—it de-identifies data yet still enables later use of re-identification if needed.
2. Data Anonymization
A method that lets you encode identifiers that connect individuals to the masked data. The goal is to protect the private activity of users while preserving the credibility of the masked data.
3. Lookup substitution
You can mask a production database with an added lookup table that provides alternative values to the original, sensitive data. This allows you to use realistic data in a test environment, without exposing the original.
Lookup tables are easily compromised, so it is recommended you encrypt data so that it can only be accessed via a password. The data is unreadable while encrypted, but is viewable when decrypted, so you should combine this with other data masking techniques.
If the sensitive data is not necessary for QA or development purposes, you can replace it with generic values in the development and testing environment. In this case there is no realistic data with similar attributes to the original.
If you want to reflect sensitive data in terms of averages or aggregates, but not on an individual basis, you can replace all the values in the table with the average value. For example, if the table lists employee salaries, you can mask the actual individual salaries by replacing them all with the average salary, so the overall column matches the real overall value of the combined salaries.
If you need to retain uniqueness when masking values, you can protect the data by scrambling it, so that the real values remain, but are assigned to different elements. Given the salary table example, the actual salaries will all be listed, but it won’t be revealed which salary belongs to each employee. This method is best suited to larger datasets.
8. Date Switching
If the data in question involves dates that you want to keep confidential, you can apply policies to each data field to obfuscate the real date. For example, you can set back the dates of all active contracts by 100 days. The drawback of this method is that, because the same policy applies to all values in a field, the compromise of one value results in the compromise of all values.
What Are the Challenges of Data Masking?
Here are some of the key challenges involved in data masking:
- Format preservation— the data masking solution has to understand the data (i.e what it represents). When the masking system replaces the original data with inauthentic data, it should preserve the original format. This is especially important for data threads that require a specific order or format, such as dates.
- Referential integrity — the tables in a relational database are connected via primary keys. When the masking solution obfuscates or replaces the values of a table’s primary key, these values must be modified consistently across the database.
- Gender preservation — the masking system should have gender awareness when replacing a person’s name in the database, and be able to detect if the name is male or female. The gender distribution in a table will be altered if the masking system changes names randomly.
- Semantic integrity — databases typically enforce rules that limit the range of values permitted (e.g. the range of salaries). Any masked data must fall within the specified range in order to preserve the semantics (meaning) of the data.
- Data uniqueness — when masking unique data, the masking system should apply unique values for every data element. If the table in question stores employee SSNs, for example, each employee should receive a unique SSN after masking. The frequency distribution of the masked data should be retained, especially if the distribution is meaningful (i.e. geographic distribution). Each column on the table should have similar masked data values to the original, on average.
Data Masking Best Practices
Best practices for data masking include:
- Data discovery—before you can protect your data, you need to have a grasp of the data you are holding, and distinguish the various types of information with varying degrees of sensitivity. Security and business experts typically collaborate to produce an exhaustive record of all the data components across an enterprise.
- Survey of circumstances—the security director responsible for determining the availability of sensitive data should oversee the circumstances in which the data is stored and used, and decide on the appropriate concealing strategy for each type of data.
- Veiling actualization—for large enterprises, it is not realistic to apply a single data masking technique across all datasets. Each type of data has to be considered in terms of the appropriate arrangement, engineering and usage needs.
- Veiling testing—this involves testing the results of data veiling techniques. The QA and testing teams must guarantee that the data masking techniques used offer the desired outcomes. In the event that a masking technique falls short of expectations, the DBA must restore the database to the original, unmasked state and apply a new masking procedure with new calculations.
Data Masking with Satori
Satori enables dynamic masking over any data platform being accessed, based on your choice of security policies, and can be set based on identities, data locations, as well as by data types.