Data Lake and Data Warehouse Security
In this chapter we’ll provide information about Data Classification and Data Cataloging, and cover the following topics:
Protecting data lakes and data warehouses is especially challenging, due to the following factors:
- Data lakes and data warehouses are at the intersection of multiple functions. Data from multiple sources is transformed, and various functions (whether people or apps) process the output with different methods and purposes. This variety makes any security project complicated, as many components, steps, and processes may fail.
- The data may be very versatile, especially in data lakes. Data can be structured (many different types), semi-structured, or unstructured.
- The high rate of data technology innovation, makes securing it an ever-moving target.
- The internal security controls implemented in data warehouses and data lakes are often insufficient and require additional external controls to ensure security.
What Are the Security Concerns in Building a Data Warehouse?
When building a data warehouse, or when securing an existing one, there are several important concerns to address:
- Mapping existing data. You should be able to catalog the data in your warehouse, understand where sensitive data is stored, and identify who has access to it. This precaution is important both in terms of risk reduction and in terms of compliance with different regulations. The dynamic nature of data storage makes this process challenging.
- Implementing access controls. Begin with proper authentication, which usually includes integration with your organization’s SSO control and may require additional restrictive policies (i.e. connection is done only from certain network segments or VPNs, require two factor authentication, etc).
- Proper authorization. Configure which users (who are usually assigned with groups or roles) need access to and what type of access is required. For example, a certain team may need to update data, but other teams only need read access.
- Granularity in the access. In many cases, there is a need for a fine-grained access control to the data warehouse to set different levels of access based on specific columns (e.g. users of specific roles can access the customers’ table but not their PII or payment details), rows, or items (e.g. a team can access only data specific to the customers in its region, not the entire customer base). The modern approach often requires data-based access control (not to be confused with database access control), which allows or restricts access to data based on its type without specific data location limits. For example, with this approach, a certain team may have access to an entire data warehouse but not to PII (as it may be blocked or masked).
- Combating over-privilege. As projects evolve, employees change positions, and configuration issues may lead some of the data warehouse’s users to have more access than they actually require. This incongruity between need and access is a risk which should be mitigated. To solve this problem, it is important to analyze the actual usage of the different data locations compared to the access granted to users. It is important to set workflows around rarely accessed data.
- Auditing and monitoring. Maintaining audit and monitoring queries and jobs running on the data warehouse is essential. This desired state is either default or requires specific configuration as a function of the data warehouse technology. Even when enabled, the query audit or access log may need to be moved to different storage locations and monitored for purging and aggregations.
- Behavior analysis. Access to data should be analyzed for risks such as bad practices or insider threats. For example, abusing a specific account and using it for multiple purposes and/or by different teams is bad practice. An employee unexpectedly pulling large amounts of sensitive data is an example of potential insider threat..
- Continuous security. The highly dynamic nature of data environments requires continuous security, rather than one-time projects. For example, mapping the location of sensitive data may prove futile, as the results may be obsolete by the time you receive them.
Security Policies for Data Warehouses
Security policies in data warehouses address security and compliance gaps in order to reduce the risks involved. Some policies can be implemented on the data warehouse itself (e.g. restricting access to specific resources to specific roles), and others should be handled by external tools. In practice, decoupling the security policies from the data stores themselves may be more efficient and easy to manage.
Examples of security policies for data warehouses:
- A policy ensuring that only users in specific groups or roles have access to sensitive information (such as PII).
- A policy restricting access to amounts or types of data being pulled from outside of the office IP addresses.
- A policy restricting users from accessing data they have not accessed within a given period.
- A policy restricting users only to data from their geographic region.
Implementing Security Controls for Snowflake Data Warehouse
Several common concerns to note when implementing security controls for Snowflake:
- Role hierarchies may be tempting to use but may not prove effective in the long run, as they sometimes make it difficult to understand why specific users or roles have access to various resources. Make sure to plan which roles architecture you want to use for your Snowflake.
- Access logs and query logs are kept for a certain amount of time and may have data latency issues (it takes time for the data to reach the query log). Handling the logs efficiently may mean integrating with SIEM or other systems, or using a specific data warehouse protection product such as Satori.
- Achieving granular access to data may require implementing column and row based access controls. You may implement this to a certain extent by using secure views or by decoupling the security from Snowflake and using Satori to create such restrictions.
For further details about securing your Snowflake data warehouse, read our Snowflake data protection guide.
Implementing Security Controls for AWS Redshift Data Warehouse
Common concerns when implementing security controls for AWS Redshift:
- Access and query logs are very limited (especially in terms of retention time), and you may want to enable logging to an external store (such as a S3 bucket). Doing so may require further processing, ETLs, aggregations, and purging of the data.
- In most cases consider investing the time to integrate Redshift access with an identity management system, such as Okta, to enhance the visibility of user access privileges
For further details about securing your AWS Redshift data warehouse, read our Redshift data protection guide.
Implementing Security Controls for GCP BigQuery
Common concerns when implementing security controls for BigQuery:
- It is important to monitor your BigQuery jobs and to analyze the BigQuery jobs log for insight into the queries running on your BigQuery data warehouse.
- It is important to ensure that users only have access to the specific tables they need (or even to implement row and column based security if possible).
For further details about securing your GCP BigQuery data warehouse, read our BigQuery data protection guide.