Data Nesting refers to the process of storing data using a nested structure. This type of structure for data is commonly used in document-based databases and data formats such as JSON file format. It differs from traditional data warehouses, which generally store data in tabular form. In recent years there has been a subtle but significant change in this way, as this type of database has been popularized with web applications that make extensive use of nested data types.
In general, we can define Data Nesting as data that contains an unlimited number of observations under a single key, which can also be nested within a higher structure composed of multiple but not necessarily equal numbers of observations each.
In the context of dimensional modeling, an example of this can be storing data from a website. Specific statistics, such as the number of visits and visit duration, are stored. There are also attributes that only exist at the visit level, such as the user’s IP address, browser type, and OS. Statistics are also kept per pageview, with a set of data being stored. Pageviews also have specific attributes, such as page name, page category, and page URL.
In the traditional world of a data mart or data warehouse design, a common approach to creating a model to support the analysis of this web data might be to create something that looks like the following (simplified) data model.
- Day 1:
- Visitor A:
- Browser OS.
- Machine IP
- Clicks:
- Page A
- Page B
- Visitor B
- Browser OS.
- Machine IP
- Clicks:
- Page C
- Visitor A:
- Day 2:
- Visitor C:
- Browser OS.
- Machine IP
- Visitor D
- Browser OS.
- Machine IP
- Visitor C:
This type of modeling addresses a few challenges that occur when building models for business intelligence.
Benefits of Nested Data
- Reduced duplication of data by making it possible to add observations without duplicating the structure that contains them. For example, inserting new details about the clicks of each user doesn’t require copying the information about the visitor itself or the page level details, meaning queries against the visit fact will perform better.
- A dimensional model allows using a single key column (for example, Browser OS) instead of multiple columns for each detail around the browser and OS (versions, device, etc.). A single key column reduces the storage cost associated with the model.
- This type of model can also improve query performance, especially for commonly used values.
- In the case of document-based databases, it reduces operating costs, and the information is stored as file-type documents.
Challenges of Nested Data
- This approach is at odds with the typical way of adding entries to a log file, which generally pimples to write to the data store as the events occur.
- The process of turning this data into the dimensional model shown above can be expensive and time-consuming.
- The approach can lead to expensive queries since often. Data will need to be joined in a complex way. Some queries might require joins of billions of page view records with millions of user records, which can ultimately lead to long-running queries with poor performance. The situation might be even worse when applying simple filters, scan, and aggregation operations on a single table.
Support for nested data in these platforms makes modeling and storage decisions simpler while driving improvements for query performance. Although several advantages will still drive its adoption further, several challenges remain, like the ones related to using this type of data in traditional analysis and business intelligence scenarios.
Cloud Data Security with Satori
Satori, The DataSecOps platform, gives companies the ability to enforce security policies from a single location, across all databases, data warehouses and data lakes. Such security policies can be data masking, data localization, row-level security and more.
Learn more: