Guide: Data Management

Introduction To Data Meshes

Data mesh is one of the latest innovations in data architecture. It aims to solve the scalability challenges inherent to the traditional data lake approach. When implemented effectively, the data mesh approach enables teams to take full ownership of their data, curate high-quality data-as-a-product packages, and efficiently deliver these to data consumers. This process improves cross-organizational data flow and allows organizations to extract greater value from this data.

Is data mesh all that it is chalked up to be? How would you implement it? Continue reading to find out.

What Is a Data Mesh?

Data mesh is the latest architectural shift in the data world. Some see it as an integral part of the data democratization process because it allows various teams to access relevant, valuable data in real-time. Simply put, it is a self-service design that is domain-oriented.

With a data mesh, software engineering best practices and techniques are paired with the lessons learned in developing robust and resilient internet-scale applications. This novel approach allows organizations to efficiently manage data in a manner that is easily scalable.

An effective data mesh allows all teams within an organization to take complete ownership of their data and package it in a useful way for other teams to access. These other teams’ access the data in a self-service manner, extracting what they need, when they need it, in real-time. In essence, the data in your data mesh is domain-oriented and serves as a useful product available to other domains.

This approach starkly contrasts the traditional siloed data approach. In the past, data producers, usually operating in multi-disciplinary teams, were disconnected from the data—and from platform engineers who process their data. These highly specialized data and ML (machine learning) platform engineers would then process the data and package it for teams of multi-disciplinary data consumers. Since these teams have no real interactions or collaborations, the process lacks cross-functional understanding, and many potential insights that could be extracted from the data are lost. Additionally, the data is often packaged in a manner that is incompatible with the data consumer’s platform and needs, leading to frustration and unnecessary time spent on converting data to a useful format. 

Data meshes are controlled by global governance and open standards, enabling (and hopefully ensuring) interoperability. This feature overcomes the challenge of data incompatibility between producers and consumers since the data is delivered as a consumer product. In essence, data producers curate the data package to get exactly what they need in a format that is useful to them.

At the nuts and bolts level, data infra engineers create a platform underpinning all of these data-sharing activities. This platform includes storage, data pipelines, catalogs, access control, and other functions necessary for effective data sharing.

Note that a data mesh is not a traditional siloed data storage facility, where data transfer is cumbersome and the system is not easily scalable. Rather, data meshes are meant to scale quickly, both in depth (i.e. number of users and amount of data) and breadth (i.e. cross-organizational reach). They can easily accommodate the constant increase in data sources and the ongoing changes in the data landscape. Additionally, well-constructed data meshes can seamlessly keep up with the speed at which these changes occur, as well as with the great variety of consumer data and use cases. 

Why Shift to a Data Mesh?

The domain data product is the primary focus in a data mesh, with the data lake and pipeline taking a  background role as an implementation detail. Here, data forms an ecosystem of products that integrate well and consist of multiple nodes meshed together to form a coherent whole.

The shift away from a centralized data lake allows for local control of relevant data, both on the consumer side and on the backend. It allows for better visualization of business reporting and improved data sharing practices. When harnessed and developed to its full potential, data meshes will help companies experience greater interdepartmental collaboration and higher quality data available more quickly across the organization.

Data meshes are inherently scalable, enabling an organization’s data ecosystem to be rapidly expandable and flexible. This agility can give organizations a leading edge in today’s competitive, data-driven world. The traditional monolithic approaches, such as data lakes, served us well in the past but are now becoming inadequate given the growing complexity and sheer volume of data needs. Data meshes allow organizations to move away from the rigidity and lack of scalability inherent in these traditional approaches.

Since the data mesh approach ensures that high-quality data is produced, consumers can analyze and thus react to data trends far more rapidly than before. This improves business agility and, by implication, the bottom line.

Primary Benefits of the Data Mesh Approach

  1. Increased team flexibility, as teams operate independently using the self-service approach. This flexibility improves time-to-market and decreases the IT backlog.
  2. Teams produce high-quality data that is standardized and easily accessible.
  3. Domain experts and product owners manage data while encouraging greater collaboration between IT and business teams.
  4. Teams can provide data faster since the self-service architecture takes care of complex details. These include identity administration, data monitoring and storage, and other semantics.
  5. Faster, data-driven business decision-making which reduces lead times and expenditure.

Overall, the data mesh approach increases flexibility and speeds up access to valuable data-driven insights which, in turn, increases profitability.

Main Aspects Underpinning an Effective Data Mesh

There are four main aspects that go into a well-functioning data mesh which necessitate a mature understanding of data principles across the organization.

Domain-Oriented Data Ownership

Business teams in digital organizations are typically aligned with the business domain in which they operate. These teams must be empowered to own, govern, and share the data that they produce. Effective regulations enable this data sharing to occur consistently, minimizing the time between data production and value delivery. For this approach to be practical, a solid data understanding is essential.

Data as a Product

Viewing data as a product instead of as an asset ensures higher data quality. Here, data is the responsibility of the product owner who must deliver data that is coherent and self-contained.

Self-Service Data Infrastructure

Self-service infrastructure enables developers to create analytical data products faster. In this case, the time taken to gain insights from the data is reduced while ensuring compliance with organizational data regulations.

Federated Computational Governance

Setting and preserving global controls while facilitating local flexibility is essential to effective data mesh operations. With this approach, data product owners are responsible for compliance with standards that are centrally managed to ensure robust and consistent security policies.

Common Data Mesh Architectures

There are numerous architectures available for practical data meshes, but they are all underpinned by the same main principles:

  1. Domain-oriented data ownership
  2. Data as a product
  3. Self-service data infrastructure
  4. Federated computational governance

Data Types

To build an effective data mesh, you must first understand what type of data you are working with. Data can essentially be divided into two categories: operational data and analytical data.

Operational data keeps the business running. This data, stored in databases, underpins business capabilities and enables the organization to perform its daily functions. It typically has a transactional nature, serves the needs of business applications, and maintains the current state of the business. 

By contrast, analytical data allows you to view the state of the business over time. This data is meant to show retrospective trends or predict future ones. It is used to train ML (machine learning) models and forms the basis for analytical reports. 

Within traditional data architectures, these two data fields exist in parallel, and integration between these two planes is complex and often nearly impossible.

Analytical data is further divided into two categories: data warehouses and data lakes. Data lakes serve the data science field, while data warehouses underpin business intelligence reporting and analytics. These two fields are often incompatible and cause many challenges for data scientists and IT teams.

In a data mesh, these differences are identified and planned for accordingly. Data mesh architecture takes the data’s nature and topology into account while accounting for the quirks and habits of the data consumer. An effective mesh connects the diverging data types in an inverted model and topology, using a domain-based approach rather than the traditional siloed technology stack approach.

Domain Ownership

Dividing data into the various domains in which they originate is one of the principles that underpins an effective data mesh. In most organizations, it is relatively simple to find the dividing lines since the organization is typically split into teams according to its business functions. 

The data managed by each team covers a wide range of aspects, all connected to the team’s business function. While managing and packaging current data is essential, this task also requires past data since analyzing historical trends is essential in training ML models to predict future trends.

For this principle to work effectively, the data (both analytical and operational) must be organized in domains. Each domain must have complete autonomy over their data and the code releasing that data to other domains within the organization. For example, users must be able to schedule new production orders in a manufacturing facility and access logs of past production runs.

In real-world situations, domains can be interdependent. To use the manufacturing example again, the production domain will likely interact with the procurement domain to determine the stock of parts needed. In turn, the sales domain would interact with the production domain to communicate demand levels. These domains are all interdependent but have complete ownership and responsibility over their data.

Data as a Product

In a data mesh, the number of data producers increases dramatically compared to the traditional data lake or data warehouse approach. When treated as a product, the quality of this data improves, rendering it useful and valuable. It is no longer what Gartner calls “dark data” that is unused and simply takes up storage space. Treating analytical data as a product implies that the data consumers are treated as customers, thus improving the delivered product.  To this end, the data pack should be discoverable, secure, explorable, comprehensible, and trustworthy. Here, the domain that produces this data should take full ownership of it, ensuring that it meets or surpasses these criteria. Measurable outcomes such as data quality, decreased lead time, and data user satisfaction should be used to evaluate the efficacy of this approach.  This approach necessitates inter-domain collaboration since the data producer should thoroughly understand who the data consumer is, how they consume the data, and which native methods they employ to this end. With this information, the producer can establish an interface that suits the data consumer’s needs. Within each domain, there should be dedicated data product developer roles that are responsible for building, maintaining, and serving data products within the domain. It is possible to have multiple data products per domain. In cases where data products do not naturally fit within existing business domains, the organization can form new teams specifically designed to service those data products. The data product provides access to analytical data as a product by encapsulating three structural components:
  1. Code which is subdivided into three sections:
    • Pipelines transform data received from upstream products or the domain’s operational system into useful information.
    • APIs provide access to data, semantics, syntax, and other metadata.
    • Enforcing governance through access control, compliance, etc.
  2. Data and metadata which are the underlying analytical and historical data in a polyglot form.
  3. Infrastructure which enables the product’s code to be built, deployed, and run. 

Self-Service Data Platform

The platform through which the data product can be accessed requires considerable infrastructure which is necessarily complex and specialized. For this reason, the infrastructure is not easily replicated across domains. The platform used to create these data products should be a high-level abstraction of the infrastructure needed to facilitate ownership of their domains. At the backend, this software would manage the life cycle of data products, provision resources, and manage sundry complexities.

Achieving this goal is where the challenge lies. Data platforms used in different domains vary greatly, and, up until now, these platforms did not interact with each other. Now, they must connect and correspond to facilitate data flow across various pipelines in the organization.

On the software side, this system would operate in various planes where each plane represents a collection of functions within the overarching architecture. Here is an example of planes that would typically be required:

  • Data infrastructure provisioning. A low-level plane accessed by advanced data product developers where provisioning for the infrastructure underpinning the self-service platform occurs.
  • Data product developer experience. This plane functions as the primary interface for data product developers. It is a high-level abstraction of the complexities underpinning the data extraction process.
  • Data mesh supervision. This plane provides a broad overview of the mesh with global control, search, and discovery capabilities.

Data Mesh vs. Data Lake

A data lake is a repository or system that stores original data in its raw or natural form. This data is usually found as object blobs and files and should typically be mined first in order to be useful. It is typically stored in one physical location, such as on a secure server in the cloud. Initially, data lakes were meant to be used as a single domain instead of the multi-domain approach taken in a data mesh which views data as interconnected across the entire system.

Data lakes are monolithic in that all data-related activities relate to a single, centralized platform. While this system is sufficient for small operations, it often creates a bottleneck once the organization’s data needs escalate.

When the data mesh functions effectively, your company most likely will no longer need a centralized data lake. With the data mesh, the distributed logs and storage containing the original data are available as products in the form of immutable datasets, All of which are addressable and quickly found. Note that, in cases where the original data’s format must be changed, it may be necessary to create a local data lake or hub within the domain where the data is held.

Some data lake principles are still used in a data mesh, ensuring that the original data can still be explored and analyzed in source-oriented domain products. Tools traditionally associated with data lakes are also still utilized for internal implementation or as a part of the shared data infrastructure.

One way to think about the interconnectedness of the data mesh ecosystem is to equate data lakes and the data mesh to interconnected lakes and streams on a continent, as was done by Jeffrey T Pollock. In this metaphor, data flows in rivers and streams and is sometimes gathered in lakes—and all of the components form the data ecosystem. Data products are produced from data gathered in the system. This data does not have to be mined in a data lake;  it could be, but it could also otherwise be gathered from streams and rivers. This means that data meshes could include data lakes, but they do not have to.

Data Mesh vs. Data Fabric

Data fabric is an architecture approach that standardizes data management across multiple endpoints, including hybrid multi-cloud environments, on-premises, and edge devices. This architecture includes data services and addresses the increasing complexity of data management. When implemented successfully, the data fabric approach ensures that any data from any source can be combined, shared, and accessed. This process is governed effectively to ensure minimal or no disruptions to the end-users.

The data mesh approach is similar to that of the data fabric. While it follows the same underpinning logic as the data fabric (robust, flexible data management), the data mesh takes this approach a few steps further. Unlike a data fabric, this approach does not limit its focus to the data or IT teams. Instead, the data mesh approach facilitates an organizational shift by enabling all multi-disciplinary teams to take ownership over their data. As with the data fabric, the processes involved in data mesh are closely governed, ensuring that high-quality data is shared in a manner that all stakeholders can quickly access.

Takeaways

The innovative data mesh approach aims to solve the scalability challenges inherent to traditional data ecosystems. This domain-driven approach allows for cross-organizational data flow, facilitating high-quality data with decreased time-to-value metrics. Here, data producers take full ownership of their data and deliver it as data-as-a-product to consumers. Four main aspects underpin this approach:

  1. Domain-oriented data ownership
  2. Data as a product
  3. Self-service data infrastructure
  4. Federated computational governance

The data mesh approach decreases lead times which allows for more agile business decision-making processes, enhances profitability, cuts costs and increases market share.

Satori And Your Data Mesh