The management of data assets can be a complex, time-consuming task, requiring the investment of large amounts of time to be adequately orchestrated. Companies and organizations want to make sure they leverage their data to make the most of their business intelligence. In the context of data engineering, units within companies and organizations tend to be divided into two separate camps: data production vs. data consumption.
A data producer is commonly used to define the user interface, automation, service, or device that gathers data. They are considered as a root source of data. In several scenarios, multiple systems and processes may simultaneously produce data for the same entity and be data producers and consumers. For example, a set of records might be updated and used by the same systems that generated it in the first place. An example is a marketing system that may get customer data from a sales system even if an e-commerce system is the actual producer of the data. Examples of data producers can be websites, systems that process transactions, external vendors, customer relationship management systems, etc.
As mentioned before, several systems can be data producers that create data and, at the same time, be data consumers that use it. Although this might be obvious, in the context of an organization with many systems and data sources, it is a crucial concept to address in data analysis and architecture. For example, data consumers may create copies of data, transform it and pass it to other systems. This can become a complication due to entangled definitions of dependencies. Therefore, it may be helpful to identify what is producing the data and what is consuming it.
Doing this might become a challenge as companies and organizations produce more data than they can efficiently manage. This data bloating might lead to data producers’ systems having growing data repositories filled with duplicate files, creating neither scalable nor flexible data. In this case, producer systems have been somewhat detached from the downstream uses of the data in analytics and reporting. As data is produced, it is up to the consumer teams to make sense of it, make it usable, and attempt to get the producers to clean it up when they find any issues.
This approach has never been efficient, but it has become almost impossible with the advent of an increased volume of data. Streaming data, big data, unstructured data, IoT devices, and the fast-paced incorporation of machine learning have meant that data is growing much too quickly for data teams to address the quality of the data. As a result, arises the need to hold data producers accountable and responsible for both making their data available to downstream teams and ensuring the quality of their data in the process.
To address these challenges, it has become common to establish agreements between data producers and data consumers teams to outline the roles and responsibilities of each team. A producer contract may include agreements on the following.
Example of Data Producers and Consumers “Contracts”
- Recency/Timeliness: Establish how long the data is produced until the producers make it available in the data repositories. This will depend on capabilities and business needs.
- Data Growth: Data systems should consider the size and volume of the data and communicate expectations for future storage capacity.
- Communication Management: Establish agreements on how issues with the data quality will be communicated to stakeholders.
- Sensitive Data Treatment: Establish rules for dealing with sensitive data such as personally identifiable information and accommodations for data protection regulations.
- Data Catalogs: Ensuring data producers systems provide metadata on data assets to users so that stakeholders can adequately understand them.
- Schemas: Establish agreements on shared data type schemas to ensure the data system can be centrally managed to scale
These sets of agreements allow data producers to consider the usage of their data and ensure engagement and understanding. It also seeks for data teams to spend less time on data quality checks and collection rather than dedicate themselves to drive business initiatives and analytics.
Cloud Data Security with Satori
Satori, The DataSecOps platform, gives companies the ability to enforce security policies from a single location, across all databases, data warehouses and data lakes. Such security policies can be data masking, data localization, row-level security and more.