Data Warehouses have been the de facto way of data management for several years until the rise of the data lake. Thanks to cloud-based storage systems that allow for cheap unstructured data storage, the data lake seeks to overcome the limitations that data warehouses face when dealing with unstructured data. The approach then used systems to read from these unstructured files to operate and ultimately load them into a data warehouse. The issue with this approach was that we still needed a data warehouse to “park” the results, which generally meant that we were still dealing with fixed costs derived from it. Moreover, the transformations being applied to files stored in the data lake didn’t ensure ACID transactions, which ultimately limited the use of this data management architecture to just a specific type of application.
What is a Data Lakehouse?
A data lakehouse is a new data management architecture that seeks to combine the benefits from data lakes and data warehouses. A data lakehouse provides a flexible and cost-efficient platform that ensures ACID transactions while allowing the development of business intelligence analytics and machine learning applications. All the mentioned capabilities make use of the same data management system.
This data management approach leads to data platforms that are scalable and cost-efficient, allowing users to create different applications without needing to access multiple systems.
Main Capabilities of a Data Lakehouse
- It allows for concurrent reading and writing of data stored in raw file formats like Avro and Parquet.
- Integrated data catalogs allow for data type schema support and better data governance.
- Data consumers can directly access data sourced in their raw formats.
- Separation of concerns between storage and compute assets.
- It has standardized storage file formats that allow for direct queries without the need for loading or transformation.
- Streams of data can be directly loaded into structured data.
Advantages of a Data Lakehouse Architecture
The advantages derived from the use of Data Lakes generally involve enhanced management of unstructured data. This architecture allows to consume and extract insights from data types such as text, images, and video using the same compute and storage systems that support the tabular data stored in structured tables and views. The architecture simplifies the management and cost by handling a single system.
The immediate advantages of having a system that combines the use of storage and compute are:
- Improved data governance and accessible exploration of data assets.
- Simplified data platform structure
Reduce data handling, movement, and redundancy.
- Direct access for data analysts to data in raw format.
- Cost-efficient storage and on-demand computing power.
Data Lahoueses seeks to provide a unified solution to data needs for organizations that strive to use any type of data, whether unstructured or unstructured. It benefits both data lakes and data warehouses cost-efficiently and simplifies the development of analytics and machine learning applications.
Cloud Data Security with Satori
Satori, The DataSecOps platform, gives companies the ability to enforce security policies from a single location, across all databases, data warehouses and data lakes. Such security policies can be data masking, data localization, row-level security and more.