Data lakes are data repositories in which organizations can store data regardless of the structure that it may have. It can store data in structured, semi-structured, and unstructured formats alike and at any necessary scale. These data assets can, later on, be processed, cleaned, and aggregate to run analytics, which can be dashboards and visualizations, to real-time data processing that is being streamed to different sources.
Depending on the type of organization, most of the time, data lakes are combined with data warehouses, where the structured outcome of the processes is finally stored. Data warehouses are optimized to run analytics on relational tabular data. This data needs to have a predefined structure and data schema to be stored in the data warehouse, after which queries can be run in an optimized manner to be used later on in reporting and analysis. Moreover, the data loaded in a data warehouse can be aggregated with other sources using join operations to create more detailed data sets.
Data Lakes differ from typical data warehouses as the data is collected from business activities, non-relational document-based sources like mobile applications, IoT sensors and devices, and other types of unstructured streams of data. The data captured has no predefined structure, which enables users to store data without properly defining the structure and data type schema.
Data consumers can later process data stored in raw format in data lakes, cleaned, and aggregated to be loaded in data warehouses or transformed into queriable file formats that data warehouses can natively support. This data can also be used directly, without having to previously load it in a relational database, to create visualizations, reports, or to train machine learning models. Most machine learning systems that rely on unstructured data such as audio, video, or images will require data lakes to store the training data and the model checkpoint files as they cannot be stored in the structured file format.
Although Data Lakes can be on-premise data centers, the trend is to have them as a resource in cloud infrastructure providers according to the security standards required by regulations or use.
One of the risks of Data Lakes is that they become data swamps, which is the term used to refer to unmanaged data lakes where given data sits to be unprocessed and either inaccessible or providing little or no value to data operations.