In Data Engineering, BigData is a set of large and diverse information that flows in and out of certain systems. This term also implies:
- That the volume of information is ever-increasing
- That the nature and schema of the data itself are also changing and variable in time
- That there is a great variety in the types of information flowing at a specific moment in time.
Big Data is a term that started to be widely used in the early 2000s when the amount of information generated online started to exponentially increase thanks to the developments related to access to the internet, digitalization of various business activities, and the increase in computing power that also allowed to capture and process this always-increasing flow of information.
The information is generally considered to be in one of the following structures:
- Structured data, for example, numeric information referring to the sales activity of customers within an e-commerce company that flows between the application, the user, and a payment system.
- Unstructured data, for example, IoT systems capturing image and audio from collectors, which are later aggregated in a database.
- Semi-structured data, for example, JSON format. Semi-structured data is structured because it has a structure but often has hierarchies and fields that may or may not exist in certain records.
The main difference between having structured or unstructured data flows is how this data is stored and how it is processed. Most data warehouses can efficiently deal with large volumes of structured information that we can process. The problem arises when the volume and speed of the incoming information increase from a myriad of sources. At this point, most data engineers and architects leverage streaming databases like Pub/Sub systems and distributed processing systems like Spark.
The data systems producing and consuming the data have increased in time, making the monitoring and control of these systems more complicated. This is due to the number of concurrent events, and the number of systems interacting. The aggregation and analysis of these monitoring logs can sometimes be considered Big Data, as the number of devices and systems sending and receiving data is high.
Companies working in operations that require data flows that can be categorized as Big Data generally require the use of specialized platforms to process and store the information as the ones mentioned above. They also need specific working processes and best practices to ensure that the data is being processed correctly. The need for such measures is because when these data flows are processed, portions of this flow may get to be wrongly processed or missing, which, given the volume of information, can be challenging to detect, generating inaccurate data being sent to stakeholders.
The amount of time in which we have wrong or missing data is generally called data-downtime and can be solved by using data checks that ensure the quality of the data from different perspectives in each of the stages of the data processing.