Data profiling is the term used to define the process of reviewing source data, understanding structure, content, and interrelationships, and identifying potential insights that users can derive from it. It is commonly a complex practice performed by data-savvy users before and during data ingestion into a data warehouse. During this process, data is analyzed and processed before entering the pipeline.
As companies and organizations move data infrastructure to the cloud, tasks like data ingestion can be applied easily. Cloud data warehouses, data management tools, and ETL services offer integrations with several data sources, although the same self-service approach is generally not possible for data profiling.
Data Profiling Common Uses
- Data warehouse and BI projects. Here data profiling can uncover data quality issues sources and how they need to be addressed during ETL.
- Data profiling can identify data quality issues during data conversion and migration initiatives. It allows handling of these issues, which can be flagged in scripts and data integration tools moving data from source to target.
- Highlight data quality issues on source system projects. Data profiling can flag data quality issues, find the source of the problems, and classify them as derived from wrongful user inputs, errors in interfaces, data corruption, etc.
Data profiling is also the act of monitoring the quality of data as well as mitigating these issues. It is a set of tools that organizations and companies can leverage to make better data decisions. It is often a visual assessment that allows us to check data quality by using business rules and analytical algorithms undercover, understand, and potentially alert on quality issues in data. This knowledge obtained can be used to drive data quality initiatives.
Common Projects in a Data Profiling Process
- Collecting descriptive statistics like min, max, count, and sum.
- Visualizing data types schemas, as well as obtaining metadata about data assets.
- Classify data assets with keywords, descriptions, or categories.
- Performing data quality assessment, risk of performing joins on the data.
- Assess metadata accuracy.
- Identifying distributions, obtaining information about foreign-key, uncover dependencies, and performing inter-table analysis.
- Identify and correct data quality issues in source data before migrations into the target database.
- Identify data quality issues that can be mitigated during ETL. Data profiling can also flag if additional processing is required.
- Identify unanticipated business rules, hierarchical structures, and foreign and private key relationships.
The need for data profiling will continue to grow as data warehouses, and data platforms must interact with an increasingly diverse set of sources and large volumes of data. It enables companies and organizations to automatically clean, optimize, and prepare data for analysis as it arrives to target data repositories.
Cloud Data Security with Satori
Satori, The DataSecOps platform, gives companies the ability to enforce security policies from a single location, across all databases, data warehouses and data lakes. Such security policies can be data masking, data localization, row-level security and more.