Metadata

Metadata is the term used to refer to the information about data assets in a given data repository. In the context of data warehousing and data lakes, metadata refers to information about the data type schema of tables, column information, latest update, source of the data, categorization, and any other type of information that can be used to describe each of the data assets.

The purpose of metadata is to provide an overview of any descriptive information about data assets and their context. It is used to improve the organization and understanding of data, to help improve categorization and create a source of truth that allows to search and define data assets that any company or organization holds.

 

Typical Components of Metadata

  • Title and description of the data assets.
  • Tags and categories that might be used to describe them or to serve the purpose of giving context.
  • The source of the data, as well as the date-time of creation and modification.
  • Operations and transformation, as well as details about the date and users who operated.
  • Information about access and permissions.

Organizations can store this information in the same data warehouse or lake that holds the information or other data repositories. Repositories intended to keep data metadata are sometimes called Data Catalogs, as they offer structured details on data assets and descriptive data and context about them. These Data Catalogs can, later on, be used for Data Discovery initiatives and to keep a tighter and cleaner control about the data assets.

Metadata can be used to enforce data governance policies and to comply with regulations like GDPR that seek to control how data is used and for how long the organization can retain it. Metadata can also be used for Data Quality checks. Data quality aims to ensure compliance with quality standards. Standards are both from the data engineering perspective (checking that data types and schemas are according to expectations) and from the business perspective (ensuring that business rules are complied with and that variables are in line with the expected previous values).

 

Types of Metadata

  • Descriptive metadata: Information about the source of the data asset. Organizations can use it for data discovery initiatives to shed light on the data assets that a company or organization might be storing.
  • Structural metadata: Provides information about the structure of each one of the data assets, how they are linked between each other, types, versions, and other characteristics of the digital asset.
  • Administrative metadata: Descriptive data about the management of the data asset. It holds information about the resource type, which permissions are applied to it, and how and when it was created.
  • Reference metadata: Provides an overview of the contents and quality from the statistical perspective. Reference metadata holds high-level information about numeric values such as percentage of missing values, statistical variables such as mean, mode, and standard deviation.
  • Statistical metadata: Describes the process in which the data asset was collected, how it was processed.
  • Legal metadata: includes information about the system that produced the data, who is the copyright owner, which type of public licensing, and any other information that might be relevant from the legal perspective.

Although it might be obvious to point out, data teams can use each metadata category to ensure good data quality and governance from each one of its different aspects. The categories mentioned might not be exhaustive and are not generally necessarily bound to just one category.

 

Satori logo2 white