数据发现: The Future of Data Catalogs for Data Lakes

Over the past few years, data lakes have emerged as a must-have for the modern data stack. But while the technologies powering our access and analysis of data have matured, the mechanics behind understanding this data in a distributed environment have lagged behind.

Here’s where 数据目录 fall short and 如何 数据发现 can help ensure your data lake doesn’t turn into a 数据沼泽.

One of the first decisions data teams must make when 搭建数据平台 (second only perhaps to “why are we building this?)是是否选择一个 数据仓库或湖泊 to power 存储 and compute for their analytics.

While data warehouses provide structure that makes it easy for data teams to efficiently operationalize data (i.e., gleaning analytic insights and supporting machine learning capabilities), that structure can make them inflexible and expensive for certain applications.

另一方面, data lakes are infinitely flexible and customizable to support a wide range of use cases, but with that greater agility comes a host of other issues related to data organization and governance.

As a result, data teams going the the lake or even lakehouse route often struggle to answer critical questions about their data such as:

  • 我的数据在哪里?
  • 谁可以使用它?
  • 我如何使用这些数据?
  • 这些数据是最新的吗??
  • How is this data being used by the business?

And as data operations mature and data pipelines become increasingly complex, traditional 数据目录 often fall short of answering these questions.

Here’s why some of the best data engineering teams are rethinking their approach to building 数据目录 — and 什么 data lakes need instead.

数据目录可能会在湖里淹死

Although exceptionally flexible and scalable, data lakes lack the organization necessary to facilitate proper metadata management and data governance. 图片由 艾德里安 on Unsplash.

数据目录 serve as an inventory of metadata and provide information about data health, 可访问性, 和位置. They help data teams answer questions about where to look for data, 数据代表什么, 以及如何使用它. But if we don’t know 如何 that data is organized, 推荐一个正规滚球网站所有最好的计划(或管道), 相反)是无用的.

In 最近的一篇文章中, Seshu Adunuthula, Intuit数据平台总监, aptly asked readers: “does your data lake resemble a used book store or a well-organized library?”

And it’s an increasingly relevant one for modern data teams. 随着公司向湖泊倾斜, they’re often compromising the organization and order implicit in storing data in the warehouse. Data warehouses force data engineering teams to structure or at least semi-structure their data, 是什么让它易于编目, 搜索, retrieve based on the needs of business users.

从历史上看, many companies have used 数据目录 to enforce data quality and data governance standards, as they traditionally rely on data teams to manually enter and update catalog information as data assets evolve. 在数据的湖泊, 数据分布, making it difficult to document as data evolves over the course of its lifecycle.

Unstructured data is problematic as it relates to 数据目录 because it’s not organized, 如果是的话, 它通常没有被声明为有组织的. That may work for structured or semi-structured data curated in a data warehouse, but in the context of a distributed data lake, manually enforcing governance for data as it evolves does not scale without some measure of automation.

过去:手工和集中目录

Understanding the relationships between disparate data assets — as they evolve over time — is a critical, but often lacking dimension of traditional 数据目录. 而现代数据架构, 包括数据的湖泊, 通常是分布式, 数据目录通常不是, treating data like a one-dimensional entity. Unstructured data doesn’t have the kind of pre-defined model most 数据目录 rely on to do their job and must go through multiple transformations to be usable.

仍然, companies need to know where their data lives and who can access it, be able to measure its overall health — even when 存储 in a lake instead of a warehouse. 没有数据沿袭的可见性, teams will continue to spend valuable time on firefighting and troubleshooting when data issues arise further downstream.

What Data Engineers Need From a Data Catalog

数据发现 can replace or supplement modern 数据目录 by providing distributed, real-time insights about data across different parts of the data stack, all while abiding by universal governance and 可访问性 standards. 图片由巴尔摩西提供.

Traditional 数据目录 can often meet the demands of structured data in a warehouse, but 什么 about data engineers navigating the complex waters of a data lake?

While many 数据目录 have a UI-focused workflow, data engineers need the flexibility to interact with their catalogs programmatically. They use catalogs for managing schema and metadata, need an API-driven approach so they can accomplish a wide range of data management tasks.

此外, data can enter a lake across multiple points of entry, engineers need a catalog that can adapt to and account for each one. 与仓库, where the data will be cleaned and processed before entry, data lakes take in raw data without any assumptions of end-to-end health.

在一个湖, 存储数据既便宜又灵活, but that makes knowing 什么 you have and 如何 it’s being used a real challenge. 数据可以以各种方式存储, 如JSON或Parquet, data engineers interact with data differently depending on the job to be done. They may use Spark for aggregation jobs or Presto for reporting or ad-hoc queries — meaning there are many opportunities for broken or bad data to cause failures. Without lineage, those failures within a data lake can be messy and hard to diagnose.

在一个湖, 数据可以以多种方式进行交互, a catalog has to be able to provide an understanding of 什么’s being used and 什么’s not. When traditional catalogs fall short, we can look to 数据发现 as a path forward.

未来:数据发现

数据发现 is a new approach rooted in the distributed domain-oriented architecture proposed by Zhamak Deghani and Thoughtworks’ 数据网格模型. 在这个框架, domain-specific data owners are held accountable for their data as products and for facilitating communication between distributed data across domains. 

Modern 数据发现 fills voids where traditional 数据目录 fell short through four key ways:

自动化扩展到整个湖泊

使用机器学习, 数据发现 automates the tracing of table and field-level lineage, mapping upstream and downstream dependencies. 随着数据的发展, 数据发现 ensures that your understanding of your data and 如何 it’s being used does, 太.

数据运行状况的实时可见性

不像传统的数据目录, 数据发现 provides real-time visibility into the data’s current state, as opposed to its “cataloged” or ideal state. Since discovery encompasses 如何 your data is being ingested, 存储, 聚合, 并被消费者使用, you can glean insights such as which data sets are outdated and can be deprecated, whether a given data set is production-quality, 或者当给定的表最后一次更新时.

Data lineage for understanding the business impact of your data

This flexibility and dynamism make 数据发现 an ideal fit for bringing lineage to data lakes, allowing you to surface the right information at the right time, drawing connections between the many possible inputs and outflows. 与血统, you can resolve issues more quickly when data pipelines do break, since frequently unnoticed issues like schema changes will be detected and related dependencies mapped.

跨域自助发现

数据发现还支持自助服务, allowing teams to easily leverage and understand their data without a dedicated support team. To ensure this data is trustworthy and reliable, teams should also invest in 数据可观测性, which uses machine learning and custom rules to provide real-time alerting and monitoring when something does go wrong in your data lake or pipelines downstream.

Governance and optimization across the lake

Modern 数据发现 allows companies to understand not just 什么 data is being used, consumed, 存储, deprecated over the course of its lifecycle, but also 如何, which is critical for data governance and lends insights that can be used for optimizations across the lake.

从治理的角度来看, querying and processing data in the lake often occurs using a variety of 太ls and technologies (Spark on Databricks for this, 马上用电子病历, 等.), 结果就是, 通常没有一个单身的, reliable source of truth for reads and writes (like a warehouse provides). A proper 数据发现 太l can serve as that source of truth.

从优化的角度来看, 数据发现 太ls can also make it easy for stakeholders to identify the most important data assets (the ones constantly being queried!)以及那些不用的, both of which can provide insights for teams to optimize their pipelines.

数据湖的分布式发现

As companies continue to ramp up their ingestion, 存储, 以及数据的利用, technology that facilitates greater transparency and discoverability will be key.

越来越多地, some of the best catalogs are layering in distributed, 特定领域的发现, giving teams the visibility required to fully trust and leverage data at all stages of its lifecycle.

Personally, we couldn’t be more excited for 什么’s to come. With the right approach, maybe we can finally drop the “数据沼泽这是双关语? 

Interested in learning 如何 to scale 数据发现 across your data lake? 接触 巴尔摩西, 斯科特·奥利里, 可以玩滚球的正规app队.

To stay-up-to-date with all the latest news and trends in building distributed data architectures, 一定要加入 数据网格学习松弛通道.