什么是数据网格-以及如何不将其网格化

Ask anyone in the data industry what’s hot these days and chances are “data mesh” will rise to the top of the list. 但是什么是数据网格,为什么要建立一个? 好奇的人想知道.

在…的年代 自助服务商业智能, nearly every company considers themselves a data-first company, but not every company is treating their data architecture with the level of democratization and scalability it deserves.

Your company, for one, views data as a driver of innovation. Your boss was one of the first in the industry to see the potential in Snowflake and Looker. Or maybe your CDO spearheaded a cross-functional initiative to educate teams on data management best practices and your CTO invested in a data engineering group. Most of all, however, your entire data team wishes there were an easier way to manage the growing needs of your organization, from fielding the never-ending stream of ad hoc queries to wrangling disparate data sources through a central ETL pipeline.

Underpinning this desire for democratization and scalability is the realization that your current data architecture (in many cases, a siloed data warehouse or a data lake with some limited real-time streaming capabilities) may not be meeting your needs.

Fortunately, teams seeking a new lease on data need look no further than a data mesh, an architecture paradigm that’s taking the industry by storm.

什么是数据网格?

Much in the same way that software engineering teams transitioned from 单片应用程序到微服务体系结构, the data mesh is, in many ways, the data platform version of microservices.

由Zhamak Dehghani首先定义, a ThoughtWorks consultant and the original architect of the term, a data mesh is a type of data platform architecture that embraces the ubiquity of data in the enterprise by leveraging a domain-oriented, 自助设计. 借用埃里克·埃文斯的理论 领域驱动设计, a flexible, scalable software development paradigm that matches the structure and language of your code with its corresponding business domain.

Unlike traditional monolithic 数据基础设施s that handle the consumption, storage, 转换, 并在一个中央数据湖中输出数据, 数据网格支持分布式, domain-specific data consumers and views “data-as-a-product,,每个域处理自己的数据管道. The tissue connecting these domains and their associated data assets is a universal interoperability layer that applies the same syntax and data standards.

Instead of reinventing Zhamak’s very thoughtfully built wheel, we’ll boil down the definition of a data mesh to a few key concepts and highlight how it differs from traditional data architectures.

(If you haven’t already, however, I highly recommend reading her groundbreaking article, How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh或者观看马克斯·舒尔特的技术演讲 为什么Zalando转向了数据网格. 你不会后悔的).

图片的文章
在高水平上, a data mesh is composed of three separate components: data sources, 数据基础设施, and domain-oriented data pipelines managed by functional owners. Underlying the data mesh architecture is a layer of universal interoperability, 反映域无关的标准, 以及可观察性和治理. (图片由蒙特卡罗数据提供.)

面向领域的数据所有者和管道

Data meshes federate data ownership among domain data owners who are held accountable for providing their data as products, while also facilitating communication between distributed data across different locations.

While the 数据基础设施 is responsible for providing each domain with the solutions with which to process it, 域的任务是管理摄取, cleaning, and aggregation to the data to generate assets that can be used by business intelligence applications. Each domain is responsible for owning their ETL pipelines, but a set of capabilities applied to all domains that stores, catalogs, 并对原始数据进行访问控制. Once data has been served to and transformed by a given domain, the domain owners can then leverage the data for their analytics or operational needs.

自助服务功能

Data meshes leverage principles of domain-oriented design to deliver a self-serve data platform that allows users to abstract the technical complexity and focus on their individual data use cases.

正如Zhamak所概述的, one of the main concerns of domain-oriented design is the duplication of efforts and skills needed to maintain data pipelines and infrastructure in each domain. 为了解决这个问题, the data mesh gleans and extracts 域无关 数据基础设施 capabilities into a central platform that handles the data pipeline engines, storage, 和流媒体基础设施. Meanwhile, each domain is responsible for leveraging these components to run custom ETL pipelines, giving them the support necessary to easily serve their data as well as the autonomy required to truly own the process.

通信的互操作性和标准化

Underlying each domain is a universal set of data standards that helps facilitate collaboration between domains when necessary — and it often is. It’s inevitable that some data (both raw sources and cleaned, transformed, and served data sets) will be valuable to more than one domain. 支持跨域协作, 数据网格必须在格式上标准化, governance, 可发现性, 和元数据字段, 在其他数据特征中. Moreover, 很像个人微服务, each data domain must define and agree on SLAs and quality measures that they will “guarantee” to its consumers.

为什么要使用数据网格?

直到最近, many companies leveraged a single data warehouse connected to myriad business intelligence platforms. Such solutions were maintained by a small group of specialists and frequently burdened by significant technical debt.

In 2020, the architecture du jour is a data lake with real-time data availability and stream processing, 以摄取为目标, enriching, 转换, 从一个集中的数据平台提供数据服务. For many organizations, this type of architecture falls short in a few ways:

  • A central ETL pipeline gives teams less control over increasing volumes of data
  • 每个公司都变成了数据公司, different data use cases require different types of 转换s, 把重物放在中央平台上

这样的数据湖导致数据生产者断开连接, 不耐烦的数据消费者, 更糟糕的是, a backlogged data team struggling to keep pace with the demands of the business. Instead, 面向领域的数据架构, 像数据网格, give teams the best of both worlds: a centralized database (or a distributed data lake) with domains (or business areas) responsible for handling their own pipelines. 正如Zhamak所说, data architectures can be most easily scaled by being broken down into smaller, 面向领域的组件.

图片的文章

Data meshes provide a solution to the shortcomings of data lakes by allowing greater autonomy and flexibility for data owners, facilitating greater data experimentation and innovation while lessening the burden on data teams to field the needs of every data consumer through a single pipeline.

Meanwhile, the data meshes’ self-serve infrastructure-as-a-platform provides data teams with a universal, 域无关, 并且通常采用自动化的方法进行数据标准化, 数据产品谱系, 数据产品监控, alerting, logging, 以及数据产品质量度量(换句话说, 数据收集与分享). 综上所述, these benefits provide a competitive edge compared to traditional data architectures, which are often hamstrung by the lack of data standardization between both ingestors and consumers.

啮合还是不啮合:这是个问题

Teams handling a large amount of data sources and a need to experiment with data (换句话说, transform data at a rapid rate) would be wise to consider leveraging a data mesh.

We put together a simple calculation to determine if it makes sense for your organization to invest in a data mesh. 请回答每个问题, below, 用一个数字,把它们加在一起,得到一个总数, 换句话说, 你的数据网格分数.

  • 数据源数量. 您的公司有多少数据源?
  • 数据团队的规模. How many data analysts, data engineers, and product managers (if any) do you have on your data team?
  • 数据域的数量. How many functional teams (marketing, sales, operations, etc.)依赖你的数据源来驱动决策, 你们公司有多少种产品, 以及有多少数据驱动的功能正在被构建? 添加的总.
  • 数据工程的瓶颈. How frequently is the data engineering team a bottleneck to the implementation of new data products on a scale of 1 to 10, 1是“从不”,10是“总是” ?
  • 数据治理. How much of a priority is data governance for your organization on a scale of 1 to 10, with 1 being “I could care less” and 10 being “it keeps me up all night”?

数据网格的分数

In general, 你的分数越高, the more complex and demanding your company’s 数据基础设施 requirements are, and in turn, the more likely your organization is to benefit from a data mesh. 如果你的分数在10分以上, then implementing some data mesh best practices probably makes sense for your company. 如果你的分数在30分以上, 那么您的组织就处于数据网格的最佳位置, 加入这场数据革命将是明智的.

以下是如何分解你的分数:

  • 1–15: Given the size and unidimensionality of your data ecosystem, you may not need a data mesh.
  • 15–30当前位置贵公司正在迅速成熟, and may even be at a crossroads in terms of really being able to lean into data. We strongly suggest incorporating some data mesh best practices and concepts so that a later migration might be easier.
  • 30 or above: Your data organization is an innovation driver for your company, and a data mesh will support any ongoing or future initiatives to democratize data and provide self-service analytics across the enterprise.

As data becomes more ubiquitous and the demands of data consumers continue to diversify, we anticipate that data meshes will become increasingly common for cloud-based companies with over 300 employees.

图片的文章
图片由模因生成器提供.net.

不要忘记可观测性

The vast potential of using a data mesh architecture is simultaneously exciting and intimidating for many in the data industry. In fact, some of our customers worry that the unforeseen autonomy and democratization of a data mesh introduces new risks related to data discovery and health, 以及数据管理.

考虑到数据网格的相对新奇性, 这是一个合理的担忧, but I would encourage inquiring minds to read the fine print. 非但没有引入这些风险, 数据网格实际上是强制的 可扩展的,自服务的可观察性到您的数据.

事实上,域名不能真正做到 own 如果他们的数据没有可观察性. According to Zhamak, such self-serve capabilities inherent to any good data mesh include:

  • 对静止和运动中的数据进行加密
  • 数据产品版本控制
  • 数据产品模式
  • Data product discovery, catalog registration, and publishing
  • 数据治理和标准化
  • 生产数据血统
  • 数据产品监控、警报和日志记录
  • 数据产品质量度量

当包装在一起, these functionalities and standardizations provide a robust layer of observability. The data mesh paradigm also prescribes having a standardized, scalable way for individual domains to handle these various tenants of observability, allowing teams to answer these questions and many more:

  • 我的数据是新鲜的吗?
  • 我的数据坏了吗??
  • 如何跟踪模式更改?
  • What are the upstream and downstream dependencies of my pipelines?

如果你能回答这些问题, you can rest assured that your data is fully observable — and can be trusted.

有兴趣学习更多关于数据网格的知识? 除了扎马克和麦克斯的资源, check out some of our favorite articles about this rising star of data engineering:

你的公司正在构建一个数据网格吗? Reach out Barr Moses和Lior Gavish 用你的经验,技巧和痛点. 推荐一个正规滚球网站希望收到你的来信!