How to Build Your Data Reliability Stack

2004年7月26日,一家成立5年的初创公司谷歌面临着一个严重的问题: their application was down

For several hours, users across the United States, France, 而英国则无法访问这个流行的搜索引擎. 当工程师们努力解决问题并发现问题的根本原因时,这家当时拥有800名员工的公司和数百万用户被蒙在鼓里. 到了中午,几个惊慌失措的工程师进行了一场冗长而密集的调查 MyDoom virus was to blame.

In 2021, an outage of that length and scale is considered rather anomalous, but 15 years ago, these types of software outages weren’t out of the ordinary. 在多年来领导团队经历了一些这样的经历之后, Benjamin Treynor Sloss, a Google engineering manager at the time, 决定必须有一个更好的方法来管理和防止这些令人眼花缭乱的消防演习, not just at Google but across the industry. 

Inspired by his early career building data and IT infrastructure, Sloss codified his learnings as an entirely new discipline — site reliability engineering (SRE) -专注于优化软件系统的维护和操作(如谷歌的搜索引擎)的可靠性. 

According to Sloss and others paving the way forward for the discipline, SRE是关于自动化的,无需担心边缘情况和未知的未知(如bug代码), server failures, and viruses). Ultimately, 斯洛斯和他的团队希望工程师们能找到一种方法,使他们能够自动化地维护公司快速增长的代码库,同时确保系统崩溃时代码库能够被覆盖. 

“SRE is a way of thinking and approaching production. 大多数开发系统的工程师也可以成为该系统的SRE。” he said. “The question is: can they take a complex, maybe not well-defined problem and come up with a scalable, technically reasonable solution?” 

如果谷歌有正确的流程和系统来预测和预防下游问题, 不仅可以轻松地修复中断,而且对用户的影响最小, but prevented altogether.

Data is software and software is data

Nearly 20 years later, data teams are faced with a similar fate. Like software, data systems are becoming increasingly complex, with multiple upstream and downstream dependencies. Ten or even five years ago, it was normal and accepted to manage your data in silos, but now, teams and even entire companies are working with data, 为数据管理提供更具协作性和防故障性的方法. 

Over the past few years, 推荐一个正规滚球网站已经见证了数据工程和分析团队广泛采用软件工程最佳实践来解决这一差距, 从采用dbt和Apache气流这样的开源工具来简化数据转换和编排,到基于云的数据仓库和湖泊等 Snowflake and Databricks

Fundamentally, 这种向敏捷原则的转变与推荐一个正规滚球网站如何概念化有关, design, build, and maintain data systems. 那种只生成一次的竖井式仪表板和报告的日子已经一去不复返了, rarely used, and never updated; now, to be useful at scale, data must also be productized,为整个公司的终端用户的消费进行维护和管理. 

为了让数据像软件产品一样被对待,它也必须和软件产品一样可靠. 

The rise of data reliability 

Tweet courtesy of the ever-hilarious Seth Rosen, co-founder of TopCoat Data.

In short, data reliability 组织是否有能力在整个数据生命周期中提供高数据可用性和运行状况, from ingestion to the end products: dashboards, ML models, and production datasets. In the past, 已注意孤立地解决这一难题的不同部分, from testing frameworks to data observability, but this approach is no longer sufficient. As any data engineer will tell you, however, 数据可靠性(以及真正像对待产品一样对待数据的能力)不是在竖井中实现的. 

一个数据资产的模式更改可能会影响多个表,甚至是Tableau仪表板下游的字段. 当你的财务团队在查询Looker的新见解时,一个缺失的价值可能会让他们歇斯底里. And when 500 rows suddenly turns into 5,000, 这通常是有问题的迹象——你只是不知道桌子在哪里或怎么坏的.

即使有大量的数据工程工具和资源存在, data can break for millions of different reasons, 从操作问题和代码更改到数据本身的问题. 

在与数百个数据团队交谈后,通过我自己的经验, 我发现,大约80%的数据问题并没有被单独的测试所覆盖. Image courtesy of Lior Gavish.

From my own experience and after talking to hundreds of teams, 目前,大多数数据工程师发现数据质量问题的方法有两种:测试(理想的结果)和下游涉众的愤怒信息(可能的结果)。. 

与SREs管理应用程序停机时间的方法相同,今天的数据工程师必须关注于减少停机时间 data downtime — periods of time when data is inaccurate, missing, or otherwise erroneous — through automation, continuous integration, and continuous deployment of data models, and other agile principles. 

In the past, we’ve discussed how to build a quick and dirty data platform; now, 推荐一个正规滚球网站正在构建这种设计,以反映迈向良好数据之旅的下一步:数据可靠性堆栈. 

Here’s how and why to build one.

Introducing: the data reliability stack

如今,数据团队的任务是构建可伸缩的、高性能的数据 data platforms 这可以通过存储来满足跨职能分析团队的需求, processing, and piping data to generate accurate and timely insights. But to get there, 推荐一个正规滚球网站还需要适当的方法来确保原始数据是可用的和值得信任的. To that end, 一些最优秀的团队正在构建数据可靠性堆栈,将其作为现代数据平台的一部分.

In my opinion, 现代数据可靠性堆栈由四个不同的层组成:测试, CI/CD, data observability, and data discovery, 每个代表您公司数据质量旅程的不同步骤. 

The Data Reliability stack is made up of five layers, including testing, CI/CD, Data Observability, and Data Discovery. Image courtesy of Lior Gavish.

Testing

在数据进入生产数据管道之前,测试数据在发现数据质量问题方面起着至关重要的作用. With testing, 工程师们预料到某些东西可能会中断,并编写逻辑来预先检测该问题. 

数据测试是验证组织对数据的假设的过程, either before or during production. 编写基本的测试来检查诸如唯一性和not_null之类的东西,这是组织可以测试他们对源数据所做的基本假设的方法. 对于组织来说,确保数据的格式对他们的团队来说是正确的,并且数据满足他们的业务需求也是很常见的. 

Some of the most common data quality tests include: 

  • Null values – are any values unknown (NULL)? 
  • Volume – Did I get any data at all? Did I get too much or too little? 
  • Distribution – is my data within an accepted range? Are my values in-range  within a given column? 
  • Uniqueness – are any values duplicated? 
  • Known invariants – is profit always the difference between revenue and cost, or some other well known facts about my data?  

根据我自己的经验,测试数据最好的两个工具是 dbt tests and Great Expectations (as a more general-purpose tool). 这两种工具都是开源的,允许您在数据质量问题最终落入涉众手中之前发现它们. While dbt is not a testing solution per se, 如果您已经在使用框架建模和转换数据,那么它们的开箱即用测试工作得很好

Continuous Integration (CI) / Continuous Delivery (CD)

CI/CD是软件开发生命周期的一个关键组件,它可以确保随着时间的推移进行更新,新的代码部署是稳定和可靠的, via automation. 

In the context of data engineering, CI/CD不仅涉及到连续集成新代码的过程,也涉及到将新数据集成到系统中的过程. By detecting issues at an early stage, ideally as early as code is committed, or new data is merged – data teams are able to reach a faster, more reliable development workflow.

Let’s start with the code part of the equation. Just like traditional software engineers, data engineers benefit from using source control, for example, Github, 管理它们的代码和转换,以便对新代码进行适当的评审和版本控制. A CI/CD system, for example, CircleCI or Jenkins (open source), 使用完全自动化的测试和部署设置,可以在部署代码时创建更多的可预测性和一致性. This should all sound very familiar. 数据团队遇到的额外复杂性是理解代码更改如何影响其输出的数据集的挑战. That’s where emerging tools like Datafold 进来——允许团队将新代码段的数据输出与前一段代码的运行进行比较. By catching unexpected data discrepancies early in the process, before code is deployed, higher reliability is achieved. 而这个过程需要有代表性的暂存或生产数据, it can be highly effective. 

还有另一个系列的工具是用来帮助团队传递新数据的, rather than new code, more reliably. With LakeFS or Project Nessie, 团队可以使用类似git的语义在发布数据供下游使用之前对其进行分级. Imagine creating a branch with newly processed data, and only committing it to the main branch if it’s deemed good! In conjunction with testing, 数据分支是阻止坏数据到达下游消费者的一种非常强大的方法.

Data Observability

数据预生产阶段的测试和版本控制是实现数据可靠性的重要第一步, but what happens when data breaks in production–and beyond? 

除了在转换和建模之前处理数据质量的数据可靠性堆栈元素之外, data engineering teams need to invest in end-to-end, 自动数据可观察性解决方案,可以在接近实时的情况下检测数据问题. Similar to DevOps Observability solutions (i.e., Datadog and New Relic), data observability 使用自动监控、警报和分类来识别和评估数据质量问题.

data observability
Image courtesy of Barr Moses

数据可观察性是通过数据健康和可靠性的五个关键支柱来衡量的:新鲜度, distribution, volume, schema, and lineage:

  • Freshness: Is the data recent? When was the last time it was generated? What upstream data is included and or omitted?
  • Distribution: Is the data within accepted ranges? Is it the right format? Is it complete?
  • Volume: Has all the data arrived? Was data duplicated by accident? How much data was removed from a table?
  • Schema: What is the schema, and how has it changed? Who made changes to the schema and for what reasons?
  • Lineage: 对于给定的数据资产,受其影响的上游和下游源是什么? 谁生成这些数据,谁依赖这些数据来做决策?

数据可观察性占了团队无法预测的其他80%的数据停机时间(unknowns unknowns), not just detecting and alerting on data quality issues, but also providing root cause analysis, impact analysis, field-level lineage, and operational insights on your data platform. 

With these tools, 您的团队将不仅能够解决类似问题,而且还能够通过对数据可靠性的历史和统计分析,防止类似问题在未来再次发生. 

Data discovery

而不是传统上认为的可靠性堆栈的一部分, we believe data discovery is critical. 团队创建不可靠数据的最常见方式之一是忽略已经存在的资产,并创建显著重叠的新数据集, or even contradict, existing ones. 这在组织中的消费者中造成了混淆,他们会围绕着哪个数据与特定的业务问题最相关, and diminishes trust and perceived reliability. 它还为数据工程团队带来了巨大的数据债务. Without good discovery in place, 团队发现他们需要维护几十个描述相同维度或事实的不同数据集. 复杂性本身使得进一步的开发成为一个挑战,高可靠性极其困难. 

While data discovery is an extremely challenging problem to solve, 数据目录在民主化和可访问性方面取得了进展. For instance, we’ve witnessed how solutions like Atlandata.world, and Stemma can address these two concerns.

Simultaneously, 数据可观察性解决方案可以帮助消除实现数据发现所需的许多常见的可感知的可靠性问题和数据债务挑战. By pulling together metadata, lineage, quality indicators, usage patterns as well as human generated documentation, 数据可观察性可以回答这样的问题:我有什么数据可以用来描述推荐一个正规滚球网站的客户? 我应该相信大多数的数据集? How can I use that dataset? Who are the experts that can help answer questions about it? In short, 这些工具为数据消费者和生产者提供了一种方法来找到他们需要的数据集或报告,从而避免重复的工作.

使这些信息民主化,并使任何使用或创建数据的人都能获得这些信息,是解决可靠性问题的关键. 

The future of data reliability

随着数据在现代商业的日常运作中变得越来越重要,并为数字产品提供动力, the need for reliable data will only increase, as will the technical requirements around ensuring this trust. 

Still, while your stack will get you some of the way there, data reliability isn’t solved with technology alone. 最强有力的方法还结合了文化和组织转移,以优先考虑治理, privacy, and security, 现代数据堆栈的三个领域在未来几年内都可以加速. 数据可靠性的另一个强大工具是优先级 service-level agreements (SLAs) 以及其他跟踪问题频率的方法,这些问题与您的涉众所设定的一致期望相关, 这是另一个从软件工程中收集到的经过验证的、真实的最佳实践. 当您的组织开始像对待产品一样对待数据,数据团队不再像金融分析师,而更像产品和工程团队时,这些指标将是至关重要的. 

与此同时,我希望您不会出现数据停机,而是有足够的正常运行时间! 

Interested in learning about data reliability? Reach out to Lior Gavish or the rest of the Monte Carlo team for more information.