如何对数据小组进行事故管理

As data systems become increasingly distributed and companies ingest more and more data, 出现错误(和事件)的机会只会增加. 几十年来,软件工程团队一直依赖于 multi-step process to identify, triage, resolve, and prevent issues from taking down their applications.

随着数据操作的成熟,是时候处理了 data downtime, in other words, 数据丢失的一段时间, inaccurate, or otherwise erroneous, 用同样的勤奋, particularly when it comes to building more reliable and resilient data pipelines.

然而,关于数据团队如何处理其数据的事件管理的文献并不多, 推荐一个正规滚球网站已经可以利用大量的资源 推荐一个正规滚球网站在软件开发方面的朋友. With a few adjustments, 这些技巧已经成为一些最好的数据工程团队的宝贵工具.

When it comes to building a data incident management workflow for your pipelines, 四个关键步骤包括: incident detectionresponse根本原因分析(RCA) & resolution, and a blameless post-mortem.

In this article, 推荐一个正规滚球网站将逐一介绍这些步骤,并共享数据团队在设置自己的事件管理策略时可以使用的相关资源.

Incident Detection

将自动数据质量监控集成到关键工作流中是事件管理流程成功的基础, and allows your team to specify who is investigating the issue and its status (i.e.,待定,调查,解决). 图片由蒙特卡罗提供.

不用说,首先,你应该, test your data 在进入生产之前. Still, 即使有最健壮的测试和检查, 在你说“坏数据管道”之前,坏数据就会从裂缝中掉出来,被推到你的电脑里.”

When data downtime strikes, the first step is incident detection. 事件可以通过检测 数据监控和警报, 哪些可以在数据管道上手动实现,哪些可以根据特定的阈值触发, or layered as part of a data observability解决方案,并根据历史数据模式和自定义规则定期自动触发.

数据监控的一个关键组成部分是 anomaly detection,或识别数据支柱何时健康的能力(i.e., volume, freshness, schema, and distribution) veer from the norm. Anomaly detection is most valuable when implemented end-to-end (across your warehouses, lakes, ETL, and BI tools) as opposed to only in a specific silo of your data ecosystem. 良好的异常检测也将调整算法 reduce white noise and false positives.

Incident detection can be integrated into data engineering and analytics workflows, 确保当问题通过适当的沟通渠道出现时,所有数据涉众和最终用户都被告知(Slack, email, SMS, carrier pigeon…)

Suggested Resources: 

Response

作为一名事件指挥官, 能够专注于数据质量问题可以更容易地理解数据管道中的错误. 图片由蒙特卡罗提供.

Good incident response starts — and ends — with effective communication, and fortunately, 其中很多内容都可以提前准备好,并在时机成熟时通过PagerDuty和Slack的适当工作流自动输出. 

数据团队应该花时间编写运行手册和剧本,通过标准的事件响应. 而运行手册则会告诉您如何使用不同的服务以及它们遇到的常见问题, playbooks provide step-by-step processes for handling incidents. 两者都将提供代码链接, documentation, and other materials that can help teams understand what to do when critical pipelines break. 

这是一本好的运行手册的重要组成部分? 在发生中断或中断时委派角色.

在传统的可靠性工程项目中, an on-call process that delegates specific roles depending on service, often segmented by hour, day, or week. 除了“事件响应器”,通常还有一个“incident commander负责分配任务和综合信息,作为响应者和其他利益相关者解决问题.

事故指挥员还负责与可能受到影响的上游和下游消费者进行沟通, i.e., those that work with the data products powered by the broken pipeline.

当数据管道中断时,端到端沿条是一种很有价值的工具,可以用来理解上下游的依赖关系,以便在坏数据影响业务之前通知相关方. 图片由蒙特卡罗提供.

With business context, 元数据是一个强大的工具 for understanding which teams are affected by a given data downtime incident; coupled with automated, end-to-end lineage, 在这些受影响的资产之间沟通上游和下游关系可以是一个轻松而快速的过程, 节省团队的手工绘图时间. 

一旦发生数据停机, it’s important to communicate its impact to upstream and downstream consumers, 处理数据的人和使用数据的人都有. 用正确的方法, much of this can be baked into automated workflows using PagerDuty, Slack, 和其他通信工具. 

Suggested Resources: 

Root Cause Analysis & Resolution

了解哪些数据源导致了破损的数据管道,这对于根本原因分析至关重要. 图片由蒙特卡罗提供.

In theory, root causing sounds as easy as running a few SQL queries to segment the data, but in practice, 这个过程可能相当具有挑战性. Incidents can manifest in non-obvious ways across an entire pipeline and impact multiple, sometimes hundreds, of tables.

For instance, one common cause of poor data quality is freshness – i.e.当数据异常过时时. 这样的事件可能是由许多原因造成的, 包括一份排着队的工作, a time out, 没有及时交付数据集的合作伙伴, an error, or an accidental scheduling change that removed jobs from your DAG.

In our experience, we’ve found that most data problems can be attributed to one or more of these events: 

  • An unexpected change in the data feeding into the job, pipeline or system
  • 逻辑的改变(ETL、SQL、Spark作业等).) transforming the data
  • An operational issue, such as runtime errors, permission issues, infrastructure failures, schedule changes, etc.

Quickly pinpointing the issue at hand requires not just the proper tooling, 但是,一个全面的方法,考虑到这三个来源中的每一个是如何和为什么会破裂的.

与数据发布相关的信息可以自动收集并根据相关性进行分类. 图片由蒙特卡罗提供.

随着软件(和数据)系统变得越来越复杂, 查明停机或事件的确切原因(或根本原因)变得越来越困难. Amazon’s 5-Whys approach provides a helpful framework through which to contextualize RCA: 

  • Identify the problem
  • 询问问题发生的原因,并记录原因
  • 确定原因是否是根本原因
    • 这个原因是可以避免的吗? 
    • 原因会不会在它发生之前就被发现了? 
    • 如果原因是人为的错误,为什么会有这种可能呢? 
  • 把原因当作问题来重复这个过程. Stop when you are confident that you have found the root causes. 

系统崩溃的原因很少是单一的. As data engineers work to reduce manual toil with smarter processes, tests, data freshness checks, and other solutions should be able to identify the issue before it surfaces downstream. When they don’t, it’s a strong indication that these failsafes are inadequate. 就像在软件工程领域一样, automated solutions, 比如数据可观察性和端到端监控, 在对抗数据停机的战斗中,您的最佳选择是什么.

To get started, 推荐一个正规滚球网站已经确定了数据团队在他们的数据管道上进行RCA时必须采取的四个步骤: 

  1. Look at your lineage: 去了解什么被破坏了, 您需要找到系统中显示问题的大多数上游节点——这是问题开始的地方,也是问题的答案所在. If you’re lucky, 所有问题的根源都发生在有问题的仪表板中,您将很快确定问题所在.
  2. Look at the code: 查看创建表的逻辑, or even the particular field or fields that are impacting the incident, will help you come up with plausible hypotheses about what’s wrong.
  3. Look at your data: After steps 1 and 2, it’s time to look at the data in the table more closely for hints of what might be wrong. 这里的一种很有前途的方法是,研究具有异常记录的表中的其他字段如何提供关于数据异常发生在何处的线索.
  4. 查看您的操作环境: Many data issues are a direct result of the operational environment that runs your ETL/ELT jobs. A look at logs and error traces from your ETL engines can provide some answers. 

(If you haven’t read it already, check out Francisco Alberini’s article on how data engineers can conduct root cause analysis — it’s well worth the read).

一旦你发现有些地方出了问题, understand its impact, determine root cause, 并与适当的涉众沟通下一步的工作, 是时候解决这个问题了. This could be as easy as pausing your data pipelines or models and re-running them, 但因为数据可能会因为数百万个原因而被破坏, 这通常涉及大量的故障排除. 

Suggested Resources: 

Blameless post-mortem 

一个无可指责的事后分析通常会对问题进行深入分析,并明确下一步措施,以防止类似事件影响未来的管道. 图片由Unsplash上的Kaleidico提供.

One of my friends, a site reliability engineer with over a decade of experience firefighting outages at Box, Slack, 和其他硅谷公司, 他告诉我,如果我写一篇关于事件管理的文章时不把这一点说得非常清楚: 

“For every incident, the system is what’s at fault, not the person who wrote the code. 好的系统是建立在错误和人类宽容的基础上的. 系统的工作就是允许你犯错误.” 

When it comes to data reliability and DataOps, the same ethos rings true. Pipelines should be fault-tolerant, with processes and frameworks in place to account for both 已知的未知和未知的未知 in your data pipeline. 

Regardless of the type of incident that occured or what caused it, 数据工程团队应该进行彻底的调查, 在他们已经解决了问题并进行了根本原因分析之后,跨职能的事后分析. 

以下是一些最佳实践: 

  • 把每件事都当成学习的经历: To be constructive, post-mortems must be blameless (or if not, blame aware). 尝试对事件进行“责备”是很自然的, 但在培养对同事的信任或培养合作文化方面,它很少有帮助. By reframing this experience around the goal of “learning and improvement,“主动采取必要的组织(创建更好的工作流程和流程)和技术步骤(为投资新工具提供理由)来消除数据停机更容易. 
  • Use this as an opportunity to assess your readiness for future incidents: 更新运行簿并调整您的监视、警报和工作流管理工具. 随着数据生态系统的发展(添加新的, 第三方数据源, APIs, and even consumers), this step will become critical when it comes to incident prevention. 
  • Document each post-mortem and share with the broader data team: 在软件工程中, 记录出错的地方, 系统如何受到影响, 而根本原因往往是事后才想到的. 但是,文档与事故管理过程中的其他步骤一样重要,因为如果拥有部落知识的工程师离开团队或无法提供帮助,文档可以防止知识缺口的积累.
  • Revisit SLAs: In a previous article,我解释了为什么数据团队需要设置 SLAs 对于他们的数据管道. In a nutshell, 服务水平协议(sla)是许多公司用来定义和度量给定供应商的服务水平的方法, product, or internal team will deliver—as well as potential remedies if they fail. 随着数据系统的成熟或变化, 不断地重新审视sla是很重要的, 服务水平指标(SLIs)和服务水平目标(SLOs). SLAs that made sense six months ago probably don’t any more; your team should be the first to know and communicate these changes with downstream consumers.

At the end of the day, post-mortems are just as important for data teams as they are for software engineers. As our field continues to advance (we’re in the decade of data, after all), 理解数据停机的方式和原因是推荐一个正规滚球网站持续改进系统和流程弹性的唯一途径. 

Suggested Resources 

++++

引用推荐一个正规滚球网站SRE前辈的话: hope is not a strategy

但是有了这些最佳实践, 你可以把事件管理从一种“四处打听,希望得到最好的结果”的练习,转变成一种良好的工作模式, highly reliable machine.

Interested in learning more about how the data teams at Vimeo, Compass, Eventbrite, and other leading companies are preventing broken pipelines at scale with data observability? Reach out to Barr Moses and 剩下的卡尔山o team.