Data Observability: The Next Frontier of Data Engineering

To keep pace with data’s clock speed of innovation, data engineers need to invest not only in the latest modeling and 分析 tools, but also technologies that can increase data accuracy and prevent broken pipelines. 解决方案? Data observability, the next frontier of data engineering and a pillar of the emerging Data Reliability category.

As companies become increasingly data driven, the technologies underlying these rich insights have grown more and more nuanced and complex. While our ability to collect, 商店, 总, visualize this data has largely kept up with the needs of modern data teams (think: domain-oriented 数据网格es, cloud warehouses, data visualization tools, data modeling solutions), the mechanics behind 数据质量 and integrity has lagged. 

No matter how advanced your 分析 dashboard is or how heavily you invest in the cloud, your best laid plans are all for naught if the data it ingests, 转换, pushes to downstream isn’t reliable. In other words, “garbage in” is “garbage out.” 

Before we address what Data Reliability looks like, let’s address how unreliable, “garbage” data is created in the first place. 

How good data turns bad 

After speaking with several hundred data engineering teams over the past 12 months, I’ve noticed there are three primary reasons why good data turns bad: 1) a growing number of data sources in a single data ecosystem, 2) the increasing complexity of data pipelines, 和3)更大的, more specialized data teams.

More and more data sources

现在, companies use anywhere from dozens to hundreds of internal and external data sources to produce 分析 and ML models. Any one of these sources can change in unexpected ways and without notice, compromising the data the company uses to make decisions. 

例如, an engineering team might make a change to the company’s website, thereby modifying output of a data set that is key to marketing 分析. 作为一个结果, key marketing metrics may be wrong, leading the company to make poor decisions about ad campaigns, 销售目标, other important, revenue-driving projects.

Increasingly complex data pipelines

Data pipelines are increasingly complex with multiple stages of processing and non-trivial dependencies between various data assets. With little visibility into these dependencies, any change made to one data set can have unintended consequences impacting the correctness of dependent data assets. 

Something as simple as a change of units in one system can seriously impact the correctness of another system, as in the case of the Mars Climate Orbiter. A NASA space probe, the Mars Climate Orbiter crashed as a result of a data entry error that produced outputs in non-SI units versus SI units, bringing it too close to the planet. Like spacecraft, analytic pipelines can be extremely vulnerable to the most innocent changes at any stage of the process.

Bigger, more specialized data teams

As companies increasingly rely on data to drive smart decision making, they are hiring more and more data analysts, 科学家们, engineers to build and maintain the data pipelines, 分析, ML models that power their services and products, as well as their business operations. 

Miscommunication or insufficient coordination is inevitable, will cause these complex systems to break as changes are made. 例如, a new field added to a data table by one team may cause another team’s pipeline to fail, resulting in 失踪 or partial data. 下游, this bad data can lead to millions of dollars in lost revenue, erosion of customer trust, even compliance risk.

The good news about bad data? Data engineering is going through its own renaissance and we owe a big thank you to our counterparts in DevOps for some of the key concepts and principles guiding us towards this next frontier. 

Download Your Free Copy of O’Reilly’s Data Quality Fundamentals

The next frontier: data observability

An easy way to frame the effect of “garbage data” is through the lens of software application reliability. For the past decade or so, software engineers have leveraged targeted solutions like New Relic and DataDog to ensure high application uptime (in other words, 工作, performant software) while keeping downtime (outages and laggy software) to a minimum. 

In data, we call this phenomena 数据停机时间. 数据停机时间 refers to periods of time when data is partial, 错误的, 失踪, or otherwise inaccurate, it only multiplies as data systems become increasingly complex, supporting an endless ecosystem of sources and consumers.

By applying the same principles of software application observability and reliability to data, these issues can be identified, resolved and even prevented, giving data teams confidence in their data to deliver valuable insights.

Below, we walk through the five pillars of data observability. Each pillar encapsulates a series of questions which, 在总, provide a holistic view of data health. Maybe they’ll look familiar to you? 

  • 新鲜: is the data recent? When was the last time it was generated? What upstream data is included/omitted?
  • 分布: is the data within accepted ranges? Is it properly formatted? Is it complete?
  • 体积: has all the data arrived?
  • 模式: what is the 模式, how has it changed? Who has made these changes and for what reasons?
  • 血统: for a given data asset, what are the upstream sources and downstream assets which are impacted by it? Who are the people generating this data, who is relying on it for decision making?

A robust and holistic approach to data observability requires the consistent and reliable monitoring of these five pillars through a centralized interface that serves as a central source of truth about the health of your data.

An end-to-end Data Reliability Platform allows teams to explore and understand their data lineage, automatically mapping upstream and downstream dependencies, as well as the health of each of these assets.

一个有效的, proactive data observability solution will connect to your existing stack quickly and seamlessly, providing end-to-end lineage that allows you to track downstream dependencies. 另外, it will automatically monitor your data-at-rest without requiring the extraction of data from your data 商店. This approach ensures that you meet the highest levels of security and compliance requirements and scale to the most demanding data 体积s.

Such a solution also requires minimal configuration and practically no threshold-setting. It uses ML models to automatically learn your environment and your data. It uses anomaly detection techniques to let you know when things break. And it minimizes false positives by taking into account not just individual metrics, but a holistic view of your data and the potential impact from any particular issue.

This approach provides rich context that enables rapid triage and troubleshooting and effective communication with stakeholders impacted by Data Reliability issues. Unlike ad hoc 查询 or simple SQL wrappers, such monitoring doesn’t stop at “field X in table Y has values lower than Z today.”

A data catalog brings all metadata about a data asset into a single pane of glass, so you can see you can see lineage, 模式, historical changes, 新鲜, 体积, 用户, 查询, more within a single view.

Perhaps most importantly, such a solution prevents 数据停机时间 incidents from happening in the first place by exposing rich information about data assets across these five pillars so that changes and modifications can be made responsibly and proactively.

What’s next for data observability? 

Personally, I couldn’t be more excited for this new frontier of data engineering. As data leaders increasingly invest in Data Reliability solutions that leverage data observability, I anticipate that this field will continue to intersect with some of the other major trends in data engineering, 包括: 数据网格, machine learning, cloud data architectures, platformatization of data products

Interested in pioneering the field of data observability with 蒙特卡罗? 申请 一个角色 在推荐一个正规滚球网站的团队!