蒙特卡罗 Brings 数据可观测性 to Data Lakes with New Databricks Integration

As companies leverage more and more data to drive decision-making and maintain their competitive edge, it’s crucial that this data is accurate and reliable. 与蒙特卡罗的新Databricks集成, teams working in 数据的湖泊 can finally trust their data through end-to-end data observability and automated lineage of their entire data ecosystem. 

在过去的几年里, 数据的湖泊 have emerged as a must-have for the modern data stack. They often offer more flexibility and customization than traditional data warehouses, but data engineers know there’s usually a tradeoff when it comes to data organization and governance. And those issues can be costly: research from the last few years suggests that companies are 浪费了数百万美元的收入 而数据团队 浪费了他们将近50%的时间 fixing broken pipelines and other data quality issues. 

这个问题的核心是 数据停机时间, those occasions when data is missing, stale, incomplete, or otherwise inaccurate. And it’s why we’re excited to announce that we’re bringing data observability—the solution to 数据停机时间—to 数据的湖泊 with the new Databricks integration for 蒙特卡罗.

什么是数据可观察性?

Inspired by the proven best practices of application observability in DevOps, data observability is an organization’s ability to fully understand the health of the data in their system. 数据可观测性, 就像它的DevOps对手一样, 使用自动监控, 报警, and triaging to identify and evaluate data quality issues. 

At 蒙特卡罗, we look at data observability across five pillars: 

  • 新鲜-你的数据表是最新的
  • 分布—whether your data falls into expected, acceptable ranges
  • 体积-数据表的完整性
  • 模式-改变数据的组织
  • 血统——上游资源, 下游ingestors, and interactions with your data across its entire lifecycle

推荐一个正规滚球网站的 可观察性数据平台 uses machine learning to infer and learn what an organization’s data looks like in order to proactively identify 数据停机时间, 评估其影响, 通知那些负责修理的人, 并能更快地分析和解决根本原因. 

数据湖可观测性的独特挑战

For teams that use Databricks to manage their 数据的湖泊 and run ETL and analysis, 数据质量问题尤其具有挑战性. Data lakes almost always contain larger data sets, often with massive amounts of unstructured data. They also usually require many components and technologies that need to work together, opening up more potential opportunities for pipelines to break. And while data engineers working in other tech stacks can leverage data testing tools like dbt and Great Expectations, scaling these solutions for the large datasets typical of 数据的湖泊 can prove challenging. 

The consequences of data quality issues in 数据的湖泊 can be significant, 特别是当涉及到机器学习时. ML是一个庞大的数据湖应用程序, but if the data feeding those models isn’t accurate and trusted, 输出将受到影响. 作为毫升领袖 吴恩达最近说, “The full cycle of a machine learning project is not just modeling. 而是寻找正确的数据, 部署它, 监控它, 将数据反馈[到模型], showing safety—doing all the things that need to be done [for a model] to be deployed. 这不仅仅是在测试中取得好成绩, which fortunately or unfortunately is what we in machine learning are great at.”

Ensuring data quality is paramount for any ML practitioner—and really, 适用于任何数据驱动的组织. 现在, 与蒙特卡罗的Databricks集成, reducing or even eliminating 数据停机时间 across a data lake is possible. 

蒙特卡罗的Databricks集成是如何工作的

推荐一个正规滚球网站的 new integration makes it possible for data teams working in Databricks to layer automated monitoring and 报警 on top of their data tech stack, 包括在他们的数据湖中. 推荐一个正规滚球网站的 integration is designed to easily scale to environments with hundreds of thousands of tables, 以及任何大小的数据集. 蒙特卡罗也提供自动化, 可扩展数据沿袭, delivering a holistic map of an organization’s data across its entire lifecycle that teams can use to quickly identify and address root causes and potential impacts of 数据停机时间. 还有可以玩滚球的正规appSOC-2认证, Databricks customers can rest assured that their data is kept secure and all best practices will be met. 

像砖创始人之一 Matei Zaharia告诉推荐一个正规滚球网站 最近, “AI and machine learning really should have been called something like ‘data extrapolation’, because that’s basically what machine learning algorithms do by definition: generalize from known data in some way, 经常使用某种统计模型. 所以如果你这么说的话, then I think it becomes very clear that the data you put in is the most important element.”

We couldn’t be more excited to see what’s in store for the future of 数据的湖泊. 和推荐一个正规滚球网站新的Databricks集成, the data powering this future just became a lot more reliable and trustworthy. 

Ready to achieve end-to-end data observability that encompasses your data lake? Reach out to 伊塔Bleier to learn more about how 蒙特卡罗 and Databricks work together.