What’s In Store for the Future of the Modern Data Stack?

Bob Muglia, 雪花前首席执行官, 讨论驱动数据分析和工程的工具和技术的下一步发展方向.

A few weeks ago, I had the opportunity to chat with Bob Muglia, 雪花前首席执行官 and one of the pioneers of the 现代数据堆栈, to learn about his predictions for the future of our industry. 

If 分析师马克西姆Beauchemin 是数据工程之父吗, Muglia is certainly the father of the analytical cloud database, bringing to market one of the most popular solutions in the space: 雪花该公司于2014年推出,推广了这个十年来最具变革性的技术之一. 

Before we begin, however, it’s important to understand what exactly we mean by 现代数据堆栈:

  • 它是基于云计算的
  • 它是模块化和可定制的 
  • 首先是最好的(为特定的工作选择最好的工具,而不是一个一体化的解决方案)
  • 这是元数据驱动的
  • 它运行在SQL上(至少现在是这样)

记住这些基本概念, let’s dive into Bob’s predictions for the future of the 现代数据堆栈. (要想了解更广泛的对话,请一定看看下面我的采访).

1. Data lakes and data warehouses will become indistinguishable

数据仓库已经存在了几十年,2013年Amazon Redshift首次引入基于云的仓库时,数据仓库实现了飞跃. 近年来, more customizable and flexible data lakes have become increasingly popular, and companies have had to evaluate whether a 数据仓库或数据湖 is the right choice for their business. That delineation won’t last for long, according to Bob.

“越来越多的, data lakes and data warehouses are coming together into one coherent thing,”鲍勃说. “I really think they’ll be very indistinguishable in five years. It’s really whether you’re looking at it as a file, or whether you’re looking at it as a relational table. That’s the right abstraction to think of. There are times when files are valuable, particularly when it comes to interchange, 但是,您想要执行的大多数操作实际上是在关系体系结构中执行的. And so this idea of a data lake and data warehouse are coming together.”

2. Analytics will merge with SQL-based systems within data platforms

这种凝聚力的趋势延续到了分析领域,包括机器学习.

“目前,这个行业基本上有五家供应商在构建人们正在构建的云平台,”鲍勃说. “有雪花和Databricks, and then the three major cloud vendors—Amazon, 微软, 和谷歌——都有自己的东西. 所有这些都是连贯的, and I think you’ll see analytic systems merging into the data platforms. You certainly see that with what Databricks is doing, 还有雪花在做什么, 以及所有的云供应商. 你会看到一个非常完整的堆栈,将有分析和先进的分析和机器学习系统, together with SQL-based data management systems.”

3. Universal standards for governance, 血统, and 指标 will begin to emerge

在不久的将来, Bob希望看到行业开始围绕数据治理开发标准——从2022年开始.

“我看到了在与治理相关的现代数据堆栈中开发一些关键标准的机会, 血统, 指标, 类似这样的事情,”鲍勃说. “我觉得有必要允许这些平台和工具之间的互操作性, and there’s a need for some standards to exist.”

He doesn’t expect it to be easy, however. “治理不是一个简单的问题, 但这是一个重要的问题,因为这是一个全世界都关心的问题——保护人们的信息对每个人都很重要,”他说. “It matters to companies in terms of their reputation. It matters in terms of intellectual property rights. 监管方面有很强的理由. So this world is evolving and people have to stay on top of it. 当这些现代数据栈的工具开启了许多难以置信的数据处理能力, 它们还必须得到保护和适当管理,以确保只有应该有权访问数据的人才有权访问数据. 我认为,虽然有一些可用的工具,但推荐一个正规滚球网站在这方面还很早期.” 

Bob has one specific recommendation for businesses looking to address governance. 

“Gartner has done a pretty good job in laying out what they call the 数据结构. That’s a model worth looking at when it comes to data governance. 它是高层次和抽象的, but as someone who works with vendors, 这是一个很好的模板,可以让你思考如何构建这些东西.”

4. Predictive analytics will evolve dramatically

展望未来几年, Bob预测,预测分析工作的完成方式将发生重大变化. 

“我认为推荐一个正规滚球网站将继续看到预测分析的发展,”鲍勃说. “当前这一代预测分析系统实际上是围绕数据框架构建的,你可以使用Python或Scala等语言对数据框架进行操作. And while this is effective—people are doing it, and the tools are improving—I still think we’re at a very, 非常原始的水平. 我希望在未来的5到10年内,机器学习的方式能有一些相当显著的改进.”

5. Knowledge graphs will be in high demand

Specifically, Bob foresees an increase in demand for 知识图—one he believes data platforms will evolve to meet. 

“我认为,总体趋势是,推荐一个正规滚球网站将开始看到知识图谱的出现,”Bob说. “现代数据堆栈将开始进化,使知识图能够通过它建立. 这确实需要与某些东西相关的业务逻辑,并将其嵌入到数据库中. 这就是区别.” 

6. 下一代的数据共享将需要组织内部(以及组织之间)的面向领域的治理

Bob thinks that data sharing—both within organizations and between organizations, as commerce—is central to the future of our industry. 

“Data starts with an organization and is created by an organization,”鲍勃说. “It is an asset that an organization creates, that you can then extract value from and utilize appropriately. And the work that Thoughtworks has done around the 数据网格 数据的组织原则和面向领域的思想是正确的.”

鲍勃承认,对于不同的组织来说,如何实现这一目标看起来会有所不同. 

“现在, 如果你是一家50人的公司, you probably don’t have a bunch of different data domains,”他说. “但如果你是一家非常大的公司,面向领域的数据在概念上是正确的. 有趣的是, what are the mechanisms you use to actually do that? 推荐一个正规滚球网站对雪花就是这么做的. 数据共享的思想是构建支持面向领域的治理模型所需的机制. 因此,这种面向领域的治理思想在现代数据栈中非常适用. 这就是数据共享. 如果您查看雪花在公司内部对数据交换所做的工作, it allows a company to set up different domains of data expertise, and then share that data with other organizations.”

Bob预测,这一领域专业知识的增加将导致数据应用程序开发的增加. “人的数据, 但他们也有业务知识,他们想把这些数据带来. 这意味着你正在创建一个数据应用程序——你将数据和业务知识结合起来,并构建一个应用程序,它可以根据数据中发生的事情采取自主行动. 这就是下一代的数据市场和共享,因为不同的领域会有不同的专业知识.” 

This impact will reverberate far beyond the data industry itself. 

“如果你有一家小型地区银行, 能够从专注于该行业的精品组织获得分析技能是令人难以置信的,”鲍勃说. “You may not have the data scientist capabilities inside your organization, but you can rent them through an organization that’s providing it. 我认为推荐一个正规滚球网站会看到成千上万的公司建立针对垂直行业的分析服务,在这些行业中,他们拥有可以应用并带来业务的专业知识. That’s the next generation of data sharing.” 

Bottom line: 2022 is the year to focus on solving the problem of productization

在鲍勃看来, 现代数据堆栈将继续为组织内部和跨组织的数据工作提供机会, increasingly relying on teams to think about data as a product. 随着越来越多的公司采用面向领域的架构,并将数据共享或出售给其他企业, 一致性的需要, industry-wide standards of data trust and reliability will become more pressing. 

“The 现代数据堆栈 itself is still relatively nascent,”鲍勃说. “无论是 可观察性 以及数据的性能, 无论是元数据管理和治理,客户仍有大量的改进机会和问题需要解决.”

Looking for more insights about how to build a 现代数据堆栈?

 下载 2021年数据平台趋势报告.