说你的公司是数据驱动的是一回事. 从你的数据中获得有意义的见解是另一回事.

问问 马泰Zaharia,最初的创造者 Apache火花. 自2010年由马泰和美国导演首次发行以来.C. 伯克利AMPLab, Apache火花 has emerged as one of the world’s leading open source cluster-computing frameworks, 给数据团队一个更快的速度, 以更高效的方式处理和协作大规模数据分析.

与 , Matei和他的团队将他们的愿景扩展, reliable data to the cloud by building a platform that helps data 团队s more efficiently manage their pipelines and generate ML models. 毕竟,正如Matei所指出的那样:“你的AI的好坏取决于你输入的数据.”

推荐一个正规滚球网站和Matei坐下来讨论了Apache火花背后的灵感, 在过去的十年里,数据工程和分析领域是如何发展的, 以及为什么数据可靠性是行业的头等大事.

2010年,作为美国大学的研究员,您发布了《推荐一个正规滚球网站》.C. Berkeley, and since then, it’s become one of the modern data stack’s most widely used technologies. 最初是什么激发了你的团队开发这个项目?

马泰Zaharia (MZ): 十年前, 人们对使用大型数据集进行分析很感兴趣, but you generally had to be a software engineer and have knowledge of Java or other programming languages to write applications for them. 在Apache火花之前,有 MapReduce and Hadoop as an open source implementation for processing and generating big data sets, 但我在伯克利的团队 AMPLab wanted to find a way to make data processing more accessible to users beyond full-time software engineers, 比如数据分析师和数据科学家.

具体地说, 数据科学家, one of the first functionalities we built was a SQL engine overhead that allowed users to combine SQL with functions that you could write in another programming language like Python.

另一个早期目标是 Apache火花 是为了方便用户建立现有的大数据计算吗, 通过设计模块化的编程接口来开放源代码库, 这样用户就可以在应用程序中轻松地组合多个库. 这导致为Spark构建了数百个开源包.

是什么鼓励了你和你在U.C. 将您的研究项目转变为企业数据团队的解决方案?

在研究项目的早期, 有一些技术公司对使用Apache火花很感兴趣, 例如, 雅虎!, which employed one of the largest 团队s using Hadoop at the time, and several startups. 所以推荐一个正规滚球网站很兴奋能不能满足他们的需求, 通过这次合作, 为新的研究问题提出想法,因为这仍然是一个新的领域.

Thus, we spent a lot of time early on to make Apache火花 accessible for the enterprise. 然后, in 2013, the core research 团队 for this project was finishing up our PhDs and we wanted to continue working on this technology. We decided that the best way to get it to a lot of people in a sustainable and easy-to-use way was through the cloud, 砖诞生了.

七年后,你的愿景成真了. In 2020, more and more data 团队s are taking to the cloud to build their data platforms. What are some considerations enterprise data 团队s should keep in mind when designing their data stacks?

首先,重要的是要考虑您希望谁访问您的数据平台. 谁将会使用它所生产的东西, 你需要什么样的治理工具? 如果你没有实际使用数据的许可, 或者,如果你需要另一个团队来写一份数据工程的工作. 只是运行一个简单的SQL命令, then you can’t access existing data or share the results with other stakeholders at the company easily, 这就成了一个问题.

另一个问题是:数据可用性的目标是什么? Whether you’re just building a simple report or a machine learning model or anything in between, 你希望它们能够随着时间的推移不断更新. Ideally, you wouldn’t be spending lots of time firefighting downtime in your applications.

So, assessing the features you need to meet the data availability requirements of data users at your company are super important when designing your platform. 这就是 数据可观测性 进来.

过去,你说过 “人工智能的好坏取决于你输入的数据.” 我完全同意. 你能详细?

I would even go so far as to say AI or machine learning should really have been called something like “data extrapolation”, because that’s basically what machine learning algorithms do by definition: generalize from known data in some way, 经常使用某种统计模型. 所以如果你这么说的话, then I think it becomes very clear that the data you put in is the most important element, 正确的?

现在, more and more AI research is being published highlighting how little code you need to run this or that new model. 如果你所做的训练模型的其他一切都是标准的, 这就意味着你放入这个模型的数据非常重要. To that end, there are a few important aspects of data accuracy and reliability to consider.

例如,你输入的数据是否正确? 这可能是错误的,因为你收集它的方式, 也可能是软件的缺陷. 问题是当你在生产的时候, 你知道, a terabyte of data or a petabyte of data to put into one of these training applications, 你经常需要逐个检查,看它是否有效.

同样的, does the data you put in cover a diverse enough set of conditions or are you missing critical real-world conditions where your model needs to do well?

To that end, what are some steps that data 团队s can take to achieve highly reliable data?

在高层次上,我看到了几种不同的方法,它们都可以组合在一起. One of them is as simple as just having a schema and expectations about the type of data that will go into a table or into a report. 例如, in the 砖 platform, the main storage format that we use is something called 三角洲湖, which is basically a more feature-rich version of Apache Parquet with versioning and transactions on your tables. 推荐一个正规滚球网站可以强制什么样的数据进入表格.

Another data quality approach I’ve seen is running jobs that inspect the data once it’s produced and raise alerts. You ideally want an interface where it’s very easy to generate custom checks and where you can centrally see what’s happening, 就像你的数据健康状况的一扇玻璃.

我要注意的最后一件事与如何设计数据管道有关. 基本上是更少的数据拷贝, ETL, 还有运输步骤, the more likely your system is to be reliable because there are just fewer things that can go wrong. 例如, you can take a 三角洲湖 table and treat it as a stream and have jobs that are listed to the changes so you don’t need to replicate the changes into a message bus. You can also query these tables directly with a business intelligence tool like Tableau so you don’t have to copy the data into some other system for visualization and reporting.

数据的可靠性 是一个快速发展的领域,我知道 蒙特卡罗 这里有很多有趣的东西吗.

当你与他人共同创立砖时,云计算还处于初级阶段. 现在,许多最好的数据公司都在为云服务而建. 是什么导致了基于云的现代数据栈的崛起?

我认为这是一个时机和易用性的问题. 对于大多数企业来说,云使得大规模采用技术变得容易得多. 与云, 你可以买, 安装, and run a highly reliable data stack yourself without expensive setup and management costs.

管理是维护数据管道中最困难的部分之一, 但有云数据仓库和数据湖, 它是内置的. 云服务也是如此, you’re buying more value than just a bunch of bits on a CD-ROM that you have to 安装 on your servers. The quality of management you have directly impacts how likely you are to build critical applications on it. If you’ve got something that’s going down for maintenance every weekend and it has to be down for a week to upgrade, 你不太可能想要使用它.

另一方面, 如果你有云供应商正在管理的东西, 具有超高可用性的东西, 然后您就可以实际构建这些更关键的应用程序了. 最后, 在云上, it is also much faster for vendors to release updates to customers and get immediate feedback on them.

It means the cloud vendor has to be very good at updating something live without breaking workloads, 但是对于用户来说, 它基本上意味着你能更快地得到更好的软件: imagine that you could access software today that you would only have gotten one or two years into the future from an on-premise vendor.


了解更多关于 马泰的研究Apache的火花, or .

有兴趣学习更多关于数据可靠性的知识? 接触 巴尔摩西 和 蒙特卡罗 团队.