构建你的数据平台的快速指南

推荐一个正规滚球网站从客户那里得到的最常见的问题之一是“我如何构建我的 数据平台?” 

For most organizations, 构建数据平台不再是“有就好”,而是“必须有”, 许多公司之所以能在竞争中脱颖而出,是因为他们有能力从数据中收集可行的见解. 

尽管如此,证明所需的预算、资源和时间 build a 数据平台 from scratch is easier said than done. 每家公司的数据旅程都处于不同的阶段, 这让推荐一个正规滚球网站更难分清优先投资平台的哪个部分. Like any new solution, you need to 1) set expectations around what the product can 和 can’t deliver 和 2) plan for both long-term 和 short-term ROI.

To make things a little easier, 推荐一个正规滚球网站列出了你需要在你的数据平台中包含的6个必须拥有的层,以及许多最好的团队选择执行它们的顺序. 

Introducing: the 6-layer 数据平台 

6 Layers of Data Platform
现代数据平台由六个基础层组成, including data ingestion, 数据存储 & 处理, data transformation & modeling, business intelligence & analytics, data observability, data discovery. 图片由 蒙特卡罗.

Second to “我如何构建我的 数据平台?”, the most frequent question I get is “我从哪里开始?” 

It goes without saying that building a 数据平台 isn’t a one-size-fits-all experience, the layers (和 太ls) we discuss only scratch the surface of what’s available on today’s market.  对于5来说,“正确的”数据堆栈看起来会非常不同,000-person e-commerce company than it will for a 200-person startup in the FinTech space, but there are a few core layers that all 数据平台s must have in one shape or another. 

请记住: 就像你建房子不能没有地基一样, frame, 和屋顶, at the end of the day, 没有这6层,你不可能构建一个真正的数据平台. 但是,如何构建平台完全取决于您自己. 

下面, 推荐一个正规滚球网站分享了“基础”数据平台的样子,并列出了各个领域的一些热门工具(你可能会使用其中一些工具): 

数据摄取 

第一层? 数据摄取. 

Data can’t be 加工过的, 存储, 改变了, applied unless it’s been ingested first. 是几乎所有现代的数据平台, 需要将数据从一个系统摄取到另一个系统. 随着数据基础设施变得越来越复杂, 数据团队面临着从各种来源获取结构化和非结构化数据的挑战性任务. 这通常被称为提取转换负载(ETL)和提取负载转换(ELT)的提取和加载阶段。. 

数据摄取
数据摄取 太ls, 像Fivetran, make it easy for data engineering teams to port data to their warehouse or 湖. 图片由 Fivetran

下面,推荐一个正规滚球网站概述了空间中一些流行的工具: 

  • Fivetran —领先的企业ETL解决方案,管理从数据源到目的地的数据交付.
  • 歌手 —用于将数据从任何源移动到任何目的地的开源工具.
  • —基于云的开源平台,允许您将数据从任何源快速移动到任何目的地.
  • Airbyte -开源 平台 that easily allows you to sync data from applications.
  • Apache卡夫卡 -开源 事件流 平台 to h和le streaming analytics 和 data ingestion

即使现在市场上已经有了大量的进食工具, some data teams choose to build custom code to ingest data from internal 和 external sources, many organizations even build their own custom frameworks to h和le this task.

编排和工作流自动化,具有如下工具 Apache气流, 完善, Dagster, often folds into the ingestion layer, 太. 编排通过获取竖井数据,进一步推进了摄取, combining it with other sources, makes it available for analysis.

我认为, 虽然, that orchestration can be (和 should be) weaved into the 平台 after you h和le the storage, 处理, business intelligence layers. 毕竟,如果没有有效的数据组成的管弦乐队,就无法进行编排!     

Data Storage 和 Processing

After you build your ingestion layer, you need a place to store 和 process your data. With companies moving their data l和scapes to the cloud, the emergence of cloud-native 数据仓库, 数据的湖泊,甚至 数据湖houses have taken over the market, offering more accessible 和 affordable 选项 for storing data relative to many on-prem solutions.

Whether you choose to go with a data warehouse, data 湖 or some combination of both is entirely up to the needs of your business. 最近, 关于使用开源还是封闭源码的解决方案有很多讨论 雪花砖的 marketing teams really brings this to light) when it comes to building your data stack. 

Regardless of what side you take, 如果不投资云存储和计算,就不可能构建一个现代的数据平台.

数据云
雪花, a cloud data warehouse, is a popular choice among data teams when it comes to quickly scaling up a 数据平台. 图片由 雪花

下面, 推荐一个正规滚球网站将重点介绍当今云仓库中的一些主要选项, 湖, or [insert your own variation here] l和scape: 

  • 雪花 – The original cloud data warehouse, 雪花为数据团队提供了一个灵活的支付结构, 因为用户要为计算和存储数据支付单独的费用.
  • 谷歌BigQuery – Google’s cloud warehouse, BigQuery, provides a serverless architecture that allows for quick querying due to parallel 处理, 以及独立存储和比较可伸缩的处理和内存.
  • 亚马逊红移 – 亚马逊红移, one of the most widely used 选项, sits on top of Amazon Web Services (AWS) 和 easily integrates with other data 太ls in the space.
  • 火弩箭 -基于sql的云数据仓库,声称其性能比其他选项快182倍, 由于压缩和数据解析的新技术,数据仓库以更轻松的方式处理数据.
  • 微软Azure -微软的云计算在这个列表中很常见,这些团队都在大量利用Windows集成.
  • Amazon S3 —对象存储服务(兼容openstack swift接口),提供结构化和非结构化数据的对象存储服务, S3为您提供了从头构建数据湖所需的计算资源.
  • ——砖, the Apache Spark-as-a-service 平台, has pioneered the data 湖house, 为用户提供利用结构化和非结构化数据的选择,并提供数据湖的低成本存储功能.
  • Dremio – Dremio’s data 湖 engine provides analysts, 数据科学家, data engineers with an integrated, self-service interface for 数据的湖泊.

Data Transformation 和 模式ling

数据转换和建模通常可以互换使用, but they are two very different processes.  When you transform your data, 您需要获取原始数据并使用业务逻辑对其进行清理,以便为分析和报告准备数据. When you model data, you are creating a visual representation of data for storage in a data warehouse.

Data transformation
印度生物技术部, which sports a vibrant open source community, 使数据分析师能够熟练使用SQL,轻松地转换和建模数据,以供平台的业务智能层使用.  图片由 蒙特卡罗.

下面, we share a list of common 太ls that allow data engineers to transform 和 model their data:  

  • 印度生物技术部 -数据构建工具的缩写,是开源的领导者 转换数据 once it’s loaded into your warehouse. 
  • Dataform ——现在 part of the Google Cloud, Dataform允许您将仓库中的原始数据转换为BI和分析工具可用的数据.  
  • Sequel Server Integration Services (SSIS) – Hosted by Microsoft, SSIS允许您的企业从各种各样的来源提取数据,然后转换这些数据,您可以稍后将这些数据加载到您选择的目的地. 
  • Custom Python code 和 Apache气流 – Before the rise of 太ls like 印度生物技术部 和 Dataform, 数据工程师通常用纯Python编写他们的转换. While it might be tempting to continue using custom code to transform your data, 它确实增加了出错的机会,因为代码不容易复制,而且每次进程发生时都必须重写.

The data transformation 和 modeling layer turns data into something a little more useful, 为下一阶段的发展做准备:分析.

Business Intelligence (BI) 和 分析

The data you have collected, 改变了, 如果你的员工不能使用存储服务,那么存储服务对你的企业是没有好处的.  

If the 数据平台 was a book, the BI 和 analytics layer would be the cover, replete with an engaging title, 视觉效果, 总结一下这些数据实际上想告诉你什么. 事实上, this layer is often what end-users think of when they picture a 数据平台, 这是有原因的:它使数据变得可操作和智能, 没有它,, your data lacks meaning. 

BI & 分析

表是一个领先的商业智能工具,它为数据分析师和科学家提供了构建仪表板和其他可视化工具的能力,从而推动决策制定. 图片由  

下面,推荐一个正规滚球网站概述了一些顶级数据团队中流行的BI解决方案: 

  • 美人 -针对大数据进行优化的BI平台,允许团队成员轻松协作构建报告和仪表盘.
  • – Often referred to as a leader in the BI industry, it has an easy-to-use interface.
  • 模式 -整合SQL的协作数据科学平台, R, Python, visual analytics in one single UI.
  • 权力BI -一个基于微软的工具,可以轻松地与Excel集成,并为团队中的每个人提供自助分析.

This list is by no means extensive, but it will get you started on your search for the right BI layer for your stack.

Data Observability

Data Observability
数据可观察性为团队提供了跨越可观察性的五个关键支柱的数据信任的整体视图, including freshness, 模式, lineage (pictured above). 图片由 蒙特卡罗

随着数据管道变得越来越复杂,组织依赖数据来驱动决策, the need for this data being ingested, 存储, 加工过的, 分析了, 转变成值得信赖和可靠的人,从未如此之高.  简单地说,组织再也负担不起 数据中断 i.e., partial, inaccurate, missing, or erroneous.  

通过在推荐一个正规滚球网站的数据平台上应用相同的应用可观察性和基础设施设计原则, 数据团队可以确保数据是可用的和可操作的. In our 的意见, it’s often worse to make decisions based on bad data than to have no data at all. 

你的数据可观察性层必须能够监控和警告以下可观察性支柱: 

  • 新鲜: is the data recent? When was the last time it was generated? What upstream data is included/omitted?
  • 分布: is the data within accepted ranges? Is it properly formatted? 它是完整的?
  • 体积: has all the data arrived?
  • 模式: what is the 模式, how has it changed? Who has made these changes 和 for what reasons?
  • 血统: for a given data asset, what are the upstream sources 和 downstream assets which are impacted by it? Who are the people generating this data, who is relying on it for decision-making?
Detecting Anomalies
Data observability will alert data engineering teams to anomalies that affect critical data sets, 减少白噪声,并根据历史数据映射事件. 图片由 蒙特卡罗.

一个有效的, proactive data observability solution will connect to your existing stack quickly 和 seamlessly, providing end-to-end lineage that allows you to track downstream dependencies. 另外, 它将自动监控您的数据—不需要从数据存储中提取数据. 这种方法可确保您满足最高级别的安全性和遵从性需求,并可扩展到要求最高的数据量.

数据发现

数据发现
最好的数据发现解决方案将提供一个自动的, dynamic overview of table 和 asset owners, 联系, 查询日志, 以及其他元数据,为您的数据提供了丰富的理解和联系. 图片由 蒙特卡罗.

When building a 数据平台, 大多数领导者的任务是选择(或构建)一个数据目录, in our 的意见, this approach is no longer sufficient. 

Don’t get me wrong: data catalogs are important, modern data teams need a reliable, 以可伸缩的方式记录和理解关键数据资产. 但随着数据变得越来越复杂和实时, the processes 和 technologies underlying this layer of the 平台 need to evolve, 太. 

Where many traditional data catalogs fall short (i.e., often manual, poor scalability, lack of support for unstructured data, 等.), data discovery picks up the slack. If data catalogs are a map,  数据发现是你智能手机的导航系统, constantly being updated 和 refined with the latest insights 和 information.  

至少,数据发现应该满足以下需求: 

  • 自助服务 discovery 和 自动化: Data teams should be able to easily leverage their data catalog without a dedicated support team. 自助服务, 自动化, workflow orchestration for your data 太ling removes silos between stages of the data pipeline, in the process, making it easier to underst和 和 access data. 更大的可访问性自然会导致更多的数据采用, reducing the load for your data engineering team.  
  • Scalability as data evolves: As companies ingest more 和 more data 和 unstructured data becomes the norm, the ability to scale to meet these dem和s will be critical for the success of your data initiatives. Data discovery leverages machine learning to gain a bird’s eye view of your data assets as they scale, 确保您的理解随着数据的发展而变化. 这种方式, 数据消费者可以做出更智能、更明智的决策,而不是依赖过时的文档或更糟糕的基于直觉的决策.
  • Real-time visibility into data health: Unlike a traditional data catalog, 数据发现提供了对数据当前状态的实时可见性, as opposed to its “cataloged” or ideal state. 因为发现包括你的数据是如何被吸收的, 存储, 聚合, used by consumers, you can glean insights such as which data sets are outdated 和 can be deprecated, whether a given data set is production-quality, or when a given table was last updated.
  • 支持治理和仓库/湖泊优化: From a governance perspective, 在湖泊中查询和处理数据通常需要使用各种工具和技术(Spark on 砖), Presto on EMR for that, 等.), 结果就是, there often isn’t a single, 可靠的读和写的真实来源(像仓库提供的). 一个合适的数据发现工具可以作为真相的中心来源.

Data discovery empowers data teams to trust that their assumptions about data match reality, enabling dynamic discovery 和 a high degree of reliability across your data infrastructure, regardless of domain. 

Build or buy your 6-layer 数据平台? 这取决于.

Building a 数据平台 is not an easy task, there is a lot to take into consideration that should not be overlooked when doing so. 推荐一个正规滚球网站的客户面临的最大挑战之一是,他们是否应该只在内部构建某些层, invest in SaaS solutions, or explore the wide world of open source.

推荐一个正规滚球网站的答案? Unless you’re Airbnb, Netflix, or Uber, you generally need to include all three. 

这些解决方案各有利弊, but your decision will depend on many factors, including but not limited to:

  • The size of your data team. Data engineers 和 analysts already have enough on their plates, requiring them to build an in-house 太l might cost more time 和 money than you think. 简单地说,精益数据团队没有时间去获取新的数据 team members up to speed with in-house 太ls, let alone build them. 投资于易于配置的、自动化的或流行的解决方案.e., 开源或低代码/无代码SaaS)在非uber /Airbnb/Netflix数据团队中越来越普遍. •迪奥戈里贝罗, Vice President of 分析 at Thous和Eyes, a leading cloud intelligence 平台, 他对这种权衡的总结比以往任何时候都好:根据他的观点, 当您的数据工程团队有足够的带宽在您的数据之上构建应用程序时,内部工具是值得的. 然而, if data engineers are spending most of their time building 和 maintaining data pipelines, 购买解决方案可能更有意义,可以减少它们的负载,并将它们释放出来,以便进行更有趣的工作.
  • 组织存储和处理的数据量. When choosing a solution it is important to select one that will scale with your business. 机会是, 如果你只需要几行代码就能完成工作,那么在一家只有20人的公司里,一个数据分析师独自一人就能获得每年1万美元的转型解决方案是没有意义的. 
  • Your data team’s budget. 如果你的团队预算有限但人手众多, 那么开源选项可能很适合您. 然而, 请记住,当涉及到跨数据堆栈设置和实现开源工具时,您通常是独立的, 经常依赖社区的其他成员或项目创建者自己来构建和维护特性. When you take into account that 只有大约2%的项目在最初几年之后会有增长, you have to be careful with what you fork.

Regardless of which path you choose, building out these layers — in the right order — will give you the foundation to grow 和 scale, most importantly, 提供公司可以信任的见解和产品.  

毕竟,有时候最简单的方法就是最好的方法. 

Did we miss anything? 接触 巴尔摩西 or Lior Gavish with any comments or suggestions.

And if you’re interested in learning more about Data Observability, reach out to the rest of the 蒙特卡罗 team.