数据目录到底怎么了?

好像每次我刷新推特的时候, a new startup launches “the world’s greatest data catalog ever.这是令人兴奋的! 

If a company is able to build the next best catalog since sliced bread, the data world will surely breathe a collective sigh of relief. And don’t get me wrong: lots of innovation is happening here and clear advancements are being made. Integrations to support data engineers and software developers working directly in data governance reports and dashboards – check. Data science workbooks to foster greater collaboration – check. ML支持自动数据分析检查.

But the reason why data catalogs are so top of mind right now isn’t because we’re happy with them. 因为他们有身份危机. 

一个数据工程师或分析师走进一家酒吧……

想象你走进你最喜欢的酒吧. 推荐一个正规滚球网站称之为数据挖掘. 这是当地运动队的海报, 提基火把(用电火点燃的), 当然), 还有一个宽敞的舞池. 

你走到酒保面前.

“这将是什么?”她问. 

你说:“请给我喷点Aperol喷雾剂。. It’s been awhile since you’ve had one (pre-pandemic, maybe?), but you remember it tasting great, particularly on hot days like today. 

酒保拿起一个杯子放在你面前. 

“配料在吧台后面. 在它.” 

听起来很熟悉? Probably not, but in the context of data, maybe this approach to “self-service” will ring a bell. 

Benn Stancil, co-founder and Chief Analytics Officer of Mode, 写了一篇文章 recently that waxed poetic about the challenges with self-service data tools. 

根据Stancil, “​​The more questions people can theoretically self-serve, 他们就越少能自食其力. 随着您添加更多的选项, 自用工具不再像疯狂的Libs, and start looking like a blank document that requires people to write their own stories in their entirety. While that’s what analysts want, it’s not what everyone wants.”

While Stancil was talking about the “的意见ated simplicity” of metric extraction vs. a one-size-fits-all approach to measuring data, we can apply this same lens to data catalogs. Too many 选项, too few 的意见s on what is actually required to make them successful. 

作为一个例子, 他指出,英语教学提供商提供的课程有限, clear definition of what they can (and can’t) offer to data engineers: easy, 快速数据摄入.

现在, in 2021, data catalogs are at a similar crossroads: try to be everything for everyone or do one to two things really, 很好. 

选择你自己的冒险:数据目录版

简·奥斯丁的话, “it is a truth universally acknowledged that a data engineer in good fortune, 必须需要一个数据目录.” 

在过去,我写过 数据目录是如何失败的 有三个关键原因: 

  • 自动化需求增加: Traditional data catalogs and governance methodologies typically rely on data teams to do the heavy lifting of manual data entry, holding them responsible for updating the catalog as data assets evolve. 这种方法不仅耗费时间, but requires significant manual toil that could otherwise be automated, freeing time up for data engineers and analysts to focus on projects that actually move the needle.
  • 随数据变化而伸缩的能力: Data catalogs work well when data is structured, but in 2021, that’s not always the case. As machine-generated data increases and companies invest in ML initiatives, 非结构化数据变得越来越普遍, 占所有新产生数据的90%以上. 
  • 缺乏分布式架构: Despite the distribution of the modern data architecture (see: 数据网格) and the move towards embracing semi-structured and unstructured data as the norm, most data catalogs still treat data like a one-dimensional entity. 当数据被聚合和转换时, 它流经数据堆栈的不同元素, 几乎不可能记录下来.

And I shared why teams need to think more creatively about data catalogs by applying principles of data discovery. 简而言之, 数据发现指的是拥有特定于领域的数据, dynamic understanding of your data based on how it’s being ingested, 存储, 聚合, 由一组特定的消费者使用. 数据发现是推荐一个正规滚球网站能力的核心, 作为数据从业者, to make sense of what we’re working with and communicate this “sense” to our stakeholders. 

那么,好的数据发现的结果是什么呢? 这取决于你问谁. 

我建议你检查所有适用的选项: 

  • 数据质量
  • 数据治理 & 合规
  • 协作
  • 理解
  • 讨论
  • 可视化
  • 安全
  • 可靠性
  • 报道
  • 可用性
  • 世界和平

这个列表让人不知所措. This isn’t to say that a great data catalog can’t check multiple boxes. 他们可以——而且确实这样做了. 但如果推荐一个正规滚球网站没有明确的目标, how can we possibly track how we’re measuring up against them? 

Here are some measurements we’ve seen used to track data catalog performance.  同样,检查所有适用的选项:

  • 数据的准确性
  • 数据新鲜度
  • 使用量度
  • 访问数据的速度
  • 编目数据的数量

But there’s something missing here: these metrics track “solution-based” outcomes, but will these actually tell you whether the data is 有用的? 什么是可靠的? 还是值得信赖的? 这正是数据目录经常丢失的地方. 

Modern data catalogs are all-too-frequently without a clear identity: in other words, a user story.

数据目录能找到自己的方式吗? 

In a past life, one of our former colleagues spent two years building a data dictionary no one used. 为什么? When his team was done, the requirements were stale and the solution was no longer relevant. 

Unfortunately, his experience is often the norm, not the exception. 而产品愿景为任何好的解决方案铺平了道路, more powerful technologies developed and outcomes are had when we build to solve actual customer problems. 现在, with data needs moving at the speed of light no matter where you look, 这种以客户为中心的方法比以往任何时候都更加重要.

Data catalogs are incredibly important as they are a literal index of how we measure the world. 但推荐一个正规滚球网站不认为他们真的会 有用的 直到它们被设计成有目的性的. 

但也许这只是推荐一个正规滚球网站的问题……无论如何, we’re eager to see how the great data catalog identity crisis pans out.

你是? 

Do you know what in the world is going on with data catalogs? 推荐一个正规滚球网站都是耳朵. 接触 巴尔摩西 or 戈登黄.