数据治理是最重要的 对我的许多客户来说, 特别是根据GDPR, CCPA, 新型冠状病毒肺炎, any number of other acronyms that speak to the increasing importance of data management when it comes to protecting user data.

在过去的几年里, 数据目录 have emerged as a powerful tool for data governance,我非常高兴. 随着公司数字化和数据操作民主化, 这对数据栈的所有元素都很重要, 从仓库到商业智能平台, 现在, 目录, 参与合规最佳实践.

But are 数据目录 all we need to build a robust data governance program?


类似于图书馆的实体目录, 数据目录 serve as an inventory of metadata 和 give investors the information necessary to evaluate data accessibility, 健康, 和位置. 这样的公司Alation, Collibra, Informatica tout solutions that not only keep tabs on your data, but also integrate with machine learning 和 automation to make data more discoverable, 协作, 现在, 符合组织的, 全行业的, 甚至是政府法规.

Since 数据目录 provide a single source of truth about a company’s data sources, it’s very easy to leverage 数据目录 to manage the data in your pipelines. Data 目录 can be used to store metadata that gives stakeholders a better underst和ing of a specific source’s lineage, 从而灌输对数据本身更大的信任. 另外, 数据目录 make it easy to keep track of where personally identifiable information (PII) can both be housed 和 sprawl downstream, as well as who in the organization has the permission to access it across the pipeline.


So, what type of data catalog makes the most sense for your organization? 让你的生活轻松点, I spoke with data teams in the field to learn about their data catalog solutions, 把它们分成三个不同的类别:内部, 第三方, 和开源.


一些B2C公司,我说的是 airbnb, 网飞公司, 超级 of the world — build their own 数据目录 to ensure data compliance with state, 国家, even economic union (I’m looking at you GDPR) level regulations. The biggest perk of in-house solutions is the ability to quickly spin up customizable dashboards, 去你的团队最需要的地方.

Uber的Databook可以让数据科学家轻松搜索表格. 图片由 超级工程.

而内部工具可以快速定制, 随着时间的推移, 这样的黑客行为会导致缺乏可见性和协作性, 特别是在理解数据沿袭方面. 事实上, one data leader I spoke with at a food delivery startup noted that what was clearly 失踪 from her in-house data catalog was a “single pane of glass.” If she had a single source of truth that could provide insight into how her team’s tables were being leveraged by other parts of the business, 确保合规很容易.

在这些战术考虑之上, spending engineering time 和 resources building a multi-million dollar data catalog just doesn’t make sense for the vast majority of companies.


自2012年成立以来, Alation has largely paved the way for the rise of the automated data catalog. Now, there are a whole host of ML-powered 数据目录 on the market, including Collibra, Informatica, 和其他人, many with pay-for-play workflow 和 repository-oriented compliance management integrations. 一些云提供商, 像谷歌, AWS, 和Azure, also offer data governance tooling integration at an additional cost.

在我和数据领导者的谈话中, one downside of these solutions came up time 和 again: usability. While nearly all of these tools have strong collaboration features, one Data Engineering VP I spoke with specifically called out his 第三方 catalog’s unintuitive UI.

如果数据工具不容易使用, how can we expect users to underst和 or even care whether they’re compliant?


In 2017, Lyft became an industry leader by open sourcing their data discovery 和 metadata engine, 阿蒙森以著名的南极探险家命名. 其他开源工具,如 Apache阿特拉斯, 玛格达CKAN, 提供类似的功能, all three make it easy for development-savvy teams to fork an instance of the software 和 get started.

阿蒙森, an open source data catalog, gives users insight into schema usage. 图片由 米哈伊尔·伊万诺夫.

而这些工具中的一些 允许团队标记元数据 用于控制用户访问, this is an intensive 和 often manual process that most teams just don’t have the time to tackle. 事实上, a product manager at a leading transportation company shared that his team specifically chose not to use an open source data catalog because they didn’t have off-the-shelf support for all the data sources 和 data management tooling in their stack, 使数据治理更具挑战性. In short, open source solutions just weren’t comprehensive enough.

仍然, there’s something critical to compliance that even the most advanced catalog can’t account for: 数据停机时间.


最近,我发达 一个简单的指标 为帮助测量的客户 数据停机时间, 换句话说, 当你的数据是部分的时候, 错误的, 失踪, 或者不准确. 当应用于数据治理时, 数据停机时间 gives you a holistic picture of your organization’s data reliability. 没有数据可靠性来增强可发现性, it’s impossible to know whether or not your data is fully compliant 和 usable.

Data 目录 solve some, but not all, of your data governance problems. 开始, 减轻治理缺口是一项艰巨的任务, it’s impossible to prioritize these without a full underst和ing of which data assets are actually being accessed by your company. Data reliability fills this gap 和 allows you to unlock your data ecosystem’s full potential.

另外, 没有实时血统, it’s impossible to know how PII or other regulated data sprawls. Think about it for a second: even if you’re using the fanciest data catalog on the market, your governance is only as good as your knowledge about where that data goes. If your pipelines aren’t reliable, neither is your data catalog.

由于它们的特点互补, 数据目录数据可靠性解决方案 work h和-in-h和 to provide an engineering approach to data governance, 不管你需要用到什么缩写词.

Personally, I’m excited for what the next wave of 数据目录 have in store. 相信我:这不仅仅是数据.

如果你想了解更多,联系 巴尔摩西.