如何解决“你正在使用那个表??!”的问题

As companies increasingly rely on data to power decision making and drive innovation, 重要的是这些数据是及时的, 准确的, 和可靠的. 当你考虑到这只是一小部分 超过7.5百万的七乘方(7700000000000,000,000,000) GB的数据 每天在世界范围内生成的都是可用的, 关注哪些数据资产是重要的只会变得更加困难. 在本文中,推荐一个正规滚球网站将介绍“关键资产”, a new approach taken by the best data teams to surface your most important data assets for quick 和可靠的 insights.   

你有没有想出最奇怪的方法来命名“相关”表,比如 “IMPT” or “使用THIS_V2”? Have you been 3/4ths of the way done with 数据仓库迁移 only to discover that you don’t know which data assets are right and which ones are wrong? Is your analytics team lost in a sea of spreadsheets, with no life vests in sight? 

如果你对以上任何一个问题的回答是肯定的,那么你并不孤单. 在过去的几年里, I’ve spoken with hundreds of data teams both energized and overwhelmed by the potential of their company’s data assets, 负责维护不断变化的数据资产生态系统. 

推荐一个正规滚球网站称之为“你正在使用那张桌子”?!这个问题比你想象的更常见. 

以下是你可能正在经历这种情况的三个迹象: 

您正在迁移到一个新的数据仓库

就像穿越塞伦盖蒂平原的水牛一样, 迁移到新的数据仓库可能是一个混乱而乏味的过程, 让你的团队意识到不再适合或可用的数据(或水牛). 图片由 简·里克斯在Shutterstock报道.

As data teams increasingly make the shift from on-prem data warehouses to Snowflake, 红移, 与其他云仓库或云仓库之间, the ability to know which data is valuable and which data can go the way of the Dodos becomes evermore important.

不幸的是, 数据验证和交叉引用通常是手工处理的, 这是昂贵的, 耗时, 而且很难缩放. 一个客户, a data team leader at a global financial services company currently migrating to Snowflake, revealed that they “are manually mapping tables in 红移 to reports in Tableau so we know what to migrate over to Snowflake and Looker.”

通常当数据团队从红移迁移到雪花时, they end up resorting to a  manual gap analysis between copies of the same tables in both data warehouses because “knowing which reports are downstream from the table will help us identify and prioritize the migration, 或者弃用一个不再需要的表.” 

你公司的数据分析师和数据科学家不知道使用什么数据

A second and just as common pain point for data teams is not knowing what data is most useful, 更别说有用了. 

如果你和你的团队问了以下任何一个问题, 他们可能正处于一个连史酷比都无法解开的谜团之中. 下面是一些常见的数据发现问题. 也许他们会产生共鸣: 

  • 我应该使用什么数据呢? 
  • 我找不到我需要的数据了,我该怎么办呢?
  • 很难理解什么是推荐一个正规滚球网站的“重要数据”……帮助?
  • 谁在用这张桌子? 这些数据重要吗? 

当人们经常问这些问题时, 你的公司明显缺乏数据信任和数据发现, which takes a toll on your company’s ability to leverage data as a competitive advantage. 

你有大量的“数据债务”

Data debt is more than just costly; it erodes user trust and leads to poor decision making. 图片由 baranq 在上面.

像技术债务, 数据债是指过时的数据资产, 不准确的, 否则会占用数据仓库中宝贵的存储空间. It’s an all-too-common occurrence and bogs down even the most advanced data teams, 使其难以及时呈现相关的见解.

数据债在实践中是什么样子的? 以下是三个强有力的指标: 

  • 你有一些过时的数据秘密(包括过时的), 不准确的表, 以及遗留数据类型),团队可能会错误地使用. 
  • You get alerts about different jobs and system checks failing but they’re ignored because “it’s always been like that.” 
  • 你更新了你的技术堆栈, 迁移到雪花, 并且正在使用最新的工具, 但不再使用相同的数据格式, 数据表甚至数据源.

介绍:关键资产

Data teams are tasked with creating visibility into the business via data-driven insights, 但当涉及到他们自己的业务时, 他们经常是盲目的. 而不是, 团队需要一个查看数据运行状况的单一视图, Key Assets which identify the most critical data tables and datasets in your data warehouse. 

幸运的是, the best data reliability and discoverability (catalog) solutions are already incorporating them into their products. 通过利用机器学习, these solutions intelligently map your company’s data assets while at-rest without requiring the extraction of data from your data store, 生成“关键资产仪表板”.”

关键资产可能是: 

  • 经常被许多人查询的表
  • ETL进程大量使用的数据集来派生其他数据集
  • 提供给许多或经常使用的仪表盘的表
  • 具有重要下游依赖关系的外部源

但是团队如何识别他们的关键资产呢? Among other variables, I suggest that teams look for tables and datasets that are: 

  • 经常访问的(我.e.AVG_READS_PER_DAY)
  • 频繁更新(我.e.AVG_WRITES_PER_DAY)
  • 大量用户使用
  • 更新/定期(我使用.e., < 1-5 days since latest update) 
  • 大量ETL流程的利用
  • Supports connectivity, in other words is read-from/written-to many other data assets
  • 数据事故率高(天/周/月) 
  • 最近/经常被BI工具查询

Additionally, Key Assets should include an “Importance Score” for each individual data asset. This score is a composite of key metrics about data usage that indicate which assets matter most to your organization. The higher the score, the more likely the asset is a significant resource for your team.

而简单的, this rendering of a Key Assets dashboard features a search functionality that lets users look for particular assets, while also making clear which assets are significant and which ones can be depreciated or cleaned up based on statistics such as the average number of table reads per day and the total number of table users.

使用关键资产解锁数据信任和发现

As 数据架构 变得越来越孤立和分散, Key Assets can help you optimize data discovery and restore trust in your data in the following ways, 和其他很多:

促进更平稳的仓库迁移

带头 数据仓库迁移 这是一项既令人兴奋又令人畏缩的任务吗. 通常情况下,数据团队被迫手动处理数据验证. 与关键资产, 团队可以自动识别正在使用和依赖的表, 哪些可以弃用, 让这个过程更快.

更容易找到重要数据进行智能决策

Key Assets makes it easy to search for and understand what data matters to your organization through its “Importance Score” and measuring various elements of data asset usage.

很可能整个公司的分析师都在做v1, v2, v3, 和太阳下所有数据集的v4(或者更确切地说, in your warehouse); finding and knowing which ones are actually relevant and important will make all the difference when you’re putting together critical analysis. 如果用户还可以搜索特定的数据资产,就会加分. 关键资产支持这两种功能.

减少数据债务

关键资产使清理“垃圾”表和管道变得更容易, allowing you to reduce data debt in your data warehouse or lake by highlighting which data tables are widely used and which ones are outdated or even 不准确的. Traditional methods of data debt reduction rely heavily on code-intensive integrations (i.e., open source) or ad-hoc SQL queries wrapped around workflow orchestration tools. 关键资产提供了一种更简单、更快的方法来获取这些指标和更多.

实现端到端数据可观察性

从摄取到分析,端到端数据可观测性 是任何严肃的数据工程团队的必备工具吗. By understanding where your important data lives and how it’s being used at all stages of the pipeline, tables and data sets that have been deprecated can be ignored and key tables surfaced.

通过机器学习自动生成关键资产

在我看来, 一个智能但安全的关键资产仪表板应该自动生成, leveraging machine learning algorithms that learn and infer your data assets by taking a historical snapshot of your data ecosystem, 不需要实际访问数据本身.

通过消除数据停机来提高数据可靠性

一个自动生成的, single source of truth that such as Key Assets is a logical conclusion for understanding how tables are utilized, 预防…的影响 数据停机时间 从显示在你的数据管道.

与关键资产, users can search for and identify which data assets need to be closely monitored for likely abnormalities or issues, 哪些可以暂时搁置. Such a solution can help teams automatically mute noisy alerts for outdated data and keep tabs on only the data assets that are actively used by the business.

我不知道你怎么想,但我等不及看到“你正在使用那张桌子”?!“问题已经成为过去.

希望确定您自己的数据组织的关键资产? 可以玩滚球的正规app可以帮上忙.