Stop Treating Your Data Engineer Like A Data Catalog

Data trust starts and ends with communication. Here’s how best-in-class data teams are proactively certifying tables as approved for use across their organizations.

Say it with me: data engineers are not data catalogs. 

You would be hard pressed to find “answering multiple Slack messages every week about which tables are good to use for this report,在他们的工作描述中, 但它还是发生了. 

数据分析师不是灵媒. 然而,, they are often placed in the position of having to intuit if the data being piped is trustworthy.

This misalignment has arisen as data teams are pushed to move faster, weave themselves across the 数据网格, and enable increasingly self-service data platforms.

Without a data certification program in place, data teams don’t often know where to go to get the best data.
Without a data certification program in place, data teams don’t often know where to go to get the best data. Image courtesy of Chokniti Khongchum on Shutterstock, purchased for use with Standard License.

It’s the data team’s equivalent of the classic document version control issues that have plagued knowledge workers for decades. What starts as a tight pitch deck evolves into:

  • A million people making and sharing ad-hoc slides;
  • Massaging content on those slides until it becomes an echo of its original intent; and
  • 创建复制标签 V6_Final_RealFinal.

The same thing happens across the data team. Everyone is trying to do the right thing (i.e., support your stakeholders, generate insights, pipe more data, etc.),但每个人都在迅速行动. 

One day you look up and notice you have 6 different models with slight variations essentially doing the same thing…and no one knows which one is most up-to-date or even which field to use.

This creates real operational problems downstream including:

  • Inefficient cycles of redundant “traffic control;” 
  • 较低的 数据质量;
  • Time spent resolving problems created from analysts using improper/problematic data;
  • 较低的 data trust across the organization; and
  • 增加 数据停机时间

When you don’t trust your data or you have lower data reliability, organizations often pad the margins of error in their forecasts. 

正如Peleton 's强调的那样 最近生产停止, poor forecasting can be especially problematic during the pandemic 当 uncertainty across demand, supply chains and the overall business environment is at an all-time high. 

更多的数据发现,更多的问题

推荐一个正规滚球网站已经 written previously about data discovery, a new approach to understanding the health of your distributed data assets in real-time, and it’s an essential part of the solution.

数据发现提供分布式, real-time insights about data across different domains, all while abiding by a central set of governance standards. 图片由巴尔摩西提供.

Data discovery provides a domain-specific, dynamic understanding of your data based on how it’s being ingested, 存储, 聚合, 由一组特定的消费者使用. 

与数据目录一样, governance standards and tooling are federated across these domains (allowing for greater accessibility and interoperability), 但与数据目录不同, data discovery surfaces a real-time understanding of the data’s current state as opposed to it’s ideal or “cataloged” state.

It is especially useful 当 teams take a distributed  approach to governance that holds different data owners accountable for their data as products, which allows data-savvy users throughout the business to self-serve from those products. 

但随着数据变得更容易获取, how can downstream stakeholders determine 什么 data sets have been served, 改变了, and approved by a given domain’s data team?

How can one domain be sure a common set of 数据质量 standards, ownership, 和沟通 processes are being upheld across the organization?

我的一位顾客, a leading media company with a mature data organization, 面对的正是这些问题吗. 作为一个结果, we have been working with them and several others to implement a data certification program. 

什么是数据认证?

数据认证 is the process by which data assets are approved for use across the organization after having met mutually agreed upon sla, 或服务水平协议,  对数据质量, 可观察性, 所有权/责任, 问题解决, 和沟通. 

Similar to the concepts of 数据质量, 数据验证, 或数据验证, data certification layers on critical processes that align people, 框架, and technology to central business policies.

数据认证 requirements vary based on the needs of the business, the capacity of the data engineering team, 以及数据的可用性, but typically incorporate the following features: 

什么是数据认证? Here is one set of criteria that a media company is using to certify data sets using 蒙特卡罗.

数据认证 programs increase scalability by leveraging a consistent approach applied across multiple domains. They also increase efficiency by facilitating more trustworthy exchanges of information between domains with clear lines of communication.

它是这样工作的.

6 steps to implementing a data certification program

Step 1:  Build out your 数据可观测性 capabilities

实现 数据可观测性–an organization’s ability to fully understand the health of the data in their system–is an important first step in the data certification process. 

Not only do you need insight into your current performance to set a baseline, but you also need a systemic end-to-end approach for proactive incident discovery, 报警和筛选.

蒙特卡罗’s Monitors page automatically surfaces anomalies, 模式变化, 删除表, 和规则漏洞. 图片由蒙特卡罗提供.

If 任何东西 within the pipeline breaks–and it will break–you will be the first to know. 这头开始, along with a detailed understanding of the data ecosystem, will reduce time to detection and resolution by pinpointing where errors occur. . 

Knowing 什么 systems and data sets have a tendency to create the largest or most frequent problems downstream also helps inform the process of writing effective data sla (Step 4).

另外, understanding the upstream dependencies of your most important tables or reports helps data teams understand 什么 data to give the most attention.

The bottom line is that a table or data set should be closely monitored for anomalies  (ideally continuously learning and evolving via machine learning) to be considered certified. 

步骤2:确定数据所有者

Each certified data asset should have a responsible party across its lifecycle from the ingestion to analytics layer. 

蒙特卡罗 allows owners to be assigned to tables along with other tags. 图片由蒙特卡罗提供.

Some data teams may choose to implement a RACI (responsible, accountable, consulted, informed) matrix, others may build it directly into the specific SLA along with the expected communication procedures and resolution times.

步骤3. 理解什么是“好的”数据

By asking your business stakeholders the “who, 什么, 当, 在哪里和为什么,” you can understand 什么 数据质量 means to them and which data is actually the most important.

This will enable you to develop key performance indicators such as:

  • 新鲜
    • Data will be refreshed by 7:00 am daily (great for cases where the CEO or other key executives are checking their dashboards at 7:30 am). 
    • 数据永远不会超过X小时.
  • 分布
    • 列X永远不会为空.
    • 列Y总是唯一的.
    • Field X will always be equal to or greater than field Y.
  • 体积:
    • 表X的大小永远不会减少.
  • 模式:
    • 该表上没有任何字段将被删除.
  • 血统:
    • 100% of the data populating table X will have upstream sources and downstream ingestors mapped and include relevant metadata.
  • 数据停机时间 (或可用性):
    • 蒙特卡罗定义了数据停机时间 as the number of incidents multiplied by (the time to detection + time to resolution). An example of a 数据停机时间 SLA could be, table X will have less than Y hours of downtime a year.
    • sla that measure each of the components of 数据停机时间 can be more actionable. Examples include: we will reduce our incidents X%, time to detection X%, and time to resolution X%.
  • 查询速度:
    • 推荐一个正规滚球网站的朋友在 在乐观的建议: “Average query run time is a good place to start, but you may need to create a more nuanced metric (e.g., X% of queries finish in
  • Ingestion (great for keeping external partners accountable):
    • Data will be received by 5am each morning from partner Y. 

This process also enables you to configure granular alerting rules tailored to 什么 matters most to the business.

Step 4: Set clear sla for your most important data sets 

Setting sla (service level agreements) for your data pipeline is a major step towards increasing your data reliability and essential to a data certification program. sla need to be specific, measurable, and achievable.

Not only do sla describe an agreed-upon standard of service, they define the relationship between parties. 换句话说, they outline who is responsible for 什么 during normal operations as well as 当 issues occur. 

布兰登Beidel, a Senior Data Scientist with Red Ventures, suggests that an effective SLA is realistic. Simply saying “having reliable data at all times” is too vague to be useful; instead, 布兰登指出, 团队应该设置有重点的sla. 

“好的sla是具体和详细的. They will describe why it’s important to the business, 期望是什么, 当这些期望需要满足的时候, 他们将如何应对, 数据存在的地方, 谁会受到它的影响.”

Beidel includes within his sla how the team should respond if the SLA isn’t met.

例如, “the data in table X will be refreshed everyday by 7:00 am” will transform into, “Team Z will ensure the data in table X will be refreshed everyday by 7:00 am. 出现异常警报后两小时内, 团队将核实, 与受影响人士沟通, and begin a root cause analysis of the issue. Within one business day a ticket will be created and the wider team will be updated on the progress made toward resolution.”

To achieve this level of specificity and organization, teams should align early – and often – with stakeholders to understand 什么 good data looks like. 

That includes within the data team as well as the business. A good SLA needs to be informed by the realities of how the business operates and how your users consume the data. 

I take a slightly different approach and differentiate between 什么 I consider the SLA of “table x will be updated by 7am” and the SLO (Service Level Objective) of “we will aim to meet this SLA 99% of the time.”

However you decide to approach it, I’d recommend against boiling the ocean. Most of my customers are implementing their data certification programs as “go forward” first and cleaning up older assets in a second wave.

蒙特卡罗 Incident IQ can help data teams understand which domains leverage which data sets. 图片由蒙特卡罗提供.

事实上, many of the best data teams will start certifying the most critical tables and data sets: the ones that add the most value to the business, 有最多的查询活动, 用户数量, 或依赖关系. 

Some are also implementing tiers of certification–bronze, 银, gold–that convey different levels of service and support.

Step 5: Develop your communication and incident management processes 

Where and how will alerts be sent to the team? How will next steps and progress be communicated internally and externally? 

虽然这看起来像是桌上的赌注, clear and transparent communication is essential to creating a culture of accountability. 

Many teams opt to have alerts and incident triage discussions take place in Slack, PagerDuty或微软团队. This enables rapid coordination while giving full transparency to the wider team as part of a health 事件管理流程.

It’s also important to consider how to communicate major outages to the rest of the organization.

例如, if an alert turns out to be a huge production outage, how does the on-call engineer inform the rest of the company? Where do they make that announcement and how frequently do they provide updates?

Step 6: Determine a mechanism to tag the data as certified

在这一点上, you have created sla with measurable objectives, 透明的所有权, 清晰的沟通过程, and strong 问题解决 expectations. You have the tools and proactive measures in place to empower your teams to be successful.

The final step is to certify and surface the approved data assets for your stakeholders. 

I recommend decentralizing the certification process. 毕竟, the certification process is designed to help make teams faster and more scalable. 集中的规定, enacted at the domain level will achieve these goals and avoid creating too much red tape.

对于认证过程, 数据小组将标记, search and leverage their tables appropriately either using data discovery solutions, 一个土生土长的工具, 或者其他形式的数据目录. 

Step 7: Train your data team and downstream consumers
Sending proactive alerts to the proper channels 当 data issues arise is a critical step of the data certification process. 图片由蒙特卡罗提供.

Of course, just because tables are tagged as certified doesn’t guarantee analysts will stay inbounds. The team will need to be trained in the proper procedures, which will need to be enforced as necessary.

Fine-tuning the level of alerts 和沟通 is important as well. 

Occasionally receiving alerts that don’t require action is healthy. 例如, you may have a table that grows significantly in size, but it was expected because the team added a new data source. 

Nothing is broken and in need of fixing, but it’s still helpful for the team to know. 毕竟, “expected” behavior to one person might still be newsworthy and critical to another member of the team – or even another domain.

然而,警报疲劳是真实存在的. If the team is starting to ignore alerts, it can be a sign to optimize your approach by either adjusting your monitors or bi-furcating communication channels to better surface the most important information.

When it comes to your data consumers, don’t be shy! You have put in an incredibly robust system 对数据质量 aligned to their needs. Help them move from a subjective to objective understanding of how your team is performing and start giving them the vocabulary to be part of the solution.

这一切都是关于主动沟通

数据认证 can be a beautiful process to see in action. The data engineer tags the table as certified along with the owner of the data set, and surfaces it within the data warehouse for an analyst to grab it and use in their dashboard. 和中提琴! No more (or at least, a whole lot less) 数据停机时间.

其核心, this process underscores that without the proper processes and culture in place, certifying reliability and building organizational trust in your data is extremely difficult. Technology will never be a replacement for good data hygiene, but it certainly helps.

If you want to know more about 什么 it takes to implement a data certification program, 接触 威尔和其他的人 可以玩滚球的正规app队