One of our customers recently posed this question related to data quality:
I would like to set up an OKR for ourselves [the data team] around data availability. I’d like to establish a single KPI that would summarize availability, freshness, quality.
What’s the best way to do this?
I can’t tell you how much joy this request brought me. As someone who is obsessed with data availability— yeah, you read that right: instead of sheep, I dream about null values and data freshness these days — this is a dream come true.
Why does this matter?
If you’re in data, you’re either currently working on a data quality project or you just wrapped one up. It’s the law of bad data — there’s always more of it.
Traditional methods of measuring data quality are often time and resource-intensive, spanning several variables, from accuracy (a no-brainer) and completeness, to validity and timeliness (in data, there’s no such thing as being fashionably late). But the good news is there’s a better way to approach data quality.
Data downtime — periods of time when your data is partial, erroneous, missing, or otherwise inaccurate — is an important measurement for any company striving to be data-driven. It might sound cliché, but it’s true — we work hard to collect, track, and use data, but so often we have no idea if the data is actually accurate. In fact, companies frequently end up having excellent data pipelines, but terrible data. So what’s all this hard work to set up a fancy data architecture worth if at the end of the day, we can’t actually use the data?
By measuring data downtime, this simple formula will help you determine the reliability of your data, giving you the confidence necessary to use it or lose it.
Overall, data downtime is a function of:
- Number of data incidents (N) — This factor is not always in your control given that you rely on data sources “external” to your team, but it’s certainly a driver of data uptime.
- Time-to-detection (TTD) — In the event of an incident, how quickly are you alerted? In extreme cases, this quantity can be measured in months if you don’t have the proper methods for detection in place. Silent errors made by bad data can result in costly decisions, with repercussions for both your company and your customers.
- Time-to-resolution (TTR) — Following a known incident, how quickly were you able to resolve it?
By this method, a data incident refers to a case where a data product (e.g., a Looker report) is “incorrect,” which could be a result of a number of root causes, including:
- All/parts of the data are not sufficiently up-to-date
- All/parts of the data are missing/duplicated
- Certain fields are missing/incorrect
Here are some examples of things that are not a data incident:
- A planned schema change that does not “break” any downstream data
- A table that stops updating as a result of an intentional change to the data system (deprecation)
Bringing this all together, I’d propose the right KPI for data downtime is:
Data downtime = Number of data incidents
(Time-to-Detection + Time-to-Resolution)
(If you want to take this KPI a step further, you could also categorize incidents by severity and weight uptime by level of severity, 但为了简单起见, we’ll save that for a later post.)
With the right combination of automation, advanced detection, and seamless resolution, you can minimize data downtime by reducing TTD and TTR. There are even ways to reduce N, which we’ll discuss in future posts (spoiler: it’s about getting the right visibility to prevent data incidents in the first place).
Measuring data downtime is the first step in understanding its quality, and from there, ensuring its reliability. With fancy algorithms and business metrics flying all over the place, it’s easy to overcomplicate how we measure this. Sometimes, the simplest way is the best way.
If you want to learn more, reach out to Barr Moses.