SE Radio 507: Kevin Hu on Data Observability

Kevin Hu, CEO and co-founder of the startup Metaplane, chatted with SE Radio’s Priyanka Raghavan about data observability. Starting from basics such as defining terms and weighing key differences and similarities between software and data observability, the episode explores components of data observability, biases in data algorithms, and how to deal with missing data. From there, the discussion turns to tooling, what a good data engineer should look for in data observability tools, Metaplane’s offerings, and challenges in the area and how the field might evolve to solve them.

Show Notes

Transcript

Transcript brought to you by IEEE Software magazine.
This transcript was automatically generated. To suggest improvements in the text, please contact [email protected] and include the episode number and URL.

Priyanka Raghavan 00:00:16 Hello everyone. This is Priyanka Raghavan for Software Engineering Radio. Today, listeners will be treated to the topic of data observability, and to lead us through this we have with us our guest Kevin Hu, who’s the co-founder and CEO at Metaplane. It’s a data observability startup, which focuses on helping teams find and fix data-quality problems. Prior to this, he researched the intersection of machine learning and data science at MIT, where he earned a PhD. Kevin has written many articles on data observability in a variety of popular, as well as scientific publications. So, welcome to the show, Kevin.

Kevin Hu 00:01:04 Such a pleasure to talk with you today. I’m a long-time listener of SE Radio and everyone on my team also is a listener. So hopefully I can make them proud today for such a pleasure to be here.

Priyanka Raghavan 00:01:14 Great. Is there anything else you would like listeners to know about yourself before we get into the show?

Kevin Hu 00:01:21 I think you did a great job with the introduction and we’ll touch on this during the show, but I would love to start by saying data teams have so much to learn from software teams, that if you have a data team at your company, chances are that a lot of the best practices that you have developed as an engineer could also help them deploy more effective and more resilient data for your stakeholders internally.

Priyanka Raghavan 00:01:48 So let’s jump into observability and some definitions before we get into data observability. The first thing I wanted to ask you is something basic, but let’s start from the top. How would you define observability in your words?

Kevin Hu 00:02:06 Observability is the degree of visibility you have into your system. And that is the colloquial definition that we use in data observability and what software observability / DevOps observability tools like Datadog and Signal Effects and Splunk have developed. And it really descends from the Physical Science discipline of control theory, where there was a concept called the Controllability of a system that given the inputs, can you manipulate and understand the state of that system? Well, the mathematical dual, the corresponding concept is, given the output of a system, can you infer the state of that system? So that is the rigorous definition from which our more colloquial definition is derived.

Priyanka Raghavan 00:02:54 Why do you think it is necessary to have a view of the system, the centralized view, which everyone seems to be striving towards? Why is that necessary?

Kevin Hu 00:03:07 It’s necessary because systems are complicated that as software engineers, we have so many systems working independently of each other, interacting with each other, that when something goes wrong, which it inevitably will, it’s very, very time consuming to understand what the implications of that incident might be and what the root cause might be. And because it’s difficult to understand, it costs a lot of time for you, a time that is hard to get back. And it costs trust in the people who rely on the systems that you develop. So, let’s go back 10 years ago, or 20 years ago when it was more common to deploy software systems, without any sort of telemetry. Make a rails app, put on an ECT box, put a heartbeat check there and call it a day. I’d never say I didn’t do this, but a lot of people did do this. The only way that you knew that something went wrong in your system was degraded or broken performance for your users, and that is not acceptable. And over the past decade with the rise of tools like Datadog, we have the visibility so that your team can be proactive and get ahead of breakages. That’s why it’s important is because it helps you stay proactive and maintain a lot of trust in your system.

Priyanka Raghavan 00:04:27 I’d like to revisit the physics definition that you gave to the first answer. So, we have this, entropy in physics, which has pretty close connection to control theory and information theory. What I was wondering is how the uncertainty of an outcome, how does that relate to observability?

Kevin Hu 00:04:49 Great question. And observability has very deep roots in physics. We’ll talk about entropy, but we can go into the other route in just a second. But entropy is the measure of the amount of information in a system, at least in the information theoretic definition, it is the number of bits. In other words, a number of yes or no questions that must be answered for you to fully understand a system. So, in a very simple system, for example, a gas at thermal equilibrium in a box, you don’t need many yes or no questions to fully describe that system. When it becomes more dynamic, right, when it starts turning into your software infrastructure, you actually need many yes or no answers to understand fully the state of that system. Which one is part of the reason why observability is important is because our systems tend to become more entropic over time.

Kevin Hu 00:05:44 It’s almost like the second law of thermodynamics where entropy only increases that that also applies to manmade systems, unless you’re kind of pulling it back in case you have that one person on your team who is a real stickler for refactoring, that and S systems become more and more entropic, the surface area of breakage increases. And that’s why you need observability, or at least some increased degree of visibility is to fight against the forces of entropy and not all of it under your control or your fault, either on a data team. Right? For example, if you centralize a lot of data in an analytic data store like Snowflake, you can be very disciplined about the data sets that you create. But if you open that up to your end users and they start using a business intelligence tool like LI-COR, they can start exploding the number of dependencies on your system.

Kevin Hu 00:06:39 So that’s entropy can emerge in many different forms, but I love the fact that you brought that up because to you go to observability and its roots in control theory, believe it or not, this takes us all the way back to the 17th century, I believe. Where Christian Hagens, he was a Dutch physicist, a contemporary of Isaac Newton. He discovered Saturn’s rings. He created this device. So, he was from the Netherlands and the Netherlands are famous for windmills. The problem with windmills which were used at the time to grind grain, is that there’s an optimal speed at which the millstone rotates to grind grain into like the right shape and size. But wind is variables speed, right? You can’t control the speed of the wind, but Hagens developed this device called the Centrifugal Governor, which is almost like an ice skater, that when they bring out their arms, they slow down.

Kevin Hu 00:07:37 And then when bring in their arms, they speed up? It’s the same concept, but applied to like a physical system. We’re now using this device, the speed of the millstone is much more controlled. But fast forward, a few hundred years, James Clerk Maxwell, who many of your listeners may know is the Father of Electromagnetism right, Maxwell’s equations. The four equations that govern them all. He developed Control Theory to describe how a Centrifugal Governor works. He was trying to understand, okay, like given the inputs into this spinning machine, what are the dynamics of that machine and vice versa from observability? And that’s really the lineage that we trace down all the way to today, where ultimately you have these highly complex systems that we want to understand in simpler terms, right? Highly entropic but give us something that we can actually use to summarize the system. And that’s where the three pillars of software observability come in, we heard of metrics, traces and logs. With these three, you can understand arbitrarily the state of a software system at any point in time. And also where the four pillars of data observability come into play as well.

Priyanka Raghavan 00:08:55 In episode 455, we did talk about Software Telemetry. And in fact, they talked about these traces, logs and metrics under an umbrella terminologies, software observability, telemetry. In Data Observability, you told me about four pillars. What is that? Could you just briefly touch upon that?

Kevin Hu 00:09:16 For sure. Well, before that, even though data is ultimately produced by either a human interacting with a machine, or a machine producing data and that’s manipulated and presented throughout the machine, that data does have critical differences from the software world. There’s some properties that make it so that we can’t take the concepts wholesale. We have to rather use them as inspiration with that in mind, the way that we think of the four pillars of data observability is okay. Priyanka, if you describe the company you work at, what is the data? You might say, okay, well, if I have a table in a database, I can describe like, here’s a distribution, like for example, distribution of the number of sales, right? This number has a certain mean value, there’s min and max. And that here’s a list of a bunch of customers, right? Here are the regions they’re from.

Kevin Hu 00:10:14 By number of regions, like which columns at PII, these sorts of descriptive measures are what we call metrics, right? They’re metrics about your data. Then you might also say like this customer’s table, these are the columns and the column types that’s schema, this is the last time it was updated. The frequency with which is updated the number of rows. We called this, the metadata, like external metadata. And the reason we draw a distinction between these two is because you can change the internal metrics without changing the external metadata and vice versa, where like the sales can change. We don’t necessarily need more rows, but if the schema changes that doesn’t necessarily change, the statistical properties. But then you might say, okay, but this is just one table. Data is all connected to each other. Ultimately going back to the sources, it’s a human putting a number into your machine, or it is a machine producing some data and everything derived from some operation applied to those ultimate sources or some derived table thereof.

Kevin Hu 00:11:21 And that is called lineage. And that’s a pretty unique property to the data world where they did it come from somewhere, right. And multiple levels of resolution. So to speak where you can say this table is a result of joining these two parent tables, or this column is the result of this operation applied to your two parent tables, or even like this one data point is the result of another operation. So it’s important to try the lineage over time. And lastly, it’s important to understand the relationships between your data and external world, where your company, you might be using a tool like Five Trend or Airbyte to pull data from an application like Salesforce into your database. And ultimately your data might be consumed by an operations analyst, who wants to understand what the state of my process is currently. And data is ultimately meant to be used. So, and logs kinds of encodes that information. So, to back up a little bit, you have two pillars describing the data itself, metrics and metadata, and two pillars describing relationships, lineage and logs.

Priyanka Raghavan 00:12:37 Great. This is fantastic. But before I dive deep into each of these areas, I want you to tell me about, say the similarities between data and software observability. So, listening to what you just said, I can understand that the similarities that it lets you get to the root cause of an issue, is there anything else?

Kevin Hu 00:13:02 The biggest similarity you’re totally right, is the job to be done. That one of the major use cases of an observability tool is instant management to tell you when something potentially bad has occurred. And to give you the information you need to both identify the root cause, like you mentioned, and identify the potential impact. In the software world you might use traces, right? Like time correlated or request scoped logs. And in the data world, you might use lineage. So, it does the same job there. And ultimately it’s for the same overarching purpose, which is to save you time and to increase trust in your system.

Priyanka Raghavan 00:13:48 If there was one thing that you could say, which is the difference between data and software observability, is it this thing with the lineage that you talk about? Is that the difference, or are there more things?

Kevin Hu 00:13:58 There are more things just to go down some of the more common differences that we’ve seen, there’s a common saying that you should treat your software like cattle and not pets. And, you know, I don’t condone treating cattle necessarily, but basically treat your software as interchangeable. That if something isn’t working right, treat it as ephemeral, treat it as stateless as possible, just like take it down, spin it back up. You can’t do that in the data world where if your ETL process is broken, you can’t just, you know, spit it down and spin it back up. And now everything is fine. Because now you have bad data in your system or missing data in your system. So you have to backfill everything that is bad or missing so that I would consider data, not like cattle, but more like thoroughbred race horses, where the lineage really matters.

Kevin Hu 00:14:51 You can’t just kill it. Like you have to really trace everything that’s been going on. And one corollary of the fact that data has like these lingering consequences, that like, if there’s a data incident, the impact, negative impact compounds over time, right? Every second that passes the amount of bad data or missing data goes up and up and up. It’s so critical to minimize the time to identify and time to resolve issues in the data world. Of course, it’s very like case dependent depends on how data is used, but I think that’s one really critical difference. And another difference is the absence of playbooks in the data world. So as engineers, we have playbooks to diagnose and fix issues, but in the data team, there are none. That if there’s a bug that occurs, you got like some duplicate rows, it affects your churn. And then everything breaks from there. That’s something that we want to change with introducing Data Observability and something that we think will change, but we’re not quite there yet.

Priyanka Raghavan 00:15:58 So those are the things that you can learn from the software observability space. That is how can you self heal, I guess, is what you’re saying. I guess what I’m not very clear about is if there is a missing data where you said you had to go back in time, you know, try to figure out what happened and how do you get back? How do you do that? How do you fill in missing data?

Kevin Hu 00:16:18 Interpolation might be an answer in certain cases. I think it really depends like the number of ways that data can go wrong is, similar to the number of ways that software can go wrong. There’s an infinite number, right? It’s the whole to story core about all how happy families are the same, all unhappy families are unhappy in a different way. So, if you get a missing data, for example, because your ETL process failed for a day. And one way to fix that, hopefully is if Salesforce has their own system of record and has that data still existing, where you can like spin it back up and extend the window that you’re replicating into your database. And then you can call a day. If in another situation you have streaming data, let’s say your users are using segment. And that is being popped into your data warehouse. Or, you know, you have a Kafka stream like an event stream. And then it goes down for a day, you might have to do some interpolation, because you’re not going to get that data back unless some other system is storing it for you. So, it’s really case dependent, which is why it’s so important to have this root cause analysis.

Priyanka Raghavan 00:17:26 One last question I want to ask before we deep dive into the pillars, is, is there a rule of thumb on how many metrics you should collect to analyze the data? The reason I ask that is because in software observability, also we find if you have too many metrics, it’s mind boggling, and then you forget what you’re looking for. Just overwhelmed by the metrics. So, is there a rule of thumb that typically data engineers should have least so many or is there no limit on that?

Kevin Hu 00:17:57 I think the industry is still trying to arrive at the right level. I personally like reverse engineering from the number of alerts that you, as a data observability user get into your, whatever channel like Slack or email or PagerDuty where that’s ultimately what matters is, what does a tool draw your attention to? And behind the scenes, it doesn’t matter so much how many metrics or pieces of metadata are being tracked over time. And we found that it depends on the size of the team, but a nice sweet spot might be anywhere between three to seven alerts per day at max. Once it goes beyond that, then you to start with like tuning it out, right? Like your Slack channel is already going crazy, anything above and beyond like a handful a day is too much. Now to go back to your question, what does that mean for the number of metrics that you track?

Kevin Hu 00:19:01 It means that we have to have a nice, like compromise between tracking as much as we can, because like we mentioned before, like the surface area is key. Anything can go wrong, especially when there’s so many dependencies that we want to track, at least the freshness and the volume of every table that you have, if feasible. That also means that if we do track everything, that our models have to be really on point. Any anomaly detection cannot over alert you and the UI needs to be able to synthesize all the alerts in a way that isn’t overwhelming and just gives you what you need at that point in time to make a decision about triage essentially, like is this worth my time? So that’s where the quality of the tool comes in and it doesn’t have to be of course, a commercial toy. It could have also be something that you build internally or Open Source, but that’s where a lot of the finesse comes in.

Priyanka Raghavan 00:19:57 I think that is a very good answer, because I think the tooling also helps in fine tuning your way of looking at things and maybe your focus areas as well.

Kevin Hu 00:20:06 Right. I just wanted to draw analogy to like a security tool where ideally your vulnerability, scanner scans everything, right? It scans the whole service area of your API, but it doesn’t cry Wolf too many times. It doesn’t send you too many false positives. So, it’s the same balance there.

Priyanka Raghavan 00:20:24 It’s a good analogy that, yeah, the false positive is not like through the roof because that’s also something that you work with, right? You also tune the tool to say, hey, this is really a false positive, so don’t show up next time. So, then your alerts also get a little better because you work with it over time.

Kevin Hu 00:20:40 For sure. And thankfully we don’t work in a space that is like cancer diagnosis or self-driving cars where, false positives in our world are okay. You just can’t have too many of them. And you want to make sure that users, engineers who are actually doing the work feel like their agency and time is being respected. So, if you’re going to send me a false alert, at least make it something that’s reasonable that I can give good feedback into you. And then you can learn from that over time. You’re totally right.

Priyanka Raghavan 00:21:12 Great. So maybe now we can just deep dive into the pillars of the Data Observability. So, the first two things I want to talk about is where you talked about metadata, which is the data about the data. Can you explain that? Give me some examples and how you would use that for observability.

Kevin Hu 00:21:31 The most foundational tests do describe the external characteristics of data. For example, the number of rows i.e. like the volume tests, the schema and the freshness, and the reason this is important is because it is the most tied to the end user value. So to give you an example, oftentimes when people use data, there is like some time sensitivity of it. Where if your CFO is looking at a dashboard and it is one week behind, it doesn’t matter if the data was correct last week, we needed it to be correct today. And that’s actually a great example of the most common issue that Metaplane and every data observability tool helps identify, which is freshness issues, right? Time is of the essence here, where it’s all relative to the task at hand, but you need to make sure that it is within a tolerable bond, right?

Kevin Hu 00:22:30 If you need it to be real-time, ensure that it’s real-time; if you need it to be fresh up to a week, ensure that it’s fresh up to a week. And the second most common issue that we find are schema changes where when we write SQL or when we create tools, there’s some assumption that the schema is consistent. I don’t mean schema just in terms of the number of the columns and the tables and their names and types, but even like within a column, right? What are the enums, what you would expect? And because there’s so many dependencies, like when an upstream schema changes, things can really, really break and this can happen through Salesforce updating its schema or a product engineer changing the name of an event, an amplitude, for example, which I’ve definitely done. And it’s not intentional that you break downstream systems, but it’s hard to know if you don’t know what the impact is.

Kevin Hu 00:23:30 And the third category of this sort of external metadata is the volume. And you’d be very surprised how frequently this comes up for a whole variety of reasons where a table you’d expected to grow at a million rows a day. And then suddenly you get a hundred thousand rows. One, this is a good example of a silent data bug as we like to call it. Where, how the heck would you have known? No one’s checking this table all the time and it’s just very difficult to know both that that occurred and what the potential impact of it is. There’s a whole universe of root causes, but this happens quite a bit in production systems.

Priyanka Raghavan 00:24:12 I had read in a lot of blogs and see literature about the dimensions of the metadata. I think they talked about timeliness. So, would you group these characteristics of the data to get off, and then that’s what you track?

Kevin Hu 00:24:27 Great point about the dimensions of metadata, the really data deliverability descends from information quality research, like in tandem with software observability, but there’s a huge, amazing literature from the 1990s and 2000s from pioneers like Richard Wang and Diane Strong that describe what does it mean to have high quality data? And they’ve identified, like you mentioned many dimensions of data quality, such as like the timeliness of the data of referential integrity. And they also have identified like a nice taxonomy with which you can think about all these dimensions and metrics. So just a step back a little bit, there are dimensions of data quality, which are really like categories of why things are important, like timeliness as a dimension, really answers why timing is important. Why is the data in my warehouse not up to date, right? Why does my dashboard take so long to refresh?

Kevin Hu 00:25:33 But once you decide to measure that dimension, then it becomes a metric. Where if your data is not up to date, you might measure the lag between when your dashboard was last accessed and when your data was last refreshed or when your dashboard’s taking a long time to refresh, you might understand like the latency between your ETL process and when that dashboard is actually being materialized or the underlying data is being materialized. So, it’s like high level concept and then how it’s actually measured. And there’s a whole list, like a huge list of these dimensions and measures over time that you can think of, is the data accurate? Does it actually describe the real world? Is the data internally consistent? Not only does it satisfy referential integrity, but that you can’t pick data out of one table and out of another table and that they result in two different numbers. And is it complete, right?

Kevin Hu 00:26:28 Does every piece of data that we expect to exist actually exist. These are what we think of as intrinsic dimensions of data quality, where even if the data is not being used, you can still measure the accuracy and completeness and consistency, and it still matters. But that’s in contrast with the extrinsic dimensions where, you need to start from a task that the data helps drive, right? And some extrinsic dimensions might include. is the data reliable to your user, like regard it as true? And that’s related to how timely the data is. Like you mentioned before, and is it relevant at all? Right? You can have a lot of data for a product use case, but if you really need to use it for a sales use case, it doesn’t really matter if it was good. And that is considered part of data quality.

Priyanka Raghavan 00:27:24 Okay. Interesting. The relevance of the data. That is an important factor. Yeah. That makes a lot of sense, which is something I think, which, yeah, I guess maybe even software observability, you can learn from data observability.

Kevin Hu 00:27:35 Yeah, it’s really a two-way street because ultimately there’re two different roles that do two different things. I do think, the data quality, all the research is very thorough. And then now it’s really coming to fruition because data is increasingly used for critical use cases. Right. If you’re reporting dashboard is down for a day, sometimes that’s okay. But if it’s being used to train machine learning models that impact a customer’s experience or decide how you allocate ad spend, for example, that can be costly.

Priyanka Raghavan 00:28:12 We talked about timeliness and relevance of the data. I also wanted to know about in software observability, when we log data, we have this concept that we really need to be careful about, PII and private data and things like that. I’m assuming that’s even more so in data observability, I was thinking about all this Netflix documentary we watched and, you know, we’re collecting data and that contributes to bias and things like that. Does that play into data observability? Or also, can you talk a little bit about that?

Kevin Hu 00:28:44 There’s yeah. Another yield that’s emerging called machine learning observability, which kind of picks up where data observability stops. So frequently a data observability tool might go up into like the features, right? The input features to train a machine learning model, but unless you’re storing like model performance and characteristics about the features within the warehouse, that’s kind of as far as it can go. But there’s a whole category of tools emerging to understand the performance of machine learning models over time, both in terms of how the training performance departs from the test performance, but also to understand important qualities like bias. And that is definitely a part of data quality, right? Sometimes bias can be introduced because the data is just simply not correct in some dimension, right? Maybe it’s not timely. Maybe it’s not relevant. Maybe it was transformed incorrectly, but data can also be incorrect for non-technical reasons.

Kevin Hu 00:29:49 And by that, I mean, the data in the warehouse and being used by your model can be fully technically correct. And yet, if it doesn’t satisfy are some important assumptions about the real world, then it still can like not be a very high quality data set or high quality model as a result. And there’s a lot of great work including work by a great friend of mine, Joy Buolamwini on Algorithmic bias and shout out to the algorithmic justice league where facial recognition is increasingly deployed in the world, right? Both in public settings and in private settings, right? You look at your iPhone or you have to submit something to the IRS. Thankfully she pointed the end to that. But, but to say that these algorithms don’t work as well for everyone, right? And ideally, if something is rolled out at such a scale, we want it to work as well for one group as it does for another. So that is a hundred percent a part of data quality and a good example of how data quality, isn’t just the quality of the data in your warehouse. It goes all the way back to how, how it’s even being collected.

Priyanka Raghavan 00:31:03 That’s very interesting. And that caught me thinking about this other point. Could there be a scenario when, if someone maliciously modifies the data, is that something that also the tool can pick up or like something built into the framework for tools,

Kevin Hu 00:31:17 If it affects, underlying distribution that a tool like ours, would be able to detect when that distribution changes drastically. But oftentimes it’s more subtle than that. Like these sorts of adversarial data poisoning attacks, which small changes into the input features have drastic changes to the behavior of the model. At least in like certain edge case is oftentimes it’s very difficult to detect. And I know that there’s a lot of great academic research trying to address this problem. I don’t want to over say our capabilities or like the state of the art and industry today, but I’d be skeptical that we’d be able to catch everything just like some of the most impactful attacks.

Priyanka Raghavan 00:32:03 Okay. So, it is probably in the infancy stage and where there’s a lot more research happening in this area is what you’re saying?

Kevin Hu 00:32:09 Exactly.

Priyanka Raghavan 00:32:10 Also in terms of this data observability, let’s talk about the other aspect, right? We’ve talked about data quality, a little bit about the metrics and the metadata. And also, let’s talk more about the logs, which is directly the data. Software observability, when you look at the logs, it’s how the interaction between two systems. In data observability, I was reading that it also captures the interaction between humans and the system, right? Can you tell us how that is?

Kevin Hu 00:32:40 Whether it is a sales rep and putting the contract size of a deal, or it is a customer inputting their NPS score or like interacting with your site? Data comes from people, when it doesn’t come from a machine and there’s humans that touch data all the way along the value chain or the life cycle of data within a company, from the data collection to the ETL system that was manually triggered, for example, to pull it into a data warehouse, to the data team, writing transformation scripts, for example, in DBT to transform it from a raw table to a metric that is actually relevant to the end user. And then it’s also consumed by humans at the end, right? Whether it is looking at a business intelligence tool, LI-COR, or Tableau to see how these numbers that ultimately aggregated numbers change over time, it could be sent back into Salesforce to help a sales rep make a decision that along every step of the process is a human involved.

Kevin Hu 00:33:47 And the reason that’s important is to understand the impact. So, for example, if a table goes down for a day, does that matter if it’s not used by anyone? It doesn’t really matter. But if it’s being used by the CFO that day at the board meeting, you better bet that it’s important that the table is up and fresh and is, you know, the data doesn’t tell you this, right? You need to have aggregated log data to understand what the downstream impact is as well as what the root cause might be. I know I’m a broken record about downstream impact and the upstream root cause, but that’s what it always comes back to. Right? Like just hearing about an incident. Okay. That’s useful, but it’s the what’s next that’s important. And the root cause like let’s say that that table is not fresh again.

Kevin Hu 00:34:34 What could it possibly be? Maybe a colleague on the data team merged in a poor PR that broke an upstream table that your current table depends on. Well, it’s important to know who merged that PR and what the context around that decision was maybe there was an invalid input in a source system, some input, a negative value for a sales number. And it’s somehow violated some assumption along the way. It’s important to know what that was too. Cause ultimately, yes, you are trying to solve the issue at hand, but you also want to prevent it from happening in the future. And unless you have like a real diagnosed root cause it’s difficult to do that. And because people are involved every step of the way you need that information.

Priyanka Raghavan 00:35:19 So this is what ties into what you call about the lineage of the data, as well as the relationship of the data. Right?

Kevin Hu 00:35:26 Exactly. Like let’s be super concrete now, like this is a table that ultimately describes the churn rate of your customers. For example, there are so many dependencies of that table, whether it’s the immediate dependencies, like the number of renewals versus the number of churns over time. But then you go one level above that. What impacts a number of renewals while it’s a number of customers that you have at all and maybe some event or some classification about whether or not they’ve turned, but who determines what a customer is, maybe that’s combination of the data in Salesforce with the data that you have in your transactional database. Oh, but who determines a customer in Salesforce is a, someone that has already submitted a contract or someone that has, you know, made a booking. Reality is surprisingly detailed. And I know that there is a hacker news post from a few years ago saying, as you zoom in, there’s more and more to discover that is as true in data as it is everywhere else.

Kevin Hu 00:36:26 There’s assumptions, there’s turtles all the way down. And let me give you two worlds for a second, where you have that customer churn rate table. If it goes down and you don’t have lineage, what do you do? Well, what people do today is that they rely on their tribal knowledge like they might have, oh I know that this is what the parent table and these are the assumptions that are in place. So let me check those out. Oh, but shoot, maybe I forgot something here. And I know that colleague is working this other upstream table. Let me loop them in for a second. There’s a lot of guesswork, very time consuming. And the Holy Grail is for you to have that whole map there for you and for you to not have to maintain it. Personally, I don’t think it’s possible to become a 100% correct there, but oftentimes you don’t need to be a 100% correct. You just need to be helpful. And that’s why lineage is important because it helps you answer those. Yes,no questions very, very quickly.

Priyanka Raghavan 00:37:27 Okay. That’s interesting. And I think it also makes it kind of clear to me on why that is important to find out the root cause and the impact. Major things that we talked about in this juncture.

Kevin Hu 00:37:42 That, on my tombstone and my birthdate because whatever the year I die, that’s the impact.

Priyanka Raghavan 00:37:49 This is great. So let’s just move on to maybe some of the tooling around this data. So can’t you do all of this in Datadog?

Kevin Hu 00:37:58 You can, but it’d be hard. We use Datadog internally. First of all, I spend a lot of my day in Datadog and it is an amazing tool. But as software engineers, we know the importance of having the right integrations, the right abstractions and the right workflows in place that you can stretch Datadog to do this. And for instance, you’re monitoring the mean of a column at a table, but let’s say that you want to monitor the freshness of every table in your database. That starts becoming a little bit tricky, right? And time consuming. You can do it. I’m confident that the listeners of this podcast will be able to do that. But it’s much easier when a tool kind of does that for you. And let’s say that you want to understand the BI impact, right? Integrate with LI-COR or Tableau or Mode or Sigma to understand the lineage of this table downstream.

Kevin Hu 00:38:53 As far as I can tell Datadog does not support those integrations. Maybe you can write a custom integration and again, every listener here can do that. Do you really want to do that? Let someone take care of that for you. And lastly, the workflows like this process of identifying and triaging and finally resolving these data quality issues, have a somewhat particular workflow, it kind of varies by team, ëcoz like we said, there are no playbooks, but that’s something that data observability tools also help with. So my answer is yes you can do it, but personally, I don’t think you should want to do it.

Priyanka Raghavan 00:39:32 If I were to like re-phrase that question and ask you what would be the key components that a data engineer should look for when they try to pick a data observability tool, what would you say?

Kevin Hu 00:39:43 Integrations is number one. If it doesn’t integrate with the tools that you have, don’t bother, right? It’s not worth your time. Thankfully, a lot of teams are centralizing on a common set of tools like Snowflake and Databricks, for example, but end to end coverage is really important here. So, if it doesn’t support what you care about, don’t bother. And I also think that if it doesn’t support the types of tests that you’re concerned with, like no one knows your company’s data better than you do as a data engineer. And you know, the last few times that there were issues, you know, what those issues were and if a tool that you are evaluating or even considering building doesn’t support the issues that have happened and you think will happen, probably not worth your time either. And the last thing is how much time, how much investment is required from you.

Kevin Hu 00:40:41 And I mean that out of total respect where engineers have so much on their plates, right? Like even putting work aside, right work might not be the number one, two or three things on your to-do list. It might be, I need to pay my mortgage. I need to take care of my parents or take care of my kids. And then work is somewhere on that list. And the number one thing on those work lists might be, I need to shoot, deliver this data to a stakeholder. I need to work on hiring very far down that list might be observability. So I think it’s very important for a tool to be as easy to implement and easy to maintain as possible. Because vendors like me can go and shout about the importance of data observability all day, but ultimately it has to help your life.

Priyanka Raghavan 00:41:28 So the learning curve should be very easy, is what you’re saying. Also, one of the big factors for picking a tool.

Kevin Hu 00:41:35 Learning curve, implementation, maintainability, extensibility, all of these are important.

Priyanka Raghavan 00:41:41 Let’s come onto Metaplane. What does your tool do for data observability apart from which I’ve seen, but can you tell us on these things like you have the integrations, I guess I’m guessing that’s something that you concentrate on.

Kevin Hu 00:41:55 Yeah. Metaplane we call the Datadog for data to be queue, but it plugs into your databases like Snowflake and transactional databases like Postgres, plugs into data transformation tools like DBT, plugs into downstream and BI tools like LI-COR, and we blanket your database with tests and automatically create anomaly detection models, that alert you when something might be going wrong. For example, freshness or schema or volume changes. And then we give you the downstream potential impact and the upstream potential root causes.

Priyanka Raghavan 00:42:36 Your tools also, do they work on the same software as a service kind of thing, is that the same model?

Kevin Hu 00:42:43 It is the same model where teams generally implement Metaplane in less than 10 minutes. They provision the right roles and users and plug in their credentials and then we just start monitoring for them automatically. And after a certain training period, then we start sending alerts to the destinations that they care about.

Priyanka Raghavan 00:43:07 I have to ask you this question, it’s not only for Metaplane, but for generally, for any data observability tool you are collecting a lot of data. So, one of things we’ve seen with also the software observability tool is then suddenly people say, please Ram down on the data, there’s this huge cost. This is big bill that might be paid. So then we have to like sort of reduce the logging. Is that something that you help with as well? Like through these data observability tools, do they also help you with reducing your cost while also logging enough to know about the root cause and impact?

Kevin Hu 00:43:39 Well, we’ll say until the day we die. Yeah, exactly. Ultimately we don’t think that data observability should cost more than your data. In the same way that data should probably not cost more than your AWS bill. And as a result, we try and really minimize the amount of time that we spend coring your database, both the overhead that you incur by bringing on an observability tool and to make a pricing and packaging model that makes sense for teams. Both in terms of ultimately the dollars you pay at the end of the month, like the order magnitude less than Snowflake and also how it scales over time, because we want users to create as many task as possible, catches more errors, gives more peace of mind and we don’t want to make it so that, oh shoot, I only want to create these four tests on these four important things. Because if I create more than that, then my costs start exploding. That’s not what we want at all. So, we try and make a model that makes sense there.

Priyanka Raghavan 00:44:42 Is that also something for the data observability space that you also give customers or tooling provide some feedback on how you can reduce cost. Is that something that’ll happen in the future?

Kevin Hu 00:44:53 You’re laying out a roadmap. We are working on that. It is a tricky problem, but it’s something that we are actually rolling out in beta right now is analyzing the logs, right? The query logs and analyzing the data that exists and trying to suggest both tables that aren’t being used and could be deleted. And the tables that are being used frequently and could be refactored, but also identifying like which quarries are being run and which are the most expensive. How can you change your warehouse parameters to optimize spend there, there’s a lot of work for us to do during that direction. And we have all of the meta data. We need to do it. We just have to like present it in the right way.

Priyanka Raghavan 00:45:35 There’s this other drop title, which has been around now for a few years, but it came up during this software observability boom phase, which is the DevOps Engineer. Because if you’re data is not available now, you get a call like midnight or whatever page duty and everything’s buzzing. I’m assuming it’s the same thing for data observability. A new set of jobs for people just doing this work?

Kevin Hu 00:46:04 There’s a new, I guess, trend emerging called DataOps, right? That is an exact one to one inspiration or coffee of DevOps to the data world. There’s an open question of how big data can get within an organization, right? Like will there be roughly as many people on the data team as there are on the software engineering teams? There’s argument for both a yes and no. And I think that if data teams generally don’t become the size of software teams, that data ops as a job might be taken on by existing roles like data engineers, analytics engineers, the heads of data, of course. But I think at larger companies with sufficiently large data teams, we are seeing roles emerge that kind of play the role of data ops like Data Platform Managers, right? A Data Product Leads, Data Quality Engineers. This is emerging by, at the larger companies. And I’ve yet to see at smaller companies.

Priyanka Raghavan 00:47:05 Finally, if I were to ask you to summarize what is the biggest challenge you see in the data observability space and is there a magic bullet to solve it?

Kevin Hu 00:47:17 The biggest challenge is extending data quality beyond the data team. Ultimately data is produced outside of the data team and is consumed outside of the data team and data teams themselves don’t produce any data, right? We call Snowflake the source of truth while frankly it’s not the source of any truth because Snowflake does not produce data. And being able to extend the visibility that observability tools bring to data teams, but to the non-data teams, I think is a huge challenge because it bumps into questions of data literacy. Like does my CFO, like if I say that the data is not fresh, do they know what that means? Or when a software engineer is perhaps like making a change to an event name. And I was to say, this is the downstream lineage, is that the right way to say it? So, I think that is an open question, but ultimately where we have to go, because our goal here is trust and the data needs to be trusted by not only just the data team, but literally everyone within an organization for it to be used.

Priyanka Raghavan 00:48:31 Interesting. So, trust is so I I’m hearing trust in the data as well as maybe more learning on the key terminologies so that everybody speaking the same language is what you’re saying.

Kevin Hu 00:48:44 Definitely meeting other people where they are. And I try and not bash them over the head with terms that only make sense to your discipline. That’s a difficult problem. And it’s a human problem. Like no one tool can solve it. It can only make it a little bit easier.

Priyanka Raghavan 00:48:59 Yeah. This has been great chatting with you, Kevin. Is there a place where listeners can reach you? Is it on Twitter or is it on LinkedIn?

Kevin Hu 00:49:07 Yeah, I’m Kevin Z E N G H U, Kevin Zheng Hu on Twitter and LinkedIn. You can also go to Metaplane.dev, try it out, or send me an email @kevinmetaplane.dev. I love talking about all things, data observability and I’d love to hear your feedback.

Priyanka Raghavan 00:49:24 Great. I’ll put this in the show notes and can’t thank you enough for coming on the show, Kevin. It’s been great having you.

Kevin Hu 00:49:31 Such a pleasure talking with you and thank you for the wonderful questions.

Priyanka Raghavan 00:49:35 This is Priyanka Raghavan for Software Engineering Radio. Thanks for listening. [End of Audio]

SE Radio theme: “Broken Reality” by Kevin MacLeod (incompetech.com — Licensed under Creative Commons: By Attribution 3.0)

SE Radio 507: Kevin Hu on Data Observability

Show Notes

Related Links

Transcript

Join the discussion

More from this show

SE Radio 723: Dave Airlie on Linux Kernel Maintenance

SE Radio 722: Dwayne McDaniel on the Engineering Challenges of Secrets Management

SE Radio 721: Rob Moffat on Risk-First Software Development

Menu

Recent posts

Search

Search

SE Radio 507: Kevin Hu on Data Observability

Show Notes

Related Links

Transcript

Join the discussion

More from this show

SE Radio 723: Dave Airlie on Linux Kernel Maintenance

SE Radio 722: Dwayne McDaniel on the Engineering Challenges of Secrets Management

SE Radio 721: Rob Moffat on Risk-First Software Development

Menu

Recent posts