SE Radio 429: Rob Skillington on High Cardinality Alerting and Monitoring

Rob Skillingon discusses the difficulty of scaling monitoring and alerting to high dimensional spaces, as are typically found in modern applications at scale. High cardinality versus high dimension spaces. The episode begins with a review of monitoring, metrics, metadata, and alerting. How do things change at scale? Filtering signals out of high-dimension data sets. When to aggregate versus when does aggregation hide the signal. Alerting on signals that occur along only a single or a small number of dimensions. The architecture of high dimension systems. Time series databases. Examples from Uber. The M3DB open source project.

Show Notes

Transcript

Transcript brought to you by IEEE Software

Robert Blumen 00:00:52 For software engineering radio. This is Robert Blumen. I’m joined today by Rob Skillington. Rob is the CTO and co-founder across here previously. He was a technical lead on the observability team at Uber for M three and probably as the creator of the open source and three platform. Problem. Welcome to software engineering radio.

Rob Skillington 00:01:20 Thank you for having me, Robert.

Robert Blumen 00:01:31 It’s a, it’s great to be here and great to get a chance to talk Rob today. We’re going to be talking about monitoring and alerting at scale, before we get into the at-scale scale part, I’d like to do a brief overview of monitoring and alerting, starting with a monitoring. What is monitoring anyway?

Rob Skillington 00:01:50 Yeah. Great question. You know, at one point I, uh, I refer to it as keeping computers running in a very generic term. Uh, but the, that more technical, you know, kind of term is really had just thinking about how software runs, making sure that it’s, that it’s up, that it’s healthy when you’re talking about a service and, you know, kind of all the parts of, of really measuring the health and performance of a system, whether that’s, you know, I, a product piece of software or an online service or an embedded piece of hardware, you know, monitoring really means the same thing everywhere.

Robert Blumen 00:02:21 It’s, it’s about being able to observe that, uh, and measure the, the health and performance of, of, of a system of question. How does alerting fit into that?

Rob Skillington 00:02:24 Yeah, so alerting is definitely, you know, been around for quite some time. Nagios is a open source, you know, uh, system that Hawks back to, uh, definitely more than a decade old, probably multiple decades old at this point. And you know, that that really is a, is a great piece of software for a monitoring online services and has been for quite some time, but that’s kind of really where we came from. And, you know, while it’s obviously super, still used on it and relevant, it’s still, I guess like the, the way that most things were done before and in an online sense, which is like, you run a bit of code that probes a system and that probe returns a result.

Rob Skillington 00:03:17 And then you kind of have this like check or fail view of the weld, sorry, pass or fail, rather in terms of whether a system is up or down. And then, you know, you kind of add more and more probes, so you can kind of pray more things. And then you have this kind of possible fail view of a lot of independent parts of a bigger system that helps, you know, you kind of make a decision on whether you think something is healthy or not. And that, that, that that’s, uh, you know, has been very popular for a very long time. Um, and then today, you know, we’re seeing a lot of time series based monitoring going on as well as looking at, you know, like what’s happening with the log output of obvious systems and kind of analyzing that data. But, you know, fundamentally it’s moving, I guess, away from pro based to really contain many other more complicated kind of signals coming out of these systems as well, such as time series based data and, and, uh, and log analytics.
Robert Blumen 00:04:16 I understood that probe would be, we’re going to check something and either maybe it doesn’t come back at all, or it was back with a bad result. That’s a fail, whereas time series, you could have a more sophisticated condition that looked at a series of points over a five minute window, and you might have a single bad data point in there, but that wouldn’t necessarily be a fail.

Rob Skillington 00:04:42 Right? Exactly. It lets you look at things more volumetrically. So, you know, if you were measuring the electricity output of an electrical component and that was speaking binary, maybe between zero and five volts, like you, it’s a lot more granular. You don’t just say zero one. Um, you know, you see exactly how, how many volts are coming out of that component and it, and the time series is yeah. Kind of like a graph of, of that signal

Robert Blumen 00:05:11 What happens when there’s a fail?

Rob Skillington 00:05:13 So, you know, when you’re talking about time series analysis, you have, there’s usually a lot of thresholds that are put in place. So it’s, it’s either you have some time series that you’re analyzing, perhaps it’s, I’m summing across, uh, different actually multiple compound time series to, you know, present a final time series that you would put a threshold on. But yeah, usually it’s about declaring that the fact that you want to make sure a certain time series either doesn’t go below a certain value or doesn’t go above another value. And then, you know, when it fails or fails that precondition, that you’re trying to hold the time series to then typically on call engineers, who would be on call for a system, uh, would get paged. So that’s either, you know, using something like some kind of service like PagerDuty, which, which allows you to set up a roster.

Rob Skillington 00:06:08 And then, you know, when an actual event happens, that requires investigation, you would kind of have your system tell PagerDuty about a PagerDuty. You would find the person on call for that roster and then kind of page their phone by giving them a call or, you know, ping their app or send them some kind of message to let them know that they need to acknowledge an incident, uh, needs to be looked at and, and, uh, resolved. But of course, you know, you can do that with, with Nagios probates checks as well, your profiles send a message to some system and have that system notify a human to go look at something.

Robert Blumen 00:06:42 The purpose of that then is to involve a human in the operations of the system to possibly fix something that won’t fix it.

Rob Skillington 00:06:51 Right. Yeah. And a lot of the time it’s, uh, you know, and this is actually a super interesting topic. We could talk at length, but it’s, you know, it’s really, um, this is where the, the human element part of this is fascinating because, you know, you really want to alert on things that actually matter. You could probably set up a million alerts, uh, if you, uh, you know, wanted your assistant to exactly a reacting and perform in some way, but really, you know, and it comes down to how’s your, your actual system behavior performing because that’s what you really care about. You don’t really care that it took an extra half a second sometimes to do something else. If that doesn’t really matter to your, to what your end system is actually doing as a unit of work. So it’s, uh, it’s a lot about kind of, uh, setting parameters on things you care about and then, um, involving a human, when that’s necessary, you know, you can use alerts as well, though for also auto remediation or automation, for instance, like you could set up a in fact,

I know a lot of auto scaling has done this way, right?

Rob Skillington 00:07:58 Like we set up an alert that when either CPU or some other threshold is exceeded, you fire a web hook, and then, uh, some system that proceeds that web hook will go and perhaps scale up your cluster, if it’s kind of within a certain, you know, you obviously need to put in some protection mechanisms there, you wouldn’t want to infinitely scale it up, but then, you know, you, it allows for automation to actually react to a systems to react, to, to change that that’s going on in the underlying system based on signals that your monitoring picks up on. So most of the time it’s evolving human sometimes. So it’s also automation, uh, kind of dealing with the output of one of those alerts or alarms monitoring.

Robert Blumen 00:08:42 We’re going to be talking more about concept of a metric. Tell us what that is.

Rob Skillington 00:08:48 Yeah. So great. That’s really fascinating question. I mean, um, in terms of what, you know, day to day we work with in the monitoring world and metric is, is really a type of event that you are tracking. I guess that’s pretty applicable to most other subcategories of computing as well. You know, it’s really, when I think about what makes up a metric in the monitoring world, it is a schemer, it’s a type of event that happens. And it’s the different dimensions that separate the types of, uh, events. So I’ve to give a concrete kind of example here, if we think about HTP requests, to me, the HTP requests, returning a response, that’s a metric, cause that’s the event that’s happening. You’re returning an, uh, an HTP request to an end user. And then, so that’s the event.

Robert Blumen 00:09:56 And then on that metric, you have a dimension such as what is the end point that was sort of, so what was the URL patent that the hand that your handler got called on? What was the status code?

Rob Skillington 00:10:11 So it was that in the 200 range, the 500 range or the 400 range, and then things like what is the service software that served that request? So all those are dimensions on an underlying metric, which is describing an event, but, you know, I will say that it’s not always events a lot of time as well. It can be volumetric signals. So CPU usage network through Manhattan, how many, how many megabytes, so megabits per second, you’re sending over every network and memory usage. All of those are very volumetric. They’re not really events per se. They’re kind of like a point in time data point that you measure

Robert Blumen 00:10:36 Was the metric, this collection of dimensions and values, or is there a single value that’s the metric? And then all these other dimensions are, let’s call it metadata

Rob Skillington 00:10:49 Or labels. Yeah. So the way that we really like to kind of structure it in terminology is, uh, thinking about, you know, the metric being, uh, kind of the, yeah, that type of thing you’re measuring, so I’ve bent or, or some sampling. Um, and then the, uh, combination of dimensions makes up a time series. So for instance, like going back to that HTP request, uh, example we have, for instance, I single time series that is part of an HTP request. Response is an HTP request response that was served a 200 status code on the slash health, uh, URL. So that’s, that’s one time series and other times series would be, um, you know, uh, the same, same metric name. And then the dimensions would be 500. So internal server error being returned from the, uh, the path slash health. So times series are the underlying different events, streams

Rob Skillington 00:11:54 That are identified by their unique dimension values. And then events is really the category series being classified as meaning one thing that you’re measuring or one of one class of events or, or component that you’re measuring, if that makes sense.

Robert Blumen 00:12:11 Say, we apologize to the listeners, Rob has just told me it started hailing where he is, and he’s moved into another location further away from the window. So she’d be better for the rest of the podcast back to where we were, if you have these metrics and they have dimensions, which have names, how do you define in a computer system, the alert on top of these metric events?

Rob Skillington 00:12:43 Yeah. So that’s a great question. Basically, you can think of when you’re kind of building an alert, you know, a lot of the time you want to see what the historical failure or pass and fail rate is going to look like. So, and you can actually imagine with, you know, when we were talking about those pro based events, that’s a bit harder to do this because you don’t have any signals coming out of the system. You’re literally writing code that may pass or fail based on what happens with the probe. Um, whereas with times here is data, you know, and, and looking at metrics, when you create an alert, you’re fundamentally adding a threshold to basically, uh, capture whether, you know, a, an alarm would go off or not based on signals you already have. Um, then you can see in a graph. So typically what you do in, you know, you author an alert is, uh, you go and look at, uh, all the different signals that are coming out of your system, uh, in a graph, maybe you apply some functions to it.

Rob Skillington 00:13:51 So, you know, maybe there’s some spikes that you want to kind of smooth out because you don’t mind if, uh, you know, CPU jumps to 90% just for a short burst. So you might want to do something like average over time, or like a moving average. And then, you know, so you can actually apply transforms on top of these time series as well. And then finally, you know, you kind of put in your threshold for where you may want to actually alert on that transformed, uh, signal that’s coming from your system. So typically what you do, you know, when you hear you’re writing these alerts, as you start by graphing and visualizing the different metrics and underlying time series that you’re kind of want to alert off of, and then you come up with a graph that makes sense to you logically on, on the type of thing you’re trying to do.

Rob Skillington 00:14:45 So you may, for instance, it’s very easy to say, I don’t want any internal server errors to be returned by otherwise, tell me about it. Um, and so, you know, very simple, I alert would just be take the number of times you return a internal server era for a given set of HTP end points. And, uh, whenever that exceeds, you know, zero or some reasonable threshold, if you know, that’s something that you need to do given the behavior of your system and then set that threshold. And then finally, depending on the monitoring system you’re doing, you’re using, you may go to a conflict file from if you’re using premier atheists, which is a popular, open source monitoring tool and save that rule to the, to the config file and then either redeploy it, your system, or, or cause a reload to occur by calling an end point or, you know, infusing a vendor, you’ll probably go to a UI or use some command line utility to, to kind of update and alert definition. So that’s typically the life cycle we see when you’re kind of all through these alerts.

Robert Blumen 00:15:50 Want to wrap up our general discussion about monitoring and alerting and talk more about our main point, which is at the scale, what does it mean to have monitoring and alerting at scale are the challenges that start to emerge. I could see if you had a small number of alerts you possible for one person to know everything about it, but what starts to break when you get into larger systems?

Rob Skillington 00:16:20 Yeah, good question. So the main thing, you know, at ad scale that tends to happen is your system gets more complex, you know, whether that’s, uh, your, your product getting more complex or, uh, fundamentally you’re scaling out the team and, uh, you know, the team to support some product line needs to refactor the code in such a way that that is more complex, but typically, you know, one of the two are happening and, uh, you have, uh, just a lot more code pause now to track. So in regards to kind of monitoring that, you need to think about all the different things, um, that can go wrong. I would start with really though, just defining what your SLA is, are, you know, what are your service level agreements? W what is the things that when you break a contract, uh, you will get a, uh, customer or user, or, or your manager upset?

Rob Skillington 00:17:24 What, what will upset them essentially? Uh, what do those contracts look like? And then essentially putting alerts on those, and then, you know, in a more complex system, uh, what tends to happen is that, you know, you have the layer that interacts with the end user, but then you have layers below as well. So you’ll have like a data layer and that data layer may have a contract with other developers inside the company that it should return reasonably within 50 milliseconds or something like that. I record about a user or a record about some, some object. Um, and so you really think of it like laying down your monitoring in terms of layering on the layers of an onion. And so, uh, you know, at the most core layer, you, you do want some alerts as well and add the medium. And then, you know, at the, the most important are the SLS and the contracts that you have that you actually have at the very edge of your system.

Rob Skillington 00:18:19 But you want to know ahead of time when, uh, things closer to the lower level, start to misbehave, because that can cause a cascading failure. And that kind of impacts all of your, um, your product or, you know, or system at the, at the higher layers. So, uh, typically what tends to happen when you scale out is you need to cover more ACP, end points and heat to cover more, uh, serve versions that you’re tracking are in production. You probably have more either instant VM instances or machines or containers running your software. So you have many more points of failure. And, you know, you want to basically make sure that when you do get an alert that goes off, you can drill down exactly to which one of those points was a failure. And so fundamentally, you’re just adding dimensions to a lot of these metrics, whether it’s because you have low complex product or because your infrastructure is more complex.

Rob Skillington 00:19:16 And then, uh, it really comes down to, you know, how do you deal with that level of complexity and scale of the, the monitoring data? And do you do that in a way that’s sane or does it become unmanageable? And it’s just a mess of alerts and no one really can understand what’s going on because, you know, you’re, you’re not doing it in a clean, consistent manner across your two systems. So that’s some of the things to really think about and some of the, um, common things that become problems very quickly. And, uh, you know, if your tools are working against you and you, you can’t just kind of collect more monitoring data and you can’t, you know, if that’s problematic or, or costing effective because it’s too expensive, um, or perhaps, you know, the there’s just frictions there. That is, that is hard to use.

Rob Skillington 00:20:04 Maybe your query language for creating these time series is super obtuse and not well-documented, which is why we love open source software in this space. You know, I’m, I’m an open metrics contributor, which is a standard here in this space. And, you know, I think that, uh, you really want your tools to work with you, and you want there to be some knowledge and nomenclature. That’s common to what you’re doing amongst the team. And you know, that, that, depending on the tool you’re using, whether that’s some very niche specific vendor tool, or some very common popular, open source tool, or, or a mix of the two, you know, you really want to think about that as well as you start to scale up your monitoring needs of points there.

Robert Blumen 00:20:48 Some of the main takeaways for me is as the system scales up, you simply have more metrics because there are more things to monitor on and more things to alert on. And then you have a lot more data which is gonna create load on the monitoring tools. And if you have a database in your monitoring system, you mentioned that this manifest has more dimensions. And what I think you’re talking about is let’s say that all your services are HTTP, and if the service which originated the metric, if that is a dimension, then that manifests as more values in the given dimension. Did I get that right?

Rob Skillington 00:21:33 Right. Yeah. And, and especially if you’re, you know, playing in a Coobernetti’s world or some kind of container system, it’s not even just a service, it’s a service. And then the underlying, uh, process ID. Cause of course, you know, if you run 200 instances on a single service on really small containers, when something goes wrong, if it’s hyperlocalized to just one of them, you kind of need that dimension of which container was running it, uh, to be able to, to localize the problem and kind of, uh, quickly go from an alert to debugging your application.

Robert Blumen 00:22:10 We’re veering into this topic, which I wanted to cover the idea in monitoring of high cardinality and high dimension monitoring, which is something you, you told me that you wanted to talk about. I’ll let you talk about that now. What, what do you mean by that and how does that relate to scale?

Rob Skillington 00:22:31 Yeah. Great question. Um, and thank you for, yeah. Let me kinda kind of bring it up a little bit to what I was kind of talking about previously with, with what happens, you know, as you start to scale up or a complex system or a engineering team that’s working on, on a system that, that starts to accrue a lot of complexity in it, what typically tends to happen is it starts to look like a bowl of spaghetti from the outside. Um, if you don’t, if you can’t really see what’s going on and, uh, you know, a lot of people start with like reading the code, uh, right. Uh, but at a certain level, and, and even with three or four engineers or developers, you know, you, you can reach a point where that’s just inefficient essentially. And honestly, you know, when you’re encountering a failure in a system, the last thing you want to do is go read the code.

Rob Skillington 00:23:26 You want to mitigate the problem for us. Right. And, uh, logs is typically, you know, be very useful here, but with the, you know, kind of change, which is happening out there, that’s seeing a lot more things run on a firmo environments in cloud. And you know, more people are using lots and lots of containers instead of a few VMs. And people are also using distributed systems more frequently, where they would have had just one monolithic system, microservices are becoming more popular. I think you have to be really careful with how far you go in that direction. And I have stories on that, but it, a technical, you know, really what, I’m the kind of trend that I’m talking about here though, is that, uh, essentially you really need some direction at a higher level than just thinking, like, I have a problem. I need to go look at the logs or read the code.

Rob Skillington 00:24:25 Like that’s kind of Verde, it’s just takes so long to perform that analysis that, you know, when you’re trying to mitigate a problem or understand even just how a larger system is working at a higher level, you really need a lot of signals coming out of that system. That, that makes sense to you and makes sense to others without having to go and deep dive on a specific set of logs or, or try to understand how a code flow, you know, a code part is working based on the code. So high dimensionality metrics is really about giving you those signals that at higher and higher levels of granularity, so that when you do trigger an alert, it’s not just telling you that, you know, you’ve got a spike in error, right? It’s telling you, you’ve got a spike in error rate. There’s a specific HCP end point that’s exhibiting the problem.

Rob Skillington 00:25:16 And only happens with the new version of the code you rolled out. And only with clients that is, you know, if we talk about, um, an example where you’re using a mobile application and serving traffic to a mobile application, it’s only, you know, you’re only seeing an error, right error rate with the new code, with a mobile client of a specific version or operating system. So all those extra caveats there helps you get to your root cause much faster, the having those extra dimensions and those extra signals that gives you depth into the problem or how, you know, this, the system is just behaving in general is super helpful. And that, and that’s, that’s really what we, when you keep adding those dimensions, that’s what causes high cardinality cause high cardinality is really talking about the number of unique times series. He’s not talking about the metrics. You might still be tracking the same type of things, but the number of underlying time series explodes because you want to slice and dice on a high number of dimensions.

Robert Blumen 00:26:56 If I averaged everything together, it might look fine. But what you’re saying is if I was able to see this given metric of a latency and a group, it based on user agent, the end point it’s hitting, whether it’s a mobile or fixed code version, and let’s add three more things, and you might see that out of that big high dimension space, that one of them is behaving very badly in that set of users is having a terrible experience, which I would not see if I averaged it all together.

Rob Skillington 00:27:33 Right. Exactly. And then, you know, I think a concrete example here is, you know, at Uber, we would have these type of things where a certain telecom providers would go dark with the routing to our backend. And sometimes it was the routing to our backend and not all internet, all internet websites. So it’s like, you know, if you’re on 18 T and you’re in some part of the U S um, that’s experiencing a routing issue to a network server that’s hosting Uber, um, you may be impacted, and that’s a terrible user experience. Like if we don’t want, you know, especially at, at the company, we wouldn’t want specific, both uses of the system to experience a hyper outage like that for potentially days at a time. We need to know about that, you know, and, and when you just have user reports from the wild, it’s really hard to correlate that.

Rob Skillington 00:28:29 So even just by adding a dimension of what telecom provider, it looks like an IP address was coming from, as well as like, uh, you know, certain things around where the mobile client was pinging in from that helped a lot to try and get to a root cause much quicker and understand a root cause. And then, you know, when you think about like Netflix, that they talk about this problem as well, where they’re, they know they can know, they can tell that a specific iOS version in Brazil, watching a certain title is not able to play back a MP four file that they’ve encoded slightly differently to all the rest. So that level of diagnostics is kind of required in a, in this wall that’s growing like more complex and, and also just honestly makes for a lot easier to actually see what’s happening yet in that when you were talking about averaging everything out, like if you just looked at how many playback failures happened for that one TV show on Netflix, you might not see the fact that this one specific iOS version in Brazil is experiencing a hyperlocal problem. It might just look like a less than one or 2% who cares. And, and, um, but that’s really actually an underlying problem that can be fixed if you know what the root causes, and you can kind of drill down quickly. I think

Robert Blumen 00:29:53 Now you haven’t said this, but I am inferring this from what you said, where the cardinality is a number of different values you can have on a metric. And then the dimension is the size of the overall space. When you multiply all these card analogies together, did I extract that correctly from what you said?

Rob Skillington 00:30:13 They probably think about it just as metric made up of many times series influenced by the dimensions. So like multiplying the unique dimension values together gives you number of times series for a metric. And then the cardinality is really just the number of unique time series. That’s really the same thing. So it’s, it’s, uh, that’s how I like to think about it.

Robert Blumen 00:30:38 Okay. And if we’re going to be talking about high cardinality, could you give us some numbers or orders of magnitude in a organization like an Uber or a Netflix? What are we looking at here for how many unique time series, how many distinct values are there? I could guess for countries you could have dozens or a hundred give us some concrete numbers. Yeah,

Rob Skillington 00:31:01 Yeah, yeah, definitely. Yeah. I’d love to do that. Uh, yeah. If we look at even like, go back to our ICP example, you know, we have, um, five status codes, perhaps like HTP 200, 400, 500, 300, and maybe even, or none of those. And then, uh, so those are categories. They’re not even status codes. You might have like 20 sentence codes, but let’s just talk about five for now. Cause that’s probably five that are important to you. Um, and then say you’re monitoring a hundred end points and then say your monitor, like maybe your product, uh, or, or system behaves differently depending on which country the user is in. So that’s maybe say you operate in a hundred different countries, and then you think about like serve a version. So you may have probably 10, it depends on your system, but maybe you’ve got 10 different versions of your, of your server version rolled out depending on, you know, a Canary or a staging environment or, or a different region that you’re operating.

Rob Skillington 00:32:02 Maybe you serving traffic out of different cloud regions. So, so that’s kind of like status code five by end points, a hundred say, and then by another hundred for the cities, um, and then say you run in six regions. So that’s, and I’ve always relied on the calculator, sir, don’t mind me while I multiply live the number here, you know, so we’re talking, that’s already a very large number. We’re talking about 300 thousands of time series there. So that’s six regions, five different status codes, a hundred end points. And, Oh, sorry, we’re talking about 30,000. I put in 10, a hundred for different server, but 10 server versions is much more realistic. So that’s 30,000 times series. You can imagine if you have, if you want to monitor on anything else, like a user client agent, like whether it’s Chrome or Safari or whether it’s iOS or Android on mobile, that’s, that’s going to multiply that 30,000 again.

Rob Skillington 00:32:58 So you very quickly get to these hundreds of thousands, millions. Now, granted that is a pretty high cardinality metric, but you kind of, you know, in, in these, uh, worlds, we, we exist in, especially when you add container label to that, you might operate over 200 containers and those container containers might rotate as well. So if you’re looking back over time, you know, you’ve got, if you deployed your 200 containers twice, now there’s 400 different containers that you could possibly be looking at the results from over, uh, over a time window that you look at. So yeah, these numbers multiply very quickly, even for small companies, Netflix, you know, at monitor Rommer, uh, which is a monitoring conference held in Portland, Oregon. You know, they, they mentioned that their system had 3 billion unique time series in it. And that was kind of a level of commonality.

Rob Skillington 00:33:49 They were working with Uber. There were, there were tons of experimentation going on. So we had tons of dimensions, you know, search algorithms, fluctuated, dynamically, and not kind of dependent on which region you’re in. So we actually had 11 billion times series before, you know, I left the company to start Chronosphere and it’s an extreme challenge. I think when you’re, when you read like the software developers become addicted to this stuff, as well, like at Uber, we had more than half of the engineering team daily daily visit with unique sessions. Uh, our, our monitoring tool am three and interacting with the graphs or setting up alerts or checking in on an experiment than where they were running. So, you know, our internal Google analytics showed it was probably the most popular tool and most visited next to our deployment system at the company.

Robert Blumen 00:34:41 I can see how having this high dimensionality wouldn’t enable you to surface problems that might be obscure if you average things together, but do you have the reverse problem? If something was affecting all user agents in six countries across five code versions, you don’t want to get, I can’t do that number in my head, but you don’t want to get 600 alerts when there is one problem. Do you have a way of grouping together so that you can collapse dimensions when it make sense to do that?

Rob Skillington 00:35:15 Most definitely that that is a common problem that, you know, you’ve identified there that is, is hard to get, right? I will say like, it depends on the tool you’re using, but a lot of tools can do this really wrong and you will get sick. You will get your phone blowing up. So the typical way that, you know, Prometheus’s, which is a popular, open monitoring and learning tool does this is, is, is really, um, you know, a scalable longterm storage and backend for, for permittees these metrics. But the typical way, you know, Prometheus’s wild, you would do. This is, uh, you set up your notification mechanism, the end points you want to be notified on. So one would be like PagerDuty. And then you essentially say that you want to grip things by always by alert name. So you author an alert and you, unless you, if you don’t group by an alert name, then you want to get signal yes or no weather alerts going off.

Rob Skillington 00:36:12 So first we want our group by alert names. So you get an alert for each type of alert and then, and then you basically make sure that, you know, 600 cities are being affected or a hundred end points. Uh, you’re only going to get one alert for that. And so it’ll grip everything else will get automatically grouped by, but then essentially you will want to retain that underlying cardinality data on the metrics so that when you see the alert, it can show you all the different combinations of, of the dimensions that are currently exceeding the threshold. So you set and, uh, that helps weather the storm very quickly, but there’s, there’s plenty of tools out there that don’t do that. And you will see an explosion like you’re talking about. And then, you know, you could go even further. If you do want to get that broken down, that alert broken down by different regions. Uh, you can do that as well and wrap the different regions to different people. So maybe there are different people on call for different things based on one of the dimensions of, of the metric that you’re monitoring,

Robert Blumen 00:37:14 Where I want to get to in the time we have left is Rob. You’ve done a great job so far at setting up the problem and what I want to get to in the time we have left a little bit about how you solve it in the monitoring platform. So let’s briefly just spend a few minutes outlining either you could pick Prometheus’s or a generic monitoring tool that combines features of the common products say, what is a monitoring tool look like? And then we’re going to ask, how do you solve this problem of having a 6 million alerts without spending a, um, excessive amount of cost on it? Let’s talk about what is a monitoring system architecture.

Rob Skillington 00:38:01 Yeah, that’s a great question. And yeah, I think, you know, the general conception of, of how to do this, uh, is, is in flux. Like we we’re living in the software development world. Um, it’s not like civil engineering is not a, uh, you know, I, I really well oiled set of processes to do this. I will say there’s obviously a, you know, a lot of well-defined ways to do it, but it’s, it’s, it is changing more than in an industry like civil engineering, where you build something in a given way. And, you know, that build that bridge is going to stay up with monitoring. It is, you know, one of the more popular ways of doing it, especially in our console today is using something like promethium, which will, you know, collect signals from your applications. You develop as well as a it’ll collect signals from the machines that your applications run on.

Rob Skillington 00:38:59 So to get CPU, memory, and network signals, uh, from it, and those, you know, so, and honestly, most generic monitoring systems will do this as well. This is how they do it kind of differ. And then essentially, you know, once you have that stream of information going from your application, as well as your infrastructure to a single place, that’s when you run into the problem of, okay, well now how do I actually, uh, add capacity to the signals that I’m collecting? And just before we dive into that, you know, that one of the first things you want to solve though, is how do you lower the friction for you to add a signal to the system that you set up? So you want to use client libraries, uh, that are popular, that are used by many people and, you know, have some, some kind of backing that, that had maturity to them.

Rob Skillington 00:39:51 Um, so the Prometheans client library, for instance, works in every language under the sun permeate. This also has this concept of exporters. So if you want to monitor a system that you don’t own, there’s likely and export or for that. So, you know, there’s things like in engine X and port, uh, exporter, uh, I read this export or on my sequel export or things like that. So you want to work out all your touch points and how to get signals into the system, lower that friction as much as possible, whether you’re using a vendor or yet something like Prometheus’s in open source. And then once the old kind of being collected in one place, you need to be able to use a system that can scale out. And so, you know, depending on what type of organization, uh, you you’re currently, you know, working for, or what kind of system you’re developing, you’ll you’ll have different needs, right?

Rob Skillington 00:40:44 So if you’re, if you never expanded from, if your system never gets complex enough to exceed, you know, a, a million or a million times series also, or, or even fewer, or even not, you know, a little bit more than that one single promethium server is going to be absolutely fine for you. But we, you know, we just did that mass on that ICP example, we go to 30,000 very quickly. So you could imagine actually exceeding a few million pretty quickly. Um, and, you know, Prometheus has certain ways where you can continue to use promethium, even once you’ve exceeded a single service capacity. You, you kind of separate out what you monitoring, they go to separate Prometheus’s instances. Um, and then when you’re looking at graphs, you, you kind of like select which promethium server you want to get your results back from, but then, you know, you may just want to centralize all of this.

Rob Skillington 00:41:34 Cause that’s kind of talking about separating out the, the, the metrics to be collected by different things, and then kind of glowing systems on top to, to look at them, um, and switching between them. But there’s a lot of popular things, you know, opensource that help you actually, uh, send, uh, metrics to a central place either most, most of the time it’s through Prometheus’s, if you’re in the premiere these ecosystem. So you kind of deploy these for Mesias service and separate what they’re going to, uh, monitor. And then they all send DiUS will send data to a central longterm storage or a central monitoring backend. Um, and that’s where, you know, him three, which is an open source project that, that I, the open source part of it, I kind of worked on at Uber and, and that’s been, uh, you know, available in open source is 2018.

Rob Skillington 00:42:24 That’s, that’s, uh, one of those central backends. Uh, there’s also the Venus, which is another open source project in this space. And you know, that one, that famous project will back up its metrics to two S3 or some other kind of objects store in the cloud, whereas like three is more or an actual database. So, you know, it’ll look like a database and give you access to, to metrics that it has on disc. So, you know, depending on your needs, you may lean to one centralized backend versus another. And, uh, you know, I really think of these as guiding functions, right? Like when you’re getting Sonata, Prometheus’s works great out of the, out of the box. Um, you know, when you’re eclipsing one permit the server, you can still extend it a little bit further with just a few little tricks. Um, and then when you go up even higher than that, and lots of you’ve got developers that are, uh, that are kind of scaling and you don’t want to any one team to have to manage her and monitoring infrastructure, you would look at something like Fain arse, if, if you’re, you know, one something that to get you started, um, and just kind of historically back up the metrics and have access to them, maybe at higher latency and all that, but, but still there with it, but with having the metrics in an object store, or you look at something like M three, if you want super high reliability by replicating the monitoring data across availability zones in a cloud region.

Rob Skillington 00:43:47 So, you know, , that was, that was really important to us. Uh, an hour of gap of monitoring could have meant we, you know, that could have been such a brand impacting event that we might as well have packed up our bags and going home. Right. So, you know, it really depends about like what kind of playing field you’re currently playing in and, and what makes sense to you, depending on, on, on what you’re doing to which tool you reach for and what vendor you reach for some vendors are more expensive and easy to use. Others are cost effective and focused on scale. And then others, some are, have proprietary lock-in languages. Others will be more promethium than open, friendly, um, and use industry standards. So that’s kind of the things that go on in my head when I’m talking, when you think about how to like, really think about this and dissect what your needs are, depending on what you’re doing.

Robert Blumen 00:44:43 All right. So can we cap that you have something which is in compiled in, or very close to the stuff you’re monitoring that can pull the metrics out, and then you need to transmit them somewhere to a central back and a new store, at least a certain amount of data in some kind of persistent store that keeps values either as long as you want, or for as long as you can afford. Cause it could be quite a lot of data. That’d be the main architectural highlights.

Rob Skillington 00:45:14 Yeah. I think that’s, that’s really what you need to start off with is what’s going to be frictionless for, to use and yeah. And which central backend do you want to, to, to kind of store it in and what are your requirements and kind of, uh, you know, budget as well for, for that

Robert Blumen 00:45:33 Want to get into the programming or engineering problem of, we talked earlier about you define alerts, alert is some code which operates on a time series, and then it produces a value which might be pass fail, or it might be a value compared to a threshold. And then that’s, pass-fail that determines whether the alert fires or not. Then we were talking about this cardinality where you’ve got a big high dimension space in that if any one single cell in this big high dimension matrix is going into a fail state, you need to generate an alert. So now from a soccer injury standpoint, how, how do you code that up and how much memory and computation does it take to have 30,000 time series that you’re simultaneously evaluating against this stream of 30,000 different incoming metric streams?

Rob Skillington 00:46:31 Yeah, I mean, that’s a great question. It’s it can vary a lot different, different monitoring workloads, even with the same level of cardinality can be very different depending on like the time series functions you’re applying on them. So for instance, you know, if you want to do a time series query that sums all internal server errors up by, by city, that resulting, uh, matrix output is only the amount of cities that you run into. That’s a, you know, uh, a lower number of time series, but if you want to then sum them up, not in group by not just by city, but you want to then group them up by, um, iOS or iOS or Android version or something else, then that’s a even higher resulting matrix app would. So, you know, just to, to lay down some, some basic numbers to get an idea, if you’re kind of collecting 10,000 times series per which to be fair, if you’re already going to actually take a, a point in time sample every 30 seconds, you’re probably even with like a cardinality of 30,000, you’re not, it’s not going to be 10,000 per second, right.

Rob Skillington 00:47:41 It’s going to be a thousand per second. So if you’re talking about 10,000 per second, various open source backends here, depending on what you’re using, we’ll, we’ll use 10 to 20 gigabytes of memory, perhaps, and depending on how many queries you throw at it. So how many, you know, time series queries you throwing at it, it could probably easily handle like maybe 50 to 100 alerts. Uh, and you’re talking about maybe four or five cores. So you probably want like an iCore machine for that, but that’s, that’s really, you know, talking about it from a promethium point of view is, you know, similar levels of scalability and resource usage. These are highly advanced systems, you know, back in the day, 10,000 per second, would that type of hardware not possible. Um, you know, and there’s a lot that’s gone into that. There’s there was Tia said that Tia said compression algorithm came out of Facebook that allowed for 11 X compression on flood data.

Rob Skillington 00:48:42 And, you know, that’s talking about turning a 16 bytes timestamp value of two, a, a byte, um, numbers into roughly it averages out using this algorithm to 1.4 bytes per data point. So that compression ratio allows you to just do a way more memory. And, you know, if you’re frequently evaluating this and they’re being pulled into memory a lot. So reducing, you know, that memory usage is only possible with algorithms like Tia said. And then you’ve also just got, um, much more sophisticated storage engines now, which, um, you know, as smart and optimize for the right thing, you know, you don’t need acid transactions and monitoring, uh, you, you, uh, you just have high volume. And so like kind of fully understanding that and refining and refining the system to, to kind of hyper optimize for that type of data structure. And that type of collection, um, has enabled us to, to really be able to do, you know, to kind of like get to those sorts of numbers. And, you know, a single M three database node can do hundreds of thousands and, you know, sometimes millions of data points per second, but we’re talking about pretty large machines. Um, w when we’re talking about those kinds of numbers, so, you know, to hundreds of gigabytes of Ram and 16 to 32 cores, that kind of thing,

Robert Blumen 00:49:57 I think there’s a problem with naturally inclined toward some degree of sharding in that if you were say monitoring different regions, you might even have different groups of people, the response in North America, Europe, and Asia, and you could have entirely different monitoring backends, even though they’re monitoring different segments of the same metrics,

Rob Skillington 00:50:20 Right? Yeah, no, it definitely, it’s a, it’s, it’s definitely a thing that leans very it’s natively can be partitioned. And, uh, know, I think that distributed systems in it’s very key is, is about kind of partitioning and how do you, you know, map and reduce to query this type of data and, and how do you even fairly distribute the workload? Like you don’t want some servers working much harder than others, because then you’ve got a micromanagement problem, you know, and then three under the, under the hood, actually we’ll do a, you know, use a fairly good hash function to work out whether the combination dimensions you’re storing on which server it should be stored on. And that helps us actually get, you know, like when, when we talk about server that a cluster, that’s doing 300,000 riots, if you have three nodes, it’ll very evenly be doing about a hundred thousand each per per second. Um, so, you know, there’s a lot that goes into making sure that these systems can scale out horizontally in a way that’s sane without you having to micromanage the storage aspects to it. Premier atheist does tend to be single load. As I kind of mentioned, it’s a bit of a more decentralized storage model, but it is, uh, you can partition yourself. And it is fascinating though, how, how, how to operate these things.

Robert Blumen 00:51:37 You mentioned the open source M three, which you’re involved in, and I covered that in your biography without saying exactly what it is. Give us a proper introduction to M three.

Rob Skillington 00:51:51 Yeah. Uh, no, I can definitely speak more to that. So, so M three is really a, a metrics platform and it’s core, and, you know, it’s primarily a Prometheus’s longterm storage backend. And what I mean by that is, so when we talked about, you know, what do you choose as your, your monitoring backend? Um, these days, it’s not just talking about a single monitoring backend. It may, you may use different tools in your monitoring pipeline, uh, to pop your data one or two or in places, but, you know, with Prometheus’s and then three, that they kind of plugged together nicely, uh, to give you basically a, a single scale-out backend for the underlying data. So you can point all your alerts and your dashboards two and three, but use premiere atheists, you know, to kind of do the frontline, uh, collection of metrics before sending them to a central place.

Rob Skillington 00:52:49 And that, that kind of, so in three is really adding a layer to either an existing metric system, or if you’re using stats to your graphite, which Anthony also supports it, it can ingest that type of metrics data directly. And so entry is really a, what we’d like to think of as a central monitoring, but database, as well as, um, aggregator. So entry actually contains unlike many other monitoring systems, um, and an optional layer of aggregation you can put and perform on the signals that come into the system. And this was extremely, um, helpful at Uber. You know, some of these super high cardinality use cases we’re talking about, we actually didn’t care about the container, uh, sometimes, um, the container labels. So the app, the single container, you know, if we’re actually monitoring on telecom provider, we don’t really care that the telecom provider, the telecom provider will be down for all the containers.

Rob Skillington 00:53:45 And, um, if the telecom is experiencing an outage, so we would actually transform these metrics as I came in to the metrics platform in that central backend, that in three is. And so when we think about M three, it’s really a time series database, that’s horizontally scalable at the bottom. It’s optionally an aggregation, a streaming metrics aggregator, which can do trans so long metrics that are coming in on the way in if you want to, which will help with even, you know, a high level of scale, you know, is really for advanced users. And then it’s the, uh, the query service. Uh, of course that sits on top that allows you to apply all these transforms. And you can kind of actually run all of these in one single process, as you get started with them three, and then separate the rolls out if you want to isolate the ingestion of metrics.

Rob Skillington 00:54:31 And the querying of metrics are that, you know, once huge query doesn’t interrupt your incoming steady stream of metrics, um, things like that. So it’s, it’s really a scale-out, um, multi-component, uh, metrics platform that you could not have stopped getting, you know, getting started with just a single node, but then spin, spin out the different components separately. And, and it was that the center and the core of monitoring Uber, and it’s also the center and the core of the, uh, the monitoring and observability hosted service that we provide a Kronos sphere. And of time before we wrap up, would you like to share with listeners where they can find you or point them to any resources on the internet? Yeah, of course. Um, yeah, we’d love, uh, if anyone’s interested more on this topic, I share my thoughts frequently on it, uh, on my Twitter handle that’s twitter.com forward slash roskilly. So R O S K I L L I P in the show notes. Brilliant. And then just visit, uh, if you’re interested in, in three at all, you can visit, um, and three db.io, and, you know, the Chronosphere, which is a hosted platform on, in three, is that a corner of your.io? And you can find more about rubber there. Thank you for speaking to software engineering radio spend Robert it’s been fantastic. Robert, thank you so much.

[End of Audio]

This transcript was automatically generated. To suggest improvements in the text, please contact [email protected].

SE Radio theme: “Broken Reality” by Kevin MacLeod (incompetech.com — Licensed under Creative Commons: By Attribution 3.0)

SE Radio 429: Rob Skillington on High Cardinality Alerting and Monitoring

Join the discussion

More from this show

SE Radio 724: Jure Leskovec on Relational Graph and Foundational Models

SE Radio 723: Dave Airlie on Linux Kernel Maintenance

SE Radio 722: Dwayne McDaniel on the Engineering Challenges of Secrets Management

Menu

Recent posts

Search

Search

SE Radio 429: Rob Skillington on High Cardinality Alerting and Monitoring

Show Notes

Related Links

Transcript

Join the discussion

More from this show

SE Radio 724: Jure Leskovec on Relational Graph and Foundational Models

SE Radio 723: Dave Airlie on Linux Kernel Maintenance

SE Radio 722: Dwayne McDaniel on the Engineering Challenges of Secrets Management

Menu

Recent posts