Episode 548: Alex Hidalgo on Implementing Service-Level Objectives

Filed in Episodes by on January 25, 2023 0 Comments

Alex HidalgoAlex Hidalgo, principal reliability advocate at Nobl9 and author of Implementing Service Level Objectives, joins SE Radio’s Robert Blumen for a discussion of service-level objectives (SLOs) and error budgets. The conversation covers the meaning of a service level; service levels and product ownership; the pervasive nature of imperfection; and why trying to be perfect is not cost-effective. They examine service-level indicators (SLIs) and SLOs and how to define each effectively. Hidalgo clarifies differences between SLOs and service-level agreements (SLAs), as well as whether traditional metrics such as CPU and memory are good SLOs. The episode examines how to define error budgets and policies to influence engineering work, how to tell if your project is under or over budget, and how to respond to being over budget, as well as how to derive value from using up excess error budget.

Related Links

From SE Radio

Alex in His Own Words

 View Transcript
Transcript brought to you by IEEE Software magazine.
This transcript was automatically generated. To suggest improvements in the text, please contact content@computer.org and include the episode number and URL.

Robert Blumen 00:00:17 For Software Engineering Radio, this is Robert Blumen. Today I have with me Alex Hidalgo. Alex is a site reliability advocate at Nobl9. Prior to his current role, he was director of SRE at Nobl9 and has spent time at Squarespace and Google. Alex is the author of the book Implementing Service Level Objectives, A Practical Guide to SLIs, SLOs, and Error Budgets, published in 2020. And that will be the subject of our conversation today. Alex, welcome to Software Engineering Radio.

Alex Hidalgo 00:00:55 Thanks so much for having me. I’m excited to be here.

Robert Blumen 00:00:57 Alex, do you have anything else to say about your biography that I didn’t already cover?

Alex Hidalgo 00:01:03 One thing I do like to always talk about is the fact that I spent most of my twenties not in the technology industry. I didn’t join Google until I was 28, and I spent most of my twenties working in the service industry front of house and back of house in restaurants. So, server, line cook, bartender, I worked in warehouses, I worked at a furniture company. And the reason I like bringing that up is because, as we’ll get into, service level objectives are all about providing a certain level of service for people. And that’s exactly what you do in all those other industries. And I think that’s one of the reasons the whole approach really kind of stuck with me. And one of the reasons I got so excited about it is because it really spoke to all my experience before I moved into tech.

Robert Blumen 00:01:45 Cool. Well, we will be talking about service-level objectives. Before we dive into that, I want to frame this discussion. If an organization is thinking of adopting the approach that’s outlined in your book, so what problem are they trying to solve when they’re doing that?

Alex Hidalgo 00:02:04 So service-level objectives, at their absolute most basic, is the acceptance that failure occurs, right? You are never going to be 100% reliable, you’re never going to hit a 100% of any kind of target. Something at some point in time is going to break; something at some point in time is going to change. And service level objectives at their most basic are just saying, okay, we understand this. So instead of trying to aim for perfection, let us try to aim for the right amount, right? Pick a reasonable target. SLOs are basically a codified version of ‘don’t let great be the enemy of the good.’ Because if you are attempting to hit a 100% anything, whether or not be what I define reliability as or easier things to think about, like error rates and availability for your computer services, if you’re trying to be 100% perfect there, you’re just not going to hit it.

Alex Hidalgo 00:02:53 And if you try to, you’re going to spend way too much, both in your humans who will get burnt out as well as literally finances, right? The amount of money you have to spend to make systems redundant enough and highly available enough to even attempt to hit something like a 100%, it’s just going to cost you too much money. It’s going to cost you too much stress, you’re going to burn your employees out. So, use an SLO-based approach to help you think about what should we really be aiming for? What do our users actually need from us, and how can we keep them happy, the business happy, and our employees happy?

Robert Blumen 00:03:26 If an organization is thinking about adopting pro-outline in your book, how are they probably doing this now that maybe is not working to where they need to look at a different way of doing it?

Alex Hidalgo 00:03:38 So, very often there is a push from the top to be as good as possible, and I don’t think there’s anything wrong with potentially striving for excellence, right? SLO-based approaches are not about being lazy, they’re not about like losing sight of trying to be the best you can be, but without explicitly setting targets, without explicitly saying something like, we want to be reliable. Or let me give you like an example, right? You run a retail website of some sort, and users log in, and they add items to a shopping cart, and they are able to check out. And sometimes that’s not going to work. One of those steps is going to fail, right? Maybe user can’t log in, maybe the shopping cart microservices is flaky and they can’t get that working, right. Or sometimes just like you check out and the vendor you rely upon for your credit card processing is having a problem.

Alex Hidalgo 00:04:33 And at some point in time that’s going to fail. And that’s totally fine. Humans are actually cool with that as long as you don’t fail too often, right? So, what you can do is you can use SLOs to say something like, all right, let’s aim to have 99.9% of all of our checkouts work. So only one in a thousand users will encounter some kind of error. Especially with the understanding the user can then generally just retry and it’ll very often work the second time around. It’s about being realistic about what’s actually possible while also realizing that humans are actually okay with some amount of failure. They can absorb a certain amount of failure. And let that happen instead of spending too much time and burning your employees out by trying to be too good.

Robert Blumen 00:05:15 If I could summarize this then, the approach is about having a realistic and also rigorous discussion about what is the level of service that you can and will provide to your users, keeping in mind the constraints of cost and people’s time and energy.

Alex Hidalgo 00:05:36 Yes, absolutely. It’s about being realistic. It’s about aiming for what you actually need to provide. No one actually needs you to be perfect all the time, right? Like think about visiting a random website. It could be any website, a news websites, ESPN to check the sports. It could be Google, it could be whatever it is. Sometimes it doesn’t load, and sometimes that’s because your internet provider’s bad or your wireless connection got flaky. But sometimes it’s because that’s actually on those services, right? And humans are fine with that, right? Like, literally imagine you just had that happen to you. You would just click refresh and as long as it loads again, or as long as it loads in two or three minutes, right? Like, maybe you sometimes have to take a break, you’re like, okay, cool, this website isn’t working right now. As long as you come back in a few minutes and it is working again, then you’re fine with that. You’re not going to abandon that website, you’re not going to abandon that service. So, figure out exactly how much failure your users, your customers, can actually absorb, and aim to be at about that level — or a little bit better I guess. But definitely don’t try to avoid every single failure because then you’re just going to burn yourself out.

Robert Blumen 00:06:42 I’d like to go into a bit more detail about how organizations decide what is that right level, but let’s first get some of the vocabulary down so we can have a more detailed conversation about it. In your book, you talk about the reliability stack with several levels. Let’s go through those levels. The first one being service level indicator, also SLI. What is that?

Alex Hidalgo 00:07:10 So, the absolute basis of all this is that you need to have a measurement that tells you something about what your users are experiencing. And I’d like to take a quick tangent. I’m going to say user a lot. And when I say user, I don’t necessarily mean a human. I don’t necessarily mean a customer. I mean anything that relies on your service, right? That could be another service, it could be a team down the hall from you, it could be a vendor, right? It’s just easier to pick a single term and just say user over and over and over again. But an SLI is a metric, a bit of telemetry that tells you whether or not your users are having a good experience, right? At some level, an SLI has to be able to at some point be split into good or bad, right? At some level you have to decide this measurement is telling us things are okay, or this measurement is telling us things are not okay.

Robert Blumen 00:08:03 Give me an example of an SLI that you used in a product or a project.

Alex Hidalgo 00:08:08 Sure. Very basic SLIs can just be things like error rates and availability levels and latency, right? You want your API response to return within 750 milliseconds, or whatever it might be. But a good example of one I actually set up that I think is a little bit more advanced and very interesting is when I was at Squarespace, I was on the team responsible for our entire elastic search ELK stack, right? So Elasticsearch log stash Kibana and eventually we got to the point where we were able to write synthetic logs with a certain like ID in them send them through Fluentd into Kafka, which we use as an intermediary. Then picked off of Kafka by logstash and then indexed into Elasticsearch. And then we were able to query Kibana to see whether or not that log arrived and how long it took.

Alex Hidalgo 00:08:55 And that’s a complicated setup. But on the same token, all we really had to do was insert a log on one side and retrieve it from the other. And then we had this latency measurement that told us how long it took on average for a log message to traverse the entire pipeline. And additionally, if the log message never showed up, we also had an availability measurement, and now we needed many other measurements at every component along that path in order to tell us exactly where the failure occurred. But that’s a good SLI because it’s telling the user journey. One of the things I always like to talk about when trying to explain what a good SLI is, is that your business likely already has a bunch of them to find. It’s just that they’re in a product manager’s document titled ‘user journeys’ or they are on the business side what they refer to as KPIs or it’s what your QA and testing teams refer to as transactional tests, right? We often already have a good idea of what we need to be measuring for our complex multi-component services. And really, the closer you can get to the user experience, to the user journey, that’s the best SLI that you can possibly produce. Now, I do want to say it’s totally fine if you’re starting a journey if or you’re measuring is latency of a single API endpoint, error rate of a single API endpoint. There’s nothing wrong with that. But you can progress over time and capture more components with individual measurements.

Robert Blumen 00:10:22 Most systems, when you set them up, they give you immediately access to some very detailed metrics like CPU memory load average, are those good SLIs?

Alex Hidalgo 00:10:33 I think those can be important things to ensure that you’re collecting because you can use that data to help you figure out whether or not you had a regression in your code or some other problem in your infrastructure. But an SLI necessarily is supposed to tell you about how things look from the outside, and your CPU can be pegged to a 100% for days, weeks, months of the year. Yet, the actual output that your service is providing to people might be timely, it might be correct. And so, it’s not to say that you shouldn’t measure something like CPU utilization and it shouldn’t… And I don’t mean to say that if you are pegged at a 100% for days, weeks, months at a time that maybe that doesn’t require some kind of investigation. But that’s not an SLI; that’s a different bit of telemetry.

Alex Hidalgo 00:11:23 An SLI says are you operating within the performance constraints that your users require from you? And you can be doing that even if you’re using more memory than you thought; you can be doing that if your pods are umming, right? As long as enough other pods in your Kubernetes set up, right? Like however you’re running, it’s actually maybe okay if you’re crash looping every once in a while, as long as the user experience is fine, right? So again, not saying you shouldn’t investigate those things at some point in time, but that’s not what an SLI is. An SLI captures a user experience.

Robert Blumen 00:11:58 Okay, I want to move on to the next level of the reliability stack, the SLO, service-level objective. Tell us about that.

Alex Hidalgo 00:12:08 SLOs are actually way more easy to understand than SLIs, right? Even though we refer to this as like doing SLOs quote-unquote, right? Really the SLIs are the most important part of the whole process. Because if you’re not measuring the right things, the rest of it doesn’t matter. So, as I said earlier, an SLI at some level has to be able to be quantified into good or bad, right? This measurement we took at this moment in time or this actual measurement of an actual user experience — if you have good end-to-end tracing — either was good or it was bad. And you can use good and then total to that’s what a percentage is, right? Like you have a subset of your total in this case good. And then you take that over your total and you have a percentage now and an SLO is simply, and I try to refer to them as SLO targets to kind of differentiate from the overarching term we use to talk about the whole process, the whole reliability stack, all that. Your SLO target is the target percentage for how often you do want to be good.

Alex Hidalgo 00:13:11 So, once you’re able to split your SLI into good and bad and therefore you’re able to calculate good in total, you can say something like, I want 99% of all of my requests to complete within X amount of time. And then you can use that to figure out whether or not you’re meeting your SLO.

Robert Blumen 00:13:28 Are SLOs always a percentage?

Alex Hidalgo 00:13:30 Generally speaking, yes. An SLO is almost necessarily a percentage because you have to at some point figure out how often you want to be correct. I guess you could say this as four out of five, right? I guess you could use some different language and if that works for you and that works for the tooling or the culture you have, like that works. But, four out of five is still 80% right? So, I think in order to adopt an SLO-based approach, at some level you do have to kind of acknowledge that you’re aiming for some kind of target percentage.

Robert Blumen 00:14:00 If we pick as an example latency of how long it takes to add a product to the shopping cart, then would you do a percentage of, say, the 95th percentile latency is 120 milliseconds and we wanted it to be a 100, or do you say 95% of the time the latency is less than a 100 milliseconds and you do it based on how frequently you are exceeding the threshold? How do you translate something like a latency into a percentage to make it an SLO?

Alex Hidalgo 00:14:38 I think a lot of that depends on what your telemetry looks like, right? Like a lot of latency measurements, for example — by default and Prometheus, if that’s what you’re using, you’re going to end up with a histogram bucket, right? And so, it’s very easy to pull out the 99th or the 95th, like percentile and perhaps that’s your starting point. But there’s not a ton of difference mathematically talking about aiming for 95%, 122nd milliseconds or less versus the 95th percentile. We want to be 120 milliseconds or less, a very high percentage of the time. A lot of it just has to do with understanding what your numbers look like, and how you can interact with them, and how your measurement systems are able to interact with them. But this is a great point to bring up that percentiles of percentiles can be misleading.

Alex Hidalgo 00:15:28 So, people will have been very used to graphing percentiles because they want to ignore the outliers, but SLOs already give you that. So, there’s nothing necessarily wrong with saying, we want the 95th percentile of our shopping cart editions to complete within 120 milliseconds, right? Maybe that gives you a strong signal that does in fact help you understand what your users are currently experiencing. But if possible, sending your raw data, or your P100 data, is I think a better and clearer way to adopt an SLO based approach because you’re already kind of handling or you’re able to handle, if you pick the right target, that kind of long tail that you’re generally trying to ignore by using percentiles in the first place. So, it’s not a wrong approach, but I do encourage people to remember: you’re basically applying a percentage twice, which may hide some outliers that actually are important.

Robert Blumen 00:16:22 Let’s move on to the third layer of the stack: error budgets. Let’s start with the definition.

Alex Hidalgo 00:16:29 Sure. So, an error budget is basically in a way the inverse of your SLO target, right? So, we’ll again stick with a very simple number. Let’s say you’re aiming for something to be good for your users 99% of the time. What you’re also kind of implicitly saying there is that we are okay with 1% of failure, and that is what your error budget is, right? Your error budget says everything is still okay overall as long as we haven’t had a bad experience at least 1% of the time. And so, your error budget is a way for you to understand in a better way how you’ve operated over time, right? So, an SLO you might be able to say, how do we look right now? How do you look right now? But an error budget is generally defined over a window, very often a fairly lengthy window, right?

Alex Hidalgo 00:17:16 Something like 28 days or 30 days, or I’ve seen a lot of teams like to do 14 days to match their sprint length, but also I’ve seen error budgets all the way as large as like a quarter or a full year even. And what that idea gives you is you can now say okay, we’re aiming to be 99% reliable, right? In whatever way we’ve defined that in our SLI, but how reliable have we been over the last 30 days? And now you can say something like, okay, we’ve been 99.5% reliable over the last 30 days; we’re doing okay. Or you can say, oh, we’ve only been 98% reliable over the last 30 days and our SLO target is 99. That means we’ve burnt through our budget, right? Because that 1% is your budget. And then you can use that data to have a discussion, right? That’s really how I like it best. You can use error budgets for amazing advanced alerting techniques and all sorts of things I really think are much superior to your basic threshold monitoring that that most people do. But really, the absolute base is that error budget status, right? How much of your error budget have you burned gives you a signal to figure out do we need to take action right now? Right? How reliable have we been? What does that mean and does that mean we need to change course?

Robert Blumen 00:18:29 Alex, there’s a thing you did in the book that I found quite useful. I think we all have a good idea of what numbers like 99%, 99.9% mean, but you translate that into a certain number of minutes or hours per month. I don’t know if you have those numbers embedded in your memory, but I bet you do. For these different numbers of nines, what does that translate into minutes or hours of downtime in a month or a week?

Alex Hidalgo 00:18:58 You’re going to challenge me to make sure I get this right but, 99.9% is 43 minutes I believe, and the the real point is that it adds up very quickly, right? Like people want to be four nines reliable, which means 99.99%, right? And that translates to mere minutes. You want to be 99.999% — the holy grail of five nines, that’s four minutes and 32 seconds a year. So now you translate that to what an on-call shift looks like, right? Like, you translate that and that can be seconds, no human can possibly actually, pick up their pager, especially in the middle of the night and possibly respond to that and fix those problems, you know. So yeah, I like to translate them in a time — not necessarily saying that a time-based approach is superior to just a pure numbers or pure occurrences, right? But it’s a good way to show people.

Alex Hidalgo 00:19:52 In my experience, leadership often thinks you can attain many more nines than you actually can. Here’s what that would look like from some kind of availability standpoint. Here’s what that would look like in terms of downtime per year. And when you present the numbers in that way it can often be eye-opening for people to realize, yeah, okay, never mind; this doesn’t make sense. We can’t be five nines, we can’t even be four nines. The redundancy required, the robustness required, the on-call response required, right? Again, let’s never forget about that part, the human element of our social technical systems. It’s a great way to translate things so that people really understand that when they’re asking for 99.99% or even simply 99.9%, that they understand what that actually implies.

Robert Blumen 00:20:40 I have been on call where the company’s policy was outside of business hours, if you get paged, you have 20 minutes, you’re supposed to be online and looking at it within 20 minutes. If you really need to minimize your downtime to less than 43 minutes in a month, then you have to start looking at having people in different time zones around the world who are in the office and at work 24 by seven so you don’t spend that 20 minutes getting somebody out of bed and getting them awake.

Alex Hidalgo 00:21:12 Yeah, exactly. Like if you have a 20-minute response time, which I think is for many services actually pretty reasonable, right? We want to keep our humans healthy. Then you can’t hit 99.9%, which as you pointed out is about 40 minutes a month, right? So, you burnt half your budget just on the allowed response time. So yeah, exactly. Then you got to have a follow the summer rotation, you got to have at least two if not three different engineers located all over the world. So now this means, I mean a little bit different in the post-pandemic world, the work from home world, but before that, that means that you need offices in many different countries, and the complexity and the finances involved with even just hitting 99.9% is frankly sometimes absurd, right? Unless you want to have ridiculous, ridiculous response-time requirements.

Alex Hidalgo 00:22:02 But yeah, that’s another great way of kind of looking at these numbers, right? When you think about, yeah, let’s stick with 99.9% equals about 40 minutes per month. Once you also then add the humans into that. Not just what can your computers give your users, but if something’s actually broken, what does that mean for the humans that need to go fix things? It can get absurd very quickly. And one of my big things is that I really try to help convince people you don’t have to be as reliable as you think you do, right? Chances are the users of your services are actually okay with more failure than you think, and find that right target. This is slightly tangential but, like, some of the best SLOs I’ve seen have been very carefully measured over months, if not years, and involve lots of customer feedback and have been set at things like 97.2%, right? Because just via actual study that was the right target. And just using tons of nines — I always like to tell people SLO targets don’t have to have just the number nine; there’s nine other numbers you can use.

Robert Blumen 00:23:04 There’s one other term you hear a lot in this space, which is SLA, which stands for service level agreement. How is that different than an SLO?

Alex Hidalgo 00:23:15 So SLAs have been around for a very long time. I’ve traced their usage back to telcos in the 60s, banks in the 50s even. I found a U.N. document from 1948 — so right after the U.N. was even formed — that used the term. And service level agreement is, well, exactly that. It is a promise to someone generally in a contract that we will perform in a certain manner a certain amount of the time. And eventually this got adopted by all sorts computer services and computer, like, service providers. And then in the early 2000s, HP started to adopt the concept of an SLO, right? And what they were trying to do is they were trying to say okay we have this SLA a service level agreement, this is something written to a contract. If we don’t meet this, we owe someone something.

Alex Hidalgo 00:24:03 Either we owe them a credit or we owe them actual money, right? But you exceed, you break your SLA, and that means you’ve broken something in a contract with another entity. An SLO is similar in terms of you measuring your performance against a target, but they were invented to be almost like an early warning system, right? So, you have an SLA, let’s move into the future now, right? We are a modern vendor, we are a B2B SaaS company, something like that, right? And you’ve written into your contract that you will be available 99.5% of the time, and this is written into the contract mostly for lawyers. It’s mostly there, right? And no one actually cares about the money, they don’t actually care about the credit you’ll get, right? That’s not what SLAs exist for even if their language is, here’s some stuff you’ll get in case we don’t perform the way we’re promising. They’re really there for lawyers so lawyers can say okay, we’re breaking our contract now, right? That’s why they really exist. So SLOs are similar to SLAs in the terms that again they measure your performance against a target of some sort. But I don’t love talking about SLAs because I feel like it’s really a different world. SLOs are operational, they’re tactical, and they’re decision-making tools. SLAs are for contracts and so that your customers can get out of the contract if they need to. That’s frankly what they actually exist for in most 2022 applications.

Robert Blumen 00:25:31 If I could pinpoint what I think is distinct about your approach versus what a lot of companies are already doing is the DevOps people will continue to get alerted on infrastructure metrics like CPU or memory because it’s not like those things are no longer important. And as you pointed out, the product managers are tracking these SLIs and they have them in their own spreadsheets or documents. What you’re talking about is the migration of these metrics or concepts that are important to product into the visibility and actual tracking of engineering. Now did I get that right, or is that a correct understanding of what your approach is?

Alex Hidalgo 00:26:19 I think it’s partially correct. I don’t think there’s any incorrect about what you said, but I do also think that those operational first-level responders can also use SLOs to make their life better, right? They don’t have to get paged on CPU utilization anymore because they can instead get paged: the user experience is bad. Now you may still want to open a ticket in case your CPU utilization is too high for too long because it could still be indicative of something being broken, but you probably shouldn’t be waking someone up at 3:00 AM for high memory if the user experience is still fine, right? If all your customers are still having a great experience or at least a “good enough” experience is what I should really say, don’t page someone. So yeah, again, go investigate those kind of infrastructure metrics if they are telling you something.

Alex Hidalgo 00:27:10 But you can probably do that during working hours if your customers and your users are still doing okay. So yeah, I think part of the approach is to think at the project manager, the product manager level in terms of are we capturing the user experience well? What are the user journeys? And again I want to say users here should include internal users not just paying customers. So, I think that’s a big part of the approach but I do think the infrastructure, the platform-level first-line responders can also use an SLO based approach to ensure they’re not getting page too often. They can investigate that high CPU at their convenience if everything else is still operating correct.

Robert Blumen 00:27:50 Would it be better to say then that you are trying to aim for a shared understanding between product and engineering about what the business goals of the system are and get everybody aligned behind achieving those business goals?

Alex Hidalgo 00:28:04 That’s a big part of it, yes. SLOs, we can talk about how they give you better alerting and all that kind of stuff. But really what they are, they’re a communication tool. They’re better data to help you have better conversations and therefore hopefully make better decisions, right? Like, I’ve repeated that line, I don’t know hundreds of times by now. And that’s what they really, really give you. And because they allow you to have better conversations, that means it’s not just better conversations within your team, that means it’s better conversations across teams, across orgs, across business functionalities, right? It gives you a better way of saying here is what we need to be doing as a business and how can we achieve those goals.

Robert Blumen 00:28:48 Could you give an example of what might have been a worse conversation and then what would the better conversation look like when they had a good SLO in place?

Alex Hidalgo 00:28:59 Yeah, like here’s a real-life story I’ve seen is there was a web application, right? like, a user-facing internet web app, and it fairly simple setup, right? Basically, traffic came in, it was load balanced across a few different kind of web app-y front end situations, and these had to talk to a database. And this database was throwing errors way too often, right? We’re talking about, like 10 to 15%, right? So only 85 to 90% of responses from the database came back correct? And there was no quick way to fix this because this was like an on-prem vendor binary, right? That there wasn’t a development team to jump into the code of the actual database to fix it. And so, in the meantime some of the web app engineers had implemented very good retry logic. So, it turns out that, from the user experience it didn’t matter that 10 to 15% of all requests to the database turned out to be errors, but the database management team did not understand this, right?

Alex Hidalgo 00:30:02 So, they thought oh my god everything’s on fire and they set up an on-call rotation that was two 12-hour shifts a day because they were only homed in a single geographic location, and they were burning themselves out trying to do anything they could to keep this thing up and minor configuration tweaks and giving it more memory and giving it more CPU and all that. And unbeknownst to them it wasn’t actually that big of a problem. It needed to be solved one day and everyone knew that, right? Everyone knew that they needed to like upgrade versions and I think get some new hardware. I wasn’t actually on the team, I was adjacent to this team, but no one realized that actually the user journey, right? The people using the web app that needed calls to the database to succeed, that was totally fine. If they had proper SLOs set up that were not just measured but discoverable and used for communication, right? Whether or not it’s your weekly sync or your monthly OpEx review or just simply having a strong culture of SLOs so you can go look at how things are actually performing. That database team wouldn’t have stressed themselves out as much and would’ve realized we can wait for the new hardware to show up. We can wait to install the new version, right? We can wait to do the upgrade. We don’t have to be so worried because, for the users, it’s fine because a web app team solved the problem.

Robert Blumen 00:31:18 This story makes me think of another point that you emphasize in your book, which is that these metrics and error budgets help the organization drive how it uses its resources. In this story you told, you had a lot of finite resources going into people either working very long hours or being up late at night trying to fix an issue that had no business value to the company, and yet that time and energy could have been used to, let’s say, develop a new product or add new features. And so, they weren’t making a good decision about how to divide up their labor between ops and stability versus new products and features.

Alex Hidalgo 00:32:02 Yeah, I don’t always love that it was formulated this way in the first SRE book because it was only formulated in this way. But the original kind of definition of how Google-style SLOs were exposed to the world was basically: if you have error budget, ship features; if you don’t, stop shipping and focus on reliability. I think it’s a bit limiting. We can get into all that if you’d like. That’s potentially a very long conversation, but it’s not wrong, right? It is a good way of having better data to balance what are you working on, what should we work on next, right? What do we put into our next sprint? Do we need to assign several additional people on top of our on-call in order to ensure we’re handling our operational tasks best or paying down some tech debt or, whatever it might be. We can go into so many different paths here of how you can use this data, but yeah, at their absolute base it’s: work on project work if you have error budget remaining, stop working on project work and go fix things if you’ve ran out.

Robert Blumen 00:33:03 Let’s come back to that in a bit. But first I want to talk about how do you decide if you are or are not over your error budget? Is it you’ve got the 43 minutes and if you usually step 42 minutes, you’re good, or is it a little more complicated than that?

Alex Hidalgo 00:33:18 It’s a little more complicated than that because at the root of the SLO philosophy is that nothing’s ever perfect, and that means that your measurements and your SLOs and the targets you’ve chosen, they’re not going to be perfect either, right? Maybe you picked the wrong percentage, or maybe your SLI is not actually telling you what’s going on or perhaps you had a true black swan event, right? Maybe you want to reset your error budget, right? If something happened to completely deplete you, but it was because, every once in a while we have one of those major internet backbone outages because — what, like the L3 outage from a few years ago, there was a bad RegX that destroyed a whole bunch of BGP tables, right? Like, maybe you don’t want to actually count that against your error budget even if it burned it?

Alex Hidalgo 00:34:04 So, like another example is that same ELK stack I was talking about earlier that I was responsible for at Squarespace, at one point in time we burnt through all of our error budget and we knew we couldn’t actually fix things until we got new hardware. This is similar to the database story, and this was right after the pandemic started, right? So, shipping had just stopped, right? Like, the supply chain just dried up, everything was a mess. And so, hardware that we ordered like March or April, something like that was suddenly not showing up until like August. And we knew we could do very little to raise that particular error budget we had. And so, we could have changed our target to something very low or, there could have been other approaches, but we chose to just ignore that one.

Alex Hidalgo 00:34:49 We’re like, yep, we’re at like 70% and that’s it and we’re not recovering, and that’s fine. We just ignored that one until we got the new hardware and we were able to fix the problems? So yeah, no like again, like you don’t have to be hard-line about it. I don’t think it’s necessarily a bad idea to have an error budget policy, some kind of document that says maybe do this in case you run out of budget, but I don’t know, it’s my favorite term the last few years: It depends, right? It’s better data. Look at the data, have a conversation, figure out whether or not you actually have to take action or not. Don’t ever be hard-line about anything. I think be meaningful in your decisions, right? Think about what the data’s actually telling you, how does that correlate to your understanding of the world? And then use that to decide what you need to do.

Robert Blumen 00:35:36 About two questions ago, you said the simple-minded approach is if you’ve run out of error budget, you focus on improving reliability, if you have error budget, you focus on features. I think you’ve refined that a bit in the last question. Is there any more nuance you’d like to add as to how the organization responds to the consumption of the error budget?

Alex Hidalgo 00:36:00 Yes, I think that part of it is what I was just kind of saying, right? Like sometimes just ignore the data, right? Because you understand what it’s telling you but it’s not actually relevant right now and maybe it’ll be relevant later? But error budgets are also for spending is I think a topic we haven’t really talked about, right? If you are running too reliably for too long, that can be a problem as well because let’s imagine your users are totally fine with you running 99% reliable, whatever that means, right? If you start running at a 100% for too long, right? Like I say a 100% is impossible. But I’ve also seen services run for a quarter, two quarters, three quarters, right? Where they really are kind of 100% — that’ll never last for all time — but you run at above your SLO for too long and your users are going to start expecting you to continue to run at that level. And now you’ve pinned yourself into a corner, right?

Alex Hidalgo 00:36:56 When entropy occurs, when things return to the mean, which they always do statistically at some point in time, now you’re in trouble because now people are expecting you to be close to 100% when that was never your aim. That’s never how the system was designed, right? Perhaps that 99% SLO was part of the design doc, right? And now you’re having problems, so you want to spend your error budget and you can do that in all sorts of ways. It’s a great indicator of let’s perform chaos engineering, right? Maybe you don’t want to be performing experiments that might break your service if you’ve exceeded your error budget, but it’s a great way to learn about your service if you have a whole bunch of it left. Or one of my favorite stories, very few people get to this, but the Chubby team at Google — Chubby is a distributed lock service, right?

Alex Hidalgo 00:37:42 So basically, it’s a file system (which every Chubby SRE won’t get mad at me for a hearing), but it’s a tiny directory structured based service where you can get little bits of data out often useful for service startup time and things like that. And global Chubby, which was a globally available version of it, was not supposed to be relied upon but it ran very well, right? You were allowed to rely upon local Chubby, right? So, each Google data center, each Google cell quote-unquote had its own Chubby instance and relying on that was fine. Global Chubby was just supposed to be for convenience; you were not supposed to rely on it in any hard fashion. And global Chubby ran very well. So often at the end of every quarter, Chubby would have error budget left, sometimes all of their error budget left and what they would then do is, well we’re just going to shut it off.

Alex Hidalgo 00:38:30 We’re going to turn off Chubby for the five minutes of error budget that we still have for this this quarter? And even though they would email, right? Like, you would get an email like as an engineer at Google saying hey this Thursday at 3:00 PM we’re going to shut off Chubby and burn the rest of our error budget because we don’t be more reliable than we’re telling you we’re aiming to be. And yet, even though this was communicated out and it was documented you should not rely on global Chubby, every single time they did this, something would break. And that’s actually cool, right? If you can get to that point, that means other people are now learning how they’ve written their service incorrect. I have so many stories, I don’t know how many examples you want me to give of how you can use your error budget status beyond ‘ship features or don’t.’

Alex Hidalgo 00:39:15 But there’s so much there, right? Experimentation is a great example, just turn it off so others can learn is a great example. I also love to use it as a signal of whether or not you should make a decision, right? Like, at one company I was at, there was this failover planned — and failovers at this company running on pure physical hardware were very labor intensive and very difficult and took a lot of people to do and would often be planned out months ahead of time. And it was like a week ahead of time and the prep meeting for it was happening and they were like, okay, we’ve spent three months planning this, this is our thing, we’re excited, we’re going to have the best failover we’ve ever had. And I walked into the room and was like, hey, I don’t want to be a jerk but we’re out of error budget. Like, we had that big incident last week, we can’t afford the chance of doing this right now and everyone in the room, I was kind of a wet blanket because they were excited for the thing that they’ve been planning on for so long. But they realized, yeah, like that’s correct, right? So, use your error budget to make decisions at even a very high level like that? But yeah, that’s a whole separate hour-long conversation we can have at some point in time.

Robert Blumen 00:40:23 Yeah, I love those stories and they are great stories that really illustrate, I would’ve thought the main issue about being too far under your error budget is while you’re spending too much on either SREs or you’re over-engineering your system, but you’ve added a lot of color to that understanding with those stories. All right, so pull something together that I think we’ve touched in and around this, but you’re having this conversation about what is your SLO, you’ve decided on some good SLIs, you’ve got product input, engineering, and it’s clear enough that your SLO could be too low or too high. How do you drive that conversation about what is the right level that we want to set this SLO at, and how would you over time get feedback into that to where maybe you decide to either increase it or decrease it?

Alex Hidalgo 00:41:22 This is one of the most difficult parts because what you really need is feedback from your users. Sometimes it’s easy, right? Sometimes you’re running an infrastructure service and the teams that actually depend on your service are literally down the hall or may even sit next to you, and it’s very easy for you to discover if they’re having a good time or a bad time using your service. But sometimes, it’s teams removed many organizations away or it’s literal customers and perhaps not B2B SaaS vendor customers who can open tickets, right? If you’re running a B2C business, it’s very difficult to go — like, imagine you’re Amazon, right? Like Amazon, the retail portion, it can be difficult to go find out, like, are people happy with us or not? But you can almost always find other metrics. You can almost always find other metrics that you can correlate against your SLO performance, right?

Alex Hidalgo 00:42:19 So again, imagine you’re some kind of retail website or no like let’s switch, you’re a streaming service, right? And you’re measuring how long it takes for your shows or movies to buffer before they start playing. And you have picked, to start off with, you want 99% of all your movies to start buffering within 10 seconds. And you set that and you realize you’re starting to exceed that a bit more often than you want to. And then your business side of things realizes our subscriptions are going down, or at least new user count is decreasing in velocity, if not actually being negative yet, you can correlate those things. Once you have everyone on board, everyone understands this is how we’re now measuring things. You can correlate that. You can say, okay, when movies take longer than 10 seconds to buffer and start streaming, too often we’re losing customers or they’re shutting off the movie quicker, right?

Alex Hidalgo 00:43:14 If you’re able to measure that. So, it’s all about being able to take your SLO data and correlating it with other metrics, other telemetry that you may have available — very often business-based metrics — and figure out, okay, how do our KPIs look right? When are SLOs performing in this manner or not? That’s kind of advanced and it takes a while to get there. That’s not something you’re going to be able to do on day one if you’re starting with an SLO-based approach. This requires buy-in across business, product, engineering, operations, but you can use other signals to help you figure that out. But, let’s back up a bit, right? It doesn’t have to be that complicated. It can be as simple as interviews with people. It can be as simple as — side note, interviews better than surveys. People on surveys will generally just click great or bad, right?

Alex Hidalgo 00:43:58 Like even that one-to-five slider, most people just pick one or five and go back and forth. But if you can survey people, interview people it’s time consuming. It’s difficult. Like I said, I think I started this answer off for saying like this is one of the most difficult parts of things is finding out what do your users actually feel about you? But that’s, yeah, it’s a thing you’ll have to undertake, and if you’re adopting an SLO-based approach, it should hopefully mean you want to care about your users more. That’s what it does, right? It gives you better ways of thinking about the user experience. So therefore, even though it’s not easy and you’re going to have to dedicate new time in order to find out how your users actually feel about things, that’s part of the process. If you want to care about your users, you have to talk to them in one way or another.

Robert Blumen 00:44:45 Does this suggest things like correlating all the information that a business has about user behavior with these SLOs? For example, if user’s unable to add an item to a shopping cart, do they come back later and try again and purchase the items in the shopping cart? Or maybe they abandon the shopping cart, which we don’t know for sure, but it’s possible they decided to go buy the products from a competitor.

Alex Hidalgo 00:45:13 Yeah, that’s exactly the kind of thing you can attempt to use to correlate. I would be careful, unless you have tons and tons of volume, doing that and kind of automated manner. Because I think you need a lot of data to pull appropriate statistical models that can really tell you whether or not that’s at hand. But this goes back to what I’ve said several times is they’re better data to have better conversations, right? You can at least go to the team that’s able to track that kind of thing and say, hey, shopping cart checkouts have been bad. What are you seeing in terms of whether or not they’re returning or not? And you can at least infer, right, you can at least make a better decision than if those two teams were not talking at all.

Robert Blumen 00:45:55 We’re getting close to end of time. I think we’ve hit on most of the main points that were in your book. Is there anything that we haven’t covered that you would like to leave our listeners with?

Alex Hidalgo 00:46:06 I think primarily that when people start thinking about adopting an SLO-based approach, they often think of it as a thing you do, right? Okay, now we have SLOs. Cool. Done. That’s not what any of this is about. There’s a reason I consistently use the term SLO-based approach because that’s what it is. It’s an approach, it’s a philosophy, it’s a different way of thinking about your users, about your services and about your measurements. And that means it’s a thing you do for all time. So, I see too many people who read about SLOs and the shiny SRE books from Google, which I’m not down on by the way. Like I helped with them. But like people read a few chapters in those books and they’re like, cool, we’re going to do SLOs now. And they don’t take the time to internalize. This is a different way of thinking. It’s not just a thing you put on a checklist and then check off later.

Robert Blumen 00:46:59 Alex, this has been a tremendous conversation. Thank you so much for speaking to Software Engineering Radio. We will link to your book in the show notes. Are there any other places on the internet you would like listeners to go if they want to find you or things you’re involved with?

Alex Hidalgo 00:47:16 Yeah, you can find me — for now I’m still on Twitter, we’ll see, but you can find me there @ahildaldogosre. So a-h-i-d-a-l-g-o-s-r-e is my handle. And go check out what I’m doing over at Nobl9. We are a company focused entirely on SLOs and helping you do them better.

Robert Blumen 00:47:34 We’ll link to your Twitter also in the show notes. Thank you so much for speaking to Software Engineering Radio.

Alex Hidalgo 00:47:40 Thank you so much for having me. I had a great time

Robert Blumen 00:47:43 For Software Engineering Radio, this has been Robert Blumen, and thank you for listening.

[End of Audio]


SE Radio theme: “Broken Reality” by Kevin MacLeod (incompetech.com — Licensed under Creative Commons: By Attribution 3.0)

Facebooktwitterlinkedin

Tags: , , , , , ,