SE-Radio Episode 276: Björn Rabenstein on Site Reliability Engineering

Filed in Episodes by on December 6, 2016 1 Comment

bjorn-rabenstein100x125Björn Rabenstein discusses the field of Site Reliability Engineering (SRE) with host Robert Blumen. The term SRE has recently emerged to mean Google’s approach to DevOps. The publication of Google’s book on SRE has brought many of their practices into more public discussion. The interview covers: what is distinct about SRE versus devops; the SRE focus on development of operational software to minimize manual tasks; the emphasis on reliability; Dickerson’s hierarchy of reliability; how reliability can be measured; is there such a thing as too much reliability?; can Google’s approach to SRE be applied outside of Google?; Björn’s experience in applying SRE to Soundcloud – what worked and what did not; how can engineers best apply SRE to their organizational situation?; the importance of monitoring; monitoring and alerting; being on call, responding to incidents; the importance of documentation for responding to problems; they wrap up with a discussion of why people from non-computer science backgrounds are often found in devops and SRE.

Venue: Internet

Related Materials

 

 View Transcript

Transcript brought to you by innoQ

This is Software Engineering Radio, the podcast for professional developers, on the web at SE-Radio.net. SE-Radio brings you relevant and detailed discussions of software engineering topics at least once a month. SE-Radio is brought to you by IEEE Software Magazine, online at computer.org/software.
* * *
Robert Blumen: [00:01:08.00] For Software Engineering Radio, this is Robert Blumen. Today I am joined by Björn Rabenstein. Björn is a production engineer at SoundCloud, and one of the main Prometheus developers. Previously, he was a site reliability engineer at Google and a researcher in the field of molecular modeling. Björn, welcome to Software Engineering Radio.
Björn Rabenstein: [00:01:33.12] Welcome, thanks for having me.
Robert Blumen: [00:01:35.15] It’s great to have you. Today, Björn and I will be discussing site reliability engineering, and since I’ll be saying that a lot, I will often say SRE. If you hear SRE, that’s what I meant. We will be discussing Björn’s experience and also many ideas from the book Site Reliability Engineering: How Google Runs Production Systems, which came out earlier this year. Björn, would you like to say anything to the listeners about your background that I didn’t already cover?
Björn Rabenstein: [00:02:05.04] You kind of got my life in a nutshell… I had two jobs, two conventional employments in my life: one is the current at SoundCloud, and I have my three-year anniversary tomorrow, which is kind of neat. Before that I had seven and a bit years at Google as a site reliability engineer, as you said. Previously, I was doing freelancing in the late dotcom bubble. Originally, I’m a scientist and I wanted to just do the proper scientific research – not computer science, but real science. For some reason, I got stuck there and ended up in computing anyway.
Robert Blumen: [00:02:52.02] Björn, I’ve been working in the field of DevOps for about a year now, but I’ve been hearing that term for six or seven years. We’ve already done two shows on that topic – number 247 and 268, on infrastructure as code. I recommend those to the listener as background to this show.
In the last year I started hearing the term SRE – is SRE another name for DevOps, or is there more to it than that?
Björn Rabenstein: [00:03:24.11] That’s a really good question. Definitely the term SRE was invented at Google internally, way earlier. The birth of the term DevOps is like 2008, and correct me if I’m wrong. I’m not sure when they came up with the term SRE at Google. I think Ben Treynor, who’s the head of the whole department was hired in 2003; I was interviewed in 2005 and joined the company in 2006, so it was already history, in a way.
[00:03:59.18] The term existed before, and I still remember I was glad when this term DevOps appeared, because I could finally explain to people what I’m doing. I felt, “This is actually a neat explanation… I’m kind of a DevOps type at Google.” Then I realized it’s causing even more confusion, and I think that’s not so much because there’s such a difference between SRE and DevOps as terms… DevOps covers so many things. People understand a lot of different things when they say DevOps.
[00:04:35.00] I guess you have treated that in the previous shows, I don’t know them by detail so I can’t say… But to be a bit mean, I think many people if they say DevOps they basically mean, “Okay, we have devs and we have ops, and now they should get along well with each other. We call that DevOps.” That is obviously not what it should be, but it will also be very different from the Google SRE view.
[00:05:01.15] On the other hand, especially now that I’m at SoundCloud, we have a particular approach to DevOps, which is more like “Yes, we don’t have dedicated devs and dedicated ops. We are both”, like everybody is both, in a way. That’s another interpretation of DevOps, which is probably not what many people mean when they say it. It’s also different from what Google means if they say SRE.
[00:05:28.15] I guess there’s an overlap, and they phrase it quite nicely in the book. In the introduction there’s a little box, pretty much in the beginning, where they explain the relationship between DevOps and SRE. We can discuss it in a bit more detail, but perhaps that answers your initial question.
Robert Blumen: [00:05:46.03] Yes, it does. We have, as you guessed, done some basic shows on DevOps that covered a lot of the ground, and your answer encapsulates that there may not be a complete agreement among everyone using these terms and what they mean. I’d like to focus, because of this book – in your experience, how does Google define SRE and what is their take on it that is a bit different from the rest of the industry?
Björn Rabenstein: [00:06:15.14] There wasn’t really a one-line definition. As I said, I was so glad when there was a term — DevOps is a catchy term you can tell people if they ask what you are doing. In the book itself they also discuss how difficult it was to explain to outsiders what SRE is, and “Now that we have the book, we can just tell them to read the book.” There’s no one-line explanation, there’s a whole book around it. Obviously, you want something shorter, I guess.
[00:06:46.28] I think it was Jamie Wilkinson (who was an SRE at Google) who was very much into monitoring – we might talk about that later – and I listened to a talk because it was about monitoring; in the introduction he said SRE is if you take software engineers and let them design an operational function. This is important to understand, that SRE is in a way a dedicated ops team, but they use software engineering techniques to approach the problem of ops. Perhaps that catches it. I think in that box in the SRE book they say you could say SRE is a particular implementation of DevOps – with some peculiarities, but in general you could say that; it’s a specific kind of DevOps. Perhaps that catches a bit of it.
Robert Blumen: [00:07:44.14] You’re now talking about SREs apply a software development discipline to the problem of ops… What type of software do SREs develop and how is it different than what we think of as the application?
Björn Rabenstein: [00:08:01.11] That’s also a very good question – where is the line? In practice, SREs should be involved, in the ideal case, in the development of the application already, because SREs know how to run that application at scale in a maintainable fashion, so they certainly should have contributions there. If it goes wrong… I’ve seen that as well – there’s a product developed, it’s launched, and it just isn’t sustainable in operation, and then SREs come back to you and try to fix it. Ideally, that’s already taken care of when you launch.
[00:08:46.01] On the other hand, there’s a lot of code you write. It’s this infamous “Replace yourself with a small shell script”, but in practice the shell script is not really small, and it’s probably also not a shell script all the time. I’ve seen shell script magicians at Google as well, but that’s more an anti-pattern; if it’s getting too complicated, you don’t take a very big shell script, you take Python, or even another language. There are so many things where you can just automate yourselves out of your job. Whenever you have to do something three times, and it’s redundant and it’s stupid, you should ask the machines to do it, and that’s software development, in a way.
[00:09:29.15] Then there’s infrastructure software, where the line gets really blurry. Let’s say you want to develop a cluster management software, or you want to develop a monitoring system. At Google there were dedicated software engineering teams that developed that software, and they might have had an SRE team – there’s again this line where even the infrastructure development is not per se SRE, but then the line gets blurry. For example the team I was in at Google, they just happened to develop a kind of distributed cron, which was totally infrastructure software, but then it happened that the SRE team developed that software.
[00:10:13.29] It’s a blurry line, and if you go outside of Google, where you are at smaller scales, it’s even blurrier. At SoundCloud, the world looks completely different. I’ve seen different worlds there… In practice, sometimes you draw the line quite clearly, and sometimes it’s really an overlap.
Robert Blumen: [00:10:34.04] I’m going to come back to SoundCloud a bit later, but for now I’d like to stick with Google. One of the main themes in the book is the way Google balances development work with operational and on-call work for its SRE. How do they do that?
Björn Rabenstein: [00:10:52.10] Again, trying to find a one-line thing, I guess that would be the 50% rule, which also applies to many other organizations, or should apply. Essentially, if you’re doing more than 50% of operational work in your work life, you are already in a non-sustainable state. The rule was, as in SRE, you should be able to use at least 50% of your time – if it’s more, even better – on automatic tasks, reducing tech [unintelligible 00:11:26.21] and at most 50% would be time spent on call or just fighting fires, fixing things that just broke.
[00:11:40.25] The idea behind that… Your system usually grows. If you have a healthy business, Google or not, you want 2x, 3x, 4x growth every year, and you never grow your engineering population by that factor. If you are doing 50% operational work, next year you will do 100% operational work, so you’re completely there. Another year later you will do 200% operational work; you are running 60 hours/week, and then you burnt out and give up, and that’s the end of your business.
Robert Blumen: [00:12:20.04] Trying to synthesize what it is you’re saying and this book is saying about SRE… If I think of generation zero in our awareness of these systems, you’d build an application and you’d give it to some guys and tell them to deploy it and figure out how to run it. Generation one is you start getting feedback from ops that you take into account during development. SRE is where the operational needs are fully integrated into the lifecycle of software for the organization. Would that be a fair statement?
Björn Rabenstein: [00:12:58.25] I guess you could, but it might also depend on a pure Google view, it probably depends on your situation… Back in my days – I don’t know if that’s more equilibrated these days – having SRE support for your product… As a development team, you really wanted SRE support for your product. That was highly sought after, because there were so few SREs relative to, let’s say, classical software engineers. Core products like search, ads and Gmail, they had for sure SRE support, but if you were launching something more niche, then you would be probably completely in charge as a developer for running and deploying that product as well.
[00:13:47.23] Once you have that SRE support, I guess you have this kind of team that deals with things like deploying. There’s also release engineering at Google – I’m also not sure, now that I’ve been away from there for three years, how that developed, which plays with each other… In a way, let’s say one flavor of DevOps were developers or just very close to deployment and do it themselves. That would definitely be a slightly different approach from, let’s say, a Google large-scale core product.
[00:14:23.04] Yes, you also are a bit conservative; if you make $2,000/sec, you don’t just want to push a button and then everything goes down and you say, “Oops! Let’s roll back.”
Robert Blumen: [00:14:34.24] Sure, it requires some thought in advance to how you’re going to keep that revenue streak going. DevOps may be a good description of what people are doing or the roots of the discipline. The term SRE contains the word “reliability” – that is another focus of the book. I wanted you to give me your take on Dickerson’s Hierarchy of Reliability.
Björn Rabenstein: [00:15:02.23] Yes, that’s a great one. Dickerson is Mikey Dickerson and the listeners might know him, he is the person who runs the U.S. Digital Services (I hope I got that right). He was the head of my sister team. I was in [unintelligible 00:15:20.04] and we had a sister team over there in California, and he was heading that team, we had a lot of collaboration. At some point he moved on – it’s a longer story, he told that story quite often… Now he is there at the White House, at the U.S. Digital Services, which is kind of an SRE team for the U.S. Government.
[00:15:46.12] Funny enough, if I remember that correctly, only when he had to explain all those principles that were in our DNA, he came up with this hierarchy of reliability, and that’s modeled after the hierarchy of needs by – who was that guy…?
Robert Blumen: [00:16:03.21] Maslow.
Björn Rabenstein: [00:16:05.09] Exactly, yes… The psychologist which is all about, like, the most noble human activities are essentially the tip of the pyramid, and you can’t do self-actualization if you don’t feed yourself, and if you don’t feed yourself, even being safe doesn’t matter. It’s this whole hierarchy where the most basic ones are actually the most important ones, and they are the bottom; in the pyramid they are the huge things, and then you put the smaller and more delicate things on top.
[00:16:36.20] Mikey created that hierarchy for reliability, and the bottom layer is the one I always remember because it’s monitoring, which is my personal, let’s say, special area as a Prometheus developer. But that was also like at SoundCloud, where you come into this growing company that runs into this problem with running a growing, more and more complex system in a reliable fashion, and the first lack, you realize, is monitoring. Without monitoring, nothing else works.
Robert Blumen: [00:17:14.25] I’m looking at a picture of Dickerson’s Hierarchy, so monitoring is the bottom-most layer of that.
Björn Rabenstein: [00:17:20.28] Yes. I have the picture here as well, now that we’re talking about it. I don’t know it by heart, I just know monitoring… But anyway, next is Incident Response, which arguably is really closely connected. Then you have Postmortem/Root Cause Analysis – I just remember the whole idea of blameless postmortems, because in many traditional companies there’s a lot of scaring people and blame them if something happens. That’s a very important thing at Google, the fact that mistakes are to be learned from and not to fire somebody.
[00:17:59.21] Then it goes on, Testing & Release Procedures – we already talked about that. Capacity Planning, Development and then the tip is the product itself. This is the one thing that people outside see and they think this is everything, but in reality it’s essentially the tip of the iceberg.
Robert Blumen: [00:18:17.12] If I understand correctly the parallel to Maslow’s hierarchy, you can’t take care of the higher levels, you cannot achieve the higher levels until you’ve solidified all of the lower levels, and the foundation in each one enables you to proceed to the next level. Dickerson is observing the same kind of dependencies exist between these layers in the reliability world. Is that correct?
Björn Rabenstein: [00:18:45.20] Exactly. Even back then, when I joined Google, I was getting into this wonderland of science-fiction – time-warp, I’m now ten years in the future and I see how computing is done in the future… But Mikey had this in an extreme fashion – I think it’s called brownfield, where you enter a field that is already pretty established but can’t cope with the modern challenges. Then you have to explain to people and you have to build it up from the base and you start with monitoring.
[00:19:22.25] SoundCloud in my experience was less brownfield because it’s a pretty young and modern organization, but it grew into this larger scale, a lot of users, a lot of complexity, and then monitoring was definitely the thing that paved the grounds to having a reliable site, and then you build up, you put one layer on top of another.
Robert Blumen: [00:19:50.00] It does make sense to me when I look at it, because without monitoring you have no knowledge of what the system is doing. If you have monitoring, you need to respond when something goes wrong, otherwise it won’t get fixed. Then after the fact you need to do postmortem to understand why. The next layer is testing – you needs to have tests to prevent that same bug or similar bugs from getting back into production. This is all making sense to me… Maybe when we get to the next layer, Capacity Planning – do you have a clear idea why that is the next layer above testing?
Björn Rabenstein: [00:20:28.24] I don’t have a spontaneous clear idea right now, but I guess that’s a pretty high-level function. If your site is one fire all the time and every deploy breaks everything, you don’t even think about capacity planning. It’s super important, of course, like every step in the hierarchy, and arguably the topmost one is the most important because your product is why you are doing all of that. But you can only think about it once the other things are there.
[00:21:02.21] Regarding capacity planning, once your monitoring is in place, you can start to collect metrics long-term, and then you can start to think about capacity planning, you can start to project, you understand your system, how it’s going, how it’s scaling, and only then you can start with meaningful capacity planning.
Robert Blumen: [00:21:21.23] I found in the book a very interesting discussion… There are ways of measuring reliability – this question will not be so much about measurement… Let’s assume we can measure reliability by uptime or error rates; the point was that you need to decide how much reliability you want or you want to pay for, and there is such a thing as “too much” reliability. What do they mean by that at Google, when they talk about having too much reliability?
Björn Rabenstein: [00:21:54.16] I heard this quite often, not only at Google; even more so out of it, when I talk to people who haven’t seen that. When you start to set an SLO – not yet an SLA, you are not signing a contract now; you want to find out “What am I aiming for?”…
Robert Blumen: [00:22:18.01] Could you define those two terms, SLO and SLA?
Björn Rabenstein: [00:22:22.25] SLA is the service-level agreement. That’s essentially a legal contract, or an internal [unintelligible 00:22:31.05] in the company contract where you say, “Okay, we’re serving at this level of (whatever) availability or reliability throughput. That’s what we commit to.” If that is broken, between different entities you might be legally liable; this is the contract.
The SLO is something you aim for, “This is what we set as our objective.” The SLI is the service-level indicator – that’s the first thing when you even define what you are doing to measure your service level. Then you have the objective, to know what you are aiming for, what you are designing your system for. Then the last step is you agree, “Okay, this is our system and we formally agree to keep that service level.”
[00:23:18.07] The objective – that’s the part where you could aim too high. People naively think the more reliable a service is, the better. But then you might over engineer. If you go for an approach where you say, “Okay, just get the latency as small as possible and give me as many nines as possible”, this will influence your design. You get to a point where you over engineer the system and you invest too much efforts, and you also reduce your agility, in a way. That’s the other term where that often plays a role – the error budgeting. [00:23:56.20] That’s also in the relationship between an SRE team and a developer team. You tell them it’s actually fine to only have three nines or four nines, because that’s our SLO, and you have an error budget you can play with; you can do risky deploys to try something out, as long as you keep an error budget. Once you reach your error budget, then “Okay, that’s it. We have to be more conservative from now on, and we cannot just do more risky deploys.”
[00:24:25.23] It’s worth something to be able to do risky things that might create errors; you also get an advantage from that, and that’s why it’s really important to have an objective.
I had a very practical thing at SoundCloud. My first project at SoundCloud was actually suffering from a misunderstanding of the SLO; it was set in the SLO that every request has to be served in a hundred milliseconds. Then we designed the whole system in a way to do that, and it had a ten times larger resource footprint than it could have been.
[00:24:58.28] Later, the same service was implemented again with a different SLO in mind and used way less resources because it had a different SLO. So it really matters. It’s not always better, in a way.
Robert Blumen: [00:25:14.19] We’ve been talking a lot about the Google experience and the Google book, and you’re also currently now at SoundCloud. What did you bring from the SRE philosophy that you’re able to apply in your current position at SoundCloud?
Björn Rabenstein: [00:25:32.09] Yes, it’s kind of fascinating how much it actually is. I only realized that when that book came out. On the testimonials page (the very first page) you can see a quote of myself, how I praise the book as my gospel that I’m preaching at work.
[00:25:53.07] It’s our surprise, because the setup is so different and the whole organization is so different; we are totally not an SRE team, we are doing quite different things, but there’s so many lessons you could still apply. It’s nice to have this all codified, and I realized how much of those lessons I’ve applied at SoundCloud. I don’t even know where to start, but it’s interesting how you can use those little concepts and ideas even in a pretty different setup.
Robert Blumen: [00:26:56.01] Why don’t you pick one example of SRE concept that you successfully applied at SoundCloud?
Björn Rabenstein: [00:27:05.08] I’m totally biased to always talk about monitoring… It’s the basis of the reliability hierarchy, according to Mikey. That was even before I joined SoundCloud – there were a bunch of ex-Google SREs that joined SoundCloud for some reason at the same time; it was not even coordinated, it just happened. Their first big problem was, indeed, monitoring. This need wasn’t met, and they had no idea where to even start, where to see what’s going on. That wasn’t only felt by them, but by everybody.
[00:27:42.09] SoundCloud was getting more complex; they started this microservice thing, and things started to scale out and getting complicated and nobody had an insight of what was actually going on. That was, for them, the reason to start a monitoring project – Prometheus – so that they could get monitoring in a way they knew from Google, and that would allow them to apply even the principles of site reliability engineering.
[00:28:12.10] Perhaps one thing we might talk about is on-call, because that’s probably something that many people run into – traditional approaches of alerting… When a server is down, I set up [00:28:25.15] check and if the server doesn’t ping anymore, I wake somebody up to reboot it.
Robert Blumen: [00:28:32.17] You just now mentioned on-call – for listeners who are not SREs, tell us about what on-call is as part of SRE job description.
Björn Rabenstein: [00:28:43.21] On-call – obviously, if you’re on-call, you carry your pager, you get alerts from the machines if they need you in intervention. The SRE team at Google is usually doing the first-level on-call support for their product. Let’s say if you’re on ads SRE… Back in my day, actually, we were on-call for everything that had ads; later, when everything grew bigger, we had more specialized teams. For developers that’s great, because they get first-level support for their product.
[00:29:17.00] Developers are still on-call in a secondary or tertiary position, where if the SRE can’t deal with a problem, they might get woken up. But it’s definitely something that’s pretty crucial for SRE. At the same time, remember the 50% rule – you are never supposed to invest more than 50% of your time into those operational concerns.
Also, SREs are not pager monkeys; that’s a common misunderstanding. Also [unintelligible 00:29:45.20] DevOps misunderstanding when you think, “Yes, we have our DevOps team and they are pager monkeys.” So it’s a bit smarter than that, but yes, being on-call is definitely an important part of your job.
Robert Blumen: [00:30:01.09] I interrupted your discussion of how this came about at SoundCloud, you were setting up on-call…
Björn Rabenstein: [00:30:10.14] Yes, that’s the example where monitoring and on-call plays into each other. We were there where you basically get a page if a single server dies, and your monitoring system mirrors that idea, because you have your [unintelligible 00:30:26.11] that pings a server; if the server goes down, you get a page. Now you scale and you have thousands of servers, they might not even be physical servers, they might be in the cloud and virtual machine and what not, and at some point you realize “If I wake somebody up whenever a server dies, I will wake somebody up all the time”, and that’s something that is not sustainable and you can’t just solve it by having (coming back to that) a team of pager monkeys that are just paid for getting paged all the time.
[00:30:58.08] You also have already designed your system probably in a way that it can sustain single host failure, for example. That’s where you get into an area where you start to think, “Okay, when should I actually wake up a human being?” It’s this metaphor – the computers are essentially the normal, mortal beings and they call the superheroes if they can’t cope; the superheroes are the humans that get paged. So when should a machine call a human in for an emergency? Then you start thinking, “Okay, I need a different view.” A machine has failed – that’s probably a problem, but I have to look at something like, “Is my user experience affected?” If I have a nice cluster management solution, a distributed system that might perfectly cope with single machine failures, my user experience is fine.
[00:31:51.00] At some point, of course, somebody should check out what is going on with that machine, but that can be during work hours. I don’t have to wake somebody up to react within two or three minutes. So you get into this idea of what we call “symptom-based alerting” – you alert if something is wrong with your user experience, or something in imminent to go wrong with your user experience. That’s the best case, when you know for sure it will happen, but it hasn’t happened yet.
[00:32:20.00] The other thing is the cause-based alerting, where you think, okay, a machine goes down, or a rack switch is broken. That might a reason for a possible failure. Or you alert on something that might or might not be causing a problem in the end. That is a very different paradigm – how to alert, when to wake people up. Now we come back to the hierarchies of reliability – you need this basic level of monitoring. Your monitoring system must be able to tell you that. If your monitoring system is only able to tell you that a machine doesn’t ping anymore, you will not be able to set something up like that.
Robert Blumen: [00:33:01.23] I want to interject here… You mentioned Prometheus monitoring – we do have a show recorded on Prometheus; it’ hasn’t been published yet, at the time I’m recording this show. I believe it will be 270, but anyone who’d like to learn more specifically about Prometheus, you can find it on SE Radio website.
You’ve been talking a lot about what should be monitored, about monitoring different layers of the stack, different types of performance. In the book they discuss the concept of Google’s four golden signals. What are the four signals and why are they very critical things to monitor?
Björn Rabenstein: [00:33:47.23] Yes, they mention those four golden signals in the book. We can read it like the Bible, chapter 6…
Robert Blumen: [00:33:55.13] Verse two…
Björn Rabenstein: [00:33:57.25] Exactly. There are different opinions on what the most important things to monitor are, but there is a huge overlap. Even if there are different opinions and details, you will always come back if you ask the experts… There’s a common core. We can just take them from the book, the four golden signals – I have it here, my Bible is always with me.
[00:34:21.25] The first one is latency – how long does it take to server request. Then you have traffic – that’s how many requests you’re service. Errors – how many of them result in an error. Saturation is how full your service is, meaning what’s your capacity. You might be serving quite fine right now, but if you increase your traffic my 10%, you might start to fail. Then you think about redundancy, and that’s also important for capacity planning, but also if you have the luxury of multiple data centers globally, how do you route traffic, when do you decide that you don’t have enough redundancy and you should do something; can you [unintelligible 00:35:05.17] send the data center into maintenance right now… Things like that.
Robert Blumen: [00:35:11.13] Saturation – that could be at any level. For example, we could have CPU saturation on a single instance, or I/O saturation on a particular storage device, all the way up to what you think is a high-level measure of capacity of a larger system. Is that correct?
Björn Rabenstein: [00:35:34.07] Yes, or pretty profane – how full is your disk?
Robert Blumen: [00:35:37.29] I asked you how did you apply SRE concepts from Google at SoundCloud… Were there things you tried to apply from Google that turned out not to work at SoundCloud because it’s a different job, different organization and conditions are different?
Björn Rabenstein: Those parts were, I would say, mostly not a technical thing. Technical ideas from the SRE world almost always apply to other… If you’re at a minimum scale, things just apply. It’s like a natural law. But there’s a huge organization and cultural thing about it. I said before how much sought-after SRE support at Google was. Developer teams really wanted SREs, so that gave them a bit of authority. They could say, “If we support your service, your service has to behave in a certain way. If it doesn’t behave that way, then we won’t support it, or we will give the pager.”
[00:36:46.11] Things like that happen, and in a way that empowers you. That setup wasn’t just there at SoundCloud, for one; the scale of SoundCloud – although we scale a lot of traffic and serve a lot of features – you couldn’t have a dedicated SRE team for every product we have. SoundCloud is two orders of magnitude smaller in terms of engineering population than something like Google or Facebook, but we don’t have two orders of magnitude fewer products. Perhaps we have a tenth of the products, but you still have a customer-facing thing, a billing thing, something for the users – all those things are there even at a smaller scale, and you cannot just have a dedicated SRE team just for ads.
Robert Blumen: [00:37:43.09] So the ratios are different. You might be a hundred times less on some scales but ten times less on other scales, so you can’t have the same ratios consistent throughout with what you see at Google.
Björn Rabenstein: [00:37:59.12] Yes, that makes a big difference. In this concrete example we couldn’t just say, “Oh, we are building up an SRE team now, and the SRE team will take your pager for the whole company.” That just wasn’t feasible. In reverse, that means there was nothing where the SRE team could say, “If you’re not doing it this and that way, we will just not take the pager” because they were not going to take the pager.
[00:38:25.11] Before my time there, they tried to say, “Okay, we have proper SREs, but they are embedded in the smaller teams we have. We send an SRE into this team and they should use their magic wand to make everything reliable.” That didn’t really work either. There was already an established culture how things were done; your culture is usually there for a reason and it’s precious, and you cannot just change everything to a different culture.
[00:39:01.06] There needed to be a desperate need to change the culture for some reason, because it’s dysfunctional, but it wasn’t dysfunctional at SoundCloud. We had a precious culture, and in the end things worked out quite differently. In some way nowadays everybody at SoundCloud is an SRE, because we follow this approach where – like I said in the beginning – the devs are also ops. You build it, you run it. So what happened in the end is not that we built up a dedicated SRE team that would work like a Google SRE team, but it was more like that we fostered SRE approaches throughout the engineering organization, and now everybody is in a way a little SRE in the company.
Robert Blumen: [00:39:46.10] You’re talking about the relationship between the SRE team or maybe the SRE role (if it’s not a team) in development – this negotiation about who is going to carry the pager, who owns alerts… How do you find the right division of responsibilities within an organization between what goes to SRE and what goes to developers?
Björn Rabenstein: [00:40:18.08] The Google approach – if you have properly SRE-supported servers, then all the alerts go to SREs first, and for important services they might have a secondary rotation in case a page falls through, but only then there’s a tertiary on call where the developer is paged. Ideally, a page never falls through automatically. If there’s something where you think, “Okay, somebody changed something in the code and I don’t understand it. Now I really need a particular developer”, then you page the developers.
[00:40:59.23] In my personal case, we were in Europe, in Dublin, and we were in charge of way more different little products than the corresponding team in California. Because we were the smaller team, we were covering more while they were asleep, and we had a lot of different developer teams that we needed to escalate through, if needed – ideally not, of course, we would be able to handle it one our own.
[00:41:30.04] Again, at smaller scales, at the SoundCloud scale, that’s just not going to happen. At SoundCloud there is essentially our team, called production engineering – not site reliability engineering. We don’t carry the pager for any product, we carry the pager for the systems we are in charge of directly, like cluster management, infrastructure stuff. Whatever you develop, whatever you build, you run it, you are also on call for it. It’s a completely different layout of responsibilities.
Robert Blumen: [00:42:01.15] I was recently at a talk by Gene Kim, who’s one of the leading thinkers and writers in the DevOps space. He posted a slide which said, “When we started paging developers at 2 AM, problems got fixed very quickly.” Now, that sounds a little bit cynical and maybe manipulative, but is that a good way for SRE and development to work together?
Björn Rabenstein: [00:42:31.08] If you have to apply that pressure… I guess I would hate it, from a personal perspective. There are three different scenarios we have here in the discussion. One is essentially what that expression is referring to – you have an existing engineering organization, you have ops people that get paged all the time because developers deploy software, throw it over the wall and it doesn’t work… Then you think, “Okay, we just page the developers at 2 AM and they will fix it.”
[00:43:03.22] At Google, the corresponding situation is probably that a software doesn’t really work according to standards, and then SREs as a last resort can threaten to give back the pager and stop to support that system. At SoundCloud it was, again, a different setup because that pager monkey thing — they might have tried it at one point, but it really didn’t exist when I came there.
[00:43:32.23] Your ability to run it was already a reality, so people had more the opposite problem – we had too small teams that shared an on-call rotation because they got really in love with their software, and you had two or three people that were the only ones that knew about that particular service, and they were doing the on-call. Essentially, a person was on call Monday to Wednesday, and the other person was on-call Thursday through Sunday. They had no life anymore, but it was their software, they knew everything about it.
[00:44:07.08] It was doing it wrong in the other direction, where the developers were too closely related to that, which couldn’t really happen in a Google [unintelligible 00:44:16.09] because there you hand over your support to an SRE team, so you must be sure everything is documented nicely. Those are reasonably smart engineers, but they have to learn what’s going on here, so you are forced to make it approachable for others. At SoundCloud it was more the problem that this didn’t work at all. But that has changed. We improved a lot there. We have bigger on-call rotations, people transfer that knowledge to other people, so more people are capable to deal with a certain outage in a certain service.
Robert Blumen: [00:44:53.00] This is something that is necessary for an organization to scale – the knowledge has to be transferred and shared among more people. You just mentioned documentation… Talk about the role of documentation in sharing operational knowledge.
Björn Rabenstein: [00:45:13.20] Yes, that’s definitely a hot topic, starting from designing a new product or a new iteration on an existing one. Google has a really strong culture of writing design docs, reviewing design docs in an informal fashion. We had to design [unintelligible 00:45:31.12] but there’s also formal review of design docs, and SoundCloud was much more agile, in a way – capital A or not. It was more like, “Okay, let’s just do it”, which came from this startup culture where you’re really a small group of people, everybody knew each other and sharing information was not an issue.
[00:45:56.01] Then you grow into a bigger organization, and this is also something where SoundCloud has matured a lot. After a few incidents that were caused by not having shared information enough, or starting to develop something without asking that other person two floors down that would actually have known a lot about that particular problem. We have a way better sharing culture. We call it [unintelligible 00:46:23.09], not design docs, but it’s essentially a similar thing.
[00:46:28.16] That’s from the start on, and then of course operational documentation is really important. That’s also something where your monitoring system – to come back to that again – can help you. For example, we have a nice way of displaying a documentation link with an alert. You get an alert – it might be that alert of that sister team you are now on-call for because they’re only two people, but the alert already gives you a link to their runbooks, so what’s going on if that fires, what’s the background and what you can do to handle that situation. That’s also very similar to Google. Despite the completely differing structure, we had also a really strong culture of having alerting documentation for incidents.
Robert Blumen: [00:47:17.04] You use this term “runbook”, where if I’m on-call I get an alert that the Fu bar is down, and here’s a link to the runbook which is a documentation site, and it would say “Check the cue length. Make sure the database is running. Look in the error log. If you see this and this, try restarting” – those types of things. Is that what a runbook is?
Björn Rabenstein: [00:47:42.13] Yes, pretty much. In the SRE book they call it “playbook”. They’re different words for essentially the same thing. I also like to have our own runbooks, because that’s kind of a litmus test if your response is scriptable. If your runbook is “Yes, that happens all the time. Just turn it off and on again, restart that service”, that’s a clear hint for “Okay, perhaps you can actually put that into software and let the machines do it, instead of waking up a human being.” This is actually a guideline for an intelligent response, where I have to use my common sense and my intelligence to find out what’s going on here, and then it’s a good runbook. You can’t blame the runbook. If you have set up something where the runbook is a script, you should turn it into a script – that’s essentially the lesson.
Robert Blumen: [00:48:35.15] Sure, and that very much goes along with the theme in the book which they call “removing toil”, and by another name “automate all the things.”
Björn Rabenstein: [00:48:46.03] Yes.
Robert Blumen: [00:48:47.18] Björn, we’ve talked a lot about Google’s approach and how it can or can’t be applied elsewhere. Suppose I’m a bit of a skeptic and I say “This would be great if we had all this money and all these people to develop all these operational systems”, but organizations don’t all have the same resources as Google and it may be difficult to get their head above water and focus on automating things because everything’s breaking all the time. How can that organization, that sees itself as not being like Google at all, what can they apply from the Google philosophy?
Björn Rabenstein: [00:49:29.18] SoundCloud is a pretty good case study for exactly that problem. SoundCloud has a hundred-something engineers – I don’t know the exact number right now… If you know the tech scene and you know how many engineers typical competitors of SoundCloud have, you would be surprised how small the organization is and how much they get done. So we are definitely in that situation where we are resource-wise – resource is a very general term, including humans – where we have not a lot to spend and we still have to keep the system running.
[00:50:09.07] We were pretty much in that state where the old approach – you can read that everywhere: a monolithic, Ruby on Rails application for the site. That just didn’t scale in various ways, so something had to be done. The decision was to go microservices, and that’s really bold to do under [unintelligible 00:50:28.21] so that was just a lot of work to do. At the same time, there’s constant pressure to launch new features because it’s a hard competition, so you had exactly what you were thinking about.
[00:50:44.03] So what can we even do? Our site is falling apart all the time, we have new features to launch, we have new infrastructure to implement… Still we kind of pulled it off. We are pretty reliable by now, but I think it’s difficult to just give a rule of thumb on what you can do. Of course, as an engineer I say I just push back if I’m asked to implement the next big feature and I think we have way too much tech debt. You have to push back, I guess, but sometimes there’s just a business need; if we don’t implement that feature we are out of business.
[00:51:20.23] Yes, it’s hard, but you need to make yourself heard; you weigh all the things you have to take into account and come to the right decision. But dying of tech debt is definitely something that can easily happen and I guess for people outside of an engineering organization it’s really hard to realize. They see engineering as a black box, they can’t know what’s going on, and you have to explain that quite carefully, that you need some air to get out of this 50% – when the 50% toil becomes 100%, and then 200%, and then there’s just no way back. Over the 50% you should really try hard to get the weight out of that. It might be hard, but in the SoundCloud case we have proven it’s possible.
Robert Blumen: [00:52:12.13] Yes, and I do want to mention show #224 that we did about technical debt, which addresses some of those issues.
Björn, the last question before we start wrapping up – you and I were chatting offline… You have a background in molecular science, and I’m working in the SRE or DevOps field; I had my education in physics, and one of my co-workers also had a physics background. You said it’s not uncommon to meet people in the SRE world who have a non-conventional background, meaning I suppose outside of computer science. Do you have any thoughts on why that might be?
Björn Rabenstein: [00:52:57.24] Yes, that’s something I found… I never applied myself at Google, I would not have dared. It was like, “Okay, this is what computer science graduates from Stanford do” or something. But once I was inside I realized there are a lot of physicists, chemists like myself, historians, or people who have never studied anything and it just turned out that they have the talent; of course, they had brought also some experience. Recruiters always told me it’s so hard to hire for SRE because you want this software engineering mindset combined with “Yes, I’m totally not afraid of thinking about network packets, or running a script…” Have this combined skill set of a software engineer that is also kind of a sysadmin-y type and has all those things in their DNA as well.
[00:54:01.01] I could imagine – that’s pure speculation – if you are coming from this physics or chemistry thing, that you are more likely to develop those skills. In my case, I’m a biochemist and we happened to do simulations and we needed a lot of computing power. The university didn’t have a lot of budget, so back then in the ’90s we started to build a computing cluster. That started with, “Okay, we don’t even have technicians that would help us assemble that things, so we now have to go to shops and get three offers from different shops and then build it all together.” The Linux network drivers were a bit problematic with a lot of traffic back then, so you started to do everything… Nowadays the buzzword is full stack.
[00:54:54.02] So you are actually a physicist or a chemist, but you start to buy computer hardware and hack kernel drivers, and then in the end you also write the software that runs the simulation. You have this whole view, and then you also have to present results in the end, at a conference. Perhaps that’s a nice education, that sharpens your mind for what you need in an SRE setting.
Robert Blumen: [00:55:19.11] Björn, is there anywhere on the web where listeners could go if they would like to learn more about you?
Björn Rabenstein: [00:55:25.08] Yes, just google me, I guess. I don’t exist (kind of) because I don’t a Twitter handle, but recently with Prometheus becoming so popular, the Prometheus folks get invited to conferences a lot, and I think my name is pretty unique on the internet. There might be one or two others with that name, but I guess I’m more popular than them.
You can listen to loads of talks about Prometheus. There was an SREcon in Dublin this year; I gave a nice talk about alerting, that pretty much resonates with the topics we covered here, so you can definitely find a lot about that.
Robert Blumen: [00:56:14.09] Great. Björn, thank you so much for speaking to Software Engineering Radio.
Björn Rabenstein: [00:56:20.08] Alright, thank you, too.
Robert Blumen: [00:56:22.17] Thank you, listeners, for downloading the show. You can send us feedback multiple ways. You can direct-message us on Twitter @SERadio, on our Facebook group or LinkedIn group, or e-mail us: team@se-radio.net.
For Software Engineering Radio, this has been Robert Blumen.

* * *
Thanks for listening to SE Radio, an educational program brought to you by IEEE Software Magazine. For more information about the podcast, including other episodes, visit our website at se-radio.net.
To provide feedback, you can write comments on each episode on the website, or write a review on iTunes. Mention or message us on Twitter @seradio, or search for the Software Engineering Radio Group on LinkedIn, Google+ or Facebook. You can also e-mail us at team@se-radio.net. This and all other episodes of SE Radio is licensed under the Creative Commons 2.5 license. Thanks again for your support!

 

Facebooktwittergoogle_pluslinkedin

Tags: , , , ,