Search
Tammy Butow

SE Radio 325: Tammy Butow on Chaos Engineering

Edaena Salinas talks with Tammy Butow about Chaos Engineering. Topics include: the factors that caused Chaos Engineering to emerge, the different types of chaos that can be introduced to a system, how to structure experiments. Some of the chaos engineering experiments that were discussed are: DNS related attacks, black hole attacks and database attacks. Tammy highlighted the importance of a Service Level Agreement and went over its components. The discussion continued with topics around what metrics to collect for monitoring, incident management, being on-call and tracking down an issue.


Show Notes

Related Links

Transcript

Transcript brought to you by IEEE Software
[0:00:00]

Announcer: This is Software Engineering Radio, the podcast for professional developers, on the Web at SE-radio.net. SE Radio is brought to you by the IEEE Computer Society and by IEEE Software Magazine, online at computer.org/software.

Edaena Salinas: Tammy Butow is Principal Cyber Liability Engineer at Gremlin, where she works on chaos engineering. Prior to that she worked at Dropbox in cyber liability engineering, and also at Digital Ocean in cloud infrastructure as a service. She is also co-founder of Girl Geek Academy, which is a global movement to teach one million women technical skills by 2025. Tammy, welcome to the show.

Tammy Butow: Thanks very much. I’m really excited to be here.

Edaena Salinas: Thank you. Today we’re going to be talking about chaos engineering. And I want to begin with—

[0:01:00]

understanding the motivation of it, because what I saw is that some of the worst outages are what led to the introduction of chaos engineering. You worked at Dropbox. Can you talk about its worst outage ever over there?

Tammy Butow: Yeah, sure. It’s definitely true that a lot of the worst outages that have happened at a number of companies have then resulted in those companies deciding to start practicing chaos engineering. So the worst outage at Dropbox is publicly available on the blog. It was in 2014 and it went for multiple days. So it was a really bad outage related to the databases. And you can read about it; the VP of engineering wrote a full review of what had happened and also explained the action items that were going to be taken to make sure that it didn’t happen again.

And one of those action items was actually doing a lot of work to inject failure,—

[0:02:00]

to make sure that everything was more reliable. Because it is this whole new idea, which is what chaos engineering is focused on, is that you can inject failure similar to the idea of when you get a Flu shot you inject a bit of harm, but it actually makes you stronger and you are able to withstand like really bad failure in the future, because you have injected it frequently. Yeah, I just think chaos engineering to me, it’s just something that you should be doing. And a lot of people learn it the hard way. So I am really excited to be helping people learn it before they have a very, very bad outage.

Edaena Salinas: And just to clarify, what you’re saying is chaos engineering means injecting failure?

Tammy Butow: Yes.

Edaena Salinas: Okay. And sort of the purpose of this is to learn about what potentially could go wrong, is that correct?

Tammy Butow: Yeah. So the—

[0:03:00]

idea is say for example we all know these days that you run on infrastructure, whether it’s been _____ all or if it’s in the cloud you know that it’s going to fail, it’s going to break in some way. You might have hardware failure, power failure, firmware failure, some sort of kernel issue, some sort of issue with your own tooling. There can be so many different types of things that go wrong. And it’s just really common; like we all know that things will go wrong.

I think sometimes, though, we wish and dream that cloud infrastructure was going to be 100-percent available all the time and work perfectly, but we really know deep down that that’s not true. And we need to build our infrastructure with that in mind, knowing that things will break, but then actually it’s much better if you control the injection of failure. So instead of saying, “Well, like I know that maybe I’m going to have some downtime and my infrastructure is going to break,” maybe you—

[0:04:00]

think, “My only break once every few months.” It’s actually better to inject the failure, observe it, monitor it, learn from it, and also then make your infrastructure more reliable. Because you learn so much from actually doing chaos engineering experiments, so that’s what I really like to focus on.

If you think about it like injecting failure to say do something simple, like shut down an instance or shut down a server, that then makes sure you’re testing a number of things, like your automated self-healing tooling that you may have, like hopefully you’ve got that there in place. Or if you’re say, for example, shutting down regularly a replica for a database then you are testing to make sure that your clone process is really resilient instead of only testing the cloning process when a replica shuts down. Because maybe that happens once every few months or maybe it happens once every few weeks, but if you actually test it—

[0:05:00]

once a week on a regular basis, I think that’s so much better. Or even more frequently than once a week; you could really test it once a day.

And a lot of this depends on how big your infrastructure is. Because say if you have tens of thousands of servers or thousands of servers or hundreds of servers, it really depends on how frequently you’re going to be injecting failure and also what kind of failure you will inject. But everybody can learn from failure injection. Yeah, it’s just so much better.

Edaena Salinas: Yes. The first time where I came into contact with the notion of testing for failure and sort of try to prevent failure was in the testing discipline. So I’m curious how it differs from chaos engineering or it does differ at all.

Tammy Butow: Yeah. So from my background, I came up through studying computer science, working as a software engineer, then for me, I started to work at a bank after university and it was really interesting for me, because in a bank you—

[0:06:00]

have to do something called disaster recovery tests, they call them DRTs, and you need to do them once a quarter to hold your banking license. It’s a compliance requirement. And if you can’t prove that you can fail over your entire bank then you actually have problems with the regulators and they will say, “Look, you need to be able to prove this, because otherwise you can’t hold your banking license, you can’t service customers, so then you’ll be shut down.

So that was where I was actually introduced to this idea of live-scale failover. You actually failover everything and you’re also attesting that your entire team can operate, even if your office wasn’t available and it was the weekend. So you actually go to a different building for two days over the weekend and failover everything and make sure that, yeah, if something went wrong at your headquarters you would be able to use the Internet, you’d all have your laptops, you’d have everything required to be able to keep the bank running during that weekend.

So that was my introduction. And then while I was—

[0:07:00]

working as an engineer there I actually was doing a lot of things. I was very much full stack before it was called full stack. And I actually just really, really wanted to go further and further back into the stack, because I was feeling like when I was building things, you know, the business might come to me and say, “Hey, Tammy, it would be really great if you could build a new way for people to be able to process mortgages. Maybe you could build a mobile app or a new type of interface.” And I just felt like whenever I built something it was always really slow or like I couldn’t rely on the infrastructure that was underneath it for it to work well. So it actually sent me further and further backwards.

I spent a lot of time working with the database. Then I went even further back looking at the hardware, because it was all bare metal at the time, working with networking, also working on security as an engineer there too. And for me I learned, yeah, that, you know, it is very helpful if you have a good understanding from—

[0:08:00]

the very top level of JavaScript, like how can your JavaScript fail? How can your CSS fail? How can that actually impact the user? Then all the way back down to the power within the data center or that your cloud infrastructure—your provider is using. So like going through every single level, that’s where you can actually think about failure injection, you can inject it across the whole stack. And I think of it also like the OSI model, thinking about every layer there.

But the thing is what I really learned over the years is if you are doing failure injection right up, you know, the hardware level, that’s going to impact everything, or the network level. So that’s an area where I think we need to do a lot of work to be able to withstand those failures. Because, you know, if your infrastructure is down, everything is down; you can’t access anything. Whereas say for example if you have your CSS is not working, like something’s gone wrong there,—

[0:09:00]

then you might have changed the way that like your interface is presented to users, or it might be completely not in a functional state at all, depending on how you’re using CSS.

So it’s really interesting when you look at every single layer, and that’s how I think about it. But I didn’t come from a testing background except for just having to perform disaster recovery tests. But I have worked with a lot of teams that have done really great work doing testing and they did a lot of automated testing. So I’ve worked really closely with those teams. And yeah, that’s been really interesting within banking when you have the automated testing teams. They’re doing a lot of work to make sure that everything is processed correctly. But yeah, there’s also a lot of—that’s a whole other field and industry, this space of testing.

Edaena Salinas: Yes. And you mentioned something just now, disaster recovery tests. By this do you mean, for example, what you mentioned at the beginning, where you said you’re shutting down a replica and—

[0:10:00]

you might want to test your self-healing tool; is that what disaster recovery test means?

Tammy Butow: So disaster recovery test is even bigger large-scale—more large-scale than that. So the idea would be that you failover your entire company. So you shut down all of your normal like business operations, you shut down the building even, you shut down all of your primary database, primary data center, primary services, and you failover to usually they call it hot-cold or sometimes they do hot-hot. So you actually failover to the cold often with banking. And it’s that you actually have like a full other set of hardware in a totally different data center that you’re able to bring up and get everything running. So every single service would failover. If it’s like foreign exchange, mortgage breaking, Internet banking, they’ll go through every service and make sure that all of them work.

Edaena Salinas: So essentially what you’re doing here is you’re testing the backup—

[0:11:00]

services, right? The backup hardware and everything indicates the one you’re currently using is failing?

Tammy Butow: Exactly. Yeah. And the reason is you do it often for preparing for a large-scale natural disaster. That’s what a lot of people are really worried about. So for example, what if there was an earthquake or something like that. And often your backup data center is usually quite far away from where your primary one is so that it should be far enough that if there was some type of disaster situation then you would be able to have people that could go to that other location. That’s what you really—yeah.

Edaena Salinas: I see. And earlier another thing you mentioned was the infrastructure and the size of the infrastructure, how big it is. What is the role of this when deciding whether to do chaos engineering or not?

Tammy Butow: Yeah. That gets interesting. If you think of a very large-scale infrastructure,—

[0:12:00]

so let’s say you have infrastructure where there are over—let’s say over 5,000 machines. Like that’s quite big. And if you have over 5,000 machines, but say you only have like a few engineers who are managing the maintenance, the operations, all of that sort of work for these machines, keeping them up to date, keeping the fleet healthy, then maybe you only have a handful of engineers for doing that for 5,000 hosts. Maybe you have five engineers. Or maybe you have ten engineers, like that would actually be a lot. Or maybe you have all of your engineers that work at the whole company are responsible for doing that, but then you only actually have say 200 engineers and 5,000 machines.

So often what you realize is you have a lot of machines, but you don’t have that many people. So that’s why you need to do a lot of automation work and you need to really understand the failure modes that you can get yourselves in.

And the thing there too is say you have ten—

[0:13:00]

engineers who are responsible for maintaining the fleet, making sure that your systems are reliable, looking after availability and durability and you have a team of SRAs that are really responsible for that. But the thing is, you know, not all of them would have been there for many years. Maybe some of them have only been there a few months. Maybe some of them have been there one year or a few have been there for four years. Maybe some of them just started. Maybe somebody has been there for a few years and they just left with a lot of knowledge.

So that’s always something you need to think about with this type of large-scale infrastructure. There’s a lot of knowledge that needs to be transferred throughout the team. So I also think there’s something there with large-scale infrastructure; you usually don’t have that many more engineers than you do for small-scale infrastructure, but you have a lot more that you need to look after and a lot more that you need to understand.

So that’s also another reason that chaos engineering can be useful, because you can use the failure injection as on-call—

[0:14:00]

training and it helps you to be able to go, “Okay, if we”—even if you just walk through it and run a game day, which is this idea of having 10 to 15 engineers in a room together, talking through some scenarios that you might run through where you would inject failure in your infrastructure and you could whiteboard it up and go, “All right, if we injected failure at this point in our infrastructure then we would expect this and this to happen” and then you would actually say, “We expect these to be the downstream impact, so the cascading failure could look like this.”

But then you could say, “We feel pretty confident that we could inject this type of failure and everything would be all right” and then within the game day you would actually inject that and then see how it goes. So it’s your hypothesis that you have, “I feel like we can, for example, shut down this replica for this database and the clone should kick off and everything should be fine and then within this period of time I would be back to having a primary with two replicas.” So that’s a scenario that you would run through.

[0:15:00]

But there might be others that you go, “Well, we’re not yet ready to run through that scenario and actually inject that failure because we need to do these five things. We need to fix our remediation. We need to fix other parts of our infrastructure to make it more reliable so that it can withstand that failure.” And that’s what you’re going to learn when you do these exercises.

Edaena Salinas: Mm-hmm. So one thing you mentioned is it’s important to have it in place depending on the nature of your employees, if they have been there a long time or not. And also what you are saying is there isn’t really a rule that says chaos engineering should just happen if you’re operating at large scale. You can also do it at small scale, right?

Tammy Butow: Yeah. That’s super-important actually to do it at small scale, because one of the really common mistakes I think running small scale infrastructure is having like not enough machines and just putting everything on, you know, maybe three or four—

[0:16:00]

servers or instances and not spreading things out, not having redundancy, not having backups. That’s likely where you get into trouble. Because, you know, you don’t yet have a ton of users so you don’t need to have that to be able to handle the large amount of traffic. But the thing is if you don’t have those backup mechanisms in place and that redundancy and you don’t think about how you can failover your services, then it’s actually going to be impossible to be able to failover.

And what you’ll notice will happen with really small-scale infrastructure when they haven’t invested in building that out is that they can have like much longer downtime and they’re much more at risk of data loss because they haven’t thought through the backup scenario for their database infrastructure. For example, maybe they’ve not tested the restore process to see if they can actually restore their backups. And also they often might not have very good incident—

[0:17:00]

management programs in place to be able to know yes, like we’re currently experiencing an issue, we need to do this to fix it.

But a lot of these things, you know, it doesn’t take a ton of time to get set up and to put in place, but you need to think about it in a really different way. So a lot of the time it actually comes down to funding, because you need to pay extra money to be able to afford to purchase these instances. But that’s the exciting thing about cloud infrastructure now, that it’s becoming more and more affordable and you’re able to purchase smaller size instances, you can spread out your services across a number of instances and then think about building more reliable infrastructure and understanding that failure will happen and that’s expected, but building your infrastructure to handle that.

So that’s what I’m excited about right now. It used to be much harder, you know, when you would have to buy bare metal, you could only afford one server or maybe three servers and, you know, then when they would fail you’d have to try and get—

[0:18:00]

them repaired. Now it’s totally different with cloud infrastructure.

Edaena Salinas: Yes. And actually the first thing that I was thinking about was this additional cost. Like you were saying, being available in other countries, having more instances costs money. So that’s what I was wondering, to what extent is this lack of knowledge about needing to have these things in place and also the effect of cost on this. But it sounds like it shouldn’t be a significant cost.

Tammy Butow: No, it’s really not a significant cost. And then also you have to look at, you know, the cost of servers these days, they’re getting cheaper and cheaper all the time. Like when you get your instances. Then the other thing is you look at the other side, if you pay, for example, this much for your infrastructure, this is your monthly bill, what would happen if you were down for three days? Which is pretty common; like you can definitely—you see it quite frequently, three-day outages or even weeklong outages.

[0:19:00]

And these happen to some of the biggest companies. And, you know, small companies as well; it happens to everybody.

And the thing there is what happens to your business if you’re down for three days? Your customers are probably going to be very unhappy and you might lose your customers. Whereas if you think, I’m going to invest in my infrastructure and I’m going to build it in a reliable way, which means I’m going to build it in this way that it might cost me a bit more for my monthly bill, but then that means I am going to have a more reliable service and my customers will be happier because they’re not going to experience that downtime or like data loss, which is even worse.

Edaena Salinas: Mm-hmm. And I guess also the nature of the product and the project matters, because for example, if it’s a photo app, maybe it’s not that critical, you’re not going to lose customers; they’re just going to be annoyed, like, “Ah, it doesn’t work today.” But if you’re dealing with a hospital system or a banking system, I think it’s even more critical, right, the type of data.

[0:20:00]

Tammy Butow: Yeah, definitely. It’s really interesting, I think—the way that I think about it too is even if it was a photo app, because people can so easily move services, if your photo app is down and somebody trusted their photos in there, and then it goes down for a day or two days and they can’t access their photos, then they might lose trust, right, in your business, because they feel like, “Can I really trust my photos in there? Do I need to have a backup service to hold another set of my photos? Then why am I paying for this service if I don’t trust it? I couldn’t even access my data.” And then they might worry that if it was down does that mean that it’s not reliable, that one day it might lose data.

So I think like these are all the things that people think about, right, when they’re looking at whether they want to use a service, and that’s—like that’s a tough spot to be in, when it’s so easy for people to create new businesses these days. And I just think it’s worth it then, you know? I actually feel like building more reliable—

[0:21:00]

services can actually help you acquire new customers and can help you make money, because you’re showing people they can trust your service.

Edaena Salinas: That’s true. And also, like you said, even the big players, the big companies have had outages for several days sometimes. But what’s good is that these companies make available their infrastructure, like AWS, where I assume they put all those learnings. And you are more equipped to recover from failure if you leverage their systems, right?

Tammy Butow: Yeah, exactly. Like it’s often—like for example, there was a big outage last year that I was on call for, which was the S3 outage. And the thing there is, you know, it was a region of S3 that was impacted, and there are several regions that you can use, or you can have a backup mechanism in place so that if that region goes down then you have some type of failover ready. And I think that’s just a really important way that we need to think these days. Yeah.

Edaena Salinas: Yeah. I want to talk in more detail—

[0:22:00]

about chaos engineering and the types of failures that we can discover by using it. Can you talk about some examples of things we can find out by using chaos engineering?

Tammy Butow: Yeah, sure. So just last week I actually taught a chaos engineering boot camp, which was at SR Recon in Santa Clara. And during that boot camp, it’s a half-day boot camp; I’ve done it twice now. I also did it at O’Reilly Velocity, and it’s a really good experience for engineers to come together. What we do is I create a cluster for everyone, because often people say, “Oh, the only way to learn about chaos engineering is the only way to do it on production.” And I always say, “No, you can actually start doing chaos engineering first on just a demo environment. Then you can also do it on stage. And then once you feel confident, you can do it on production.”

But what I do for the chaos engineering boot camp is create a demo environment. So I spin up a—

[0:23:00]

primary and two nodes, a Kubernetes cluster for everybody that’s coming along, and actually put people in groups. So I’ll have three people working on a cluster together, and then I give them access to my GitHub Reaper, which has a few chaos engineering experiments in there that people can start to run, which is like a nice taste-test for how do you start to inject failure and how do you start to run these different types of experiments.

And I also deploy a demo microservices app from Weaveworks, which is an e-commerce store. So it means that when you come into the workshop you’ve got this demo environment, you have the application there, you can go to the IP address, you can see the e-commerce site running, and then you can start to inject the failure. But because you’ve got three people that have access to the cluster it’s really much more interesting than having just one person on that demo environment, because it’s more similar to actually working on production systems, where you have multiple people that have access to your infrastructure—

[0:24:00]

at the same time, right?

And like this actually teaches you a lot about running chaos engineering experiments in a safe way and in a secure way, because there are a lot of ways that you can do it badly. So I like to also show people that side of chaos engineering too, that it’s really important to think of how you can run these experiments in a safe way, in a controlled way instead of just random and anyone doing whatever at any time.

So the first experiment we do is a CPU attack, where we’re just consuming CPU resources. And then that one shows you what that looks like within your infrastructure, and I get people to check like can you tell that it’s happening right now. I usually also put Data Dog agents on there so then everyone can see what that looks like. But then you can also just check from your command tools, like using htop or something like that. And yeah, so you can actually then say—when you open up htop you can go, “All right, I can see the attack happening right now. I can see that it’s consuming CPU—

[0:25:00]

resources,” but then you’re seeing how does Kubernetes handle this. And for that attack, that specific one, everything is pretty good and, you know, nothing strange; you’ll notice nothing really that is customer-impacting.

But then we run some other attacks which are much more damaging. So we run a network corruption attack. We’re actually corrupting network packets. And for this workshop it was very interesting, like usually you would corrupt a smaller percentage of packets, but I decided to go wild and my script was corrupting 50-percent of packets, because I just wanted to make it that you could really see the impact of the chaos engineering experiment. And you definitely could. So especially on conference Wi-Fi, when people started to run that it just like completely hosed their instances. They’re able to see what happens when you have a massive network issue where there’s like high amount of packet corruption and then like you can’t actually load the site, you can’t even load talk. You know, you’re not able to actually see—

[0:26:00]

what’s happening on your host at all.

So that’s at a point where if you had automated remediation and automated tooling then you would—because you’ve got this Kubernetes cluster, like the idea there is you would be able to just get rid of that host and then have a new host replace it. Which I like that about Kubernetes, that it’s built in a way that you should be able to just get rid of that host because something weird is going on on that host, and that is a way to actually trigger that to happen. ‘Cause you are causing some type of failure that is not expected, right? But then you need to be able to capture that, record that, and make it understand that your system needs to take an action based on that.

So yeah, that’s like two examples of the attacks. There are so many others. Another one is, you know, you can do DNS-related attacks. Often there was like a big dying outage that happened. Then you can also do black hole attacks, time travel, all types of different attacks. We 11 attacks built into—

[0:27:00]

Gremlin, and we just launched in December. So that’s really exciting. And we’re constantly building more, creating more different ways to inject failure in unique ways. ‘Cause I think everyone thinks of failure as a host going down, but that’s just pretty standard failure. We are trying to think more about the weird types of failure.

Say for example if you inject network latency to a replica for a database, what happens to the primary when it’s trying to communicate with the replica? Can it handle that? Does something go wrong with the replication process? What happens if a promotion then needs to happen?

So it enables you to actually think through all these things and then actually trigger the failure to understand what happens.

Edaena Salinas: Mm-hmm. You mentioned a couple of interesting points over here, especially the attacks. One thing that I wanted to clarify regarding the test experiment and the setup is you mentioned you added a couple of Data Dog agents. Can you explain what this is?

Tammy Butow: Yeah. So Data Dog is a monitoring tool, and you can deploy your—

[0:28:00]

agent in different ways. So you can just install it directly on the host or you can do it in a number of different ways. And you’re able to get a free trial for Data Dog, so that’s why I like to use it. Because it’s very quick, so you install the agent, just run one simple line script and then your agent will then start reporting data to the Data Dog UI and you’re able to pick up all of the stuff, like CPU resources, all the things that we need to know for when we’re running out attacks. And I find that to be a really handy way to show everyone very, very, quickly on a demo environment what the impact is of failure injection, because it’s very visible.

But if you didn’t want to use Data Dog then you could use something else, like Prometheus and Grafana and use some open source monitoring tools like that. But yeah, the way that Gremlin works is actually similar to the idea of how you would install Data Dog too. So for Gremlin to run your chaos engineering experiments—

[0:29:00]

you actually also install an agent on every host or you can install Gremlin inside a container. And yeah, that’s how that works, too. It’s a similar idea where you agents across your hosts and then you report back up and then also there’s like a control plane with a nice UI, an API, and a command line tool.

Commercial: Has work become synonymous with e-mail? Do you hear chat pings in your sleep? Good news. Stack Overflow for Teams is a private secure home for your team’s questions and answers. No more digging through stale wikis and lost e-mails. Give your team back the time it needs to build better products. Go to S.TK/radio to try Teams today, with your first 14 days free.

Edaena Salinas: Yes, since you just brought up Gremlin and it’s where you’re currently working, can you explain a little bit more about what it is?

Tammy Butow: Yeah. So Gremlin is the first company founded to actually provide a—

[0:30:00]

platform to do chaos engineering. And it’s like 100-percent focused on doing that.

So I was really excited to join. I joined at the end of last year and I met the founders a few years back, when I was—I had just started working at Dropbox and one of the founders, Colton, he’s the CEO, he reached out to me and asked me if I’d like to speak on a conference track that he was putting together for QCon New York. And he was putting together a track all about failure and he asked me if I wanted to speak about disaster recovery testing that I was running at Dropbox and what I had done in the past at NAB. And I said, “Yeah, that would be great.”

So that’s how we met and we stayed in touch over the years and then I reached out to him recently because I was really interested in working on Gremlin with him and with the team. And I joined as employee nine, which is very exciting to be working somewhere. It’s the smallest company that I’ve worked at so far. But I think it’s really great that we’re actually focused on building a chaos engineering platform.

[0:31:00]

And we’re trying to build it in a way that it works for everybody, doesn’t matter what infrastructure you’re using. If you’re on AWS, if you’re on GCP, Azure, like it will work for all of those different types of platforms: Digital Ocean, Bare Metal, whatever it is. And also whatever type of infrastructure you’re using. If it’s any type of database: MySQL, Postgres, DynamoDB. It can be anything.

So to me that’s really exciting and a big step forward for the industry, because before a lot of what was happening in the chaos engineering space was that people were, you know, making a lot of smaller tools, but they were focused on just doing one type of thing, just for one type of provider or like just for one type of experiment. And it wasn’t so much where it was about let’s build something that actually works everywhere; that’s what I really like the idea of doing.

Edaena Salinas: Mm-hmm. And does this include a set of predefined tests that you can do?

Tammy Butow: Yeah, exactly. So we have a—

[0:32:00]

set of predefined—we call them attacks, because we like to say that you’re actually triggering an attack on your services or your infrastructure. And the idea too is like if you talk to the Netflix team, they won’t call them tests, but Colton will talk about chaos engineering as doing testing. So different people have different thoughts on that and what that is. And the reason they don’t call it test is because they say it’s an experiment. So for a test you’re trying to test something, I want to test this and see the result. With an experiment it’s more exploratory, like you’re trying to learn from it. So I guess that’s like once difference there.

But then we call them attacks. And I like to think of them also as there’s not only the predefined attacks, but you can also run attack combos. So you can combine several attacks together and then see what is the impact if you do say two attacks at the same time, how does your infrastructure handle that? That’s when you get to a really much more advanced state. You can also schedule the attacks to run at the same time every week on the same day, or you can schedule them to run daily or weekly.

[0:33:00]

So I think that’s a really good move forward as well. And there’s also an API. So you can actually roll the attacks and Gremlin tooling into your own tooling. So you can make it very, very easy to trigger attacks based on something happening within your own infrastructure or making it easy. That’s something that Twilio did recently, which was pretty cool, with one of their systems, called Rate Queue. So they used the Gremlin API to trigger failover. Yeah.

Edaena Salinas: In the words of that you described that you were giving about chaos engineering you mentioned you did a couple of attacks, you and the people at the workshop. One of them was DNS related. There was another one called the black hole attack. Can you explain what this attack means?

Tammy Butow: Yeah. So actually in the workshop I don’t do black hole or DNS; I just do CPU and network-related attacks. So you can find those on my GitHub. If you go—

[0:34:00]

to GitHub.com/TammyButow you’ll find a reaper, which is a chaos engineering boot camp reaper. And then within there I have some chaos engineering scripts that you can run. So you can just pull that reaper down onto your machine and run those attacks and then see the example of what happens. But then we have some other types of attacks that you can run as well, which are only available within Gremlin. Yeah, those ones you have to actually get a Gremlin trial. You can get a free trial and you can check them out.

So with the black hole attack, this attack will drop matching network traffic that you specify. So that’s what that idea is. I guess if you think about it like you’re creating a black hole and you’re sending all of the traffic into this black hole, so it’s going to basically nowhere; you’re just black-holing traffic. Then DNS will block access to DNS servers. So this is to be able to say, “Well what happens if our DNS servers were down?” Like say if you were—

[0:35:00]

using DIN and then DIN had an outage—DIN, sorry, what would happen next? And that’s the idea there.

Yeah. So and then there’s also time travel, where you’re actually changing the system time for the host, and that one is really interesting. Like if you think about it, it’s similar to Daylight Savings, maybe some type of issue there, or other time-related events. Like everyone was really worried back in the day about like Y2K bug, what happens when the time changes. And so you’d actually be able to reproduce that.

Edaena Salinas: For those that are not familiar with the Y2K bug, can you explain briefly what it is?

Tammy Butow: Yeah. It was before I was working in industry. So I was in university. I think it was. Yeah, it was like when it was the year 2000 and then all of them, computers, machines were going to change. Like I was in high school at the time, and it was like the Y2K bug or the Millennium Bug, and everyone was really worried. Like it was so interesting when you read on the news, everyone was worried that—

[0:36:00]

all the computers in the world were going to crash and we would lose all the data everywhere and no one would be able to access anything, because of how people had been storing time. And they didn’t think we would be able to handle when we clocked over and it was 2000. They thought time might start again from the beginning of time for computers. But actually in the end everything was fine, so it was really like a big scare; people were very worried. But in the end, yeah, it really like wasn’t a big issue. But it was an interesting thing and it could’ve caused a lot of problems for people. And I think, you know, yeah, just an interesting thing to read about.

Edaena Salinas: Yes. Another thing that I wanted to clarify is you were saying these attacks, people can find the scripts on your GitHub. Can you talk about these scripts, like what are they written in? What do they look like?

Tammy Butow: Yeah. So if you check out my GitHub you’ll find the scripts. Like they’re actually—on my GitHub they’re just batch scripts, so you can—

[0:37:00]

just run those. But the other thing is you don’t have to run them as batch scripts; what you can do is one of them you can just copy the command and run it straight in your terminal. And a lot of the time, like with chaos engineering you’re really using different types of things that have been built into Linux, that you’re trying to change or use to be able to introduce these values. And these are because they’re infrastructure-related failures, like especially the networking related ones. So I would recommend just like grabbing the script, checking it out first, like read the script, see how it works, you can read like the Linux manual pages to learn more about what I’m doing. But it’s actually really cool to just run it yourself and you’ll be able to see the impact of those.

Yeah, it’s a really cool thing to do on a demo machine. You can even just do it on a host with nothing else installed on it, if you’ve never run a chaos engineering experiment before. And you’ll see like—

[0:38:00]

how easy it is to run your very first experiment. And that’s something that I’m really excited about, is getting everyone to be able to be—say like, “I’ve run my first chaos engineering experiment. I injected failure. I was able to see the impact of it.” And that’s why I’ve got those up there, so everyone can get their first taste test before they then move on and start to do it in stage and then start to do it in production and actually see the real value from doing failure injection for real on real services and actually triggering proper failover and other types of actions that you want your systems to take place.

Edaena Salinas: Mm-hmm. So this is like a “Hello world” of chaos engineering?

Tammy Butow: Yes. Yeah, that’s actually what I call it. When I do the workshop I say, “This is the Hello, world, of chaos engineering. It’s like the real, you know, a good place to start. It’s a really nice place to start.”

Edaena Salinas: Okay. And earlier we talked about experiments and attacks, some people call it attacks, like you at Gremlin. Other people call it experiments.

[0:39:00]

But what essentially are the results and how are they collected?

Tammy Butow: Yeah. So it’s actually really interesting. If you use Gremlin then what happens is you trigger an attack. We actually log the progress of the attack and we think that it’s really important for your entire team to know what attacks are happening when. So you can actually see that within the UI, you can see the history of all the attacks that have been run, who they were run by, what time they were run, what host they were targeting, what tags those hosts had, what instances or containers.

And so I think that’s really important, because one of the big things with chaos engineering is you need to make it very visible; you need to let people know what’s happening when. Because what happens if you trigger some type of attack and it does something unexpected and somebody didn’t know that the chaos engineering was happening? Then they’re going to be really shocked, right? Because they would not be expecting this to happen. And you don’t want that. You don’t want chaos engineering to be like that.

It’s actually better if you let people know and you—

[0:40:00]

say, “Hey, we’re going around these chaos engineering experiments. They’re controlled experiments. We’re picking these services to run them on” or these containers, these instances. And then you actually let people know. So I also recommend creating a Slack channel, if you use Slack, a chaos engineering Slack channel, and automatically posting in there. We actually have a Slack integration for Gremlin, which is really cool. So yeah, just automatically post in there. And it would say, “The attack has started. Now the attack has been successful.”

Or sometimes something could happen where the attack might fail, so then you would also want to know that, why did the attack fail, and you can look into that. So that’s really how that works. And I think that that’s pretty important.

Edaena Salinas: And how does the organization learn from the results of these experiments?

Tammy Butow: Yeah. So the big thing that I think is important to do is you need to have a few things in place first before you start running your chaos engineering experiments, so that you can make the most of them and really learn. So first off I like—

[0:41:00]

to start by saying what you shouldn’t do. So what you shouldn’t do is like just have someone decide to go in tomorrow and start taking down services or start taking down hosts and not telling anyone. Because that’s just like creating unnecessary confusion and it’s not controlled chaos; it’s not in the spirit of helping everyone learn together and work together and get value from it. It’s just like creating chaos in an unnecessary way.

And also it’s not good to scare people and say, “Hey, we’re going to take down the whole data center tomorrow. Your service better be ready.” And then the service teams go, “What? I’m not ready for that.” Like you can’t just do that; we’ll have some really big outages. That’s also not a good thing to do, because chaos engineering to me is—it’s definitely a journey, so you really need to get everybody onboard and treat it like a journey, like a marathon that you’re all going on together. Or even more than a marathon, it’s like an epic trek, because it’s going to take a long time and you need to work together —

[0:42:00]

and make sure that you’re already there and on the same page.

The next thing you want to do is have a few things in place. So it’s really important to have—make sure that you feel pretty confident with your monitoring tooling so that if you were below SLA, for example, your service level agreement, say that’s 99.99, you’d be able to know. And I think a good question to ask yourself is do you have one dashboard that your entire company could look at to know if you are currently meeting the SLA, if your core products, that were like critical customer-facing products.

Then the next question is do you know what your five to ten critical services are that your customers are using? Are you able to list those out? Like often one would be the database, another one might be say the traffic services, like if that’s EngineX or Apache. And depending on your business, there would be other types of services, too.

So it’s important to be able to know what those are and then make sure your monitoring would bubble up that. And I recently wrote a post—

[0:43:00]

going through how to create a high severity incident management program, which is on the Gremlin community. And I really recommend reading that as well to make sure that you feel like you have a good incident management program before you start to do chaos engineering. And the reason is what happens if you run some type of experiment and it causes an outage, but you don’t have a high-severity incident management program and you don’t have a way to quickly escalate issues, to triage issues, to make sure that they get resolved really fast, because I think it’s important to have that in place before you do a chaos engineering.

And it doesn’t take a ton of time to put something like that in place you know, you can really start to put that in place within a week. You can do something like have volunteers who say, “Hey, I’m happy to be on call to help do this.” But I think it’s a good thing to really read through that, ’cause it’s quite detailed. I think it’s about 4,000 words. But I really try to explain how to build one of those programs out.

So yeah,—
[0:44:00]

monitoring is important before you get started, knowing what your critical services are is important. And then also making sure that you have a high-severity incident management program is important too.

Edaena Salinas: Are coming up with these things and defining these things what you call the service level agreement?

Tammy Butow: Yeah.

Edaena Salinas: Okay. And what was the name of that block post that you recommended?

Tammy Butow: Yep, it’s called How to Establish a High-Severity Incident Management Program. So a pretty long title. But if you look up my name, like Tammy Butow, Gremlin, and then you look up incident management or high-severity then you’ll be able to find it.

Edaena Salinas: Mm-hmm. When doing these experiments on attacks under the chaos engineering umbrella, do these normally happen during business hours?

Tammy Butow: Yep. Yes, definitely. Yeah. So when you start they definitely happen during business hours. And I think that’s really important, because the idea there is, you know, you’re coming together, you’re running things purposely in a controlled way. Often you might even start with everybody—
[0:45:00]

in the same room when you’re running your first experiments. And that’s definitely what I did when I was working at Dropbox. When I would run the chaos engineering experiments for the databases I would be in a room, I would run through my different experiments, and I would have a few other engineers there from the databases team and we would take turns. Like some weeks it would be my turn to run them, other weeks it would be someone else, and we were constantly thinking of new experiments that we wanted to run, too.

So the thing there is, you know, you’ll run the same experiments over and over, and then it’s good to automate those and say, “Okay, we’ve run this many, many times. We’re really confident with the actions that happen afterwards. Now we’re just going to automate that.” And then you could actually have it happen at nighttime outside of hours, when everybody is asleep. But at the start it’s good to do it with everyone in the same room. I do think that that’s really good.

And then you start to think, okay, what happens—what are some other types of experiments we could run? And as a team you can brainstorm those and you could say, “What would be really bad? If this happened it would be—
[0:46:00]

really bad.” And then you actually write those experiments out. We would just write them out on a whiteboard first, like what’s actually think, “If this happens that could be quite dangerous for us, it would be quite bad” and then actually go, “How can we make it so it wouldn’t be so dangerous and make your infrastructure and services more resilient so they could handle that failure scenario and then actually run it?” So then that’s a very good way to do chaos engineering.

But often like we talked about at the start, people do chaos engineering because they’ve had some type of major outage or massive catastrophic failure and then they start to introduce chaos engineering. So that is a good way to start. So if you have an outage, one of the action items for the outage could be do the fixes and then run—actually reproduce that outage again to make sure that if it ever happened again you’d be totally fine and you would know that everything is okay and your infrastructure could handle it.

But then you start to get into the space where you’re actually being more proactive,—

[0:47:00]

which I think is really cool. But it takes time; it takes a few months to get there.

Edaena Salinas: For defining these attacks, is there a dedicated chaos team made up of people that build the system or that didn’t build the system?

Tammy Butow: Yeah. Like every company seems to do it differently. So it’s interesting, I think I’ve worked at places where there’s been a dedicated team who thinks of what types of situations to run. Most of the places I’ve worked, there are SREs who are embedded on the teams that think about what those experiments might be and then they’re the ones that usually guide the brainstorming of what types of experiments you will run, but then really the whole service team would have input and would try and figure out what types of experiments to run together.

But I do think that it’s really good to have service teams own the experiments that they’re going to run. And the idea there is that they know their service really well, right? And they’re constantly modifying their service,—

[0:48:00]

they’re trying to make it better, they’re adding new features. When they add new features then there might also be new failure modes that can crop up. So it’s really good for them to own that, I think that that’s really important.

But yeah, there are a few companies where they have a centralized chaos team that would own that initiative and roll it out across the company. Like Netflix has a dedicated chaos team where they would go across the company and evangelize and advocate for chaos engineering across the service teams. So that’s a different approach that you can do too.

And I think all approaches really work, but often it’s either a chaos engineer, that that’s their real—that’s their title and they’re the ones that do the chaos engineering, or it could be an SRA that does chaos engineering, or Facebook, for example, you are called a production engineer. So it might be production engineers that do chaos engineering. Or it could be a suite, like a software engineer, that really wants to build reliable systems, totally understands that it’s important, wants to make sure that what they build—

[0:49:00]

is reliable, so they’ll be the ones that decide to drive forward with injecting failure and using chaos engineering as a practice.

Edaena Salinas: And this can also depend on the resources of the organization and the size of the organization, right?

Tammy Butow: Yeah. Like the big thing is, you know, to start doing chaos engineering it really only takes one engineer. So you don’t have to have a ton of people that are driving that forward; you can start small. And we really recommend that at Gremlin as well. We talk about this idea of the blast radius. So it’s important to start small first, run one small, you know, controlled experiment on one service, then gradually expand out.

And like it’s okay to start in stage; you don’t have to start in production. Then gradually do more and more. And as like you start to show the value and the benefit of chaos engineering then you’ll see more and more people start to pick it up. That always happens. And the other thing too is to be able to show the value of chaos engineering it’s really important to collect metrics to—

[0:50:00]

tell the story before you start to do the chaos engineering and then after. And some example of metrics that you would want to collect are incident counts.

So when I say incidents I mean not just like high-severity incidents, but also alerts. So it’s really important to collect—if you have something, use something like PageGD you can have all of your alerts that are in PageGD and go, “All right, in the last week we had x-hundred alerts that actually paged an engineer.” And then you could say, “We started to introduce chaos engineering and we actually had a reduction”—say there was a 20-percent reduction in a month of chaos engineering and then made _____ alerts because of the chaos engineering.

And you could also say, “We’re tracking high-severity outages, so that’s why you would want to have your high-severity incident management program.” And you could say, “For example, the worst type of outage is a sev-zero, which is a catastrophic outage, and maybe you were—

[0:51:00]

having those quite frequently before you started to do chaos engineering. But then you would be able to use the chaos engineering to reduce your high-severity outages. And what will often happen, this has been in my experience, is you’ll have less very, very catastrophic outages, like the really bad outages that are sev-zero/sev-one, and you might have more outages which are sev-two/sev-three, so they don’t impact customers as badly, and that’s pretty common.

Edaena Salinas: Mm-hmm. And earlier we were talking about the service level agreement, and like you said, one of the main components of this is monitoring. And just now you were mentioning keeping track of the number of incident counts, tracking the severity of the outages. Is there anything else that you recommend to have in place under monitoring?

Tammy Butow: Yeah. Like the other thing you want to do is collect as much data as you can about your high-severity outages. Because really you shouldn’t be having too many of those,—

[0:52:00]

but if you read my post on how to set a program up you’ll see how to measure those, how to track those. But when you do start to capture the data on those big incidents, those high-severity incidents, what you want to do is capture the time to detection, so how long did it take you to actually detect the incident was occurring; the time to resolution, how long did it take you to fix that incident; and then also the time to prevention, how long did it take you from fixing the incident to then doing all of the action items to try and say like we want to make this incident never happen again. And then also running your game day and your chaos engineering experiments to confirm that your fixes have actually worked.

Because, you know, it’s one thing to like do fixes after an incident; it’s another thing to actually do chaos engineering and reproduce the incident and prove that if that happened again your infrastructure would be totally fine. And then the next metric is the time between failures, so—

[0:53:00]

what you want to be measuring there is if the incident actually did happen again, how long does it take for it to happen again. Is it a week? Is it a month? Is it six months? So it’s really good to just be tracking those metrics, too.

Edaena Salinas: In addition to measuring things related to the incidents themselves and the time between the failures, do we also need to be measuring, for example CPU usage or disk?

Tammy Butow: Yeah, 100-percent. Actually some people will say that you shouldn’t measure those things. I think you should. And a lot of the time it comes for free out of the box if you use a tool like Data Dog. They’re going to help you easily measure CPU, disk, I/O, memory, like that comes out of the box.

And the thing there is I would say you want to be measuring it and you also want to be looking at things like measure—say you’ve got 10,000 hosts, one of the handy things is to measure CPU usage across all of your hosts and look for spikes.

[0:54:00]

Is there like one host that’s acting out and it seems unusual or strange? And we would often, like in the past where I’ve worked we would always alert on types of issues where it goes above like 80-percent, because that’s some sort of issue that’s happening, you know, that’s unexpected.

And I think if you feel like with your infrastructure that it would be too noisy to page for those things, and I think it’s important to think about how can you make your infrastructure more reliable, because you’re really like playing with fire there if you’re letting things get to a point where they could just get knocked over because they were consuming too many resources and you’re not handling that properly. So yeah.

Edaena Salinas: What is the process of correlating an incident and what happened? For example, do you recommend looking at graphs of what was the CPU like and then maybe there’s some correlation there?

Tammy Butow: Yeah, I have a really—so like the way that I do incident management, like I think it’s very important—

[0:55:00]

to have two things. One is an incident manager on call, which is a rotation of people who are usually engineering leaders that have quite a lot of experience managing incidents. And so I was an incident manager on call at Dropbox while I worked there. And you’re really on call for every single service across the whole company. So if there is any type of major outage that impacts Dropbox then you’re going to get paged if you’re on call. And it’s a 24/7 on-call rotation with, yeah, a small number of people, five people or so.

And the thing there is I think that is actually really important to have that rotation, because those people, like they’ve had experience with managing large scale incidents before, really good skills in troubleshooting, debugging, thinking fast, being able to make sure they’ve got the right people there to resolve the issue, and it also means that you’re separating out the idea of managing the incident and then resolving the incident. And to resolve the incident it’s really good to have a technical lead on call. This would be actually the person who’s—

[0:56:00]

on call for the service that is currently having some type of issue.

But the thing that an incident manager is really good at is figuring out which services are having an issue. And then there, yeah, you need to have a really good high-level dashboard that shows you are we currently meeting SLA. And in that dashboard it’s good to e able to show this is the health of Dub Dub Dub, this is the health of your API, and then seeing if it’s below SLA or not. And it’s really clear and easy to then say—and then you go the next level under after that. That’s what I like to look at.

You can then have other dashboards that show your critical services and you can quickly pinpoint which services are having an issue if you know what your critical services are. Because usually it’s one of those. If it’s going to be impacting users in a really big way it’s usually some type of a critical service that’s having trouble, and then you make sure that—’cause what can happen too is say a critical service has just deployed something new or some type of issue has occurred—

[0:57:00]

that’s impacted that critical service, but maybe they might not even catch it with their own alerting that they’ve built for their service; it might be something that slips past them. But the thing is if you have incident managers on call then they’re going to get paged for drops in SLA. So they are actually looking at it from the customer perspective, not from the service perspective. And their job is just to fix stuff. They’re just trying to get everything back up and running as soon as possible, and they make sure that they find the right service team that needs to fix it and they get the right people onboard.

So yeah, you don’t need too many things to be able to triage, but you do need to know like what is our SLA right now, are we meeting it, what are our critical services, which ones are having an issue. The majority of the time it’s going to be one of those that is actually having an issue. It’s pretty rare that you have a customer-facing impact and it’s something really small and obscure. Sometimes that can happen and it’s harder to debut, and that’s when you see like the mean time to detection of what the—

[0:58:00]

actual incident is taking a really long time. And then you need to be able to have much better diagnostic tools. So yeah. But this is like after doing incident management for a really long time you just get really fast at it and it’s a skill that you develop.

Edaena Salinas: You mentioned an important thing is to have incident management in place, which means to have at least one person on call, and this person has really good debugging skills, knows the system really well. How can you start training other employees so you don’t always rely on the more senior people on the team? Can you do that?

Tammy Butow: I think like the big thing there is you don’t want to have too many—it’s like you don’t want too many cooks in the kitchen. So the thing there is you actually want pretty small rotations for those incident management rotations. Having five people where they sync once a week, they make sure they understand which new services are being built, which new teams have been created. Because—

[0:59:00]

they need to have a good understanding of the services that exist, but also the teams and the engineers that are working on the teams. And that can be quite hard if there’s like several hundred engineers and it becomes like part of your role and your job of what you do day-to-day, is that you need to keep up with what’s happening across the company so that you’re always aware and understanding what’s going on.

But I think then it’s a really good opportunity for somebody. If they really want to learn to be good at that, to be really good at incident management, they could just put their hand up and say, “Hey, I’m interested in this being part of my career path going forward. I’d really like to become a good at this. It’s a skill I’d like to develop.” And then they should really do some training.

So what you could do there is they could shadow an incident manager on call, they could get the page _____ alerts at the same time as the incident manager and then jump online as soon as that incident manager is working on the incidents, they can jump online right with them and then just shadow them by watching what they do.

And the best way to learn often with—
[1:00:00]

that is just by watching. It helps even more if they’re able to sit by them in the office if it’s a daytime page, but it could be in the middle of the night. And that’s how you really learn.

And for me, I started to do the high seniority incident management work really early in my career, and it was actually just because nobody wanted to do it. And I thought, “Well, this is pretty important. I actually really care that what I’m building, services and software, I want people to be able to use it and I want people to have a good experience using it.” So I felt like it was something that I just wanted to learn, I wanted to be really good at. And I wanted to reduce the amount of incidences we were having, I wanted to reduce the amount of time it took to resolve incidents. I wanted to reduce just the impact to people that were using our software.

And I’ve learned a lot of things doing it, so I definitely recommend that it’s a very good thing to learn. It’s a great skill to develop. And it also a really great skill for your life outside of work, because you’re able to manage these,—

[1:01:00]

you know, high-impact, they can be high-stress moments where sometimes they can take several hours. Like the S3 outage last year, I was on call there and working on that for, you know, five hours straight. So you have to be very, very good, very focused for sometimes several hours until it’s resolved. But yeah, I think it’s a good thing to do.

Edaena Salinas: Yes, and it can also help on other type of skills, like you said, how you’re reacting to this failure, if you’re having anxiety or things like that, right?

Tammy Butow: Exactly. Like you learn how to be really calm. And like my friends will always tell me outside of work, if like any type of issue happens, I’m so good at figuring out how to resolve things. So often my friends will say like, “Oh, I had this issue. I just lost my passport. I’m in this random country. What do I do?” And I’m really good at being able to think through like these are the types of different actions you need to take to be able to resolve this as soon as possible. You just get really good at that, and that’s a great life skill.

Edaena Salinas: And then you hide—

[1:02:00]

your friends’ passports to inject chaos in their lives or something.

Tammy Butow: Definitely. [Laughs] I hide their passport. No, I’m joking. [Laughs]

Edaena Salinas: Well, Tammy, thank you for taking the time to come on the show. It has been a pleasure talking with you about chaos engineering.

Tammy Butow: Thank you so much. It’s been really great to talk with you too.

Edaena Salinas: For Software Engineering Radio, this is Edaena Salinas.

Announcer: Thank you for listening to SE Radio, an educational program brought to you by IEEE Software Magazine. For more about the podcast, including other episodes, visit our website at SE-radio.net.

To provide feedback you can comment on each episode on the website or reached us on LinkedIn, Facebook, Twitter, or through our Slack channel at [email protected]. You can also e-mail us at [email protected]. This and all other episodes of SE Radio is licensed under Creative Commons License 2.5. Thanks for listening.

[1:03:00]

[End of Audio]

Join the discussion
1 comment
  • As as person who was working on IT during Y2Kb it always annoys me that kids say it was no big deal, since nothing happened. Industry paid through the nose running up to the deadline. Wish they to security as seriously. We would have a lot less security snafu if industry as a whole put their money where their mouth is. At that time I was being offered 85k to rewrite Cobol code.

More from this show