Search
storment-fuller

SE Radio 550: J.R. Storment and Mike Fuller on Cloud FinOps (Financial Operations)

J.R. Storment and Mike Fuller discuss cloud financial operations (FinOps) with host Akshay Manchale. They consider the importance of a financial operations strategy for cloud-based infrastructure. J.R. and Mike discuss the differences between operating your own data center and running in the cloud, as well as the problems that doing so creates in understanding and forecasting cloud spend. Mike details the Cloud FinOps lifecycle by first attributing organizational cloud spend through showbacks and chargebacks to individual teams and products. JR describes the two levers available for optimization once an organization understands where they’re spending their cloud budget. They discuss complexities that arise from virtualized infrastructure and techniques to attribute cloud usage to the correct owners, and close with some recommendations for engineering leaders who are getting started on cloud FinOps strategy.


Show Notes

Transcript

Transcript brought to you by IEEE Software magazine.
This transcript was automatically generated. To suggest improvements in the text, please contact [email protected] and include the episode number and URL.

Akshay Manchale 00:00:18 Welcome to Software Engineering Radio. I’m your host. Akshay Manchale. Today’s topic is Cloud FinOps, and I have two guests with me, J.R. Stormant and Mike Fuller. J.R. is the executive director of the FinOps Foundation. He was formerly the co-founder of Cloudability, which was later acquired by Apptio. He continued to work as VP of product and engineering for a year post acquisition and decided to pursue his passion of advancing the FinOps field as a full-time employee of the non-profit FinOps Foundation. He has worked closely with the largest cloud consumers in the world, helping them design strategies to optimize and analyze their cloud spend through technology, culture, and process. Mike is a principal engineer and has been working on cloud and FinOps at Atlassian for over 10 years. Mike’s team of data engineers, analysts, and FinOps practitioners help Atlassian get the most value out of the money it spends on cloud. He holds nine AWS certifications and has presented at multiple AWS Reinvent and AWS Summit events on topics that include security and Cloud FinOps. Mike has served as a member of the FinOps Foundation Technical Advisory Council and is currently a member of its governing board. J.R. and Mike are both co-authors of the O’Reilly book, Cloud FinOps. J.R., Mike, welcome to the show.

Mike Fuller 00:01:38 Thanks for having us.

J.R. Storment 00:01:38 Thank you. Great to be here.

Akshay Manchale 00:01:39 J.R., maybe we’ll start with you to set the context for our episode. Can you describe what is Cloud FinOps, why it’s important?

J.R. Storment 00:01:48 Yeah, definitely. So FinOps is the practice of managing cloud spends, and specifically we’re talking about public cloud spend — big cloud providers, commonly AWS, Azure, Google Cloud — and because of their very, very variable nature, the highly variable nature of Cloud spend, which is that it comes up, it comes down based on usage, right? You pay for what you use, and the fact that distributed decisions are now happening about Cloud spend, where engineering teams can procure the resources they need instantly, they can scale up new resources, they can choose different services. Those things coming together really created a need for FinOps as a discipline, which is the practice of really understanding, allocating, and maximizing cloud spends for companies using it. And when we started talking about this a few years ago, FinOps was really limited to just a few set of smaller tech companies who were practicing it using Cloud at scale, but now FinOps is practiced kind of by every major organization in the world as Cloud spend as a thing has become really ubiquitous and grown in the last few years.

Akshay Manchale 00:02:52 Great. In terms of differences with respect to companies that own their infrastructure, what’s different in terms of running your business on the cloud? How does your mental or business models for finance change because of the cloud versus traditional company that has its own infrastructure, maybe Mike?

Mike Fuller 00:03:13 Yeah, so I guess within the traditional infrastructure space, you’re kind of doing these large upfront purchases that you’re depreciating over sort of a three or a five-year period, and the variation of the spend in the data center, you’re not paying a different amount for your equipment each time that you run.

Akshay Manchale 00:03:30 So what’s different?

Mike Fuller 00:03:32 So yeah, with your traditional data center, you’re buying equipment upfront with a sort of a large upfront expenditure, and then you’re not paying different amounts for your infrastructure month by month. You’re sort of depreciating that equipment over a sort of a three or a five-year period, usually as sort of a fixed depreciation schedule. Within cloud, though, you’re buying servers at a per-second or even per-millisecond basis, which means that the amount that you’re paying for your infrastructure varies all the time. And so, the amount of compute that you’re using this second versus the next second really does mean that the variation of spend is what drives a lot of the complexity. And so, trying to apply a traditional financial model to cloud spend means that you’re sort of waiting these long periods between looking at the dollars, and a lot of variation happens in between those cycles. And so, what FinOps is trying to do is really move you into that more real-time attention to how spend is happening within your organization and getting you away from those sort of slow cadence, consistent spend financial models that have traditionally been used in the data center.

J.R. Storment 00:04:35 And what Mike hit on there in terms of real-time, I should have introduced a bit in the first part, which is: the really key thing and differentiator with cloud and FinOps as compared to other previous disciplines of managing technology is that maybe not technically real-time, but very, very hopefully close to it, a near real-time approach of getting consistent data in terms of cloud spend back to those who are responsible for using the cloud spend so that they can use that near real-time feedback to change their behavior. And that’s really a key difference is that in cloud, when you start using something you start paying for it. When you stop using something, you stop paying for it. And that, as a fundamental model, is very different than traditional approaches.

Akshay Manchale 00:05:17 I think the elasticity is really nice with the cloud where you can get things on-demand. And as an engineer, what that enables me is to experiment more than I could previously with fixed infrastructure. So, sometimes when I hear FinOps, maybe the discipline around financial spending seems like maybe there is more red tape that you have to navigate as an engineer to try something out. So, can you talk about how it can enable or how it sustains innovation while not completely running loose with respect to spending and having some sort of a framework for that? Is that possible? Can you still innovate while also having a disciplined financial operating plan for the cloud?

Mike Fuller 00:05:57 Yeah, I definitely think so. I think the whole point of FinOps is to ensure that your company maintains that freedom to innovate. If you look at, we’ve got a tightening sort of economic outlook at the moment. Companies are going be looking for ways to reduce spend. And by having a good FinOps culture, you’re able to work out where you are getting good value for the spend that’s being made within an organization. And so, by being able to sort of track the benefits that you’re getting from the dollars spent, you’re able to then use that metric to sort of encourage further spending in areas for the business. And also, it is really about giving confidence back to the business that the dollars being spent on innovation is returning back in business value.

Mike Fuller 00:06:42 And so what we see happen with the FinOps being in the picture is that the freedom to innovate kind of maybe gets a little bit too wild, and wastage and unoptimized infrastructures gets developed. Eventually it gets out of sync with what the business expects to be spending on cloud. And then you end up with these like halt moments or pullback from cloud moments, and that really stifles innovation. So, what we’re trying to say is a little bit of FinOps consistently over time goes a long way to enable innovation to stay within the cloud space without letting it sort of go wild and the business to get out of sync or out of touch with what the cloud spend should be. So, it’s really just finding that balance between the company feeling very confident with the spend it has going into the cloud space, enabling that innovation, but also making sure that the business feels confident that it’s being spent well.

J.R. Storment 00:07:33 And in a lot of ways, engineers are used to dealing with constraints, and this is just a new constraint that’s introduced, and it can help accelerate innovation if you get to that place where Mike’s referring to where it’s an efficiency metric, a constraint that’s introduced at the beginning. So, it can be part of design considerations. Where the innovation gets hindered, I think, is if it’s introduced too late in the process or you’re asking folks to re-engineer something to be more cost-efficient when it wasn’t a consideration early on. And so the big shift, one of the shifts we’ve seen in the last few years is the idea of cost being introduced earlier in the process so that essentially it’s enabling more cloud to happen and people thinking about cloud and cost more on the lines of what additional services, higher-level innovation can you do in Cloud that you can’t do in traditional models on-prem data centers, et cetera. Because you can use different types of services, you can procure services more quickly. And I think the big shift we’re seeing very much in that aspect is engineers now have to consider cost as a new efficiency metric and therefore operate in a new model.

Akshay Manchale 00:08:37 There is definitely a cultural and an engineering mindset change I think that is required to get this right. I think that maybe the responsibility is shared among various different functions and businesses. So, who’s involved in Cloud FinOps? Who are the people? What kind of roles do different people have to play in a sound FinOps strategy?

J.R. Storment 00:08:59 Yeah, there’s really a big mix of people. I’m fond of saying that everyone is responsible for cloud cost. Everyone does FinOps, right? And so yes, absolutely, we’re starting with the engineers who are writing code and deploying resources. You also have their now-partners in finance teams who are struggling to understand how to allocate the cloud costs. After they come in, they’re struggling to understand how to forecast them. I was on a call this morning with some folks doing FinOps for government agencies and they, their finance teams are asking them to do five-year forecasts on their cloud spend in a world where they’re in the middle of migrations, they’ve got 2000 engineers working across different services. It’s almost impossible to do, right? So, finance teams are thinking about this entirely different and have to be educated and learn and shift.

J.R. Storment 00:09:48 Traditionally, you had procurement teams who had to go buy hardware for software engineers and operations people to use. Now they’re no longer the gatekeepers of the hardware. They are trying to do deals with cloud providers, make commitments to cloud providers by commitment vehicles like reserve instances or savings plans or committed use discounts. So, it’s a complete change in how procurement people and sourcing people need to operate. You get into product teams who have to start thinking about their cloud costs so they can understand how the profitability of their individual services, and ultimately executives. This used to be just a problem for those people off to the side of the organization using a lot of cloud. Now Cloud spend is raised up to the level the CFO often because cloud spend is the largest variable cost for many organizations.

J.R. Storment 00:10:36 And for a lot of the organizations in the FinOp Foundation — we’re talking about like nine out of 10 of the Fortune 10 large organizations — Cloud is becoming one of the biggest expenses in the technology world, after labor, right? After their people. And so really it has become this thing where it’s really everybody’s responsibility in the organization. All that being said, across all those different groups, there does tend to be a centralized enablement team, a FinOps enablement team, that is working to help all those other groups do that. So, Mike, what are you seeing?

Mike Fuller 00:11:07 Yeah, I think that in addition to those, we’re starting to see other personalities that are coming out of like your TBMO and your ITSM teams, your SAM teams, they’ve been cloud sort of enabled engineers to quickly and secure licenses straight from the Cloud service provider and sort of skip the standard sort of SAM teams that were there. And they’re being FinOps is able to bring that conversation back to those traditional sort of teams with frameworks. And extending out from that, we’re seeing sustainability teams now integrating with their data and collaborating with FinOps teams to trying to drive this, like, green use of Cloud and trying to bring the picture of not just cost efficiency, but sustainable use of cloud. So, I think that we’re having a lot of touch points from that central FinOps team where they can enable a lot of different areas of the business with the Cloud spend and cloud billing data.

Akshay Manchale 00:11:55 If I start from a company that has some small footprint in the cloud, maybe they have a roadmap to have more — in your book you talk about the lifecycle of FinOps, so maybe starting from that smallish company standpoint. Can you give a broad overview of what this lifecycle of FinOps journey looks like?

Mike Fuller 00:12:15 Lifecycle of FinOps, which is around the phases — and I think we’ll dig into those in a moment — when it comes to the growth of the FinOps team within an organization from a small business up to large, I think that really, we call that like the adoption curve of FinOps. So, start from what we see happening, especially in the smaller cloud spend companies, is what we call the virtual FinOps team. It’s like maybe a few key people or maybe one or two people in the organization that see part of their day job being thinking about the FinOps related tasks. As the cloud spend footprint gets bigger or the complexity of the amount of teams using cloud within your organization grows, then that stops becoming sort of a side gig for someone in the organization, starts becoming very important to have someone there constantly thinking about and driving what FinOps looks like within your organization.

Mike Fuller 00:13:03 So you start to see that FinOps practitioner role fully evolve into a dedicated role within the org. And then as you’re really getting to the large scale, especially when they’re globally distributed teams, and that you end up with these full teams of FinOps members that are sort of distributed: different sort of capability sets so things that we see often as things like data engineers and analysts, the FinOps practitioner that are just really sort of focused in on the individual billing elements of cloud, and collaborating with those finance partners and engineering partners. J.R.?

J.R. Storment 00:13:35 Yeah, and I think we see it come from two waves. There’s a bottoms-up approach, which is very much, I know Mike, that’s how you started. So, I was just talking to one of our members who’s a software engineer or SRE specifically who started in that ‘I see there’s a problem here. I need to go solve this. I want to help the organization get better.’ And typically, that lifecycle starts there, and those teams work on putting in basic FinOps practices around visibility and allocation, but it doesn’t really all come together until you get the combination of also a tops-down executive support mandates. That can come from the technology leader and ideally should, because the CTO of CIO has the teams who are responsible for the cloud spend. But it also typically needs to be a partnership with the finance leaders we talked about who needs to help drive the importance of the overall company numbers and the budgets and the margins and those areas.

J.R. Storment 00:14:16 But ultimately, I mean, where it’s heads to over time is really enabling more of that data-driven decision making by all these teams that we talked about so they can make better decisions every day. And a big part of that really is getting buy-in across the organization that cost is important. And I think that’s kind of the journey we see folks go on. There’s all the capabilities that lead up to good FinOps right? There’s visibility, there’s allocation, there’s usage optimization, rate optimization, but it really means that everyone needs to start thinking about it in a new way.

Akshay Manchale 00:14:55 So let’s start with the initial part of the FinOps journey where you really want to understand what’s happening in your organization, where are you spending the money? So, how do you get started with that? What models do you have to understand the Cloud spend and to analyze where you’re spending money, where you shouldn’t be spending money? How do you get started on that journey?

Mike Fuller 00:15:17 That then connects us to the FinOps lifecycle, in which effectively we have sort of three phases of the FinOps lifecycle. There’s the in-form phase, which is, I like to sort of think about this as putting the thumb tack on the map about where you are today. And then you have the optimized phase, which is really sort of figuring out what are those paths on the map that we could go down? Like, where could we optimize what sort of to get to a better position as far as our Cloud efficiency goes? What are those paths that are available to us? And then let’s set some goals on which paths we want to take ourselves down. And then on operate phases, that’s the actual driving down the journey, taking that pathway, we’re going put things into action: look at automation implement tools, AI ML tools, or automation tools, or cloud vendors tooling in order for us to start to move towards that more Cloud efficient world.

Mike Fuller 00:16:07 And then we loop back around and go back to inform, to really check where we are. Are we actually progressing down the pathway we expected to be? Has anything sort of thrown us off course, come up anomalous spend, or unexpected extra items that have sort of thrown a spanner in the works and sort of taken us off path. And then we just continue this process of looping around for each of our individual FinOps capabilities in order to sort of measure where we set where we want to be, and then implement change to sort of head towards that pathway.

J.R. Storment 00:16:36 The first thing in there, honestly, in that whole process is just starting to look at the cloud spend data regularly, and one of the first things that companies see when they start to look at it is that it’s not always clear what the money is going to, and it’s not always clearly aligned against how the business is structured and how the business is doing reporting elsewhere. Hey, we’ve got these set of Amazon accounts or these set of Azure subscriptions, and they were set up by an engineering team maybe when something was in a staging or development environment and now they’re supporting some set of production spend too. And how do we start to tear apart bits of the infrastructure and make sure that we can show it back to the people who are using it to drive accountability, to drive that decision making?

J.R. Storment 00:17:22 And so, one of the first important steps of actually, I think, doing FinOps is starting to get into a tagging strategy, an allocation strategy, an account strategy. Something that says, let’s agree that this is how we’re going to split out our cloud spend so that we can then start to build on top of that to get the rest of the visibility we need before we even start to think about higher level functions like optimizing spend before we get to way down the road to unit economics in those areas. And I think it comes up, I feel like a broken record, but we can’t say enough, like, this whole practice is really not about spending less, or saving money, or optimizing spend. I mentioned earlier, I was on a call with this government group and they were pushing back saying, well, we’re not really looking at right-sizing or savings or these things.

J.R. Storment 00:18:09 We really just need to charge back and showback. I mean, that’s really that first stage that majority of people find themselves at, which is how do we understand the spend, how do we get visibility into it? And then as you grow, and you may have, as Mike was saying, millisecond billing or per-second billing across thousands or millions of resources. Then how do we keep a strong policy and governance and strategy in place so that that spending, as it scales, is still — not controlled, right? Because we don’t want to control and lockdown spending to the earlier point of not hindering innovation, but that there’s a strategy to keep it allocated and keep visibility in place as it gets to be really big.

Akshay Manchale 00:18:45 You mentioned chargeback and showbacks to understand what’s happening with your cloud bill. So, do you just get a large PDF with your spending maybe right now, but you want to, like, improve on that process? So, can you describe how you improve on that process from just having one large account maybe and one bill to getting this chargeback or showback sort of a model to understand your cloud spend?

J.R. Storment 00:19:09 Yeah, so the data that you’ll get — and Mike, I’ll ask you here in a minute to get your take on how you started with it — but the data that you get and you say PDF yet that may be your invoice, which is the PDF of billing at the end, which is going to be rolled up to a service level and maybe some number of dozens of pages. But the underlying data, the granular detail of the individual resources — in Google, it’s the BigQuery export, and AWS, it’s the cost of usage report — that data can be hundreds of millions or billions of individual charges for a large company at scale, in Cloud. So, the challenge steps back a bit to not even be can you start to understand this, but can you even open and use the data? Like a lot of companies need to use the cloud in order to open the cloud bill in order to get insights out of it. So, how did you all approach it, Mike, as you could start getting into that more detail?

Mike Fuller 00:19:58 Yeah, it’s kind of like when I started the cloud bill for us was a couple of million lines, and today we’re in the billions of lines. And so, the complexity of that is we’ve been asking for more detail as practitioners over the years cause we need more detail in order to get the granularity in order to understand the cost and also to allocate those costs out to teams. That’s a sort of a double-edged sword there where we ask for the detail, you get the detail, the detail then adds to those billing lines and really expands out the breadth and depth of the data. But the benefit of having that level of detail is you are able to individually identify nearly every cloud resource individually that’s costing you money every hour and allocate it through to teams.

Mike Fuller 00:20:41 Now, there’s a decision between chargeback and showback, and this seems to be really driven by the choice of the company. Some companies like to have a central budget of IT, and so they really do charge it to a central location. And the benefit there would be for showback to be able to allow the teams that are causing that spend to be able to still see the impact of their decisions without having to worry about distributing budgets out to teams that previously hadn’t had that experience. For companies that have already got distributed budgets across the organization, then it becomes really important that they get the costs from this central bill in the PDF, as you put it, out to those teams budgets so that they’re actually reflected at the right places on the P&L. So, it’s really the decision between chargeback and showback is an org-level, usually finance team-level, decision.

Mike Fuller 00:21:29 But the value you get of either is the same thing, which is the right, the teams that are actually driving the spend are able to actually see that and put that back to those decisions they’re making. Is that cost increase by the change we made last week actually worth that much money or should we be thinking about rolling it back, or does that actually land where we thought it was going land in costs? So, there’s a whole pile of next layer challenges you get with things like shared services and that, but showback and chargeback is very key to that surfacing the cost to the right teams at the right time.

Akshay Manchale 00:22:04 Do you have an example where either chargeback, showback has resulted in better utilization that you can share?

Mike Fuller 00:22:11 Yeah, so I think we talk about the Prius effect. So, effectively the idea here is that when you surface the costs to a team, that they naturally want to optimize that anyway. So, the Prius effect sort of talks about the transition from the 1970s car that’s guzzelling gas down the freeway; in order for you to figure out the efficiency, you kind of wait until you run out of petrol and then you can back calculate the miles that you got on the tank. You go to a more modern electric vehicle that tells you that immediately on the dashboard, sort of the exact amount of kilowatts that are coming out of the battery. And what usually tends to happen is people just drive more efficiently just because they’re being made aware of the impact of their driving. And it’s easier for them to make those choices as they’re making it.

Mike Fuller 00:22:55 Do I really want to put the foot down cause I’m running late, or am I okay to take a second longer and be more efficient? And so, the same sort of thing happens with engineering teams that if they’re aware of the cost impacts of those decisions, then naturally they will adjust the decisions they’re making and correct the decisions they’ve made based on that feedback loop. And so, the tighter you can get that between the decision they’ve made and the time they’re able to be informed on it, the quicker they can identify the driver of the cost changes and adjust as needed. And so, yeah, we definitely feel like this is one of the real key points for that near-real-timeness. And why having a chargeback showback model combined with that fast feedback loop really helps with driving efficiency for engineering.

Akshay Manchale 00:23:38 Do you think that sometimes if you react too quickly to seeing what your spend is and maybe you reduce your instance size, or maybe you make some modifications, does that lead to problems with respect to peak periods where you have massive loads? How do teams deal with that? Or when you start showing teams what they’re spending, how do you prevent them from taking actions that might be detrimental to the business itself?

Mike Fuller 00:24:04 I think this comes down to cloud experience. Like, your engineering teams will need to understand their workloads in cloud — not just the point in time, what the workload looks like right this hour, but what it looks like over time. And I feel like this is just a standard SRE reliability problem. Even if you take cost out of the picture, they’re going look at ways that they scale up. Is scale up times fast enough? Do they have enough capacity to be able to scale up to the right sizes for the peak workloads? Really, I think this is that sort of handhold where it is on the engineers to help us pick where they can be more efficient, but also it’s on them to sort of balance between good and fast and cheap. So, we call this the iron triangle, and it’s effectively like you can spend more and make a really good, really fast, really reliable service or you can spend less and make some compromises in those elements.

Mike Fuller 00:24:55 So you might be having only two AZs not three, or you might be only running in one region and not two. And it’s really on each of those services that you’re running for your business to balance which of those you’re going invest heavier in and which ones is actually okay, does it need to be fault-tolerant or can it just be highly available because it’s an internal service or a lower tier service? So that balance really is, it’s up to engineering teams to learn this experience and make sure that they’re thinking about that balance between good, fast, and cheap.

Akshay Manchale 00:25:23 There’s a nice intersection with SRE performance and all of that. Like you said, I want to talk about the organization structure and what your business actually sells, right? So, you might have a product that is running, and you have to have infrastructure to run that. Maybe you’re a company that has multiple product lines that have shared services. How does the chargeback / showback model work in that sort of a model where you have different products that have their own infrastructure requirements and you also have these shared services that different products end up using, right? Maybe there’s a log storage thing or something like that. So how do those companies account for those shared costs, and how does the chargeback or showback model work in that sort of a situation?

Mike Fuller 00:26:05 Yeah, so I guess with the difference between a single-product company and a multi-product company, I think that’s a nice banner to put over top of things. But the reality is, as more and more engineering moves towards microservices, even the single-product teams have many different services running internally. Some of those will be to support the internal reporting, like you say, logging, observability infrastructure. Some of them will be there to run the actual production service you’re offering. And then in the multi-product companies, you have multiple products that are aligned and sharing services between them, or you might have actually separate arms of the business, one running the distribution center and one running delivery centers, and stuff like that. So, the idea that there are some companies out there with just the simple one thing in the cloud, it’s only ever short-lived because as you do more and more of your business in Cloud, you end up with this sort of complex mix of things.

Mike Fuller 00:26:54 And then, so that then gets us to the whole shared resources or shared services. And so, we kind of, I feel, put them in two separate buckets. There’s like a single cloud resource that’s being shared and a group of cloud resources being shared. And more often than not, you end in the group. So, if you think of something like a Kubernetes cluster, you end up with more than one instance, many storage volumes, load balancers, you have all these sort of cloud resources. So, the first part of shared-service cost reporting is to first identify what is the thing you’re sharing as a cloud resources set. And so that’s where you’re tagging and account strategies are going come in, making sure you’re putting all the things in a project or in a subscription in order for you to identify the thing you’re sharing amongst your teams.

Mike Fuller 00:27:38 And then, the thing the Cloud service provider can’t give you in the billing data is what is on those shared resources. So, within a Kubernetes clusters, who are the teams that are running particular workloads? And so, that’s then when you come into what we call the proportion data. And so, this is something you collaborate with the teams that are running those shared resources or shared services to try and get that extra layer of data that you need from them to understand which teams are using how much of that shared resource over time. And then the two pieces together — the cloud bill with the collection of cloud resources in identifying what you’re trying to share, and then that proportion data saying which teams to give the cost of that much of the infrastructure to. And by doing that, you’re able to then take a large cloud blend that is being shared across potentially hundreds of teams and actually break that out and show those teams or even charge back to those teams, the individual cloud costs for those resources.

J.R. Storment 00:28:32 And this shared service issue is, it’s one of the main challenges that we see out there in terms of getting to the full allocation, getting to forecasting and getting accountability out there. Mike mentioned Kubernetes as an example. It’s not just shared services, it’s layers of virtualization, right? That are being split out, and cloud is one layer of virtualization. Kubernetes is another on top of that. And so, we do see that as being a big hole in early-stage practices, which is not properly accounting for those. And then ultimately something that a lot of effort is put into, I think you talked about the proportional split of the spend. I mean, ultimately where that allocation of spend is heading toward is also then measuring the proportional output that each one is getting, right? Like, what activity is it actually generating ultimately, maybe even what revenue is it driving?

J.R. Storment 00:29:18 And that’s kind of the whole point of this whole practice, right? Is okay, so let’s not just look at what we’re spending in cloud, but let’s look at the value that’s coming back out on the other end from that, from that shared service so that we can start to make some trade-off decisions. Mike mentioned the iron triangle: spending more is a necessarily a bad thing if you are trying to get better performance, if you’re trying to drive more customers, if you’re trying to deliver more features and more innovation. So, getting that shared services in place, or the reporting of it is, is really key to getting the larger bit in place for allocating and showing back to get to that unit economic view of things.

Akshay Manchale 00:29:54 You both mentioned about tagging in accounts. So can you maybe dig into that a little bit to say what options are available to be able to group your utilization into your business unit, into your products, into your teams, et cetera? How do you enforce that? How do you set that cultural expectation maybe to be able to at a later point look and attribute your resource utilization?

Mike Fuller 00:30:16 Yeah, so I guess the most granular weapon in the process there is that the account-level hierarchy – so, your accounts, your subscriptions, your projects, that sort of larger cloud resource, like a cloud account that you’re putting your resources within — and so, you’re wanting to sort of define a strategy around those that allows you to, sort of, at a granular level understand the infrastructure that goes inside. So, the obvious ones are dev, stage, and prod, right? You’re not sort of mixing the three together in one project or whatever. Then with inside that, you’ll probably end up with multiple of something. So, two different services, or four or five shards of a single service. You want to be able to identify which of those cloud resources are part of which piece. And so, that’s where your tags and labels come in to sort of start to separate the two.

Mike Fuller 00:31:05 With tags and labels, you’ve got a lot more, more granularity. You can put a lot of those on cloud resources, obviously with that balance of trying to get them done properly by your teams, which we’ll get to in a second. That tag and label really is the sort of the more granular piece. So, you’re taking that large, the sort of real course-grain account level and then getting down to those granular level tags and accounts, and then the combination of the two. Lastly, there’s what we call a synthetic tag. So, when we get the cloud bill in, we’re using detail in that Cloud bill with some external lookup, some extra piece of data. And so, this is where we’re getting a lot of the CMDBs-type lookups happening, where we’re bringing in extra information we know, so we’re using one key piece of information, maybe like the service name of a cloud resource to pull in extra data after the fact into the cloud bill.

Mike Fuller 00:31:49 And that gives you even more granularity to the details about individual cloud resources. The trick there, obviously, is to make sure that your teams are building their resources in the right cloud accounts and are tagging and labeling, which you mentioned. And so, there’s sort of I guess there’s sort of three approaches that we see to that there’s the control of what can be done and, and we’re starting to see the CSPs or the Cloud service providers now offer more features around preventing the creation of resources without the right tags and tag values. Now these are, they’re not perfect, it’s not perfect silver bullet today. There’s only certain ways you can describe particular characteristics of that and potentially that would impact existing running deployments, et cetera, in your environment. So, it’s not an easy switch to turn on, but it is a great area for you to start to put some controls around tagging and labeling.

Mike Fuller 00:32:39 After that then you have the sort of remediation-type tools that come in, and they might stop cloud resources if they’re not tagged correctly, or we say that sort of going towards deletion of cloud resources and that sort of, that’s really a hard one for some companies to take on, especially if it potentially could delete real production stuff. And so, there’s an adoption curve of that sort of control-after-the-fact. And then lastly, which is the more software approach, which is just to at least report upon two teams where they are tagged correctly, where they’re not tagged correctly, and having allocation methods that handle for untagged resources so that you know when it’s not tagged, these is how it’s handled, that someone will ultimately end up responsible for it. When you end up with no sort of remediation action at the end, you end up with a lot of costs that just fall in a hole where no one’s paying attention to them. And that’s worse than just having a fairly crude allocation method that would give it to somebody to care about. Usually once you tell somebody that they’re getting cloud costs that are not theirs, they usually want to figure out who’s are they and get them allocated to them properly. So, you’re sort of distributing that pain point of unallocated costs.

Akshay Manchale 00:33:47 I guess the clarity around your spending improves over time to a point where you have very fine processes and place to always attribute correctly catch attributions that are not present for whatever reason. I want to just dig into one other point you mentioned was the challenges with tagging, labeling. With respect to virtualization, when you have another layer of virtualization like Kubernetes that’s orchestrating your containers, that’s running in different machines that are already virtualized. Can you talk a little bit about what is the challenge there, and what do you recommend that teams and companies can do in those situations?

Mike Fuller 00:34:22 The funny thing with Kubernetes is it’s kind of a cloud on top of a cloud, and you end up applying all the same FinOps practices again to the layer above. And so, you start to then think about namespaces just like you thought about cloud accounts. Your projects and subscriptions, that’s your namespace. And within the namespace you can then label the pods, and they’re kind of your parallel to the tags that you had within the cloud resources above. And so, you can kind of apply yet again the same course and fine-grain allocation strategies that you had at the raw Cloud account into the Kubernetes environment. I guess a well-written tagging labeling strategy or policy within your organization can almost be just completely translated straight one-for-one into the Kubernetes land, and then you can apply very similar sort of things. They can prevent pods from starting unless they’ve got the right labels you can stop or terminate pods that don’t have the correct labeling. So, you can almost apply all the same strategies that you had at the cloud layer again under Kubernetes. When you go over away from something like Kubernetes into other things like EMR jobs or shared database layers, potentially similar sorts of things, but you might be able to tag and label or to namespace particular things.

Akshay Manchale 00:35:33 Is the metering of your usage that could be done in one way for on the cloud provider, whereas Kubernetes might meter and tell you how much you’re using for a particular pod container, whatever in a different way. So, does that translate well when you apply the same tags and labels across both sites?

Mike Fuller 00:35:52 Yeah. So, I guess there’s the tags and labels really just identify the workload itself to whether it’s raw on the cloud account or within something like a Kubernetes pod to figure out which is which. The metering of which is important for your proportioning is really driven by the workloads themselves. Some might be very CPU-intensive and so you’re really wanting to use that measure as far as what is the impact. So, if you are finding that you’ve got very low CPU pods, but they’re using a lot of memory, that’s probably going be your driver. The more of those you need the more underlying cloud resources that are needing to be provisioned. So, it’s really trying to identify what is the driving element of the workload and then using that to measure. We’ve seen workloads that are driven mostly by network activity: large amounts of network activity, very little amount of CPU. And so, you kind of, for those, you’re starting to look at how much of the network capacity is it using as a proportion of the cluster. And so, yeah, I think it really, you can’t have one sort of silver bullet that does them all. It’s kind of really looking at workload-specific, what is the driving element.

J.R. Storment 00:36:53 The principles are the same within a container world. In some ways it’s like another little FinOps cycle that happens within containers, right? You asked about the allocation strategies. Instead of allocating to tags, instead of allocating to projects and Google and those things, you are allocating to namespaces. And you’re looking at labeling, and within the container system, but then you also have the same issue — we talked about shared costs — of allocating out not only the use of the containers themselves, but also like unused portions of it, right? And there’s, you add another layer of right-sizing, it has to happen within the containers themselves within the cluster, what hardware that cluster is sitting on and there’s sort of recursive layers of this, and that becomes an initial organizational challenge because again, you’re distributing duties between different people, right?

J.R. Storment 00:37:42 There might be one group managing the containers and the cluster, another managing infrastructure, in some cases another who’s managing the commitments to the cloud providers that are running the resources, that are running all these things. And so, you’ve got to loop through all of those, right? And sort of start at, again, where is the spend going within the containers? And then, how are we using the right amount of it? And then, can we get a better rate for that which we are using? And then be constantly communicating that out in that real-time feedback loop so that you’re, you’re not doing these big, ‘hey, we see spend is really off and we need to cut down and make changes.’ You’re doing constant iterative iterations to Mike’s earlier point of continuing the cycle constantly and regularly.

Akshay Manchale 00:38:20 I want to switch gears to what comes after being in this informed phase where you have a mature information-reporting strategy where you can chargeback, showback to various of your business units, engineering teams, what they’re using. So, the next real thing is optimization. And in the book, you talk about two different levers that are generally available for optimizing. Can you talk about what they are? And then we’ll dig into them.

J.R. Storment 00:38:46 Yeah, so the two levers are essentially using less, which is your usage-optimization lever, and then paying less, which is your rate-optimization lever. And there’s no magic here. Kind of like everything else, there’s a usage quantity and a rate quantity, but what’s different in FinOps world is really I think who is responsible for those and the fact that it is so distributed. Historically, folks, when they’re starting to jump into FinOps, they think pretty much immediately, you had mentioned it earlier, right-sizing, right? I want to right-size the thing to do the job that is needing to be done. That’s a really important lever to pull in terms of, say, a compute resource and getting the right size instance against that.

J.R. Storment 00:39:33 But you kind of want to get back before that to say, ‘hey, how do I just turn off things that aren’t being used at all,’ right? There’s a shutting those down. That type of work, that usage optimization work really is best done by the engineering teams and the operations teams who are responsible for the infrastructure and understand how those changes may impact performance down the road. So very much in FinOps, we want to push the data out to those teams to think about where they could potentially use less. On the other side of that, there’s how do we pay less for what we have used, or are using? And that’s really the rate-optimization piece tends to live in the more advanced practices within a centralized FinOps team who is looking across the entire cloud estate, the entire cloud infrastructure, to figure out how to make broad commitments to a cloud provider.

J.R. Storment 00:40:20 Hey, we think we’re going use about this much in the coming year, so we’re going work with a cloud provider around a commitment there, and are in the microcosm looking to say, hey, this team is using this much of this particular resource, let’s make a one-to-three year commitment via reserve or a savings plan. And what’s interesting there about that second lever is that we do see it centralized very often because the teams who are deploying resources and writing code are trying to right-size or optimize the usage; they’re not often thinking about those financial commitments to the cloud provider. And frankly, it’s sometimes they’re afraid to say, yeah, I’m going use (I’m doing air quotes) “this resource for the next year or three years” because they’re looking at cloud as something that they may want to use the latest resource type, and they may want to use a new service that’s come out.

J.R. Storment 00:41:04 Whereas a central team who’s responsible for the entire organization spend can say, ‘yeah, generally as an organization we’re using this many thousands or millions of dollars a month within this cloud provider across lots of different teams that may be changing what they’re doing constantly.’ So, we’re comfortable as a larger organization committing to this amount of this type of resource or this amount of spend. And so, those two things obviously are very closely interconnected. As you get into things like forecasting of spend, it gets even more complicated because you have to start to think about scenario modeling around not only what you’re using, but if this team makes this optimization to the usage, how does that affect our commitments to the cloud provider? And those commitments affect the rates, which obviously affect how much you’re going spend over time. So, it’s a fine dance between the two.

J.R. Storment 00:41:52 One of the things that we often get asked as well is which of those levers you should pull first? Right? Should you optimize the usage, or should you optimize the rates? And people commonly say, well of course you want to optimize the usage. I want to turn off things I’m not using, I want to size down things that are too large. Why would I make any commitments to the cloud provider before I do that? And unfortunately, the stark reality of the case often is that those changes to the infrastructure can take a lot of time and a lot of effort by engineering operations teams. And often the right approach is really to start making commitments while you’re optimizing usage, right? To commit to reserves, savngs plans, those things, so that in parallel you can be optimizing your usage efficiency while you’re getting better rates for what you’re already using.

Akshay Manchale 00:42:40 In terms of right-sizing, there are these developments with serverless offerings where you’re constantly only paying for what you use rather than predicting what you might need on a box and then using it mostly at some fixed capacity most of the time. So, how does that impact your general forecasting or understanding of cloud spend? Does it make it easier? Does it complicate things for the FinOps journey?

Mike Fuller 00:43:07 Some people see serverless as some, like, solution to not needing to right-size. And the reality of it is that you’re still sizing your serverless whether it’s a function as a service, you’re picking a memory or CPU commitment that you’re getting for that function, or if it’s a serverless-based container or orchestration like EKS or any of the Kubernetes or ECS services, you’re sizing those pods. And so, you’re making some form of size commitment even with the serverless, in most cases. And so, you didn’t really avoid the right-sizing, you’ve just changed sort of how it’s sized from being a size of an EC2 instance or VM to really the size of the serverless commitment you’re making. I think once you — often the sort of right-sizing for serverless is put on the lower priority because for the vast majority of cloud spenders, it’s VM instances, managed databases, and object storage that are like your three big items.

Mike Fuller 00:44:03 And so, serverless usually is a way down the list as far as cost goes on your bill. But I think that as we see more and more companies lean heavier into serverless, it will start to become more of a cost item on their bill, and right-sizing serverless workloads will start to become way more popular because it becomes more and more predominant in their cloud bill. And then, as far as predicting though, I think the good thing about serverless is you run a lot of it – like, your big cloud spenders would be running tens if not hundreds of thousands of EC2 instances. When you run enough of them, you end up with this sort of like very stable base load amount; with serverless, because it’s such a small individual element, even small cloud consumers using serverless use a lot of serverless.

Mike Fuller 00:44:46 And so, you end up with that base load very quickly anyway. And so, predicting the workloads on serverless I think will become easier because you’ll end up with lots of teams doing lots of different things with serverless, but in aggregate a fairly stable base load. There’s going be huge spikes at periods, but you kind of end up with this – like, if you think about sort of a sine wave and then a second sine wave phase sort of phase shifted: while one team is using a lot another team is using less, and then while that they’re using a lot the other one’s… and so, you sort of end up with this sort of like noise all canceling out, and sort of this, like, sort of hum, if you will, of the cloud spend ticking along. But yeah, I think it will be interesting as more and more teams become more cloud-native, with cloud adopting more and more of these serverless things, to how that sort of story fleshes out over the next couple of years.

Akshay Manchale 00:45:34 Yeah, that makes sense about how they might normalize out depending on time of day and different services. We talked earlier about different organizations having shared services — or rather, everyone’s looking at microservices and how lots of resources are in fact shared. In terms of rate optimization, how do you go about getting the number of reserve instances, or getting a committed usage discount from a cloud provider, and then applying it back into the actual products or the actual needs of the infrastructure? How do you map that into individual products and resources?

Mike Fuller 00:46:10 Yeah, this is an interesting challenge.

J.R. Storment 00:46:12 It’s different at various stages of maturity, right? So early on, the challenge when you’re starting finance practice is you’re beginning to make commitments to the cloud provider for reserved instance capacity or savings capacity. And it’s a small percentage of your total spend. And what companies end up with is a big deviation between the on-demand spend and what you’re paying for that and the covered or reserved spend and what you’re paying there. And one of the challenges across most of the cloud providers is you generally can’t really dictate where those savings are applied. They’re applied to a certain type of resource in a certain region, but you can’t say ‘this particular resource’ and they’re not really meant to be tied to an individual resource. They’re meant to be tied to a type of spent. And so early on there can be big differences in how an individual team sees their spending because they’re either getting that discount or not.

J.R. Storment 00:47:07 And longer term though, as you get to a very large scale of good coverage in those areas — let’s say 90% of your coverable resources are covered — the challenge changes a bit, which is that you’re kind of, instead of having to figure out who’s going get it, who’s not, you’re trying to figure out can you cover that last remaining bit to get them to the next level of discounting. And so yeah, I mean it’s a hard problem. You’ve been through years of it, Mike. How have you sort of seen it evolve?

Mike Fuller 00:47:34 Yeah, for the most part I think that the, as you say, the higher-level coverage gives us a fairly stable rate that teams can then sort of trust upon. But there are times where particular workloads really kind of miss out on being the one that gets covered. And that’s often the case with things like savings plans. They often want to discount the highest-return items first. So, you’ve got teams that are kind of running those smaller-size instances, potentially in the more common regions. And so, they end up kind of being the team that always misses out. And so, there’s this conversation of reallocating the savings does come up from time to time. I know that there’s tooling in this space from third party vendors that can help you reallocate some of the savings. One of the approaches I’ve taken previously is we actually scrape back all of the savings from a particular product line and then reallocate the dollars on budget lines as we want to apply it as a sort of more fair system, if you will.

Mike Fuller 00:48:28 And then I have seen in-house tooling being built by some of our best, most advanced practitioners that will reshuffle and allocate these RIs to different teams as they’re needed. So, it really just comes down to the amount of importance that the RI savings hits the right team – so, the materiality of that. And then, the amount of effort you really want to put into reallocating that savings for that particular area of the business. In the most part, with the higher-level coverage you end up with just the small, sort of, pockets not getting the right benefit, and you might be able to just adjust that with some added budget in that business unit, that sort of thing.

J.R. Storment 00:49:03 Increasingly, we’re seeing a trend as well where some of the more advanced FinOps practices are actually not even paying attention to where the discount is being applied and are setting, essentially, their internal rates of, we’re managing that coverage of commitment centrally, and if you use this type of resources, we’re going give you this type of rate for it. And then, as Mike said, sometimes they’re shuffling the behind the scenes or they’re doing kind of a second set of books. There’s the actual cloud billing data, and then there’s how that’s allocated internally. And whether that’s the right approach or wrong, it presents a different set of challenges.

Akshay Manchale 00:49:37 I presume the second set of books kind of counts for normalizing the cost across different teams. For example, if I’m a team that happens to get, like, reserved instances or discounted instances all the time, my total spend looks lower than it should be, right? So, going back to the informed phase of understanding where you’re spending and how much you’re spending, your second rate of books that you just mentioned, is that the reason being that you can normalize the actual cost of the underlying resource across different rates applied to the exact same instance type or service that you’re buying?

J.R. Storment 00:50:10 If you think about that feedback loop and the Prius effect that Mike mentioned, it’s much more effective when you see spend that you can influence and that you know that your action is going change that, right? And so, we’re seeing a lot of these companies that are doing this second set of books say, end consumer of cloud, engineering team, don’t worry about what rate you potentially are going get. Just focus on using the right amount — that first lever — and then we’ll get you the best possible rates for that. And it removes this scenario that happens a lot unfortunately in those earlier stages of adoption, where one day a team gets a reserve that’s assigned and their prices are low, and then they shut down a resource and while they have a shut down, another team picks up that discount; it goes to the other team; they bring a resource back online and suddenly their rates have gone up, and they say, ‘well, I haven’t changed anything. I may even be using smaller resources or fewer resources, and my bill has gone up.’ And so, that’s why we’re seeing that move towards separating those out.

Mike Fuller 00:51:04 This becomes really important for things like MSPs that want to have a standard stable rate for the services they’re charging out to their customer base. And you see tools like Amazon billing conductor sort of really supporting this model of sort of standardizing the rates presented within the billing data and sort of in italics, hiding the savings in order for you to sort of allocate them yourself where you needed.

Akshay Manchale 00:51:27 It’s interesting that everyone is working towards that sort of common understanding of where your spend is going. It’s not just from your end, but it’s also from the cloud providers that are assisting and evolving their billing data to suit this sort of like allocation. Once you’re in this journey where you have your information necessary, you have the optimizations, you understand what you can do, you talk about just the operational phase – like, where you operate with this framework. Can you talk about what sort of advanced strategies and techniques exist once you are past the initial set of low-hanging fruits to move towards a more sound financial operating strategy for the cloud?

Mike Fuller 00:52:07 The ultimate goal that we see for FinOps is to get into that data-driven decision-making. And so, there’s a whole pile of different capabilities within FinOps that can kind of deliver you towards that ultimate goal of decision-making. And we’ve covered a lot of the sort of the bread-and-butter capabilities today. We’ve mentioned forecasting a couple of times. It’s also another really important one, which often connects with budgeting, but integration with other frameworks is also really important for FinOps to fully settle into the business and sort of mesh well with the existing practices that are there. But ultimately, yeah, like if FinOps is done well, engineers don’t see FinOps as an added task. It’s not oil and water, it’s just mixed into the way they think about cloud, and the way that they’re making their decisions and getting that feedback loop happening is almost a natural behavior for engineering. So, there’s a whole pile of capabilities and that we cover in the FinOps framework on the FinOps.org website that really sort of cover each of those areas and what they look like at a starting phase for each of those capabilities, right through to advanced what we’re seeing as an advanced practitioner out in the field. And so, the capabilities aren’t really a checklist. They’re more like a menu of items that you can pick to build the right practice for the right organization.

J.R. Storment 00:53:27 And I think the first early days of FinOps really lived in a world where FinOps was in a bubble. It was billing data, spend data, that was really organizations, or people in the organizations, had to go find and they had to go look for, and they had to think about it, ‘oh, our costs are up. I need to go check this out.’ And where it’s moved for more advanced organizations is that the data of FinOps is in the path of the engineers. It’s in the path of the executives. It’s in the path of everyone who needs it, so it’s not outside their workflow. And this manifests in a tangible basis and taking FinOps and cost related data and putting it into engineering tools into Grafana, where they’re looking, into Jira, where they’re collecting and planning sprints, into wherever they’re working.

J.R. Storment 00:54:11 And the other side of that also is starting to integrate in other data with, and Mike mentioned, intersection with other frameworks, integrating in other types of data, revenue data, business output data, all of those things because cost data on its own is really hard for folks to, I think, understand when, when you’re dealing with an organization that may be spending millions, hundreds of millions, billions of dollars in technology spend, what is the right amount of spend? How much is too much or too little? And so, you really want to get into a place where you’re combining that with other business data to give context to that information. And then making sure, as we talked about, it’s early on in the process. Oddly, the most advanced organizations, what they’re doing is nothing magical, but they’re considering FinOps data at the beginning of the architectural design process rather than the end, right? And that’s letting them actually make and effect change throughout that process. And they’re not asking engineers to go out of their workflows, and they’re aligned in their outputs to say, yeah, we’re going to make decisions about costs that ultimately are going result in better outcomes at all stages of the process, rather than trying to wedge in, last-minute, a change that’s going reduce costs and may have a negative impact on business outcomes.

Akshay Manchale 00:55:14 I want to just wrap up with one thing, but I think you touched upon this already in your previous response, but what are the right cultural expectations that you can set as a business leader? What not to do in terms of controlling, understanding, and optimizing your Cloud spend? Any last closing thoughts on that for leaders to set that expectation, set that cultural workflow in terms of having a sound FinOps strategy?

J.R. Storment 00:55:41 I’ve got one I can start with, Mike, and then I’ll move one to you. But for me, one of the big differences we saw from the first version of the Cloud FinOps book to the second one, and after a couple years of looking at it, was just how often people who are new to the practice and in the early days of this, they would go to engineering teams and sort of be blaming them. It would be an us-versus-them approach of, ‘you need to cut costs. You are being wasteful. Here’s all these recommendations to save money; you’re not doing your job.’ And that immediately puts an engineer or anybody on their back foot and makes them want to push back and disagree and have an argument. And the really advanced and successful folks are just using basic human motivation skills to partner in the organization to come in with data, to have a conversation to say, hey, how can I help look at this with you and come to the best outcome on their own? And I think that’s really the most important learning from seeing a lot of folks fail at it.

Mike Fuller 00:56:35 Yeah. Maybe to put a bit of a spin on it, thinking about sort of economic outlooks and potential downturns and stuff like that, it’d be easy for CTO leaders to set that tone for engineering teams to go and solve the cost problem or go reduce spend, and you end up with a lot of people moving a lot of bricks all at the same time, and you don’t know if you’re building a better cloud or you’re actually making it worse. And so, really I feel like just understanding that this is a, like J.R. says, it’s a partnership, it’s a collaboration between teams. So, everybody is building towards the same awesome outcome for the organization. And you don’t end up with everybody trying to help without talking to each other. I think that if there’s just a tendency for, not so much like a role separation, but more everyone understands the role they play, and they all want to play together in the same sandpit, that really helps drive a good outcomes for the business.

Akshay Manchale 00:57:28 Great. Thank you so much J.R. and Mike for coming on the show and talking about Cloud FinOps. This was really nice and informative to understand how and what not to do with respect to cloud spending. This is Akshay Manchale for Software Engineering Radio. Thank you for listening. [End of Audio]


SE Radio theme: “Broken Reality” by Kevin MacLeod (incompetech.com — Licensed under Creative Commons: By Attribution 3.0)

Join the discussion

More from this show