|
In this episode we discuss with Randy Shoup, Distinguished Architect at eBay, about architectural pinciples and patterns used for building the highly scalable eBay infrastructure. The discussion is structured into four main ideas: partition everything, use asynchrony everywhere, automate everything, and design the system keeping in mind that everything fails at some point in a large distributed system. TranscriptSo welcome listeners to a new episode of Software Engineering Radio. This is another episode we are recording at a JAOO derived conference. This is QCON, and today we are talking to Randy Shoup from eBay. So welcome Randy. Thank you. So Randy, why don't you introduce yourself to the audience? Sure, well, I am Randy Shoup. I am a Distinguished Architect at eBay. My particular area of specialization at eBay is the Search Engine. So I am responsible for sort of all the vertical functionality from the web pages down to the depths of the search engine that eBay has written itself. I have been in eBay for about three and a half years. Before eBay, I was the Chief Architect at a small security company called Tumbleweed Communications and I worked there for six years. Before that, I worked at various other places including Oracle for a long time. Okay, so the topic of today's show is architectural principles in, obviously, eBay. So why don't you give us a general broad introduction to what the challenge is; what the magnitude of the problem is at eBay, and then we can look at how you solve that? Sure, well, I always like to start by talking a little bit about the magnitude of the problem because that's sort of sets the context. So at the moment, eBay manages about 250 million registered users worldwide. We have about two billion photos. That puts us in the realm of even large photo sharing sites. eBay users do just over $2,000 in trades every second worldwide. So they trade $2,000 worth of goods every second. We do well over a billion page views a day -- a hundred million items in 50,000 categories worldwide. We store about two petabytes of data, which is quite large. [Laughing] How many zeros is that? That's quite a lot of zeros. 10 to power of... 12? 12, yeah, it's a quadrillion. So well -- yeah and then we do about 5.5 billion API calls every month. So it's a decent-sized problem and it gives us -- as you'll see as we talk forward -- it gives us some particular challenges which we have to solve. The other aspect of eBay's problem, or eBay's issue, is the dynamism of it. We change the site constantly. So every two weeks we're rolling about 100,000 new lines of code, and we do a sort of iteration process where every two weeks, new code appears on the site. In total, we do about 300 new features every quarter -- some of them small, some of them large. And it's a worldwide site, of course, so we are in 38 countries worldwide and what that ends up being is -- for the database savvy -- it's about 48 billion SQL statements that we actually execute every day. So that's not one of those metrics where more is better, but it does give a sense of the scale of the problem. Yeah, and from the things you just mentioned, I take it that obviously some of the forces and challenges in the architecture are probably scalability and maybe an ability to evolve the system continuously without shutting it down. So why don't you elaborate a little bit on the core architectural forces that you're going to address with the patterns and practices we are going to talk about in the main section of the interview? So there are four or five architectural forces that we think of and you mentioned the first one: quite obviously, scalability is our primary challenge. The second is availability. So any public website is going to appear to be available to its users and that's one of the core aspects of it. Latency -- both latency for users in terms of how quickly the system responds to them, but then also data or execution latency -- how rapidly does an event driven by the user change data that is in our persistent storage, and how quickly does it actually cause follow-on effects? Then of course, for a large scale system you need to worry about manageability. So we need to make sure that our operations team [inaudible] -- it's simple, it's maintainable, and we can diagnose problems. And then the final [one], which is similar to scalability, but I think it's worth talking about separately, is simply cost. And we measure cost in terms of human efforts -- so development effort -- how quickly can we evolve the site; how expensive is a feature or less expensive; and then also the operational costs -- so the hardware, the software, the people who run the site. Energy, probably, is also -- Actually, that's right, yes, and you talked about exactly that with Werner Vogels and it's -- yes, so power consumption in our -- so anybody who runs large data centers knows this, but power consumption -- simply powering and cooling the hardware is actually the biggest limiting factor for the growth of our hardware. So it's not physical space, it's not being able to buy the machines; it's simply being able to power the machines and to cool the machines. And what we are finding in eBay -- but everyone else -- Google, Yahoo!, Amazon, et cetera are all finding the same thing -- is that the size of an individual data center is now exceeding what a residential power grid in most of the United States can power. So now you have to start thinking about more and more and more smaller data centers, and then architecting around that. So that is actually one of our main challenges in growth is exactly that. Okay, so why don't you give us an overview about the -- let's say, most important strategies that you use to solve those -- or address these issues and then we can look at each of them in detail? Sure, well, I like to think about our approach in terms of four different strategies. The first approach is partition everything, split everything into manageable chunks. The second strategy that we apply is be asynchronous everywhere. So try as much as possible to move processing to an asynchronous flow rather than a synchronous flow. The third strategy is automation, automate everything that's possible to automate. And that's sort of expressed [inaudible] by how individual components are configured. They should automatically adjust or adapt to their environment. And then also, the larger scale automation of how does a system improve itself over time and can we involve or machine-learn our way into doing better? And then the final strategy is remember that everything fails. So how do you design a system that can remain available, given the reality that everything -- all the hundreds of thousands of pieces of infrastructure will fail at any given moment? Okay. So I think that the rest of the interview or discussion is going to be looking at those in detail and looking at specific examples of how these principles and strategies are applied. So let's start with, obviously, the first one: Partition Everything. Sure. So again the first strategy is about partitioning everything into smaller chunks. And the way I like to think about it -- or the way our head of research like to talk about it is -- if you can't split it, you can't scale it. So simply scaling the site to more than a billion pages and so on is none -- no single machine or single system is going to be able to handle that load. So vertical scalability only takes you so far. So the way that keeps the system simple and maintainable, but yet available and growable is to split the system into individual chunks, and we will talk about individual examples of how we split that in different parts of our infrastructure. So it's pretty obvious, but why would somebody split their architecture? The main motivation is scalability. It's the ability to hold costs constant -- to have a constant sized unit of processing, let's say, whether it's an application server, a database, or whatever, but some particular unit that's of reasonable price-performance, and then be able to scale horizontally by adding more and more of those units. But the other aspects, maybe less obvious, are availability and manageability. If you think about the individual components as individual components, you can be maintaining or dealing with a particular component -- you could in the worst case take a particular slice of the infrastructure offline and manage it separately without affecting the availability of the rest of the infrastructure. Or maybe updating half of the nodes to a new version and doing these things in steps. Exactly right, and I can talk more about that as we go along, but definitely one of main ideas is that if you think of the infrastructure as partitioned into smaller slices, you can do rolling updates in parts of infrastructure where the other parts are staying stable. So the first aspect of partitioning that is maybe the hardest, and therefore I want to start with first, is partitioning the data tier -- partitioning databases. And that's something that's becoming a lot more well-known within the industry, but it was relatively new when we started doing it -- really when everybody like us started doing it -- sort of all without knowing about one another. So I guess let's talk a little bit about the patterns that we employ in splitting. We have two patterns and we'll see them instantiated throughout the infrastructure, but the first is Functional Segmentation -- so dividing things into different functional areas. So in eBay's example, we will have the selling systems, distinct from the buying systems, distinct from the search systems, distinct from various back-end systems, and so on. We actually have quite a few different divisions there. That's functional segmentation, and then the other pattern is something that I call Horizontal Split. And that's simply being able to chop a single functional unit into smaller and smaller pieces to be able to manage and grow them separately. But that means that you need some kind of dispatcher if you access the persistent store because the actual data you are looking for might be on this, this, or that host machine database. Exactly right, and we will -- I will tell you about that because that's one of the main challenges of horizontally splitting the data tier. Okay, but first -- maybe the slightly simpler approach, but still interesting, is functional segmentation of the data tier. So again, as I mentioned, we segment our databases into particular functional areas. So there is a set of user databases, there is a set of item databases that are about the items that we deal with, there are a set of transactional databases, products, accounts, feedback, and so on. Transactional? Transactional, meaning not in the sense of database transaction, but in the sense -- Right, transient data like the shopping card or something. Yes, transaction meaning a purchase. So we call them transactions, but you can think about them as the purchase flow or check-out -- that sort of thing. Just want to make sure the term "transaction" in the context of database is not misunderstood. That's right, and thank you for pointing that out. Yeah, so for us, the databases that are around transactions are about, essentially, the relationship of users to items if you like. This user has purchased this item, and items have multiple quantities, and can be purchased multiple times, and so on -- so there's a many-to-many thing. Anyway, so once you've decided that you need to divide your databases into different functional segments, how do you it? Well, what's interesting, at least to me, is that you can do that -- the same principles that you used to divide information into tables in a single database is that exact same approach -- that same data modeling approach you take at one level higher. So the user databases include not a single table of users, but a whole suite of related tables all of which are sort of partitional by the user ID, if you like. And similar for items. So there is sort of a central item table and then there is a whole suite of other tables that are related to that. But those are "by item" as distinct from "by user" or "by transaction" or whatever. So we use the same data modeling techniques that you would ordinarily use. And so what that ends up -- at least in the eBay infrastructure -- is that we end up having a thousand different logical databases and we spread them over 400 different physical hosts. We don't make a one-to-one mapping between a logical database and physical database. We have another -- a sort of set of indirection there where there's a logical-to-physical mapping. That's functional segmentation. The next pattern applied to databases is Horizontal Split. So essentially that's take the overall user data -- the dataset, the overall dataset about items, and how do you divide it up? Well, it turns out that -- of course, there are multiple different ways of dividing the data up. You can do something very simple where you hash or mod on a key -- so item ID or user ID -- if it ends in 1 it goes in this database, if it ends in 2 it goes in that database, et cetera. And there are also -- you can imagine doing range things: users that are one to one million go in this, users one million to two million go in another, and so on. And then there are more complicated schemes, like a lookup scheme sort of like a -- imagine a routing table or something like that. And then you can combine those schemes in different ways. This, as I hinted before, is something that's becoming increasingly known within the industry. The industry standard term -- we don't use this internally, but the industry standard term is sharding. Right, I heard about that when, I think, Google published [Hibernate Shards] and -- Exactly right, yeah, so when we started doing this -- and we have actually built something internally, which is equivalent to Hibernate Shards. We call it our own data access layer, and it does the standard OR mapping, but it also does the sort of shard management that you hinted at before, where there's a bit of -- most of the operations are routing operations, right? I need to update this particular item, so I need to go to the appropriate database that has that. Some of them -- we try to do as this little as possible, but some of them are kind of scatter/gather operations. I need to do -- Right, I was going to talk about these. These are expensive then. Those are very expensive, and I will talk -- and let me talk about that in a second, but we try to make sure that scatter/gather operations are -- we try to minimize the times that we do scatter/gather operations, and actually to try to do them in different types of infrastructure. But that's what this layer is for, and what Hibernate Shards would do as well -- so a sort of routing or aggregation layer, and that lives in our application server tier. But back to the split idea: so given a particular item ID or user ID, we know which particular host that it belongs in, and this data access layer allows us to route to the particular host. And, of course, just as Hibernate Shards would, it abstracts the developers from knowing about how an item is associated with the particular split and so on. But yeah, so a bit on the sharding term: That's something that's come out of Google. That's the term that they use internally, and then also Yahoo! tends to use that term as well, so that's the standard way we should be talking about this -- we have shards. The corollary -- and this is maybe the most controversial thing that -- when I give this talk -- that people react to is that eBay doesn't use database transactions. To be more precise, eBay doesn't use any distributed transactions. So we do use -- Right, any single write is, of course, a transaction. A single write is its own transaction, and then we do bunch together -- I will talk about this a second -- we do bunch together multiple inserts or updates into a single write, which is transactional for that, but we don't do any client side transactions at all. We have a J2EE infrastructure; we don't do any client-side transactions. We do no two-phase commit, nothing like that. So whenever we need to do these sort of -- have two things happen transactionally together like updating a master-detail table, we do that with what we call an anonymous PL/SQL block. That's an Oracle-specific term, but it's essentially a little bit of stored procedure passed along in the call. It is like a closure in programming languages, where you take a piece of behavior and feed it to something else, which then executes it. Sure, or if you think about it, it's a little mini-script. Yeah. It's a mini-script. So okay -- go ahead -- The reason why you are doing this is because you expect that this typically doesn't fail because it happens down in the database. So it's a kind of isolation that's different from transactions -- or, why are you doing this? Yeah, let me back up a little bit and say why we don't use transactions because I didn't motivate that very well. The why is simply that because we functionally segmented the architecture into user data is over here and item data is over here and transaction data is over here, and then further because we've partitioned or split or sharded the user data into 10 or 20 different shards and the item data into 10 or 20 or 100 different shards, there is no single data -- any operation that's interesting -- the user buying something -- ends up having effects in different places of the infrastructure. So let's pretend that -- well, it's a real example: I am bidding on something. There is the actual bid event or the bid update that happens in the item database on that particular item, but then there are also other systems that we need to update that are -- we like, for better or worse, for it to happen at roughly the same time as that bid update -- counts and history for user activity that they can see in their web pages and so on. So all those things -- it's just impossible, the way that we've -- there is no single database that holds all of them, so there's no way that we could do it in a non-distributed transaction.Once you decide then that you have an operation that occurs over multiple databases, you have a choice of trying to do a distributed transaction or trying to do something else. Well, as Werner Vogels says very well -- and other people as well -- distributed transactions are sort of the bane of scalability. It sort of grinds the entire system to halt, and so in terms of scaling, that's really not an option for us. There is a whole trade-off of consistency, availability, and partitioning, which people may have heard of -- the CAP -- the so-called CAP Theorem. So we need to have partitioning for scalability, we need to have availability, and so what that means is that we give up on consistency. That's the choice that's forced on us by the theorem, essentially. So we choose A and P; we lose C. But we'd like to have C. C is nice thing to have, right? We'd like to have consistency. So what we actually end up doing is -- in other words, how do we guarantee consistency without doing these distributed transactions or without doing two-phase commit or whatever? The simple approach is that you can often get consistent behavior or semi-consistent behavior by simply ordering the database operations in the appropriate way. You update when you are doing an insert, for example -- it's the reverse of the way people often think about it, but If you do an insert of detail records before you insert the master record, you never end up with a dangling situation. Because you can't see the details as long as the master isn't in there. Exactly right, so that's one example, and there are plenty of other examples. If I want to have A-update and B-update happening together, I just simply do A-update and then I do B-update, and once I know that A-update has succeeded, I can do B-update. Well, now the transactional issue is, what if B fails? I need to rollback A and so on. So the meta-question, which I will get back to in a moment is, do we really need ACID consistency for every operation? So I will come back to that one in a second, but particularly for this type of -- but, assuming that we do need some type of consistency there, what we do is we typically follow along that update with an asynchronous event, which can go run and clean things back up. So you can do -- so instead of thinking of it as, "The transaction succeeded or failed," think of it as, "I did these operations, and now I am in this interim state, and I need to recover from that state." So the recovery can typically happen in an asynchronous way. So we would either have sort of a batch job that goes through and cleans things up or we queue up an asynchronous event that has guaranteed delivery. Doesn't happen immediately, but it happens eventually. So that means that, potentially, a user can encounter some kind of inconsistent state because the "non-transaction" -- the distributed thing over several databases didn't work and the clean-up thing hasn't been executed yet. Yes, so let me -- that's a great point, and so yes, it's theoretically possible for something like that to happen, of course. I will say several things. One is that there are many -- in fact, most -- in fact, almost all of such situations where it doesn't matter and actually can't be noticed. So that's the first thing. Things that we really care about like ordering of bids at the end of an auction -- that stuff is transactionally handled. It is all in a single database. So that's not a big worry. It's things like, "Well, I am selling my item and I am also updating my seller preferences at the same time." We make a business decision that it is more important for us to make sure that the item is listed to be sold than it is to go and at the same time, transactionally with that update, go and change the seller preferences. So again the meta-point, which I'll expand on a little bit right now is consistency is not an all or nothing proposition. In the ACID database world, it is: the transaction succeeded or it failed, it's on or off, binary. The reality of large-scale systems is that the degree -- it's a spectrum, and the degree of trading of consistency can be traded off with those other metrics. And that's what the CAP Theorem tells us, and Eric Brewer discovered this 10 years ago, in 1998, and it has taken us a while for it to kind of percolate through the community and for everybody to sort of understand it. But the point is that there are some operations that need to be very highly consistent and the 100% is the absolutely appropriate number, and nothing less will do. But there are other cases where -- in fact, again, the majority of cases where it doesn't need to be transactional certainly at that moment, and something where an asynchronous event that goes and recovers is perfectly appropriate. And then there are plenty of situations -- a significant percentage of those things where consistency is just never -- it's just not all that important. It's okay to lose some information because we got the stuff that was really important, and if somebody has to redo an operation, that's unfortunate, and we wish it didn't happen, but that's the cost and the price that we pay for having an available system and a scalable system. Yeah, I think a term that's used in this context sometimes is also compensating transactions. You explicitly undo something if the overall thing didn't work. Yeah, so if I could layout the -- there is, again, this spectrum of consistency and backing away from the moat: So there's an ACID transaction, 100% consistent. Then backing away from that will be something like compensating transactions. Backing away from that would be a sort of best-effort, asynchronous event, which would go and clean things up. And then backing away from that would be sort of less and less effort that we spend on keeping it consistent. And again, just a point that I'll make is, architecturally, the degree of consistency is one of those other knobs and dials that you have to play -- once you realize -- once you kind of get the Zen of being inconsistent, it frees you quite a lot to build systems that are scalable and available, and it becomes yet another tool in your toolbox: how you dial consistency up or down along that spectrum. And it also fits together nicely with the asynchronous thing that we will talk about later. Exactly right, it feeds right into the asynchronous thing. But first, let's look at partitioning other aspects of the system. We talked about the database, basically. The application tier is the next one, I guess. Great, yeah, so I started with the hard one, which is partitioning the data. The much easier problem is partitioning the application tier. So we could have implemented it in other ways, but eBay's application tier is a Java-based tier. We run on servlets and so it's very -- it's a minimal amount of J2EE, but if you want to think about -- No EJBs -- No EJBs, nothing like that, but we've sort of written our own internal -- infrastructure is too strong a word, but a set of POJOs, essentially, that do that kind of functionality for us. Anyway, but for us, it's easy to scale the application tier because we make it entirely stateless. And I'll come back to the sort of corollary implication of that in a moment, but we first functionally segment the application tier in the same way as we do -- or similar to how we do the database. So there is a set of application servers that deal with the selling part of the application, a set of application servers that deal with the finding or search part, the buying, the check-out… We end up having more than 200 of those individual pools, some large, some small, partitioned by functionality. And that's partly for very good architectural reasons, that's mainly why we do it, but there is also the reality that a single EAR file can't hold the entire eBay application. So it's sort of out of necessity that we've had to learn to do that, in any event. So that's segmenting it functionally. Again, the next pattern is the Horizontal Split pattern. So again, within the search pool or set of search pools, within the selling pool, et cetera -- all the applications servers, all the hundreds or thousands of applications servers that are in that pool are all entirely equal, and each can serve the load -- it's load-balanced in a very standard way. So there is nothing particularly tricky about that, except that it has implications for application design. So the implication is that we have absolutely no session state at all. So -- On the server. On the server. So that's how you can keep the application servers entirely stateless and completely horizontally scalable. So it means that we can take -- at any moment, we can and do shut down, say, half of the application servers in a pool, upgrade them to a new version of the code, which we are doing every two weeks, and then do a flip or do rolling updates where we do 10%, 50%, 100%, and so on. So all that kind of stuff gives us the ability to manage that quite flexibly, and gives the operations team the ability to do that. Well, that has strong implications for the kind of applications our application developers can built. So we don't store any session state there, and the reason for that is a bit obvious when you take a step back. Since the eBay -- application, if you like -- is implemented on multiple distinct pools, it means that when a user goes through the website, that sort of session flow, it means they're going to pass from a selling application to a search application to a view item application, et cetera. So there is no one place in the infrastructure that they are going to keep coming back to, where you could even store the session state. And that's actually a good discipline for us because it forces us to design the application in an entirely stateless way. So how do we -- but, "Randy, you say there is session state, there is transient state. How do you deal with it? How do you do multi-page flows and so on?" Well, there aren't any -- there is nothing new under the sun or there is not really any magic, it's simply using -- for most of the session state we use either URL rewriting or cookies. Cookies are limited -- as everybody probably knows or will soon discover -- cookies are limited by the RFC to 4k. So we only get 4000 bytes worth of cookie space, which seems like a lot to start, but actually, when that covers everything that eBay.com does, it starts to become pretty limiting. So just to clarify that maybe for users, using cookies is nothing special, but what you're getting at is that you don't use a cookie to store some kind of ID, which you then use on the server to look up session information data, which can be anything. You literally put all the state information into the cookie. Yeah, we do -- yes, with a few exceptions. So the few exceptions at the moment are for multi-page flows where -- so imagine somebody is selling an item. That's not something -- you don't enter all the information in a single page, but we need to retain that transient state going along. So how we do that is exactly how you hint, which is there is essentially an ID that's in the cookie that refers to something in a scratch database. In a database, right. Yeah, now, so it's not in memory or even on disk in the application server tier. It's physically on a database. That database is limited to one of your logical database things. So it's something that's kind of local to one of your applications. Yes, that's exactly right, and we are experimenting with other schemes where we have a session storage that's not in a database, but is separate from the application server tier, and we are kind of working the issues through on that. But we sort of think of those as server-side cookies, and just like cookies from the application developers' perspective, the developer has to assume that that cookie can be deleted and any time. So really important transient state like as for a multi-page flow, as I hinted, that goes in a database, as all our important information does. That doesn't live permanently in that database. That's why I call it a scratch database. But important data lives in databases and other transient stuff just gets passed back and forth. So that's our approach. And then the final example of partitioning is the search infrastructure, and that's the part that's near and dear to my heart because that's my particular area of expertise. So our search infrastructure -- so the functional segmentation is simply that the search infrastructure is read-only, as you might imagine a search infrastructure would be. It's functionally segmented from the rest of the transactional databases of the site, and I will talk a little bit about how the one gets transferred to the other -- Its needs to be updated from time to time. That's right, that's asynchrony right there. So we will talk about that as we go forward, but the search function, the search engine is a functionally different part of the site as the transactional database. That's something I have heard also in these data warehousing and data cube applications where you prepare the data in a way that is not suitable for transactions because it might contain duplication and stuff. It's not necessarily always normalized, but it is optimized for queries. Yeah, that's a very insightful analogy to draw. The eBay search engine is to the eBay transactional databases as the eBay data warehouse, which also exists, is to those transactional databases. There is -- in data warehousing terminology -- there's a kind of ETL process which I will talk about, Extract Transform Load, which takes the data from the primary databases and moves it over into the search engines. It doesn't work like most data warehouse ETL processes do, but that's a detail. But the general idea is, yes, it's a kind of read-only -- it's a different data structure, right? A search engine is an inverted index. It's an entirely different data structure, and should be, from third normal form-style relational databases. It is even a relational database or is it really different? It is a very different. It's a search engine like Google or Yahoo! developed by the same people that developed the AltaVista search engine, and as with many search engines, it's developed on similar principles, which is that it's an inverted index. There's a set of documents with IDs, keywords are indexed into those documents, and query operations happen by intersecting lists or vectors of those keywords, very simply, and there's a lot more detail about how that works. The challenge for -- just as an aside, the challenge for an eBay-style search engine is that our users expect the search engine to be updated in essentially near real-time. When somebody bids on an item that changes the price, and price is a filter that people are very interested in querying on. So it actually means that the style -- the sort of classic web search engine style of "you build the index in a kind of batch mode and then upload it to the search engine" is something that doesn't really work for us. It needs to be a lot more real-time. So I will talk a little bit about how that real-time system works in my asynchrony section, but anyhow, to finish the thought on scalability for search, the idea is that the search engine can be horizontally split. So there is this overall search index of whatever size it is. We divide it up into chunks of ten or twenty or sixty or hundred, and divide the infrastructure that way. And then we have an aggregator piece, which now does do scatter/gather over all those different parts of the index. So somebody queries for "iPod" or "Mickey Mouse" or "Wii" and the aggregator sends the query to each one of the different splits or shards and gets the results back and aggregates them and sends them back to user. Okay, so that was the discussion on partitioning. The next of your basic architectural strategies is -- or was or whatever -- is to be asynchronous everywhere. This is something we all know, but as the systems grow larger and larger, it becomes more and more important. The more processing that you can do asynchronously, the better. There are obvious reasons for that. Scalability is one, where if component A and component B are linked asynchronously, you can scale A and B separate from one another, right? Availability is maybe another key point, which is that A and B will not necessarily always be up at the same time. So you want to have the ability to have A be available and B not be available, and some kind of asynchronous queue or topic or some kind of way of asynchronously linking those two things allows you to do that. So A can be up and can perform its job without B being up at the moment. The slightly more subtle point is that not only the up/down state of B and A are decoupled, but also what we call availability characteristics. So when we build out a particular pool or piece of infrastructure, one of the very first things we ask is does it need to be so-called always available? Does it need to be highly available, which is [less than] always? Does it need to be sort of best-effort available? Et cetera. Again, like consistency or anything else, it's a sort of spectrum of availability and it's more expensive to do the more available. So you can decouple the availability characteristics of the two things. Maybe A does need to be always available, but B doesn't need to be, as long as they are linked asynchronously. And then, of course, latency. You can improve user experience latency by getting back to user very quickly, doing some of the more expensive processing later at the very explicit cost, though, of the data or execution latency happening later because that's an asynchronous process. And then the other maybe more subtle point, but I think should be appreciated more than it is, is that when there's asynchronous integration between components, the cost metric is a little bit different. You don't have to build out for the peak load if there's a queue there. You don't need to build for the peak load, you simply need to build for the average load or maybe the above average load, but in any event -- because as you hit your peak, the queue allows you to spread the processing out over time. The queue queues it up. Dampens it. Exactly, it's exactly dampening as in [electrical engineering]. It takes high amplitude fluctuations that peaks and valleys are, and turns them flatter. So that helps quite a lot. So patterns that we apply asynchronously are sort of well-known patterns in the industry. One is sort of message dispatch -- an event-driven or message-driven model. And the other something that I call periodic batch, which should be familiar to everybody -- the eldest of us will be familiar with that as well. [Laughing] I am the older one of the two of us, that's for sure. But in any event, people who've been around for a while would be familiar with a batch. Sure, yes. So some particular examples of those patterns in the eBay infrastructure are something that we call event streams. So in a very sort of JMS-style approach, when things happen on eBay, there are lots of different systems that are interested in that thing having happened. So somebody has listed an item, somebody has purchased an item, somebody has bid on an item -- all those things are interesting events that other people are interested in knowing about and doing stuff with. And so we've developed, as is our want, our own infrastructure for doing that, but its very much along the JMS-style lines. So typically what happens is the primary use case that lists an item or bids an item or whatever also, as part of its insert or update of that record, also is inserting an item into an event table, and that event gets consumed by these multiple consumers than run asynchronously. So that's again where we use this trick of anonymous PL/SQL blocks and making that transactional. So it's a single database transaction to update the item's current price with the bid and insert and queue the event -- the item bid event for other people to process. And where do those queue tables live? I mean, they are part of one of those systems. Otherwise, you couldn't do it with your anonymous SQL block. Yes. But that means that some other system needs to watch this table of the other systems so you have these dependencies between those systems in the sense that they monitor each other's queue tables or event tables. That's exactly right. So in order -- very insightful, as you correctly point out, in order for us to do it in a single database transaction by its very nature, the queue table -- or really the segment of the queue table or the shard of the queue table that is associated with this particular set of items or this particular set of users or whatever needs to live in the same -- be co-located in the same database as the item table or the user table or whatever. Absolutely true. But that actually has advantages, not disadvantages, if you think about it for a second because it means that the primary data and the event data are co-located in a single split or single shard. So yes, when a consumer is interested in consuming events out of that particular shard, they do have a dependency on that database. But they already did in some sense because they have a dependency to read the queue, but they also have a dependency -- I mean, typically they have a tendency to read the data as well. So it works quite well, actually. It's sort of a very natural way of essentially partitioning events along the same lines as partitioning the data. So the asynchrony there is again a very standard JMS model. There are multiple consumers each of which are consuming those events, and the key sort of design thing that we've been able to take advantage of is the design principle for the event streams we built are simply that -- we don't care about exactly-once delivery and we don't care about order. This has implications for our application developers, but again, just like trading off consistency, it frees up the ability to do a much more performant system. So we talked about using asynchronous message dispatch to communicate between the different separated applications and shards of those applications. How does the async metaphor or principle play with the search infrastructure and the updating of the index we talked about before? So we use exactly the same pattern, although as you will see, the implementations of that pattern is slightly different, but fundamentally, the idea is that, again, interesting things happen in the primary databases. People list items, they bid on items, items are purchased, and so on. And we want those changes to be reflected in search as quickly as possible. So how we do that is we have a sort of "asynchronous feeder infrastructure" we call it, simply feeding updates that happen on the primary databases into the search infrastructure. So how that works is we have something -- for each of the shards of the item database, let's say, we have a search feeder application, which is a sort of daemon which is periodically polling for updates that happen on the primary databases. What happens is that that feeder application notices the updates, persists those updates locally -- for a reason I will talk about in a second -- then messages those updates out over a multicast message bus, and the individual nodes of search infrastructure are all listening -- each listening to an appropriate set of those messages to update themselves. So the reason why there is this intermediate persistence is that the multicast buses -- multicast messaging is, of course, unreliable. But we would like it to be reliable or semi-reliable, so the way we make it reliable is that when nodes notice that they've missed a message, they send a message back to the search feeder application saying, 'Hey, I missed message 4 or missed message 5. Could you please resend that to me?' And how does it know how to resent it? It reads it from the persistent storage. Right, okay. So that's the asynchrony part. And then those individual nodes are, again, listening to those messages and keep their slice of the search index in memory, and they update that search index in memory, and then it's immediately available to be queried. Okay, you already talked about the batch stuff before. That sounds really old style, but it's a good way of, obviously, batching things that happen only at certain time intervals, so do you want to talk about that a little bit? Sure, I will talk briefly about that, and what you say is exactly correct that the batch is a sort of old style 1970's transaction processing approach, but it's still a very good one. Not everything is appropriate for messaging. So it's typically appropriate for those things that are scheduled or infrequent or, frankly, particularly expensive. Such things that we do once in hour, once a day, once a week -- things that we don't need to do very often. Those are types of things that are perfectly appropriate for batches. So that allows us to do large summations over large amounts of data, but actually also, we end up combining the two patterns as well. So we will have a batch that reads a bunch of primary data and then generates a bunch of downstream processing through message dispatch. So we will have a batch that collects a bunch of data and then it generates a whole series of other events that can be consumed later on. So the two patterns are quite complementary to one another. Right. Anything else on async? That's about all I have to say about async. Okay, so strategy number three: Automate Everything. Automate Everything. So again, it's pretty obvious, but as the number of systems and the complexity of systems grows, it becomes more and more difficult to manage it manually. And that's true of hardware; that's also true of software. So if we as software engineers can build systems that can be more automatable and adapt themselves to their environment, that makes it much easier to manage and also has a bunch of other nice effects as well. So clearly there is a scalability effect for the organization. You scale the number of machines without linearly scaling the number of humans. Machines are less expensive, so there's a cost advantage. But again, maybe the more subtle and less appreciated one is you can do different kinds of functionality -- more sophisticated kinds of functionality in an adaptive or automated system than typically you can with manual tuning. If I am tuning the size of a thread pool from 9 to 10 to 12, as a human, I only have so much attention that I am going to put to that, and I need to sleep and time to think. But something that could be configured in an adaptive way and the machine could tune itself frees the human to do something else -- even something very simple like that. It's also something Bob Hanmer talked about in the Fault Tolerance episode that you should make humans aware of what's going on, but you should take them out of the critical path of actually adapting or recovering a system. That's a very nice way to say it, yeah. So basically, you have a bunch of metrics that you measure all the time and then you have a set of rules and they tune the system itself. So that's the -- That's the general idea -- for the cases where we've done this, that's the general idea. So one specific example is, again, with the event streaming system. The individual consumers -- we set what you call an SLA -- a Service Level Agreement for a particular consumer, saying something like, "You, consumer, need to have processed 99% of the events within one minute or 10 seconds or whatever." And then, given that SLA, the consumer then tunes its various parameters to meet that SLA with minimum resources. So if load is high, it might use more consumer instances, poll more often, more threads, et cetera. When the load is lower, it might dial that back. But that's been very, very effective and helpful for us in that way. So the other more maybe sophisticated approach is less about adaptively configuring and more about machine learning. So this is something that I have been working on for the last while and which is what you might call the adaptive finding experience at eBay. So how can we adapt the search experience, which is one of the main experiences that people have at eBay -- how can we adapt it to the particular context that the user is in? How can we learn from what that user or that type of user has done in the past? What can we learn about what the people who have searched for iPod or Mickey Mouse or whatever have done in the past? How can we dynamically adapt the experience in that way? So as with any machine learning setup, you need a feedback loop. So that feedback loop for us is a couple of obvious steps: we collect data about what we've shown and what's actually happened afterward. We sort of aggregate that data -- correlate it together in this sort of -- in the background. Then we make some inferences from that, we deploy those inferences or those rules or those metrics back up to the site, and then next time, when we are going to serve a page for iPod to this particular type of user, we'll know what to do. So it's generally the idea, and this is not [inaudible] to eBay, this is a general idea across the industry. We're closing that loop -- learning from what users have done and directly making the site adapt to what the users have done in the past and what worked for them better in the past. Okay, strategy number four, I guess the most important aspect of distributed system design, at least: everything can fail independent of other things. The whole system goes down or not; parts can fail. So why don't you talk about this a little bit? Sure, yeah, the final and very important part of distributed systems is that they fail. And the more systems you have, the more likely -- if a given percentage of time a unit is down -- Yeah, the more units -- You're going to have -- more units are going to be down. That's obvious. So, essentially, the idea is to build systems to be tolerant of failure and to try to continue moving as far and as fast as they can, even when other parts of the system or other components that they depend on are down. So we try to -- we assume that every operation will fail, every resource will be unavailable. We make all those assumptions as part of the programming model. We, of course, want to detect failure as quickly as possible. We want to recover from failure as quickly as possible, and while that failure is going along, we want to do as much as we possibly can. Degrade gracefully. Degrade gracefully, exactly right. So the patterns that we have here are a sort of failure detection pattern, which I'll talk about. Something that I call rollback, which is essentially undoing operations. And then graceful degradation of the experience. So in terms of failure detection, one of things we learned early on at eBay is the need to have a central application logging system that allows our operations team -- and development team for that matter -- to monitor the health of the infrastructure. So every application server and piece of infrastructure is constantly sending events over a multicast message bus, and those -- we have a whole set of listeners, which are listening to various slices of those events and aggregating them in various ways, sending alerts, and so on. So it's the only way a large-scale system can really work is by having some kind of monitoring that's constantly happening and is available for the people who can do something about it to know. In terms of rollback, two of the things that have been really useful -- two simple approaches that we've taken have been very useful for us. One is in the area of code deployment and the other is in the area of feature deployment, and the fact that they are separate from one another, which I'll talk about in a second. So the general rollback idea is you never make any changes to the site that you can't undo. So it's not about rolling back data in transactions or the "non-transactions" you have; it's actually about rolling back updates to the system. That's right. Yeah, so rollback as applied to large-scale updates the system, as distinct from a transactional rollback. Yes, thanks for clarifying that, that's very good. So again, the philosophy is that there's no single change that we can make to the system that can't be undone. And that has very severe implications for application development, particularly for schema development because like you hinted about, if we just think about in terms of the database, what that means is that I can't make any changes to the database schema that I can't undo. I can't make any changes to the -- if you like, interface to the data that can't be undone. There always needs to be backward-compatibility for data as well as for interface. So that actually gives us a lot of challenges. But because we roll the site and we make updates to the site every two weeks, we are essentially forced to do this kind of thing. And you are always operating different versions of the same things in parallel? Yes, at any given moment, there are almost surely multiple versions of at least one component in flight because it takes us -- because for thousands of applications servers, it takes us serious time to be able to go through and update all of those things. And, of course, as with anything else, things fail. So you need to be able to roll out and roll back. So when we do code deployment for a particular two-week iteration, which we call a train for historical reasons -- for a given train, we have what's called a rollout plan. And each feature that's on that train has its individual rollout plan. I need to make modifications to this part of the infrastructure, this database. I need to make changes to this pool of application servers and also to this one. So we have individual rollout plans for features, and we roll them all up to a grand one -- take what's called the transitive closure of all the individual dependencies of those individual rollout plans, and that's the master rollout plan for the site. And then have automated tools that execute that rollout plan in -- I was going to say dependency order, but it's actually reverse dependency order -- Starting with the -- Starting with the leaves of the dependency tree, and moving back up -- automated tools that do the roll out, and then if there are issues that we detect, we can automatically roll back and put things back in the same place in an automated way. And you do that in the dependency order so you can have the leaves still running, but the stuff that depends on the leaves that caused the problem -- you just remove that. Exactly right. So we can move forward, and we can traverse up or down. If you think of a dependency tree, you're traversing up or down. So that's for code deployment, and then for feature deployment, every significant feature that we roll out has an on or off state. So what that means is that for any kind of reason -- operational reasons, business reasons, whatever, we can turn on and off a feature with sometimes a flip of a button. And that's much faster than doing code rolls as you might imagine because it's driven by central configuration, which sort of everybody subscribes to. So that actually allows us to do several things. One is that it decouples the code deployment from the feature deployment. It means we can roll out code that's essentially off, which we do all the time. And then we can turn it on at some particular moment, either because we've announced that we are going to make some pricing changes at X date and we can turn them on at that date, or because the feature isn't working well business-wise, whatever-wise, and we can turn it off. It also allows us to unroll some of those code dependencies, which can become circular. So changes on this particular train -- this change to the selling pool requires this change to the finding pool first, but there's this change to the finding pool that requires this other change to the selling pool first. You can unroll those dependencies by rolling all the code out with various flavors of on/off of the features, and that allows you to deal with that. It reminds me of another very old thing. I know it by the name of runlevels. So if you want to start up the system and you have dependencies going around, you start it up to some level where it's maybe still passive, but everything is there, and then you start up the next level. It's kind of similar -- Yeah, I never thought about it that way, but that's absolutely right. Yeah, I guess that's a great analogy. I think this takes us a little bit one step further because instead of -- there aren't states of the system as a whole, but there are states of individual parts of the system. And what we've ended up with is a situation where application developers sort of check for the availability of resources as you would imagine they would, or deal with the unavailability of resources. But they can also check for the availability of features in that same way, and that's sort of a similar metaphor for both of those things. And in some sense from the perspective of the code, it almost doesn't matter: if this feature is down or this resource is down, I can do X, and I don't care about which it is, essentially. Yeah, exactly, that's actually interesting. Yeah. It is a little bit like reflection. So the system knows about its own configuration or state in terms of update and version, and then you can react on that explicitly. Yeah, exactly right. It's one of these things where it shows that trying to make things transparent to application developers really doesn't work in big systems, right? Also a very insightful comment, yes. There are lots of ways to say this. Joel Spolsky talks about it as Leaky Abstractions. I think that's a wonderful way of saying it, but yes, there are fundamental -- what's a way to say it crisply -- the availability state of the system is a leaky abstraction, and there's no way around it. You cannot abstract the developer from the fact that resources cannot be there or features cannot be there. That's a fundamental part of the system, and the more you try to hide that from the developers, actually, the more complex it makes it. Internally, yeah -- Because somewhere someone's got to be aware of that availability state, and if it's not a developer, then who? [Laughing] Yeah, the magic framework guys. Well -- that's the scary part: the magic framework, which would need outrageous numbers of rules and that's just an -- so in some sense to keep it simple, we make it obvious and transparent. Okay. I think this was the main thing. Randy, thank you very much for being on the show. I think this was very interesting and, well, thanks. Thank you very much. I enjoyed it. |