|
In this episode we talk to Matthew Wall (Guardian News and Media) and Erik Doernenburg (Thoughtworks) about their work on the new guardian.co.uk website. We discuss the challenge of scalability and interactivity, their use of Domain Driven Design, some of the technical building blocks as well as the approaches they use for performance measuring and scalability tuning. TranscriptSo welcome listeners to another episode of Software Engineering Radio. This is the another one we are recording at OOP 2008. And in this episode we're going to talk to Matt Wall and a Erik DoernenBurg about the new Guardian.co.uk website. So this is an episode that doesn't introduce cool new concepts of technologies, it's more like a case study or a real world example of how to use some of the things we've talked about in the podcast before. So welcome Matthew and - well we only have one mike so welcome Erik in a moment. Why don't we start by Matthew, Matt, I guess, Introducing yourself. Matthew Wall: Hi, how are you out there. My name is Matthew Wall. I am the technical architect working at the Guardian for Guardian's website guardian.co.uk. I have been working at the Guardian for coming up to four years now and I specialized in Java and Agile software development. Okay. Erik DoernenBurg: Hi, my name is Erik DoernenBurg. I work as a principal consultant at ThougtWorks and my role in the project at the Guardian was that of a tech lead from the consultant side and I collaborated with Matt on the initial architectural vision and the implementation of the first site that would then later on use the same technology to be rolled out across the entire network of the Guardian sites. So why don't we start by having a brief look at what the Guardian is, I guess. Well, obviously, some of the audience knows about it, some don't; so why don't you give us an introduction about the domain? Matthew Wall: Well, the Guardian has a long history. It is a high quality British newspaper, there aren't many of those, but we are one of them and it has a very long history. It is around 150 years old. And around ten years ago the Guardian noticed really that there's a lot going on in this new thing called the Internet. This was a new way to share and exchange ideas, to publish information, and we decided that we wanted to take our newspaper business online and thus became what we call Guardian Unlimited or guardian.co.uk which is the website version of the Guardian. Initially as you know you can imagine in the late 90s a lot of the work on the web site was kind of experimental. We used various products and various techniques to get our web presence going and we were really, really kind of learning how to do it as we went along. And what we ended up with -- partially because of the technical work that was done on the original site and also probably mainly because of the fantastic editorial content that the Guardian produces -- was something that was much more successful than certainly any other newspaper in Europe; I think we are the most read online newspaper in Europe. And we actually have a very high readership in the States; and the product grew, and grew and grew. I mean I think in total we have something around the order of 2.5 million pages of content on our web site. So it is quite a big offering. But the original site was very, very much built, kind of after the newspaper mentality. We were basing it on scripting languages from a technical point of view; Database scripting languages like Tcl and Perl are very, very common here at the Guardian in the early days of the site. And we were essentially assembling web pages together rather like they were assembled in a newspaper -- you know, we're thinking about a page which has an article on it and an advert and another piece of content in this kind of grid type arrangement and assembling the pages like that. And thus we went for quite a number of years. I mean the site is very, very successful and it has grown and grown and grown. But now there is this thing coming along called Web 2.0. I am not exactly sure what Web 2.0 is. I guess nobody is... It is a buzzword. Matthew Wall: I haven't read the RFC and I'm not sure exactly what it is. I think from the Guardian's point of view Web 2.0 means a few things. It means that JavaScript now works; it means that people generally have a broadband Internet connection; it means that people generally have a modern browser you know like Firefox or IE6+ or Safari or something like that. And it also means that they have an expectation of how to interact with the Internet quite dynamically. So we realize that our old legacy site wasn't really scalable in the Web 2.0 world and we needed to look at the rebuild of the site which is kind of where Erik and ThougtWorks came in to help us along with that initial process. And kind of what we needed to do was first of all get a much stronger model of our business and the content; that is very, very important to us. The previous site, as I say, it was kind of, I hate to say it, but lashed together and there are lots of scripts and lots of -- there is not very much of a model in there. Historically grown hack -- Matthew Wall: That's a good way of saying it. For example our site we had, I mean there is, a podcast that we are doing now. And we actually had -- I think we still have -- the world record for the most downloaded podcasts, which is a Ricky Gervais comedy podcast, that we hosted. And if you actually look in the 01 code there's almost no evidence that the site can actually support podcasts; there is no strong model. So we needed a strong model and we chose Domain Driven Design Technique. Let us talk about this a little bit later and let's first maybe focus on the technology that you've used and why you have used it and then maybe visit some of the, let's say, design philosophy and maybe some of the process. Matthew Wall: Sure. So before we look at the technology stack, why don't you give us a little bit of an overview like how many people were involved? How long the project was spanning. So people have an understanding of what the context is, how big was the project and that kind. Matthew Wall: Sure I mean the initial -- one project was developed by a development team of around ten people over a about nine or ten years along with the system's team of a similar sort of size to support it. That is the old system. Matthew Wall: That is in the old system. The new system now on our development team we have in the order of fifty Java developers are working in four teams. We have business analysis function, a QA function and everything like that. It is about 90 people and we are going to be working for -- we have been working for a year-and-a-half and we'll be working for another well at least another year I think. So Erik, I guess, why don't give us a little overview about the technologies that you have decided to use. Erik DoernenBurg: When we started looking at this project as an implementation project of a brand new website, in a way, I mean, keeping all the good things, the award winning design and the award winning editorial content adding some of the new requirements that we had heard about that probably plays a little bit into Web 2.0 space of what people do with tags and relating articles to each other and in a much more intuitive and much more powerful way. We realized very quickly that we needed to come up with a technology that was capable of serving the amount of traffic, the amount of volume that was hitting the site but at the same time you need to be very, very dynamic. The approach that's been done so far was to actually press the generated HTML content on to a hard drive. So that means you could really pretty much serve as much traffic as Apache would allow you to do. Because there was no real dynamic creation of pages; that was kind of statically generated. Erik DoernenBurg: Yes By the way, that is exactly the same way I do my personal website. Erik DoernenBurg: I think it was a very common approach and it worked very well. The key reason, and we'll come back to that later actually, why we couldn't do this anymore with that there was now so much dynamic content on the page that we couldn't figure out how to de-cache those pages. There was just no way that we could figure out, if we changed that tiny bit of content, which pages needed to be regenerated. So we knew we had to come up with a stack that would allow us to serve most of this dynamically. Of course, we started talking about this with the Guardian two years ago. Ruby on Rails was on the horizon. We didn't feel comfortable that it would be able to serve that amount of content. You feel more comfortable today or would you still say it's a little bit as if probably for somewhat smaller amount of pages? Erik DoernenBurg: Today I would probably do a Spike; I would not sign away my life right away and say we can definitely do that in real. My feeling too. Erik DoernenBurg: And so at the time we decided on Java, I mean one of the big requirements was that this could be deployed on Linux or a UNIX operating system which ruled out some of the other platforms. And we did want what we call an enterprise solution, an enterprise in that context probably just means big. We wanted something that could serve it. And at the same time I'm a strong believer in open source software and I think the way that software is developed really mimics what people need. There are hardly any features you don't need and most of the features that you require are there and work very well. So what are -- Erik DoernenBurg: So stack we chose was an entire Java open source stack, pretty standard I would say. At present it is all upgraded to Spring 2, Hibernate 3, and Velocity 1.5. So again we've even shied away from using JSPs and used some open source technology. Matthew Wall: Velocity is used to actually generate dynamically the web pages. Erik DoernenBurg: Yes exactly it is the template engine and I said we are using Spring 2 and Hybernate 3 and there is a good point in that we actually do make use of the new features. We just didn't upgrade the frameworks, we were one of these projects and Matt mentioned it early on, as very Agile. So we are not afraid, we do have the tests and we can actually look forward to getting new versions of those frameworks which was another benefit of using open source that we have constant progress and we actually, as I said, we looked forward. The moment Spring 2 came out, we used some of their advanced features Such as? Erik DoernenBurg: The custom name spaces, for example, and some of the request scoped beans. These are two examples of that. They actually helped us. Matthew Wall: We were not worried. Erik DoernenBurg: And other things that we used, pretty much standard things: JSON, The Yahoo! UI toolkit for the editorial tools, that is the administrative web site. Exactly which the editors use, because they were quite spoilt, I guess, from the old website and they wanted a really powerful system and again there is more content now and this is one of the things that really sets the Guardian apart in my opinion is the quality of the editorial content. So we shouldn't spoil the editors. Erik DoernenBurg: We should, we should spoil the editors to write really, really good stories which we'll bring back. But that was that, I mean, we really tried to focus on a few key frameworks and keep the stack very simple but at the same time very highly performant. Deployment is in Resin, so again the no big massive application server just something small that works that we knew worked. Development is done in Jetty which means that developers can run the entire stack within seconds on the developers' workstations So how do you do scalability? Is it all stateless? Do you do replication? So you have to -- like basically a load balancer that -- Erik DoernenBurg: Yes it is absolutely stateless. There is no conversation with the client and a server. Sometimes there's a little bit but that's held in the JavaScript and it does not hit any web server, it is just really IP. And you have the big massive back-end database of which you assume that it never dies and so on -- Matthew Wall: Yes interesting about the database I mean obviously database is where all our content is stored and we use Oracle as a backend database. We also deploy to two physical co-locations. So we have one in London and one in Manchester with WAN in-between. The Guardian has always tried to get the most out of the least so this is part of our using open source. We try and get the most out of our software and the most out of hardware. There are lots of very, very expensive solutions to, sort of, for the reliability of the database. And when you look at those solutions deployed over a WAN they become very, very expensive. Right, so the problem is how to keep them in sync. Matthew Wall: Exactly so what we actually do at the moment is we have two Oracle nodes in each co-lo and we replicate between them but we have them in an active-passive configuration. So that means we actually have potentially an unreliable database. So obviously we have to deal with that and we also have to deal with undeployment. I mean one of the key things that we need to be able to do is change, and change our understanding and change our model, change our database schema. And we release the production every two weeks, sometimes more than that; will often evolve the database schema or upgrade. So without the database unreliability problem there is still a maintenance window or database outages. So we have to be able to deal with it. And we actually have evolved quite a nice system for dealing with that which is essentially we can fall back and turn database off and serve the site out of our ehcache. So I guess we should talk about that at some point Matthew Wall: I mean it's basically standard Hibernate second level cache that we have installed in a Hibernate layer. However the problem with that is, you are going to get popular items actually in ehcache because you can't really guarantee what is in there. So we've now sort of augmented that with another approach, we have a feed running in the background which actually rather like the legacy system it actually presses the HTML of the pages to disk at a point in time and does its best to keep them up to date. So what we actually do in the event of a database failover or actually a piece of schedule maintenance is fall back to those pressed copies of the pages. Which means you cannot do dynamic updates at that time and maybe some of the interactive features or --? Matthew Wall: It is actually possible to -- it depends what state we have gone into. But, yes, if the database is gone than we can't do dynamic updates. But we cannot loose what we have got. Right, which is probably good enough for people to not really notice that something bad had happened. Matthew Wall: Exactly I mean it is all part of sort of graceful degradation and we have actually found that that as an approach works very well. Ideally when we move to a newer network layout which will happen -- maybe next year we will able to look at sort of high availability Oracle and maybe we'll reduce our dependency on this. So at this point in time you don't do basically a two PC commit if you change something over both databases you commit in one and then you basically have an asynchronous replication to the other one. Matthew Wall: Absolutely I believe underlying it all is Oracle Flex I believe and others. I am not that familiar with Oracle. You mentioned before DDD, Domain Driven Design so can, whoever maybe want to, say something about let's say the design philosophy or whatever you call it. I mean it is not the process it's a way of looking at the system. Matthew Wall: One of the things that we really wanted to have with our new version of our website was a shared philosophy across both the development team and editorial team. A common language. Matthew Wall: A common language, exactly. I think the web site is becoming much more important to the newspaper and will continue to do so over the next few years and I think it is very, very important that the editors don't regard the technical work that happens on the website as a black art, as a mystery -- I think they should be involved in it. And so when we first looked to uploading R2 as well as choosing what technology stack you are going to build, which I think is what a lot of people focus on. I think another massive contributor to success is choosing how you are going to build it, with what social processes you are going to build it. Because I am big believer that particularly on a large project that involves lots of different disparate people - you have got Java developers, you have got front end CSS JavaScript developers, you've got editors who write content, you've got commercial people who are trying to sell adds, and in the middle of all those you have got this big piece of software. And if you don't figure out a way of organizing those people socially around that software your projects won't be successful. So we decided right from the very beginning that we wanted to have a shared understanding of our domain across all these people. So we all went scurried away and read Eric Evan's marvelous book and then came back and decided that Domain Driven Design seemed the way to go. One of the beautiful things I think of a team structure that we have got at the moment is that we have a very strong domain model which defines how content relates to each other, how content relates to tags, and all of this has been designed by the editors. And how do you actually represent that domain model, because I thought that's always one of the I wouldn't say weak points of DDD, but it's one that's rather unspecified, because having it only in the code is not necessarily an approach for the editors. So is there any way -- how do you make that tangible? Matthew Wall: This is a very good question I mean it was actually quite easy at the start of the project because again our focus is on the simplicity and what is the simplest way of doing it. And it is basically to get the editors and the techies into a room together and with a very, very large selection of sheets of paper and file cards, and blue tack and felt tip pens and get the editors to draw along with the techies of what they think they want and what is going on and it's nice and malleable. You can cross things out, and rip things up and draw things again, And that process sustained us for the first few months really of the development of the project really while the core of the domain was crystallized. And it also had probably fewer people involved. Matthew Wall: Yes exactly. You know the domain model is quite sophisticated, and there is a very, very large number of people on the development team. And it does become more difficult. I used to try and keep these paper copies of that domain model up to date but it's like -- I don't know if you know the English phrase painting the fourth bridge -- but it is just a task that never ends. And by the time you have got to the end of it you have to start again. So we've now started looking at -- we can extract the domain model from the cages and tools like MagicDraw and stuff like that and at least get a visual representation of it so that people can draw on it -- So that is the master is still the code but you extract basically graphical rendering so non programmers can relate to it. Matthew Wall: Yes exactly. Because I was trying to get that how and whether and if so how we use any kind of modeling, but I guess you probably don't because -- Matthew Wall: I mean the whole system was produced socially by sitting down with pieces of paper and drawing a diagram and then feeding that into the development process. But often once it is crystallized the model kind of evaporates. Erik DoernenBurg: And to add to the idea of modeling I mean the model isn't necessarily a graphical image. No sure. Erik DoernenBurg: Then model is probably also a mental model that people have in their heads. And one of the development approaches that were used on the Guardian project was Agile development and we have really small story cards that specify individual features that the system will exhibit later on. And those story cards, of course, use the same language. That is one of the big features of Domain Driven Design of this ubiquitous language. So that means in the discussions between the technical people and the business analysts we used the same language and by that we are fairly sure that we have the same mental model of the domain in our heads when something is called an article or a tag etc. That the same model is in this, so there is definitely a lot of modeling. The modeling though happens probably in the conversation between the different people. And is solidified mostly in the story cards. And then the graphic representations later on is just the kind of a feed back loop. This shows you what you have done. And I think one thing that helped us scale as well at the realization end, if you speak to Evans he's also coming to the same conclusion, that sometimes you just break up the Domain model into multiple distinct pieces. There are a couple of advantages for doing so - you get better transactionality, you get better caching, and so on. So the idea of having one big massive domain model that has lots of relationships so in the end would be like one directed graph is not really proving very practical, which is -- Of course I mean that's the whole story about that the Business Object Model has died. Erik DoernenBurg: Exactly. And you tried to make data structures more or less owned by the, let's say, service that publishes them all that stuff. So, yeah, that's also important to scale because, I mean, if you have to agree on a common structure in a 500 developer enterprise you are probably lost. Erik DoernenBurg: Yeah, I mean one things that really helped us with that was the idea of focus really on the aggregates and aggregate roots and make sure that nobody actually has any permanent references for anything but aggregate roots which allowed us to make this more modular, made it feasible for people to have a better grasp of the domain model without knowing all the detail of each of the aggregates. See if I probably had done that I would probably have defined a domain specific language -- for example a textual DSL that would know about those concepts you could actually write programs using those common language terms as first class citizens, but I don't want to get there now... That's why I was asking, so since you have already, have the mike you have another talk here at OOP. It actually was a Keynote about simplicity. And I also heard from what you talked about, that simplicity was a key driver in building this architecture. So why don't you spend a couple of minutes and elaborate a little bit about what is simple, what is too simple, what is simplistic, how simple should you become? Erik DoernenBurg: One thing that we knew from the start in collaborating on this with the Guardian on setting or creating that initial web site was that this would be, even more so than the old web site probably, technology that would live for ten years. And we knew already that if we started with something that was complex or very complicated it would not scale in the sense of being extensible for the next ten years. So that every piece and every new jar file we introduced, every new concept we introduced, we asked our selves: do we really need this? Very often as architects Matt and I did sometimes experiment with new ideas. But then rather than just implementing them we would have two approaches the existing one and another approach and would then go to developers and sit down with them and discuss which one they found more intuitive and more easy to use, which one they found simpler and then always go with that. So we were are really trying to actively avoid being clever. And that comes back to what you asked about before, it actually works quite well in English. I mean there's a difference between something that a simple which I think is something you strive for. It is the solution that does make sense and it expresses the intent very clearly; the intent of what you want to do with that code. That is very, very different from being simplistic and simplistic of course is something that is simple but just isn't fit for purpose. So there are some solutions in there and we can't really explain -- it's quite hard to explain without a diagram but just to give you an idea we did have to do something with caching, and there a solution that is on the surface probably quite complicated ends up being very simple. The idea here was that when the editorial tools need to update something in the database we need to figure out which caches we need to flush in the front ends, in the ehcache. And, of course, the simple solution on the surface, the simplest solution is to just have in the editorial tools, whenever they update an object, to actually send messages and make those updates. That didn't work all too well because people had to think about two things --well what they wanted to update in the database and how to send those caching messages. So we used some of the, like I said early on, some of the Spring 2 features with Request Scoped Beans to inject special objects into those tools, so that they could offload the responsibility for decaching into specific classes. On the surface, of course, you have to understand how that exactly works in Spring 2, you have to understand what a Request Scoped Bean is. There are a couple of really, I wouldn't want to say nasty, but non obvious XML and some Spring context files and but again the solution now is simple in the sense that it expresses the intent very well. And it is also simple for the person who has to use it all over and over again and you do the complex stuff once and in some kind of frameworky thing. Erik DoernenBurg: Yes. Sometimes complex is not necessarily bad. I mean if you have a complex problem it is okay to have complex code; trying to make that complex code simple will often end up being very simplistic. That's the distinction between the central complexity and accidental complexity. You want to avoid the latter. Erik DoernenBurg: Exactly I think, yeah, in a nutshell what we were trying to do is we were trying to drive out all accidental complexity from this. And despite what some people say you can do this in Java. There is still a little bit of accidental complexity you have to deal to a tiny extent with some of the remnants of J2EE like web.xml but really tried even to push that out. Yeah Erik DoernenBurg: We only had -- I think the web.xml file gets touched, I don't know, once every five months or so. So there is hardly anything in there that is worth noting. So we've actually moved everything into different methodologies and make that use very simple. Matthew Wall: I mean the whole focus on simplicity has been a key driver in the project and continues to be. It is very, very important to us, but we don't always get it right and it is a really, really difficult balancing act. You shouldn't be afraid, as Erik was suggesting earlier, to experiment and try two or three different approaches and then collapse back on the one that seems to work the best. But there are examples of things where we have avoided complexity that actually in retrospect we probably should have embraced a lot earlier. So the site has to be very, very performant because we are now serving every page dynamically, we have to really, really have a very, very fast site and it is actually quite complex to compute related content. So how many hits or whatever -- the metric is do you have? Matthew Wall: We say it is about a 170 million hits a month from 17 million unique users. Do you know how many per second that is, just because I'm curious? Matthew Wall: I think the peak sort of thing that we've had on a very, very, very heavy news day like a 9/11 type of cheap bombing event we do something like 70 pages per second per sever in the production environment. How many servers do you have? Matthew Wall: We have eight, so it can be quite quick. Although the average load is a lot lower than that. Sure of course. Matthew Wall: The traffic is very spiky. An example of where we avoided some complexity that maybe we should have embraced was in calculating related content. I've got this article, it's about these things, it's being given in these tags. What other content can the system tell me about that is related to this, that I might be interested in and the simple solution we chose was to actually do that in the database. I mean we had to do some interesting stuff; with Oracle materialized views we used to get those queries to perform. And it does work but in our performance testing it always causes a problem; the more complex solution would be to do that in our search engine. We have an enterprise class search engine and now we are starting to realize that actually, that probably would have been the better choice, the cost of performance tuning these queries each release is actually probably much greater than the cost of actually biting the bullet and writing the more complex code to get it off into the search engine. You can't do everything the simplest possible way. So let's briefly look at maybe the development process I guess you did Test Driven Development because ThoughtWorks was involved. Matthew Wall: Yes absolutely. It was quite a big shock for us when ThoughtWorks came into the building I must say, it was like a whirlwind going through the place -- every one looked stunned for a couple of weeks. If any of you have got a development team out there that you want to convert to Agile it's well worth going, I must say because it's a bit of a shock but they do sort you out. Now at the time when really none of the developers on our team would even consider writing a piece of code without writing a test for it, it's just second nature to everybody and everybody can see the value of that. The amount of change we can do is phenomenal. It is a safety net effect. Matthew Wall: It is incredible. So apart from Test Driven Design or Development or whatever the second D means today, what other Agile practices or process pass? Do you use any like of the big name process? do you do XP? Or what is it that you're actually doing? Matthew Wall: We don't do XP or Scrum or anything like that, although whenever I read a book on XP or Scrum or talk to anybody that's doing it, it sounds like we are using very similar processes. I kind of worry about giving the process a name because my key -- if I wanted to sum Agile in a couple of sentences -- it would be "what do you want us to do for you by Friday." It is short iterations, just get it done and it is also "let's make sure that we have got a process that we can change next week." And I think when you start to name your processes and say we are doing Scrum or XP people can read a book about what Scrum or XP is or how it should be done. And in actual fact your processes should evolve and change over time. I mean one of the key things that we do is we have short iterations, release the production every two weeks, Test Driven Development, and all of our development is done by developers in pairs. That again makes a huge difference with regards to learning and also the quality of code that is produced. I interviewed Jens Coldewey yesterday about introduction of Agile Development Practices into teams, 10 years of experience, and he said the same thing that you said, that the big name processes are not the important thing and he said the most important thing about introducing an Agile processes is doing retrospectives and evolving the process over time. Matthew Wall: Yeah I couldn't agree more. We now have a number of four Agile teams on the project, which is an interesting thing in its own right. You've split down the whole 90 people into several sub-teams so you can better handle the Agile approach. Matthew Wall: Yes. I guess that's another best practice Matthew Wall: As a as a single team it would have just been a nightmare. But each team has a retrospective at the end of which each iteration and recommendations going to the next iteration. One thing I'm interested in these highly scalable web applications is that performance and scalability are obviously important. And you have to do performance and scalability testing and that also means you have to measure that somehow. So do you do metrics on performance? Do you do profiling? How do you profile a web application? Can you tell us a little bit about the tooling and the approach involved there? Matthew Wall: Yeah, absolutely; performance is really key to us particularly moving from a rather static architecture like the previous sites were, to a dynamic one. We have an application server, some Java involved in every page. But the first key to Performance Driven Development is to keep it simple as we've kind of said, but you still need to validate that you have got that. So one of the first things that you need to understand is what the level of performance you need to achieve, because there's obviously -- achieving ever increasing performance goals and meeting every increasing performance goals is expensive. To get the extra, the final 10% is often quite an expensive process. You need to make sure that the performance goals that you've set are realistic. So the first thing that we have to do is understand how our existing website performs and scales and what traffic expectations are.So we have a number of monitoring tools on the web sites, things like Gomez, things like Hitbox, which are telling us the amount of traffic that's coming in to the site, the amount that was coming to site on peak days. We understand our traffic pattern very well. And Monday lunchtime is often the biggest period on the site -- I suppose everyone coming back to work and not really wanting to do very much and reading the news on the Internet, with Friday afternoons also being quite busy; I think everyone is quite tired by then. And also doesn't really want to do any work. Yeah, looking for the weather on the weekend. Matthew Wall: Exactly, but we also have something that maybe many web sites don't have in them, we have very unpredictable kind of performance explosions. if you get a horrible news event or a very major news event like 9/11 then -- both 9/11 and the London bombings our web site actually stayed up. I mean it was certainly lot slower than it normally is, but it stayed up. So we have to prepare for both of these eventualities. So how do we do it is -- prior before every release we put the release candidate that's come out of out build pipeline through a very, very rigorous performance test suite. Initially we started thinking very simply about this and we actually used simple open source tools. So we have monitoring tools on all of our application service. We have Zabbix that can tell us how much CPU Load we are doing, how much IO we are doing, things like that. We are using the standard Jstack and Jstat things in the Java distribution to tell us what is going on in the JVM. Because we have Oracle underneath everything we have Oracle Enterprise Manager which can tell us what's going on in Oracle. And then we have a suite of tests and essentially we have two types of tests. One of the tests is for a particular type of page or a particular type of request - give me a feeling for the performance profile.This type of the test is just more to weed out developers mistakes, simple mistakes. If a developer has written a silly Oracle query or a piece of Java that's not very well written this will spot that very quickly. But that doesn't tell you whether the system as a whole is going to be able to sustain the load of production. So then what we have to do is take a realistic performance test of production load under various levels of loads, you average loads, you Monday lunchtime load, and then your 9/11 load and see how the system performs. It's very interesting with our website and our caching strategy and the stateless nature of the web site to write a performance test that the application passes without it really being indicative of what's going on in production. So what we do is each time we run a performance test we extract the previous week's logs from production and we convert that into requests against our current website. And that's okay for most things because quite often we will be migrating functionality from the old site into the new site. So we knew how many hits these pages got in the old site so we can translate that to hits in the new site. But also we are also introducing new functionality that the old site didn't have. So there we have to really make an educated guess as to how often we think that functionality will get hit in production. And add that together into the performance test. The performance test, Initially, we started basically with a suite of JMeter tests and it was okay when the performance test data was quite small. Performance test is essentially a list of URLs that we need to hit, that's basically all it is. But now we want to get some more complex metrics so we actually have got a simple homegrown Python suite that can just extract these URLs and run a performance test. The interesting thing is everyone focuses on pages per second for a website; how many pages per second does it do? Yeah, I was asking that too. Matthew Wall: Depending on what mode it will do about 70 or 80 per application server up to around 400 per application server if it is in emergency mode. That's not the end of the story, you also need to know when it's running at that low profile, how does it use your database, how much more load could it take, what's the application doing, things like JProfiler are fantastic tools to seeing inside that sort of stuff. The suite to test should also have other metrics. So what is your ninety ninth percentile response time? It doesn't matter how many pages a seconds you do, if it takes five seconds to generate a page we say that a page must come out of the application server in under 0.5 of a second and it is important to make sure these things. You mentioned the build pipeline. I guess that's another cornerstone of Agile development, I guess. Do you want to maybe say something about that, I guess also ThoughtWorks is who created CruiseControl. So you have some background there I guess. Erik DoernenBurg: I think you were right in saying that there is actually a cornerstone especially with a development team of that size. I mean one of the key ideas that Matt alluded to are pair programming and collective code ownership, which really, really helps not only educate other people of the development team but also get very, very good consistency on the macro level. So you don't have people working in cycles and coming up with different solutions to similar problems. But at the same time that means you have one code base, you have so many people working on exactly the same code base, and everybody has the liberty, ultimately, to change every piece of code. That means, of course, if you are implementing multiple stories at the same time and you're checking in quite often it can happen every now and then that people start making changes that are incompatible. Many of them you can actually weed out by running the test suite on your local developer workstation before you are committing in obvious merge problems and so on. But there are the most subtle issues that would only surface ones that code is actually merged and run in a more server like environment. And that is exactly where continuous integration comes into play and it is something that I mean ThoughtWorks certainly has done for about ten years now and we again in collaboration with another large client of ThoughtWorks we created something called CruiseControl which is -- I don't know we call it the grand daddy of Continuous Integration Service. There's a lot of movement in that space at the moment, there is lot of commercial entrants to that space, which is interesting. So we are using Continuous Integration Service, and in this case we actually do use CruiseControl. One thing that we have learned though, from the past, was that especially with a large team like that you do need to have very, very quick feedback. So like Matt said, if you have so many people working on the team you will get commits almost every ten minutes. The problem is if your test suite runs every thirty minutes or takes thirty minutes to run that means people wasting a lot of time waiting to see whether the build is okay. One of the mantras of course is if the build is broken you must not check in but somebody needs to fix the build first. So that thing is a big issue on large Agile projects. What we've done in the past and again did on the Guardian project was to split that out into what we now called a build pipeline. So rather than having one single Continuous Integration Server you have multiple of them. So the first Continuous Integration Server only runs the unit tests which are generally very fast. They should be in the order of five minutes or less. Because 99% of the problems you actually do find in the units tests. And only if those pass the build artifact gets copied somewhere else and then the secondary CruiseControl server picks that up and runs the automated regression test suite on that which could be something like Selenium to test some of the more fancy AJAX , JavaScript bits, or HTML units to test whether the page has actually come out as a more integration test. And those tests can take longer, but it's okay because they don't often find problems. The build pipeline in the case that the Guardian project, of course, continues with the performance tests. There are some performance tests that need to be run all the time. There is of course the old adage about premature optimization, but one thing it is actually quite helpful in, if you come in the next morning and there's an automated page that tells you performance in that area has degraded because then you know what did I change yesterday. What could have effected that. The Eclipse guys also do this and they have nice graphical diagrams that show which area. So I think it is not necessarily important to get a system to a certain performance. It's probably more useful to make sure you don't degrade it over time. Erik DoernenBurg: Exactly, it's like Matt said some of the mistakes -- especially with all the different paradigms that we're mixing a project like this it is easy to make mistakes and to just give you one example what occurred on that project was, of course, we're doing Java, we're doing Domain Driven Design, we have a domain model in Java and people are doing Object Oriented Programming - so what happened was there was one functionality required for the administrative application to say which tags are unused. It was a garbage collector for tags. Erik DoernenBurg: Yes kind of. Well just to display them, and, of course, if you think about it from the Domain Driven Design and Object Oriented Model point of view, what you do is you write a loop, you iterate over all the tags and for each tag you say, "tag.count = 0, Do you have any articles?" Of course, that translates into object relation mapping and effectively pulling the entire database into memory. Big inhale. Erik DoernenBurg: And these things you can spot very, very easily with those tests. That's an obvious example but there are various many similar examples. So that was, in that case, what we call the build pipeline. It is almost like stamps. So you create the first part of the pipeline that creates to deliverable, a war file in this case. There were no EJBs so we just have a .war file. That is created and is never changed later on. So that .war file first gets the stamp of approval of saying unit tested. Then it gets moved into the second stage of the pipeline which tries to see integration tests, gets an next stamp. Then it gets to the performance -- But in some way it's more testing pipeline than actually a built pipeline; because building happens only in the first stage; then you do different levels of heaviness on the various tests during the different intervals. Erik DoernenBurg: Exactly. All the building is done initially and it's actually a part of the whole idea that nothing gets changed later. Everything that is configurable is loaded using the property post-process in Spring from plain text files. Okay, I guess, so we covered the main points, is there anything else anyone of you wants to mention. Matthew Wall: I think one of the key things as well -- we have described a few of the processes that we are using at the Guardian at the moment. Not all of these processes were processes that we were using on day one of the project. A lot of these have evolved as we have made mistakes, as we have learnt from them on the course of the process, and I think if you were to interview us in another year or another two years you would find that we would still be using most of what we have got now but I'm sure there would be loads more that we can tell you. Hopefully something's changed. Matthew Wall: And this is like a key thing, I think; for anyone working on a large software project. Don't think, once you've learned a process, that it's static, be prepared to change it, to evolve it internally, and evolve with your team's experience and knowledge. Okay then thank you very much for the interview and I guess have fun with the talk you're going to do about the same topic in a couple of minutes. Matthew Wall: Thanks very much thanks for having us. Erik DoernenBurg: Yeah, thanks and bye. Links |