Tag: fault tolerance
This is the second part of the discussion on fault tolerance with Bob Hanmer (if you didn’t listen to Episode 77, which contains part one, please go back and listen now; this episode builds on that previous one!)
We start by discussing a set of error detection patterns. Among are the well-known approaches such as checksums and voting. We then look at error recovery patterns, including restart, rollback or roll forward. The next section looks
at error mitigation patterns, which include shedding load and doing fresh work before stale. The last patterns section then looks at fault treatment patterns.
We conclude the episode with a small discussion about how to design systems using (these and other) patterns, and with some thoughts on why actually wrote the book.
In this Episode we discuss fault tolerance based on the new book by Bob Hanmer. This is the actually the first part of the discussion, the remainder will be published in the next episode of SE Radio.
We start by discussing some of the context for fault tolerant systems and the imperfect world assumption. We then discuss a number of terms we will need when discussing the fault tolerance patterns. We then discuss the fault tolerance mindset and connect fault tolerance to a number of related subject areas, such as software quality. We then discuss the shared context for the patterns that follow, among them the important observation that fault tolerance does not come for free!
Finally we provide an overview over the different sections covered in the book and start the detailed discussion of the patterns by looking at the Architectural Patterns section.
The next episode will discuss the remaining patterns in the book.