Tag: fault tolerance

SE-Radio Episode 246: John Wilkes on Borg and Kubernetes

Filed in Episodes by on January 7, 2016 3 Comments
SE-Radio Episode 246: John Wilkes on Borg and Kubernetes

John Wilkes from Google talks with Charles Anderson about managing large clusters of machines. The discussion starts with Borg, Google’s internal cluster management program. John discusses what Borg does and what it provides to programmers and system administrators. He also describes Kubernetes, an open-source cluster management system recently developed by Google using lessons learned from […]

Continue Reading »

Episode 222: Nathan Marz on Real-Time Processing with Apache Storm

Filed in Episodes by on March 6, 2015 2 Comments
Episode 222: Nathan Marz on Real-Time Processing with Apache Storm

Nathan Marz is the creator of Apache Storm, a real-time streaming application. Storm does for stream processing what Hadoop does for batch processing. The project began when Nathan was working on aggregating Twitter data using a queue-and-worker system he had designed. Many companies use Storm, including Spotify, Yelp, WebMD, and many others. Jeff and Nathan […]

Continue Reading »

Episode 203: Leslie Lamport on Distributed Systems

Filed in Episodes by on April 29, 2014 2 Comments
Episode 203: Leslie Lamport on Distributed Systems

Leslie Lamport won a Turing Award in 2013 for his work in distributed and concurrent systems. He also designed the document preparation tool LaTex. Leslie is employed by Microsoft Research, and has recently been working with TLA+, a language that is useful for specifying concurrent systems from a high level. The interview begins with a […]

Continue Reading »

Episode 134: Release It with Michael Nygard

Filed in Episodes by on May 6, 2009 5 Comments
Episode 134: Release It with Michael Nygard

This episode is a discussion with Michael Nygard about his book “Release It” which covers aspects of software architecture you often don’t think of initially when starting to build a system. Some of the points we discussed were capacity planning, recovery as well as making the system suitable for operation in a data center.

Continue Reading »

Episode 78: Fault Tolerance with Bob Hanmer Pt. 2

Filed in Episodes by on November 23, 2007 3 Comments
Episode 78: Fault Tolerance with Bob Hanmer Pt. 2

This is the second part of the discussion on fault tolerance with Bob Hanmer (if you didn’t listen to Episode 77, which contains part one, please go back and listen now; this episode builds on that previous one!)

We start by discussing a set of error detection patterns. Among are the well-known approaches such as checksums and voting. We then look at error recovery patterns, including restart, rollback or roll forward. The next section looks
at error mitigation patterns, which include shedding load and doing fresh work before stale. The last patterns section then looks at fault treatment patterns.

We conclude the episode with a small discussion about how to design systems using (these and other) patterns, and with some thoughts on why actually wrote the book.

Continue Reading »