Tag: fault tolerance
John Wilkes from Google talks with Charles Anderson about managing large clusters of machines. The discussion starts with Borg, Google’s internal cluster management program. John discusses what Borg does and what it provides to programmers and system administrators. He also describes Kubernetes, an open-source cluster management system recently developed by Google using lessons learned from […]
Leslie Lamport won a Turing Award in 2013 for his work in distributed and concurrent systems. He also designed the document preparation tool LaTex. Leslie is employed by Microsoft Research, and has recently been working with TLA+, a language that is useful for specifying concurrent systems from a high level. The interview begins with a […]
This episode is a discussion with Michael Nygard about his book “Release It” which covers aspects of software architecture you often don’t think of initially when starting to build a system. Some of the points we discussed were capacity planning, recovery as well as making the system suitable for operation in a data center.