SE-Radio Episode 282: Donny Nadolny on Debugging Distributed Systems

Filed in Episodes by on February 14, 2017 1 Comment

donny-nodolny-100x125Donny Nadolny of PagerDuty joins Robert Blumen to tell the story of debugging an issue that PagerDuty encountered when they set up a Zookeeper cluster that spanned across two geographically separated datacenters in different regions.  The debugging process took them through multiple levels of the stack starting with their application, the implementation of the Zookeeper cluster, the Linux kernel, and the TCP stack.   Donny explains how they identified problems at each layer, and how finally gained a complete understanding of the issue as the interaction between multiple bugs, incorrect assumptions, and less well-known behaviors of TCP.  Robert and Donny spend the final part of the show reflecting on lessons learned from this bug including the need to question what your tools tell you, the importance of persistence in debugging, and how to implement more useful monitoring.

Venue: Internet

Related Links

Tags: , , , , ,