Riskgaming

Why engineers are using chaos to make computers more resilient

The CrowdStrike meltdown on July 19th shut down the world with one faulty patch — proving once again the interconnected fragility of global IT systems. On Tuesday this week, the company released its Root Cause Analysis as both an explanation and a mea culpa, but the wider question remains: with so much of our lives dependent on silicon and electrons, how can engineers design resilience into their code from the bottoms up? And more importantly, how can we effectively test how resilient our systems actually are?

⁠Kolton Andrus⁠ is one of the experts on this subject. For years at Amazon and Netflix, he worked on designing fault-tolerant systems, building upon the nascent ideas of the field of chaos engineering, an approach that iteratively and stochastically challenges systems to test for resilience. Now, as CTO and founder of ⁠Gremlin⁠, he’s democratizing access to chaos engineering and reliability testing for everyone.

Kolton joins host ⁠Danny Crichton⁠ and Lux’s scientist-in-residence and complexity specialist ⁠Sam Arbesman⁠. Together, we talk about why resilience must start at the beginning of product design, how resilience is aligning with security as a core value of developer culture, how computer engineering is maturing as a field, and finally, why we need more technological humility about the interconnections of our global compute infrastructure.

Transcript

This is a human-generated transcript, however, it has not been verified for accuracy.
continue
reading