What we learned from reading 100+ Kubernetes Post-Mortems

Plain Schwarz

Barki, Noaa

Formal Metadata

Title

Title of Series

Berlin Buzzwords 2022

Number of Parts

Author

Barki, Noaa

License

CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.

Identifiers

10.5446/67159 (DOI)

Publisher

Plain Schwarz

Release Date

2022

Language

English

Content Metadata

Subject Area

Computer Science

Genre

Conference/Talk

Abstract

When building our Kubernetes-native product, we wanted to find the most common sources of failures, anti-patterns and root causes for Kubernetes outages, so we got to work. We rolled up our sleeves and read 100+ Kubernetes post-mortems. This is what we discovered. A smart person learns from their own mistakes, but a truly wise person learns from the mistakes of others. When launching our product, we wanted to learn as much as possible about typical pains in our ecosystem, and did so by reviewing many post-mortems (100+!) to discover the recurring patterns, anti-patterns, and root causes of typical outages in Kubernetes-based systems. In this talk we have aggregated for you the insights we gathered, and in particular will review the most obvious DON’Ts and some less obvious ones, that may help you prevent your next production outage by learning from others’ real world (horror) stories.