Coinbase MSK outage post-mortem
A post-mortem from Coinbase following a significant outage partially caused by MSK, AWS' managed version of Kafka.
Root cause: a thermal event (cooling system failure) inside a subset of racks within a single building in AWS us-east-1. We run a primary replica of our exchange infrastructure in a single zone, consistent with industry standards to reduce latency. To prepare for failures like this, we maintain a distributed standby, but during this incident, failures in the primary zone that were designed to be isolated were not [...]
Our primary managed Kafka partitions process many terabytes of data daily and are designed with resiliency guarantees for uninterrupted operation during a datacenter failure just like this. In this case, those guarantees failed and required manual recovery. [...]
There is a hint here that MSK failed to have multi-AZ resiliency despite multiple replicas configured at the application level. It will be interesting to see what the full root-cause analysis looks like....
Tags: kafka resiliency coinbase multi-az az aws us-east-1 post-mortems postmortems