Skip to content

Archives

Coinbase MSK outage post-mortem

  • Coinbase MSK outage post-mortem

    A post-mortem from Coinbase following a significant outage partially caused by MSK, AWS' managed version of Kafka.

    Root cause: a thermal event (cooling system failure) inside a subset of racks within a single building in AWS us-east-1. We run a primary replica of our exchange infrastructure in a single zone, consistent with industry standards to reduce latency. To prepare for failures like this, we maintain a distributed standby, but during this incident, failures in the primary zone that were designed to be isolated were not [...]

    Our primary managed Kafka partitions process many terabytes of data daily and are designed with resiliency guarantees for uninterrupted operation during a datacenter failure just like this. In this case, those guarantees failed and required manual recovery. [...]

    There is a hint here that MSK failed to have multi-AZ resiliency despite multiple replicas configured at the application level. It will be interesting to see what the full root-cause analysis looks like....

    Tags: kafka resiliency coinbase multi-az az aws us-east-1 post-mortems postmortems