More on the Coinbase 07-05-2026 outage
More on the Coinbase 07-05-2026 outage, caused by a "thermal event" in AWS us-east-1 and its impact on the suppposedly multi-AZ Managed Kafka product:
AWS's managed Kafka service failed silently. A significant portion of our event-streaming infrastructure runs on MSK, AWS's managed Kafka offering. The architectural promise of a managed Kafka service is that when individual brokers go down, the service automatically reelects partition leaders and continues to serve traffic out of the remaining brokers. The loss of an entire zone should result in reduced capacity, not unavailability.
That is not what happened and this extended the outage.
A defect in the AWS MSK control plane prevented automatic partition-leader reelection. Two of our MSK clusters became stuck in a "healing" state with producers unable to write. The cascading effect blocked our fee service, which blocked quoting, which is why most customers experienced this incident as broken trades and quotes rather than as a Kafka outage. Adjacent systems, including portions of our ledger pipeline, payments, and several data pipelines, were affected the same way. Additionally, one of our Kafka clusters was set up in a 2-AZ configuration that increased the blast radius and recovery time, but the MSK control plane defect impacted 2-AZ and 3-AZ Kafka clusters similarly.
We worked the recovery in real time with AWS engineering, ultimately performing manual partition reassignments at 3:00 AM ET to migrate topics off the impaired brokers. Priority-zero and priority-one topics were back to full availability by 9:30 AM ET. The remainder cleared by 2:00 PM ET.
In fairness, they also had a single-AZ point of failure in their architecture which they also describe there, but still, not great performance from MSK. Disappointing.
Tags: msk reliability multi-az aws services kafka resiliency outages post-mortems postmortems coinbase