Skip to content

Archives

Adrian Cockroft’s take on the AWS outage

  • Adrian Cockroft's take on the AWS outage

    "n my opinion the root cause of the recent AWS outage is their architectural decision to have everything depend on the same instance of DynamoDB, including operation of DynamoDB itself. This is a circular dependency, and the ability to observe and fix the failure as it happened also failed. The ability of customers to file service reports failed. So the engineers trying to figure out what was happening were completely blind. It took them an hour to figure out what had broken and another hour to fix it, then the pent up demand rushing in broke other key services for another 12 hours or so.

    If DNS had been misconfigured on a different non-critical service, I think it would have been obvious to detect and quick and easy to fix. However, anything going wrong that also takes out the ability to see it going wrong and fix it, is a liability.

    To break the circular dependency, I think there needs to be a separate, internal only, set of services and data stores that the most critical AWS services use, and which are designed to come up without dependencies on public interfaces. Maybe an internal region, inside each public region, but with a simpler implementation that has few carefully managed dependencies. Otherwise, it’s just a matter of time until this happens again."

    Tags: adrian-cockroft outages post-mortems aws amazon us-east-1 dynamodb circular-dependencies depe