DynamoDB’s metastable cache load workaround
Marc Brooker on the latest DynamoDB USENIX paper — good paper and commentary. He picks out this very interesting tidbit:
‘When a router received a request for a table it had not seen before, it downloaded the routing information for the entire table and cached it locally. Since the configuration information about partition replicas rarely changes, the cache hit rate was approximately 99.75 percent.’ What’s not to love about a 99.75% cache hit rate? The failure modes! ‘The downside is that caching introduces bimodal behavior. In the case of a cold start where request routers have empty caches, every DynamoDB request would result in a metadata lookup, and so the service had to scale to serve requests at the same rate as DynamoDB’ So this metadata table needs to scale from handling 0.25% of requests, to handling 100% of requests. A 400x potential increase in traffic! Designing and maintaining something that can handle rare 400x increases in traffic is super hard. To address this, the DynamoDB team introduced a distributed cache called MemDS. ‘A new partition map cache was deployed on each request router host to avoid the bi-modality of the original request router caches.’ Which leads to more background work, but less amplification in the failure cases. The constant traffic to the MemDS fleet increases the load on the metadata fleet compared to the conventional caches where the traffic to the backend is determined by cache hit ratio, but prevents cascading failures to other parts of the system when the caches become ineffective.
(tags: aws dynamodb metastability caching caches production failure outages load memds marc-brooker papers usenix)