Why Percentiles Don’t Work the Way you Think
Baron Schwartz on metrics, percentiles, and aggregation. +1, although as a HN commenter noted, quantile digests are probably the better fix
(tags: performance percentiles quantiles statistics metrics monitoring baron-schwartz vividcortex)
-
Spotify wrote their own metrics store on ElasticSearch and Cassandra. Sounds very similar to Prometheus
(tags: cassandra elasticsearch spotify monitoring metrics heroic)
ELS: latency based load balancer, part 1
ELS measures the following things: Success latency and success rate of each machine; Number of outstanding requests between the load balancer and each machine. These are the requests that have been sent out but we haven’t yet received a reply; Fast failures are better than slow failures, so we also measure failure latency for each machine. Since users care a lot about latency, we prefer machines that are expected to answer quicker. ELS therefore converts all the measured metrics into expected latency from the client’s perspective.[…] In short, the formula ensures that slower machines get less traffic and failing machines get much less traffic. Slower and failing machines still get some traffic, because we need to be able to detect when they come back up again.
(tags: latency spotify proxies load-balancing els algorithms c3 round-robin load-balancers routing)
Low-latency journalling file write latency on Linux
great research from LMAX: xfs/ext4 are the best choices, and they explain why in detail, referring to the code
(tags: linux xfs ext3 ext4 filesystems lmax performance latency journalling ops)