Simple testing can prevent most critical failures
Specifically, the following 3 classes of errors were implicated in 92% of the major production outages in this study and could have been caught with simple code review:
Error handlers that ignore errors (or just contain a log statement); error handlers with “TODO” or “FIXME” in the comment; and error handlers that catch an abstract exception type (e.g. Exception or Throwable in Java) and then take drastic action such as aborting the system.
(Interestingly, the latter was a particular favourite approach of some misplaced “fail fast”/”crash-only software design” dogma in Amazon. I wasn’t a fan)(tags: fail-fast crash-only-software coding design bugs code-review review outages papers logging errors exceptions)