2012-06-07 15:52:41 UTC, the Heroku routing mesh experienced a major outage that impacted all apps running on the Heroku platform. Customer impact was as follows:
- Approximately 30 minutes of complete HTTP routing outage.
- Afterward, approximately 1.5 hours of intermittent HTTP errors and degraded HTTP route times for 10-15% of all HTTP traffic on the platform.
- For most of the outage, API maintenance mode was enabled as a control rod to contain damage.
status.heroku.comwas largely inaccessible for the early part of the outage, and intermittently unreliable in later parts.
The routing outage was the result of three root causes.
The first root cause is related to the streaming data API which connects the dyno manifold to the routing mesh. On the dyno management side, an engineer was performing a manual garbage collection process which created an unusual record in the data stream. On the routing side, the subprocess of the router which handles the incoming stream could not parse this record.
The nature of this streaming API is similar to that of the replication protocols used by CouchDB or Redis. As such, unexpected records cannot simply be discarded, since the veracity of the entire dataset depends on the in-sequence collection of data. In this model, the correct failure mode of the routing subprocess which consumes the stream from the dyno manager when encountering an unexpected record is to stop processing the stream, flag an error in the monitoring system, and wait for a human to investigate. This puts that routing node into a degraded / read-only mode, which is suitable to continue operating for the next few minutes as engineers investigate the issue. However, rather than going into read-only mode, the subprocess repeatedly crashed when attempting to handle the unusual record.
The second root cause was that when the router subprocess encountered the record, instead of going into this degraded mode of operation, it crashed completely. Each time it was restarted by the supervisor process it tried to handle the record and crashed again.
The third root cause was that the supervisor process has a cooldown for subprocess restarts, similar to that found in Upstart and other init-style process managers. Due to a design flaw in the router's process tree, the supervisor process was itself crashing when this cooldown was reached.
Additionally, there is a warm-up time when new routing nodes are brought online. As our engineers worked to boot a large amount of extra capacity at the same time, this placed a substantial load on our internal systems and increased the boot time for these new nodes. Had we been able to bring up this new capacity faster, the residual effects of this incident would have been shortened.
Just over two weeks ago, we launched a totally rewritten version of our status site. The improved status site allows users to subscribe to notifications when an incident is opened. As a result, our status site experienced unprecedented spikes of load during this incident. This high load crushed the site, resulting in an inability for us to effectively communicate with customers during the course of this incident.
Customer communication during an incident is extremely important. When we lost our primary channel for communicating with our customers, we made a very bad incident even worse.
HTTP Routing Failure
Since the outage, we've rearchitected the routing subprocesses to be more resilient to unexpected input. Rather than crashing, they will fail gracefully by alerting a human of the bad input and continue to operate in read-only mode. We are also updating the router to be able to run cleanly in read-only mode in order to provide service even when unanticipated control plane failures occur.
We're also working to decrease the time it takes us to bring additional routing nodes online, especially when many of them are launched at the same time. Reducing this cycle will enable us to shorten the duration of these kinds of residual effects.
Additionally, we intend to make much greater use of fine-grained control rods in the future to disable specific functionality in order to prevent incidents from spiraling out of control. Had we utilized this functionality earlier, it would have shortened the system's recovery time.
The status site is not hosted on Heroku, so it does not benefit from the platform's increased elasticity. In order to cope with the increased demand generated by the status notifications, we have replaced the server with a larger instance. We're also improving the site's performance with improved caching and other optimizations. Finally, an inadvertent dependency on fonts & images hosted on the Heroku platform has been removed.
We are continually striving to achieve the uptime our customers demand. In addition to platform uptime, however, there's also how we respond to incidents when they do happen. Our internal incident readiness team has been working on improving our procedures for rapid response and customer communication during outages and we saw some of the results of that work during Thursday's incident:
- Engineers were on the case within 2 minutes of our monitoring alerts, and had diagnosed the actual issue within 7 minutes.
- Improved communication procedures resulted in a public status incident being opened 3 minutes after the first alert was sent. Previously, a lack of procedure meant that it could sometimes take much longer to confirm an issue. (Unfortunately, the status site outage caused our timely action here to be largely moot.)
- Utilization of control rods (locking down the API via maintenance mode) prevented the issue from being prolonged by secondary effects.
However, this incident has demonstrated exactly where we should focus our efforts to further improve our incident response. We intend to use the lessons learned to rapidly iterate on the enhanced incident response procedures we've been developing. In the end, we want to deliver a reliable platform and keep our customers as informed as possible when we're having issues.
Our engineers have fully resolved the HTTP routing issue and application functionality has returned to normal.
We will follow up with a full postmortem.
API functionality has been fully restored. Error rates are now within normal levels and our engineers are continuing to closely monitor the situation.
The number of applications seeing H99 errors is continuing to decrease as we continue to work toward a full resolution of the HTTP routing issues. The API is back online now as well.
Our engineers are continuing to work toward a full resolution of the HTTP routing issues. The API is currently in maintenance mode intentionally as we restore application operations.
Most applications are back online at this time. Our engineers are working on getting the remaining apps back online.
Our routing engineers have pushed out a patch to our routing tier. The platform is recovering and applications are coming back online. Our engineers are continuing to fully restore service.
We have identified an issue with our routers that is causing errors on HTTP requests to applications. Engineers are working to resolve the issue.
We have confirmed widespread errors on the platform. Our engineers are continuing to investigate.
Our automated systems have detected potential platform errors. We are investigating.