Last Saturday the Heroku runtime platform experienced four hours of intermittent, platform-wide errors. These began slowly, affecting a small number of apps, and ramped up to a 90 minute downtime for most apps.
With this incident, we failed to deliver the uptime that you, our users and customers, have entrusted us to achieve. Everyone at Heroku feels passionately that this downtime and our handling of the incident was far below the level of quality that we hold ourselves to.
Our entire technical team has devoted the past week to digging deep on understanding exactly what happened. There are several root causes and a great number of exacerbating ancillary issues, described in the next two sections, followed by how we are going to address these issues and improve for the future.
On Tuesday (Feb 21), we deployed a code change to the dyno management agent which runs on the machines which comprise our dyno manifold. This change introduced a latent problem with creating extra ("phantom") dynos, which only became visible under unusual load conditions that would occur that weekend.
On Friday (Feb 24), we deployed another change to the dyno agent which introduced several
fsync() system calls for one of the LXC container build-up steps, in order to ensure correctness in some edge cases where filesystem caching and simultaneous startup of multiple dynos could potentially result in a race condition.
That evening, our engineers noticed a deviation in one of our runtime metrics: mean dyno boot time, which had increased by about 2 seconds. This was an early indicator of a problem which become dramatically exacerbated the following morning. Because the increase in dyno launch times was not substantially outside of the variation caused by normal changes in platform usage and load, we did not take immediate action.
Dyno Boot Time
Starting on Saturday morning, indicators around load on the runtime machines started to increase, particularly on disk I/O. As daytime usage began to ramp up on the platform, the fsync() call -- normally nearly instantaneous -- began to slow down.
Disk Write Latency Spike
This slowed dyno boot times, but then began to escalate into an elevated number of processes being stuck in the "starting" state (as visible in
heroku ps for a given app) and failing to boot at all. Our ops staff saw these metrics and recognized them as leading indicators of a serious problem. The incident commander was called in at 17:10 UTC.
The incident commander rang an all-hands-on-deck alarm and the entire ops, runtime, and routing engineering teams came online to begin digging into the problem. User-facing issues began to appear for app management capabilities such as deploying new code, app restarts, add-on management, and unidling of free (single-dyno) apps. Production (multi-dyno) apps were as yet unaffected other than intermittent HTTP errors and elevated page load times.
The situation grew worse over the next hour. A number of very slow dyno launches registered as timeout errors, causing the dyno manager to attempt corresponding dyno launch retries. These extra dyno launches further loaded the runtime hosts. This runtime load in turn slowed subsequent dyno boot times, further exacerbating the problem. In this way the runtime and dyno manager entered into a feedback loop of elevated dyno launch rates and dyno boot times. The increased load on the platform from the elevated dyno launch rate and the underlying fsync bug also caused application performance to further degrade, resulting in elevated app response times, connection errors, and routing queue depths.
At 18:44 UTC, the number of slow or unresponsive dyno backends caused the number of open HTTP connections against the Varnish cache to reach a breaking point, at which point Varnish went offline. This caused all applications using the Aspen/Bamboo routing stack (*.heroku.com) to stop serving GET requests. (POST, PUT, and DELETE were still working, since these bypass Varnish completely, as were apps using the Cedar routing stack, since they do not use Varnish.)
A final compounding issue was a bug introduced five days earlier (mentioned at the top of this document), but only now uncovered by the overloaded dyno manifold. As load on the dyno manifold increased dramatically, a number of dynos were incorrectly accounted for as having been terminated. When the dyno manager launched replacement dynos for these processes, a "phantom dyno" remained. These phantom dynos further exacerbated the feedback loop, as the total number of dynos on the platform grew and increased platform load even further.
At this point the load on the dyno manifold was extremely high. As a result, and with the associated Varnish failure, the platform experienced a nearly-complete outage lasting approximately 90 minutes. Runtime and ops engineers attempted to both understand the problems and use manual measures to bring load under control. Once the feedback loop between the runtime and dyno manager was identified, our runtime engineers deactivated the feedback mechanism in the dyno manager. Load began to fall within ten minutes. We restarted Varnish, and reactivated the dyno manager's normal functions to bring the dyno manifold back under control and all apps back online.
Dyno Manifold Load
One residual issue that persisted for several hours past the outage was an outcome of the phantom dyno bug. Most apps are unaffected by having extra dynos (for example, five web dynos serving requests when only four were requested via
heroku scale web=4). A small number of apps that rely on singleton dynos, such as a clock process, experienced problems as a result. One example of this was the Heroku Scheduler add-on, which is implemented as a standard user-space Heroku app with a clock process. We corrected this issue later that afternoon, at Feb 26 5:40 UTC.
Our runtime engineers worked late into the night on first-level analysis, to understand what had happened and make sure that a recurrence was not possible. Both the fsync and the phantom dyno bugs were identified and corresponding fixes rolled out by Saturday evening.
Impact to Apps
Four different user-facing problems affected various sets of apps, enumerated below. All times are in UTC on February 25, 2012.
- Application management failures. Starting around 17:00 UTC, a small fraction of apps experienced failures in deploys, scaling, restarting, and un-idling (for free apps). This quickly ramped to the majority of apps by 17:30 UTC with intermittent successes. Resolved by 19:10 UTC.
- Increased HTTP latency and errors. Started at 16:30 UTC and escalating until 18:00 UTC, choppy for the next few hours. Approximately 20% of HTTP requests not affected by the Varnish outage encountered errors during this time. Errors peaked at 19:30 UTC. Individual apps may have performed better or worse depending on their dyno configuration and HTTP traffic patterns. Resolved by 20:40 UTC.
- Varnish failure. Starting at 18:44 UTC, Varnish crashed globally, causing errors for all GET requests routed through the Aspen/Bamboo routing stack. POST and PUT requests don't pass through Varnish, nor do any requests on the *.herokuapp.com HTTP stack, so they continued to be served with the same global intermittent errors and elevated latencies. Varnish came back up at 19:31 UTC, ran for 22 minutes, then crashed again at 19:53 UTC. Resolved by 20:39 UTC.
- Phantom dynos. Starting at 17:30 UTC, some apps had extra dynos running. For stateless processes (e.g. web and worker processes) this has minimal impact, but for singleton processes such as the clock process type it could create double-scheduling and other bad behavior. Note that phantom dynos were not tracked by the system and therefore would not impact billing or be visible in
heroku ps. Resolved by Feb 26 05:40 UTC.
Failure is an opportunity to learn, and we intend to maximize the learning we gain from this experience. Here are some of the areas we've identified that need improvement:
- Higher sensitivity and more aggressive monitoring on a variety of metrics. The boot time increase on Friday evening was an early warning indicator, but it was just below the threshold of concern based on our past experiences.
- Better containment to prevent cascading problems from affecting an entire stack or the entire platform. This includes vertical containment between components, such as Varnish being hardened against a large number of unresponsive backends; and horizontal containment, such as partitioning of dyno groups.
- Improved flow controls, both manual and automatic. More granular controls will give us the ability to respond more quickly in the case of feedback loops, runaway load, and other types of cascading failures. There will always be bugs; we need the ability to respond more quickly by locking down or deactivating parts of the system when certain red-alert thresholds are reached. Good flow controls can mean the difference between a brief service interruption and a multi-hour catastrophe.
- App management lockdowns. Some users with working multi-dyno apps inadvertently took their apps down when they tried to deploy code or run a restart during the outage. Locking down app management capabilities temporarily during a serious incident can preserve an app's working state.
- Improved early warning systems. It was fifteen minutes between the first noticed problems and incident commander coming on deck, and another ten minutes after that before we sounded the all-hands-on-deck alarm for engineering. We can do better.
- Expanding our simulations of unusual load conditions in our staging environment. In the past six months we've seen a much wider variety of apps and usage patterns come onto our platform. We're excited to have these new apps and patterns taking advantage of Heroku's capabilities, but they also create new challenges. The quantity and variety of load these apps bring with them requires consideration of new dimensions in our capacity testing, fuzz testing, and other approaches to ensuring platform robustness.
- Improved customer communication during outages. Our status site is below par, and the messages we post are vague, unhelpful, and often untimely. As an example of the latter, we weren't able to get information about the Varnish failure until long after we had identified it internally. We've been experimenting with new procedures for communication between our eng/ops and support teams during incidents, and we are working on a redesign and rethinking of the status site and related procedures.
Last Saturday we failed to deliver the uptime that you, our users and customers, have entrusted us to achieve. This outage has given us a great deal of inspiration and concrete ideas for how we can improve. Our team will be attacking the many areas described above with the utmost priority, and continuing to identify new ones as we strive to earn your trust.
We continue to work on the analysis. It is our responsibility to you to deliver a complete and accurate postmortem. We will deliver the postmortem once we have fully completed our investigation.
All applications are now operational. We are actively monitoring the situation to ensure continued operation. We will post a complete post-mortem early next week.
Process management is continuing to improve rapidly.
Process management is improving rapidly. We also had a failure in the aspen & bamboo http stack which caused us to misidentify an issue there. We have recovered the aspen & bamboo http stacks and are continuing to work on the process management issue.
Process management is recovering. Engineers are closely monitoring the situation.
Process management error rates remain high. Engineers are continuing to investigate and work towards a resolution.
Process management error rates remain high. Engineers are continuing to investigate and resolve the issue.
A bug in process management took many apps offline 2 hours ago.
We have identified the bug and engineering is actively working on triaging the situation. At this time many multi-dyno applications are fully operational. Engineering is working through the remaining applications and automatically bringing them back up. We expect to have all applications operational again shortly.
Process management is recovering but error rates are still higher than expected. Engineers are continuing to work towards resolution
We are continuing to experience high error rates on the platform. Engineers are actively working to resolve the issue.
We are continuing to experience high error rates on the platform. Engineers are actively investigating the issue.
Engineers are continuing to work towards resolving the elevated error rates.
Engineers have identified an issue with process management on the platform and are working to deploy a fix.
Engineers are still working to determine the exact cause of the issues. We believe it is centered around processes starting and stopping properly. This currently affects applications starting and stopping. We'll continue to update this status as information becomes available.
We are continuing to investigate the high rate of errors. Both app and api operations are affected.
Engineers are looking into elevated error rates. We'll provide an update once more information becomes available.