On Friday June 29th, Heroku customers experienced a disruption in service which affected running applications and Heroku Postgres databases. We deeply regret the effects of this incident on our customers, and accept full responsibility for the downtime they experienced. We would like to share some additional technical detail about what happened, the steps we are taking in response, and actions that customers can take to protect themselves.
Scope of impact
For all applications
- Heroku API access was limited during recovery operations
- The API was in maintenance mode or read-only mode for 3 hours 41 minutes
- Application deployments were disallowed for an additional 1 hour 30 minutes
- Customers experienced degraded performance due to lost capacity
For applications using the Cedar stack
- About 30% of applications lost one or more dynos for 2 hours 27 minutes
- Dynos were gradually restored over the following 2 hours 50 minutes
For applications using the Bamboo stack
- Total outage for all Bamboo applications for 3 hours 50 minutes
- Intermittent failures and degraded performance for an additional 51 minutes
- Degraded performance due to lack of HTTP caching for an additional 9 hours 30 minutes
For Heroku Postgres databases
- 20% of production databases experienced up to 7 hours of downtime
- A further 8% experienced an additional 10 hours of downtime (up to 17 hours total)
- All production databases were successfully recovered from continuous backups
- Some Beta and shared databases were offline for a further 6 hours (up to 23 hours total)
- Approximately 0.02% of shared/development databases were restored from a daily backup where continuous backups were not available
Third party services
Many customers who relied on third party services and addons built on EBS were also significantly impacted.
The disruption was precipitated by an Amazon Web Services outage which affected the US East region, beginning at 8:04PM PDT. AWS has published a summary of the incident and their response to it, which goes into more detail about the underlying causes of the events discussed here.
The direct impact of this outage on Heroku was twofold:
1. Lost EC2 instances
Approximately 30% of our EC2 instances, which were responsible for running applications, databases and supporting infrastructure (including some components specific to the Bamboo stack), went offline.
We intend to insulate our customers from the loss of any EC2 instances, and had prepared for such a scenario by developing fault tolerant infrastructure, automated failover systems, and emergency response procedures to enable us to quickly restore service. Unfortunately, our recovery efforts were severely hampered by subsequent infrastructure failures, and the result was an extended outage. Complications included:
- The management API for the AWS US East region became unavailable
- In order to restore sufficient capacity quickly, we needed to bring additional instances online in the same region. Without the API, we couldn’t start any new instances.
- We regularly rely on this API in order to collect information about the state of our infrastructure, and to diagnose problems. This lack of visibility slowed the recovery process.
- Elastic Load Balancer instances and Elastic IP addresses failed to respond promptly to configuration changes
- We use ELBs and EIPs to redirect traffic away from failed instances to redundant and secondary systems, so when these mechanisms responded slowly or malfunctioned, traffic continued to be routed to systems which were down.
As a matter of course, we disabled and limited access to the Heroku API while the platform was stabilized.
2. Lost Elastic Block Storage volumes
A large number of EBS volumes, which stored data for Heroku Postgres services, went offline and their data was potentially corrupted. As a result, customer databases remained down until they could be recovered, unless a follower database was activated (see below for more about followers). As a precaution against this type of incident, we continuously archive customer data for recovery purposes, and many databases needed to be restored from these archives.
In past EBS incidents, it was very rare for volumes to be damaged even if they sustained downtime, so they were very fast to recover. We were therefore not fully prepared for a recovery effort of this magnitude, and it took several hours to automatically restore the affected databases.
In the end, all production databases were successfully recovered and normal operations were restored.
What we are doing
We are planning a comprehensive remediation effort to address the factors which led to the downtime.
We will continue to work to increase our resilience to infrastructure outages and to improve our ability to recover from a wide range of failure modes, including the loss of instances and EBS volumes. Our #1 goal is to provide a trustworthy platform regardless of the underlying infrastructure.
As a result of this outage, we have produced new tools which enable us to more expediently relocate database services from a failed availability zone. We will further invest in these tools and make them a part of our daily operations, to ensure that identifying and relocating affected resources goes smoothly in the future.
Some of our failover mechanisms did not perform adequately during this outage, because of their reliance on certain AWS services, including the US East region API, Elastic Load Balancers, and Elastic IPs, which were affected by the outage. We will investigate ways to make these mechanisms more resilient to such failures, and consider alternative solutions where they are available.
We do not plan to invest further in improving robustness for applications using the older Bamboo stack. Instead, we will focus on improving the current Cedar stack, and we encourage customers to migrate to Cedar, where we are investing all of our application availability efforts.
What you can do
While we strive to make the Heroku platform as a whole as resilient as possible, the design and configuration of applications can significantly improve their resiliency. Here are some recommended steps you can take to improve the availability of your app.
Heroku Postgres offers a simple way to maintain a continuously updated replica of your database, called a follower. Followers can answer read-only queries, and if your primary database fails, you can easily “promote” the follower to be a new, writable primary by running two commands.
We have made many infrastructure improvements in the past year which only benefit customers using the current Cedar stack. Customers using Cedar experienced, on average, only 1/10th the downtime compared to those using Bamboo.
Applications with multiple running dynos will be more resilient against failure. If some dynos are lost, the application can continue to process requests while the missing dynos are replaced. Typically, lost dynos are replaced immediately, but in the case of a catastrophic failure like this one, it can take some time.
We take our availability very seriously. We are taking steps both to address the specific issues raised by this outage, and to continue to improve the overall resiliency of our platform to infrastructure instability. The improvements we have made over the past year materially reduced the downtime experienced during this incident, but we recognize we have more work to do. Our entire team is focused on making Heroku the most trustworthy platform available for our customers.
App operations are restored at this time. We are monitoring the situation, and tracking additional issues that arrise in separate status incidents.
Dedicated production databases should be operational. We're continuing recovery of shared development databases and Crane and Kappa beta databases. Varnish remains temporarily disabled on the Bamboo HTTP stack.
HTTP routing is improving on Bamboo although the HTTP caching layer has been temporarily disabled on this stack.
Database recovery efforts are ongoing.
We are experiencing increased error rates on the Bamboo HTTP stack.
HTTP error levels on the Bamboo stack have returned to normal.
Some applications continue to be affected by unavailable database services. We are still working on restoring these.
There is an increase in HTTP errors on applications that run on the Bamboo stack. We are working on resolving this.
HTTP error rates have returned to normal levels. Deployment via git push have been re-enabled, but we are seeing elevated error rates for this service. We continue to work to resolve this issue.
While the majority of applications are online and operating normally, a number of applications continue to be affected by unavailable database services. Additionally, some addon providers have also been affected by tonight's electrical storm. While these services are beyond our control, our addon providers are also working diligently to restore service.
API operations have been restored, but git pushes remain disabled. We continue to see elevated error rates across our HTTP stack (such as H99s).
We're seeing elevated HTTP error rates across the platform. Our engineers are currently investigating.
We're seeing elevated error rates to our API which manifests as non-responsive API calls. For example,
heroku commands from the command-line may not return. Our engineers are working to address this issue.
The majority of applications are online. Our infrastructure provider has informed us that due to the electrical storm which interrupted power, possible disk corruption may have occurred on affected databases. We are recovering all at-risk databases from our continuous-protection archives.
The majority of applications are online. A number of applications that rely on unavailable databases are still offline. We are working to reduce error rates and restore database services. Full API access is now available.
We continue to work to stabilize the platform and restore database services. The API is now available in read-only mode.
We've restored availablility to many applications, but continue to see fluctuations in error rates. A number of applications have processes which are unavailable or are relying on database services which are still offline. We are working on stabilizing the platform and restoring database services.
We've restored the majority of internal services and are seeing a reduction in error rates, but many applications and databases remain offline. We are continuing to work to restore processes and databases.
Our engineers continue to work to restore affected services and bring new capacity online. Our data team is also working to restore unavailable databases.
Our engineers are continuing to move services away from affected infrastructure. We'll provide further updates as we progress.
Our engineers continue to work to restore affected systems. Some production applications are unaffected but many applications are offline. API access is disabled while we restore service.
We have lost connectivity to some of our infrastructure. Our engineers are working to restore affected systems. We've disabled API access while we work through the issues.
We're currently experiencing a widespread application outage. We've disabled API access while engineers work on resolving the issues.
Our automated systems have detected potential platform errors. We are investigating.