Heroku Status

Current Status and Incident Report

Widespread Application Outage

Production 7h60h Development 7h69h

Follow-up

Starting last Thursday, Heroku suffered the worst outage in the nearly four years we've been operating. Large production apps using our dedicated database service may have experienced up to 16 hours of operational downtime. Some smaller apps using shared databases may have experienced up to 60 hours of operational downtime. Code deploys were unavailable across some parts of the platform for almost 76 hours - over three days. In short: this was an absolute disaster.

On Specifics

It's no secret that there was a huge Amazon EC2 outage exactly corresponding to the beginning of our downtime; so one can easily surmise that this was the root cause of Heroku's downtime as well. This post will reference the AWS services that we use behind the scenes so that we can be very specific. Note that although we will be discussing various AWS service failures, we don't blame them for what our customers experienced in any way. Heroku takes 100% of the responsibility for the downtime affecting our customers last week.

What Happened: First 12 Hours

On April 21, 2011 at 8:15 UTC (or 1AM in our timezone), alerts began coming in from our monitoring systems. We opened an incident on our status page (follow @herokustatus to get these updates via Twitter). We saw what appeared to be widespread network errors that were resulting in timeouts in our web, caching, and routing services. We began investigating and immediately opened a support ticket with AWS at the highest priority.

For the first several hours of the outage, we tried shutting down misbehaving instances and replacing them with new ones. This is our standard handling of EC2 issues of this nature. Our platform is designed this way and it typically works very well, producing very minimal disruption to our customers. In this case, however, we found that things were getting worse.

The biggest problem was our use of EBS drives, AWS's persistent block storage solution. We use this on instances which require state, mainly databases, but a few other types of nodes as well. Our EBS drives were becoming more and more unpredictable in their behavior, in some cases becoming completely unresponsive, even after detaching from their current instance and re-attaching to a new one.

We were in direct contact with our technical account manager at AWS the entire time, who provided us potential workarounds. Unfortunately, these workarounds were not helping, and the failures grew even more widespread.

Historically, the best move for us in these incidents is to do our best to keep things running (killing unhealthy instances, etc.) and wait for AWS to resolve things. Rarely has that taken more than an hour or two.

In this case, the EC2 outage lasted a total of about 12 hours. In the afternoon on Thursday, we were able to begin starting new instances en masse and we believed we'd be just an hour or two away from recovery. The majority of applications were back up on Thursday afternoon, but it took us much longer to recover the remaining ones.

What Happened: The Long Haul

Unfortunately, while EC2 was more or less fully operational again, the EBS system was not. As you can see from the AWS status page, the EBS outage lasted a total of 80+ hours. Heroku was able to get back online more quickly than that thanks to help from our contacts at AWS and hard work from our engineers.

While most applications were back online within 16 hours, there were still many applications on the affected shared database servers. We were also having problems with some of the the servers that we use to process git pushes for deployment, which meant that the applications hosted on those servers could not have new code deployed, even if they were otherwise online.

The next 48 hours were spent with our engineers working closely with AWS to restore service as quickly as possible. We saw slow but steady progress for 36 hours, with servers continually returning to service as their underlying EBS disks started responding again.

We also worked with customers that had applications that were online but weren’t deployable because of the ongoing problems with some of our git servers. We were able to, on a case-by-case basis, create new repositories for these customers to push to, which allowed them to deploy while we worked to bring the original servers back online.

Our Response

Our monitoring systems picked up the problems right away. The on-call engineer quickly determined the magnitude of the problem and woke up the on-call Incident Commander. The IC contacted AWS, and began waking Heroku engineers to work on the problem.

Once it became clear that this was going to be a lengthy outage, the Ops team instituted an emergency incident commander rotation of 8 hours per shift, keeping a fresh mind in charge of the situation at all time. Our support, data, and other engineering teams also worked around the clock.

We prioritize getting top-paying customers back online over our larger base of free users, which is why customers (particularly those with dedicated databases) were back online much more quickly than free apps. While we think this prioritization makes sense, we do strive to provide a high level of service to everyone. Even though the outage was much shorter (less than 16 hours in most cases) for our top customers than for our free users (as much as 3 days in some cases), we measure our downtime as the time it took to get 100% of apps back online.

We updated our status page throughout the incident. Some folks have complained that our updates lack detail, or (in many cases) were repetitious of previous updates. This is something we'll strive to improve, but it's actually a lot harder than it sounds. There are large swaths of time where it's simply a matter of continuing to restore databases from backups and otherwise bring replacement systems online - one hour doesn't look too different from the next. What matters is that we've got a full crew working hard at bringing everything back online, and the status updates are there to let everyone know we're still hard at work.

Remediation

It hardly needs stating, but we never want to put our customers, our users, or our engineering team through this again.

Failures at the IaaS layer will happen. It's Heroku's responsibility to shield our customers from this; part of our value proposition is to abstract away these concerns. We failed at this in a big way this weekend, and our engineers are even as we speak hard at work on architectural changes that will allow us to handle infrastructure outages of this magnitude with less or no disruption to our customers in the future.

There are three major lessons about IaaS we've learned from this experience:

1) Spreading across multiple availability zones in single region does not provide as much partitioning as we thought. Therefore, we'll be taking a hard look at spreading to multiple regions. We've explored this option many times in the past - not for availability reasons, but for customers wishing to have their infrastructure more physically nearby for latency or legal reasons. We've always chosen to prioritize it below other ways we could spend our time. It's a big project, and it will inescapably require pushing more configuration options out to users (for example, pointing your DNS at a router chosen by geographic homing) and to add-on providers (latency-sensitive services will need to run in all the regions we support, and find some way to propagate region information between the app and the services). These are non-trivial concerns, but now that we have such dramatic evidence of multi-region's impact on availability, we'll be considering it a much higher priority.

2) Block storage is not a cloud-friendly technology. EC2, S3, and other AWS services have grown much more stable, reliable, and performant over the four years we've been using them. EBS, unfortunately, has not improved much, and in fact has possibly gotten worse. Amazon employs some of the best infrastructure engineers in the world: if they can't make it work, then probably no one can. Block storage has physical locality that can't easily be transferred. That makes it not a cloud-friendly technology. With this information in hand, we'll be taking a hard look on how to reduce our dependence on EBS.

3) Continuous database backups for all. One reason why we were able to fix the dedicated databases quicker has to do with the way that we do backups on them. In the new Heroku PostgreSQL service, we have a continuous backup mechanism that allows for automated recovery of databases. Once we were able to provision new instances, we were able to take advantage of this to quickly recover the dedicated databases that were down with EBS problems.

We have been porting this continuous backup system to our shared database servers for some time and were finishing up testing at the time of the outage. We’ve previously relied on point backups of individual databases in the event of a failure rather than the continuous full server backups that the new system makes use of. We are in the process of rolling out this updated backup system to all of our shared database servers; it’s already running on some of them and we are aiming to have it deployed to the remainder of our fleet in the next two weeks.

Conclusion

This outage was not acceptable. We don’t want to ever put our customers through something like this again and we’re working as hard as we can on making sure that we won’t ever have to. The level of support and patience we received from customers who had every right to be frustrated was amazing. We appreciate your trust in us and we’re going to live up to it.

On the bright side, we couldn’t be more proud of the work of our Ops, Database, and Support teams, and all of our engineers during this incident. Whether or not AWS suffers an outage of this magnitude ever again, we're glad to have the extra impetus to build Heroku into an ever-more resilient platform.

Resolved

We have fully restored git services for all applications and App Operations are functioning as expected. We will provide a postmortem for this incident once we've completed our internal review.

Update

At this time all shared databases and all dedicated databases are restored. App Operations are functioning as expected.

We're still working to restore git push functionality to a very small percentage of applications and expect to have this functionality restored soon.

Update

At this time most applications are online and we're making progress recovering the remaining affected shared databases. We do not have an ETA but the pace of recovery has increased.

We're still working to restore deploy functionality for all applications.

Update

At this time most applications are online. Deploys have been restored for a majority of applications too. We're still hard at work recovering the remaining affected shared databases and deploy functionality. We do not at this time have an ETA but are dedicated to fully restoring all functionality as quickly as possible.

Update

We are continuing to work on recovering shared databases and restore git push functionality to any affected applications with all available resources. We do not at this time have an ETA.

Update

We have successfully recovered a number of databases. Additional applications running on a few shared database servers are still being recovered. We are continuing to work on recovering these shared databases with all available resources. We do not at this time have an ETA for the full recovery of these shared databases.

Update

We are aware of the incredible difficulties this downtime is causing many customers.

Current Operational Status:
All dedicated database applications are fully online.
The majority of shared databases are online.
The majority of applications can deploy via git.
New app creation is fully working.

Next steps:
We are working with our service provider to restore both deployments and operation to affected shared databases as quickly as possible. In parallel, we are working on alternative recovery options.

Update

We have restored all dedicated databases and are continuing to work on the effected shared databases and deploy tools.

Update

We are still working through restoring the affected databases and restoring full deploy capabilities.

Update

We have successfully brought up our core services and begun restoring service to applications. Many applications are now fully operational. The remaining effected apps databases are being brought online now. We will continue to work to bring the remainder of applications up as quickly as possible.

API services are now fully restored, and all gem commands are now working. Deploys are working for some applications. We will continue to work on restoring deploys for the remaining applications.

Update

We have successfully brought up our core services and begun restoring service to applications. Many applications are now fully operational. The remaining effected apps databases are being brought online now. We will continue to work to bring the remainder of applications up as quickly as possible.

API services are now fully restored, and all gem commands are now working. We are working on restoring deploys.

Update

We are continuing to restore service to applications. In some cases the process of bringing many applications online simultaneously has created intermittent availability and elevated error rates. We continue to work to fully restore availability as quickly as possible.

Update

We have successfully brought up our core services and begun restoring service to applications. Many applications are now fully operational. We will continue to work to bring the remainder of applications up as quickly as possible and then restore api and git services.

Update

We have been able to successfully boot new servers and are in the process of restoring our core services. Once our core services come online we will be able to start to bring app operations back online. We will post further updates as soon as we have additional information.

Update

We continue to experience widespread connectivity issues that are preventing us from booting servers. We are working with our service provider to resolve this as soon as possible. We do not currently have an estimate for when this will be resolved. We will post further updates as soon as we have additional information.

Update

We are continuing to work with our service provider to restore outstanding connectivity issues. We will continue to update every half hour or as new information becomes available.

Update

We are continuing to work with our service provider to restore outstanding connectivity issues. We will continue to update every half hour or as new information becomes available.

Update

We are continuing to work with our service provider to restore outstanding connectivity issues. We will continue to update every half hour or as new information becomes available.

Update

We are continuing to work with our service provider to restore outstanding connectivity issues. We will continue to update every half hour or as new information becomes available.

Update

We are continuing to work with our service provider to restore outstanding connectivity issues. We will continue to update every half hour or as new information becomes available.

Update

We are continuing to work with our service provider to restore outstanding connectivity issues. We will continue to update every half hour or as new information becomes available.

Update

We are continuing to work with our service provider to restore outstanding connectivity issues. We will continue to update every half hour or as new information becomes available.

Update

We are continuing to work with our service provider to restore outstanding connectivity issues. We will continue to update every half hour or as new information becomes available.

Update

We are continuing to work with our service provider to restore outstanding connectivity issues. We will continue to update every half hour or as new information becomes available.

Update

We are continuing to work with our service provider to restore outstanding connectivity issues. We will continue to update every half hour or as new information becomes available.

Update

There is nothing new to report. We're continuing to work to get full connectivity restored and will continue updating every half hour.

Update

There is nothing new to report. We're continuing to work to get full connectivity restored and will continue updating every half hour.

Update

There is nothing new to report. We're continuing to work to get full connectivity restored and will continue updating every half hour.

Update

There is nothing new to report. We're continuing to work to get full connectivity restored and will continue updating every half hour.

Update

We're still seeing elevated error rates due to connectivity issues and are working with our service provider to fully restore service.

Update

We're still seeing elevated error rates due to connectivity issues and are working with our service provider to fully restore service.

Update

We're still seeing elevated error rates due to connectivity issues and are working with our service provider to fully restore service.

Update

We're still seeing elevated error rates due to connectivity issues and are working with our service provider to fully restore service.

Update

We're still seeing elevated error rates due to connectivity issues and are working with our service provider to fully restore service.

Update

Connectivity issues are causing applications and tools to work intermittently. We're working with our network service provider to fully restore service. There is nothing new to report at this time.

Update

Connectivity issues are causing applications and tools to work intermittently. We're working with our network service provider to fully restore service at this time.

Update

We do not have anything new to report at this time. We're still working with our network service provider to fully restore connectivity. Applications and tools are working intermittently at this time. We'll continue to update on a 30 minute interval unless we have something new to report.

Update

We're continuing to work with our network service provider to fully restore connectivity. Applications and tools are working intermittently at this time.

Update

The elevated error rates are due to connectivity issues. We're continuing to work with our network service provider to fully restore connectivity. Applications and tools are working intermittently at this time.

Update

Error rates appear to have stabilized. Both applications and tools are functioning as expected at this time. We're continuing to keep a close eye on the situation as we investigate the root cause.

Issue

We are investigating high error rates. We'll post an update when we know more.

← Current Status