Heroku Status

Current Status and Incident Report

ssl:endpoint unavailability

Production 18h Development 2h

Follow-up

On Monday, December 24th 2012, some Heroku customers experienced an extended application outage caused by a failure of our ssl:endpoint add-on. This was a particularly severe event, affecting a critical service that our customers rely on to secure their web traffic. Heroku failed a number of our customers with the duration and severity of the outage, and for this we offer our deepest apologies.

This outage was caused by a disruption of the Elastic Load Balancer (ELB) service provided by Amazon Web Services, which in turn disabled approximately 30% of Heroku applications using our ssl:endpoint addon. Affected applications were unable to serve any web traffic during the outage.

Details of the event

On December 25th at 01:00 UTC our platform monitors detected elevated error rates. Heroku's on-call engineers immediately investigated, and quickly determined that the problem was affecting our API service and a subset of ssl:endpoint applications, and that the problem was caused by a large number of malfunctioning ELBs. Specifically, requests to some ELB hostnames were not being forwarded to the backing servers, despite healthy EC2 instances and dynos being ready to serve these requests.

The responding engineers immediately opened an urgent support request with AWS reporting the problem, and generated a list of the ELBs that were showing problems.

Next, they attempted to provision new ELBs to replace key unhealthy ones. However, the ELB API was not responsive, and soon AWS disabled it entirely in order to contain the problem.

At this point, it became clear that the only complete solution to the problem was for ELB service to be restored. As a security precaution, Heroku does not retain the private keys used to secure customer SSL traffic. These keys are stored in the ELB service only, and therefore we were unable to intervene on behalf of our customers to restore service.

Heroku engineering and support staff then focused on working with customers and AWS to assist with their response.

On December 25th at 06:45 UTC, AWS began to restore service to ELBs, first to Heroku's core operational systems and followed by our premium support customers. By 18:45 UTC, all ELBs were recovered and Heroku was fully operational.

Remediation

We continue to work to protect our customers from infrastructure level failures, so that they can focus on building great apps. As we encounter new failure modes like this one, we work to limit the impact of similar failures in the future.

Prior to this outage, Heroku has been developing a new and simplified DNS system. Under this system, Heroku applications using SSL will no longer require a different DNS configuration - your domain's CNAME will always refer to yourapp.herokuapp.com. This not only simplifies custom domain and SSL setup, but it will also allow us to replace errant ELBs with no customer intervention. If we detect a failure to an application's ELB, we can provision a new one and update our internal DNS records to utilize it.

Combined with maintaining a slack pool of ELBs, this technology will give us the ability to route around partial failures of the ELB service.

In addition, we are undergoing research into mitigating against correlated service failures. We have prototyped two promising new technologies. The first partitions the Heroku service - limiting many outage scenarios to a subset of our cloud. The second provides geographic redundancy. Both of these are in the earliest stages of development, but we expect that they will yield a more robust and reliable Heroku service in the coming year.

Resolved

ssl:endpoint services are fully operational

Update

Less than 1% of ssl:endpoints continue to experience problems

Update

About 2% of ssl:endpoints remain to be recovered to normal operations

Update

About 6% of ssl:endpoints are still affected

Update

About 16% of ssl:endpoint deployments are currently impacted. Recovery efforts continue.

Update

About 25% of ssl:endpoints are currently affected by this ongoing incident

Update

Endpoints are continuing to recover, but a significant portion of endpoints continues to be affected. We are working to restore service as quickly as possible, and apologize for the continued impact.

Update

We are seeing an increase in the recovery rate of Endpoints. A significant proportion of ssl endpoints remain unavailable, and recovery work is ongoing.

Update

Endpoints are being brought back online as AWS restores service. A significant proportion of ssl endpoints remain unavailable, and recovery work is ongoing.

Update

At this point we believe the impact to be limited to a subset of ssl:endpoint deployments. We are working to recover all affected endpoints.

Update

The Heroku API is back online and available. We continue to work with AWS to restore availability to affected applications.

Update

At least some apps using ssl:endpoint are affected. We are continuing to investigate and assess the full impact of the issue.

Update

At least some apps using ssl:endpoint are affected. We are continuing to investigate and assess the full impact of the issue. We are currently investigating some issues with Heroku API availability.

Issue

At least some apps using ssl:endpoint are affected. We are continuing to investigate and assess the full impact of the issue.

Investigating

An Elastic Load Balancer issue in Amazon's us-east-1 region is affecting some Heroku customers

← Current Status