We encountered a large number of failures of the block storage devices which help power our Heroku Postgres databases in one availability zone. Our team immediately began working to restore availability to affected databases, which was ongoing for several hours. In our followup after the incident we identified several areas where improvements could be made to reduce the time required to restore should future failures occur, these are currently being implemented.
Furthermore, the Heroku status site is typically updated when a large amount of databases are affected, and closed when back at acceptable levels – provided all have a ticket opened with the status of their database if affected. During this incident many direct notifications were not opened correctly creating unclarity for affected databases. We're working to put additional safeguards in place to ensure this does not occur in the future.
This issue is now resolved.
Our database engineers are restarting remaining affected database instances. We are continuing to monitor the situation.
Our database engineers have identified all affected databases and are in the process of recovering them. We are monitoring the situation.
We saw a number of customer databases go offline, as well as an increase in ELB latency as some ELB nodes became unhealthy. We did not see any other impact to production applications.
Unhealthy ELBs have recovered now and our database engineers are working on recovering the remaining customer databases.
Our engineers confirmed issues with EBS-backed instances for a single availability zone. We are currently working with our infrastructure provider in resolving this.
We are seeing a several databases in a single availability zone down at the moment. Our engineers are continuing to investigate the impact on production applications.
Our automated systems have detected potential platform errors. We are investigating.