Incident Report: restarting the platform after a complete electrical shutdown of the infrastructure

TL;DR

Starting from 12:56 UTC on thursday the 24th of January, an electrical defect in our datacenter created a chain reaction leading to a complete shutdown of the platform. At 18:40 UTC the same day, all apps and databases were successfully restarted without any data loss. This post details the course of events, analyses the reaction of the team, and details which actions will be taken to improve the situation in the future.

Timeline of the incident

[12:56 UTC] First alerts, all our websites are unreachable, operators are paged. The whole team, all members working remotely, is all hands on deck. We quickly notice we don’t have any network access to the platform, VPN access is down, all customers applications are also unavailable. Contact with our infrastructure provider is initiated.
[13:10 UTC] The team of our provider is already on the incident they are figuring out the situation.
[13:25 UTC] The first feedback we have is that a BGP router seems to be down and might be the cause of this issue. Later on, we understand the whole infrastructure actually suffers an electrical failure. A detailed explanation from our provider is given in footnote ¹.
[14:30 UTC] Electricity is back, network is up as well as our servers. The whole infrastructure has been rebooted. First priority is to restart and ensure the correct execution of our internal services (restarting database clusters, orchestrator, etc.). The second objective is to restore, as soon as possible, all customers apps as well as their database addons.
[15:13 UTC] All the required internal services are up and running, we started restoring customer databases.
[15:45 UTC] 95%+ of databases are up and running, as well as all our internal services. We are starting the recovery of customers applications and looking manually at databases which haven’t restarted correctly.
[18:40 UTC] Until that time, apps have been redeployed in the infrastructure progressively until completion. It means that apps were unavailable for a duration between 2h49 and 5h44.

Analysis

This incident was the longest since the genesis of Scalingo. Starting with a major hardware failure in our datacenter, followed by a period of recovery to get all the databases and apps up and running again.

During the first part of the incident, the whole team was online but waiting for the hardware infrastructure to be available. For quite a long time, we were being told by our infrastructure provider that it was a networking problem. We didn’t expect a complete reboot of all our infrastructure.

Once the power back in our infrastructure, the recovery started. We first realized some of our internal services failed to restart properly. We rely on etcd for service discovery, but our etcd cluster was partially down because of a wrong configuration on one of the host due to the reboot. We couldn’t execute our recovery scripts without this cluster and had to find first a way to start it before proceeding to the next step.

Many of our microservices are actually typical Scalingo apps, the same you host on Scalingo. Eating our own dogfood is a great thing as we are our first customers and testers. Without all the internal services up and running, we had to manually bootstrap some basic services. It took about 45 minutes to restore the most important internal services.

To recover all the databases, we had a great tooling and in 30 minutes, the largest majority of them were up and running. Some didn’t get up as the electricity shutdown put them in a recovery state which had to be handled manually by an operator. We eventually succeeded to recover all databases without any data loss.

Once the databases were available, we focused on recovering the applications: The entity having the responsibility to give orders to servers to start applications is the scheduler. It is designed to always choose the most available nodes. This strategy works great in a normal operational context, but we quickly noticed that in a recovery situation it was really slowing us down. The algorithm responsible to choose the best node sort the list of available nodes based on different monitoring metrics which are refreshed every ~30 seconds. Hence when querying the scheduler to start applications during this period of 30 seconds, only the subset of the nodes considered the most available were ordered to start containers. Consequently, all applications were scheduled on a handful of servers, leading to timeout errors. We had to slow down the app recovery process to prevent this problem from happening. The recovery got slower here but applications kept being restarted. This part of the recovery had already been tested on our staging infrastructure but this issue only appeared at the scale of the production environment. We observed a lack of tooling to follow this recovery process.

After the attempt which resulted on server overloads and timeouts, we didn’t have the observation means to know what was operational and what was not. Because of this lack of overview, some applications had to be restarted multiple times. The corollary problem to this lack was that we couldn’t estimate efficiently and precisely and how much was up and how much was still be done. Estimations we gave publicly were based on the RAM consumption of our infrastructure. We knew how much we were using previously, so with a simple computation, looking at the current RAM usage, we made this estimation.

So ultimately, no operation went really wrong in the recovery it was just not fast enough, for all the different reasons stated in the previous paragraphs. We should be much better prepared and more efficient to achieve this process. That is why a lot of actions will be planned and executed.

Actions taken and future plan

As the incident was related to two different actors, actions have to be taken by our infrastructure provider to ensure such electricity default can’t have such impact on the infrastructure (routers / servers / SANs). They rearranged immediately their power supply equipment, in order to improve the electricity consumption balance. Additionally, more power capacity will be provisioned this week.

The second part of the incident was related to us and could be improved for a faster recovery. Our plan for the future is composed of two parts: organisational and technical, because we defined that both can be improved.

On an organisational perspective, the following actions will be achieved:

Crash recovery training: the team trained to recover a completely down infrastructure on the staging environment. However, it has been the first time such a major event happened to the company on the production environment, at this scale. We plan to organize regular training in our staging environment to ensure our processes are good and that we’re able to recover efficiently from an infrastructure disaster. We will put more stress on this environment to close the gap between both environments.
Process definitions and tests: we maintain an internal documentation containing processes for support queries as well as incident recovery but it is still incomplete. Processes to apply in major incidents like this one are not all present, because how rare they are. Resources will be invested to improve the current operational documentation. Of course those processes have no value if they are not tested, that’s the purpose of trainings defined above.
Internal communication improvement: during this incident all the members of the team were working remotely. The communication went well but could be improved. We have mostly been using Slack to communicate internally. It worked great, but at some point, we lost time finding operational information (who was doing precisely what etc.). These elements were explicitly written, but then covered by other messages. Next time, a custom channel dedicated to the incident will be created, as well as a real time co-authoring document where each operator can write and update the tasks they are working on.

From all the technical issues we’ve identified after the event analysis, multiple technical measures will be taken:

A low level diagnostic tool will be developed to help operators checking the state of the all the basic components and their dependencies (etcd, DNS, Message Queue, etc.) automatically, without having to do manual check for each component.
A recovery mode will be added to our scheduler which will distribute apps differently than its normal workflow in order to restart applications faster and prevent overloading servers running the applications. This recovery mode should also disable the garbage collection of application images cache. It would help restarting applications faster, if images are not downloaded but already present on servers.
A high level diagnostic tool will be developed in order to track the recovery of the platform. It was difficult for the operators to keep track of the progress of the recovery process. We need such information to communicate efficiently with our customers.

On a broader scale, we’ve been working for a while on a multi datacenter presence. First, it aims at being present in more than one data centers. However, our more long term goal is to provide data center resiliency in case of major incident like the one that happened last week.

Few words to conclude

We sincerely want to thank our customers which have been very understanding and encouraging during the incident. You chose to use Scalingo because you don’t want to handle the infrastructure part in your company. You have more time for what really matters: your code, your applications, not system administration tasks (nor disaster recovery).

Here the recovery time has been long, and we’ll work at reducing it as much as possible. We will use thursday’s incident to improve our processes and our technical stack to be more efficient. Needed resources will be dedicated to develop a better internal tooling, and eventually speedup the recovery process.

Finally, while this incident downtime is still in the range of our Terms of Service, we fully understand that it was one long straight downtime period. Therefore financial compensation will follow for customers in our first tier (99.9% availability guaranteed with at least 2 web containers running). It will be deduced automatically on the next invoice.

Footnotes

1: The 24th January 2019 at 13:56 (Paris Time UTC+1), one appliance deployed in one of our data-centers encountered an electrical issue impacting one of its two power supplies. This electrical default tripped the circuit breaker of the related power supply circuit. The whole rack where this equipment is deployed should still have been powered through the second power circuit, but unfortunately its circuit breaker also tripped, due to what we could identify as an overload generated by all rack devices switching their redundant power supply to this single circuit.

The issued needed an on-site intervention in the data center which consisted in several verification before restoring power supply and then, sequentially powering on the devices to identify the one that generated the initial circuit breaker tripping. All devices were fully available at 15:30 (Paris Time UTC+1).

Incident Report: restarting the platform after a complete electrical shutdown