Starting from 12:56 UTC on thursday the 24th of January, an electrical defect in our datacenter created a chain reaction leading to a complete shutdown of the platform. At 18:40 UTC the same day, all apps and databases were successfully restarted without any data loss. This post details the course of events, analyses the reaction of the team, and details which actions will be taken to improve the situation in the future.
This incident was the longest since the genesis of Scalingo. Starting with a major hardware failure in our datacenter, followed by a period of recovery to get all the databases and apps up and running again.
During the first part of the incident, the whole team was online but waiting for the hardware infrastructure to be available. For quite a long time, we were being told by our infrastructure provider that it was a networking problem. We didn’t expect a complete reboot of all our infrastructure.
Once the power back in our infrastructure, the recovery started. We first realized some of our internal services failed to restart properly. We rely on etcd for service discovery, but our etcd cluster was partially down because of a wrong configuration on one of the host due to the reboot. We couldn’t execute our recovery scripts without this cluster and had to find first a way to start it before proceeding to the next step.
Many of our microservices are actually typical Scalingo apps, the same you host on Scalingo. Eating our own dogfood is a great thing as we are our first customers and testers. Without all the internal services up and running, we had to manually bootstrap some basic services. It took about 45 minutes to restore the most important internal services.
To recover all the databases, we had a great tooling and in 30 minutes, the largest majority of them were up and running. Some didn’t get up as the electricity shutdown put them in a recovery state which had to be handled manually by an operator. We eventually succeeded to recover all databases without any data loss.
Once the databases were available, we focused on recovering the applications: The entity having the responsibility to give orders to servers to start applications is the scheduler. It is designed to always choose the most available nodes. This strategy works great in a normal operational context, but we quickly noticed that in a recovery situation it was really slowing us down. The algorithm responsible to choose the best node sort the list of available nodes based on different monitoring metrics which are refreshed every ~30 seconds. Hence when querying the scheduler to start applications during this period of 30 seconds, only the subset of the nodes considered the most available were ordered to start containers. Consequently, all applications were scheduled on a handful of servers, leading to timeout errors. We had to slow down the app recovery process to prevent this problem from happening. The recovery got slower here but applications kept being restarted. This part of the recovery had already been tested on our staging infrastructure but this issue only appeared at the scale of the production environment. We observed a lack of tooling to follow this recovery process.
After the attempt which resulted on server overloads and timeouts, we didn’t have the observation means to know what was operational and what was not. Because of this lack of overview, some applications had to be restarted multiple times. The corollary problem to this lack was that we couldn’t estimate efficiently and precisely and how much was up and how much was still be done. Estimations we gave publicly were based on the RAM consumption of our infrastructure. We knew how much we were using previously, so with a simple computation, looking at the current RAM usage, we made this estimation.
So ultimately, no operation went really wrong in the recovery it was just not fast enough, for all the different reasons stated in the previous paragraphs. We should be much better prepared and more efficient to achieve this process. That is why a lot of actions will be planned and executed.
As the incident was related to two different actors, actions have to be taken by our infrastructure provider to ensure such electricity default can’t have such impact on the infrastructure (routers / servers / SANs). They rearranged immediately their power supply equipment, in order to improve the electricity consumption balance. Additionally, more power capacity will be provisioned this week.
The second part of the incident was related to us and could be improved for a faster recovery. Our plan for the future is composed of two parts: organisational and technical, because we defined that both can be improved.
On an organisational perspective, the following actions will be achieved:
From all the technical issues we’ve identified after the event analysis, multiple technical measures will be taken:
On a broader scale, we’ve been working for a while on a multi datacenter presence. First, it aims at being present in more than one data centers. However, our more long term goal is to provide data center resiliency in case of major incident like the one that happened last week.
We sincerely want to thank our customers which have been very understanding and encouraging during the incident. You chose to use Scalingo because you don’t want to handle the infrastructure part in your company. You have more time for what really matters: your code, your applications, not system administration tasks (nor disaster recovery).
Here the recovery time has been long, and we’ll work at reducing it as much as possible. We will use thursday’s incident to improve our processes and our technical stack to be more efficient. Needed resources will be dedicated to develop a better internal tooling, and eventually speedup the recovery process.
Finally, while this incident downtime is still in the range of our Terms of Service, we fully understand that it was one long straight downtime period. Therefore financial compensation will follow for customers in our first tier (99.9% availability guaranteed with at least 2 web containers running). It will be deduced automatically on the next invoice.
1: The 24th January 2019 at 13:56 (Paris Time UTC+1), one appliance deployed in one of our data-centers encountered an electrical issue impacting one of its two power supplies. This electrical default tripped the circuit breaker of the related power supply circuit. The whole rack where this equipment is deployed should still have been powered through the second power circuit, but unfortunately its circuit breaker also tripped, due to what we could identify as an overload generated by all rack devices switching their redundant power supply to this single circuit.
The issued needed an on-site intervention in the data center which consisted in several verification before restoring power supply and then, sequentially powering on the devices to identify the one that generated the initial circuit breaker tripping. All devices were fully available at 15:30 (Paris Time UTC+1).
At Scalingo (with our partners) we use trackers on our website.
Some of those are mandatory for the use of our website and can't be refused.
Some others are used to measure our audience as well as to improve our relationship with you or to send you quality content and advertising.