Follow up 30th Sept 2016 downtime

September 30, 2016 - 3:05AM

TL;DR

From 00:05 CEST 30th September 2015, our infrastructure has been blacked out from the Internet: all requests reaching your apps timed out. The problem has been solved at 01:40 CEST. It was an external networking issue on our provider side, not a problem related to our infrastructure.

Detailed timeline of the events

At 00:05 CEST, Leo’s (our CTO) phone rang. Not the usual tune when someone is calling, but a PagerDuty alarm telling us that something was wrong with our platform. After a short analysis, it was clear that the problem was network related, not related to our software stack, and we had to get the responsible for the level below us, our infrastructure provider.

It is not a secret that we’re not working with a public cloud like Amazon Web Services or Google Cloud (which are not exempted of downtime either), but we are relying on a smaller French structure which is based on 2 datacenters in Paris and in Strasbourg. We’ve made this choice to have a better control on the price/performance ratio and because we knew those guys (yes, we keep on telling them that their website should be in english too) for a long time.

So at 00:10 CEST, we reached their support, actually engineers who have also been woke up by something wrong. They acknowledged our problem, telling us it is impacting several of their customers and that their team is on it. Ok, all right.

Then, started the ‘dark’ period, this period where you feel blind as there was nothing we could do, except waiting our infrastructure got juice (i.e. network) again. As we had no problem with our infrastructure, applications and databases kept running as expected. Harassing their team would not have helped us more as the problem has been acknowledged and the level of stress of everyone was, obviously, very high.

We got a first message at 00:30 CEST, that the cause might be related to a BGP router blocking the traffic, resulting in the fact that our infrastructure was in the dark, but still, no solution, stress was increasing minute after minute. Then again, we could just wait for them to solve this.

At 01:00 CEST, they called us back to declare they got the real problem, which was not related to the BGP router, but to the firewall/load balancer positionned just after. They were in the process to find a workaround as quickly as possible to recover the service, and to solve it.

It is only around 01:40 CEST that our servers got the internet back and that all applications and the platform itself went back online. All the information we have at the moment from our provider is that the problem came from their load balancers (which are positionned just before of our servers). These have been shortcut to give internet back to our infrastructure. Once the real problem has been solved later, we got back to the initial networking configuration.

What have we learned

First, our emergency notification system is working all right, a few minutes after the problem declared itself, someone was awake, ready to get things into hands. That is a good point, even if we test it, it is critical that such alarm happens correctly in real situations.

Then, in the hosting world, we have positionned ourselves in the Platform as a Service area. It means, that we ‘use’ an infrastructure provider, install our software stack, and provide service to customers in order to ease their life, the choice of this infrastructure provider. To the question “are we working with the good guys?”, we answer yes. The contract we’ve signed with them mentionned the following:

  • There won’t be any downtime lasting more than 4h
  • There won’t be more than 12h of downtime per year

So far, both these conditions have been respected and we’ve never been closed to them. No infrastructure provider is exempt of troubles, Google Cloud and Heroku had a downtime around 1h these last weeks, but it is important to us to explain this failure openly and quickly.