Incident Report: Mitigating a massive DDoS impacting our infrastructure

TL;DR

On Thursday 16th of July 2020 our infrastructure was impacted by what was eventually diagnosed as a Distributed Denial of Service (DDoS) attack causing partial or total unavailability of our osc-fr1 and osc-secnum-fr1 regions. The attack impacted both the Scalingo platform itself as well as the hosted applications of our customers. This post details the course of events, analyses the reaction of the team, and details which actions will be taken to improve the situation in the future.

Timeline of the incident

All timestamps are in Central European Summer Time (CEST).

[10:58] First alerts, some components of our infrastructure are detected as unreachable by our external probes, our team is alerted. Creation of a post on our status page announcing networking issues to access the infrastructure. Beginning of the diagnostic.
[11:00] Our infrastructure is unreachable. Our public IPs on both osc-fr1 and osc-secnum-fr1 regions are not responding to any type of requests done by our operators. Our infrastructure provider Outscale is contacted.
[11:18] Situation gets back to normal, interruption duration: 20 minutes. We got the acknowledgement that a network issue was impacting the access to the whole network backbone.
[12:30] Alerts are triggered again, same kind of behaviour as the previous wave but only on osc-fr1: network access is unavailable for the second time.
[13:00] We get the information from Outscale that the source of the problem seems to be a DDoS Attack.
[13:40] Outscale decided to drop the traffic of the targeted IP (aka blackhole the traffic) of our infrastructure to limit the impact of the attack on their infrastructure. HTTP/HTTPS traffic to the application is now routed via our failover IP. It left us with a single IP to handle customers requests to their applications. Some of our customers may have endured connection timeouts during this failover.
[13:57] Networking seems to be operational again, interruption duration: 1h27.
[16:45] New wave of attack, on our failover IP this time, until 16:51, duration: 6 minutes of unavailability.
[17:41] Alerts are ringing again, close communication with Outscale about what is possible here to mitigate for good the attack. Fortunately they were evaluating a DDoS mitigation platform for a few weeks. They propose to set it up for us to drop the malicious traffic before it arrives to the data center.
[18:05] It is decided to stop running any traffic by both IPs which are the target on the attack and start using a new public IP.
[18:17] A new public IP is provisioned and used by default for all apps using *.osc-fr1.scalingo.io or domains having configured a CNAME DNS field. Users using a A field are encouraged to come to us on the support chat to get the new IP to use.
[18:30] The attack is still ongoing and has an impact on the whole region. Outscale is still setting up the anti-DDoS solution specified above.
[19:15] The mitigation solution is setup, our initial public IPs are restored. The attack is still ongoing, but no impact on Scalingo infrastructure nor on Outscale's. This last wave led to a partial unavailability of 1h34.

The cumulated total interruption duration of this incident was 1h53, and a partial interruption of 1h34.

Analysis

The incident was the first massive DDoS endured by Scalingo. We were attacked in the past but not to that extent. A DDoS attack effect is that it fills completely the networking pipes with forged content and it's preventing legitimate traffic to reach its destination.

Thousands of IPs from all around the world were targeting our infrastructure with a flood of requests: 8Gbps of traffic and approximately 1,000,000 connections per second were attempted by the attacker. The attack was done on the port 443 (HTTPS), and was generating HTTPS requests with extremely large headers to amplify the volume of data sent to our platform. We assume it was a rented botnet, a large set of infected devices controlled remotely. According to Cloudflare statistics, in 2019, 92% of DDoS attacks were under 10Gbps, so this attack was in the higher part of common DDoS attacks.

The target of the attack was a specific website hosted by our services. It has not been determined why their website was the target of such an attack. It does not store any sensitive information, nor anything valuable which could usually be the target of such type of attacks.

Having multiple IP addresses to serve our infrastructure did not help at first since the whole Internet bandwidth from the data-center was filled with attacker requests, blocking access to all the network whatever was the target IP. Our services were impacted as well as other entities sharing the same infrastructure. Having multiple IP addresses helped us mitigating the attack, modifying the way they were routed individually.

Impact

As listed in the Timeline of the event, 4 attacking waves happened. During these events, the following symptoms appeared:

Timeout when accessing Scalingo services (Dashboard, APIs, Deployment, One-off containers, etc.)
Auto-deployments / Review apps from SCM Integrations were failing. We might have missed operations since webhooks from the different platforms were not reaching our services either.
Timeout/Connection Refused when reaching the applications deployed on Scalingo.

Communication

Our status page https://scalingostatus.com was being updated regularly during the day.

We've answered to all messages coming through Intercom either via the in-app chat, or through our support email support@scalingo.com.

Our Twitter account @ScalingoHQ posted about the major parts of the incident.

Specific information has been pushed personally to some customers or to people who asked.

Actions Taken and Future Plan

During the incidents, several actions were attempted to mitigate the impact of the attack on Outscale infrastructure and on Scalingo customers.

The first action taken was to identify the attacked IP and propagate a "Blackhole" route instruction to neighbors. This action was taken by Outscale to prevent the traffic to jam their whole network backbone. We subsequently only used the second IP to route customers' requests until the attack also impacted it.
During normal operation, Scalingo has 2 main IPs to reach Scalingo hosted applications. We quickly configured a third failover IP in case Outscale needed to blackhole our two usual IPs. We kept this third IP secret, until the point we really needed it. We started using it at 18:30.
Outscale doesn't have an official anti-DDoS solution as a public product yet. However, they were in the process of evaluating such solution. In view of the seriousness of the incident, it has been decided to make use of it to mitigate the attack at the end of the afternoon. They plan to extend its usage to prevent future attacks automatically.
HTTP Request management: A queue system already exists, if an application is not able to cope with the intake of HTTP requests, after queuing 50 requests per container, next requests are automatically dropped, preventing from overloading our HTTP routers, this process worked correctly and is not going to be changed.

Few Words to Conclude

First, we want to thank our customers which have been very understanding and encouraging during the incident.

We're fully conscious that such incident has an important impact for all of you. That's why our team is handling an on-call rotation 24/7 managing the infrastructure so you don't have to.

Incidents happen and its our role to handle them as good as possible and to be prepared for them.

DDoS attacks are always difficult to apprehend due to their nature, dummy attack just filling any possible capacity which is available. We wrote procedures to handle them but the scale of this attack was huge.

We had great contact with the Outscale team. We fill confident in our choice of partner here, they helped us getting back on our feet thanks to their network of partnerships.

We're closely monitoring the implementation of the anti-DDoS feature proposed by Outscale to protect your applications from these kind of attacks in the future.

Financial compensation

As our Terms of Service states it, we propose a 99.9% SLA to our Business customers (99.96% for databases using a Business plan).

We're fully aware that the downtime which occurred July 16th has heavily impacted this engagement.

Therefore all Business customers will automatically get a financial compensation of 10% on their invoice for the month of July (5% per hour of downtime).

To qualify as a Business user you must own at least one application with a database using a Business plan and at least 2 containers serving web or TCP traffic to your app.