Network incident, 13 October 2021
13 October - 6pm
For weeks now, we have been facing an increase in DDoS attacks, which we counter every day.
In an effort to improve our defence mechanisms, we continually strengthen our configurations to improve the level of protection offered to our customers.
A change was prepared and validated by our Change Advisory Board (CAB) per the correct Method of Procedures (MOP) and peer review (announced on 2021-10-12 at 16:28 CET).
2021-10-13 09:05 CET - The planned change is initiated as scheduled: (http://travaux.ovh.net/?do=details&id=53785)
2021-10-13 09:18 CET - Change actions are carried out as expected (BGP isolation, configuration updates).
2021-10-13 09:20 CET - When modifying the network configuration, a problem occurred. The router did not interpret a command correctly. The aim was to regulate the redistribution of BGP in OSPF. All IPv6 traffic remained accessible.
2021-10-13 09:21 CET - The team detected a problem with the router's performance and started the escalation process immediately.
2021-10-13 09:25 CET - The start of the crisis management process was launched, in accordance with our procedures in place (the delay with the crisis is due to the expectation related to the convergence time of change).
2021-10-13 09:30 CET - The rollback procedure didn't work, so we took the decision to physically isolate the associated equipment and triggered on-site physical support.
2021-10-13 09:45 CET - The datacentre team is dispatched to the telecom room to launch the second bypass plan.
2021-10-13 10:00 CET - The datacentre technician begins operations in then telecom room (3:00 am local time USA)
2021-10-13 10:02 CET - The first request was for the disconnection of the optical equipment to isolate connectivity and restore service as quickly as possible.
2021-10-13 10:10 CET - We make the decision to power off the faulty router.
2021-10-13 10:18 CET - The defective device is switched off (it takes two minutes for network convergence).
2021-10-13 10:20 CET - First services are restored.
2021-10-13 10:30 CET - The stabilisation of connectivity to restore all remaining services begins.
2021-10-13 10:57 CET - This is the end of the crisis from a technical point of view.
2021-10-13 10:30 CET - Ongoing actions are taken to finalise and verify the stability of our network and finalise the restoration of the remaining adjacent and non-blocking services (tasks will be follow-up actions).
OVHcloud operates a global network that spans all continents. To ensure the best possible access for its customers, this network is fully meshed.
By nature, mesh means that all the routers participating in the network are connected to each other, directly or indirectly, and constantly exchange routing information.
During the outage, the entire internet routing table was announced in the OVHcloud IGP. The massive influx of routing information on the IGP led some routers to behave in an unstable manner. Since the OSPF table was full, this caused the RAM and CPU to become overloaded. The impact was on IPv4 routing only and all IPv6 traffic was accessible.
A convergence loop between BGP and OSPF occurred, rendering IPv4 routing inoperable. This has made it impossible to process IPv4 traffic correctly on all of our websites.
We were able to regain control of the situation very quickly, by accessing the faulty equipment and isolating it from the network.
(After D2 was offline, the network converges began, emptying OSPF tables on devices and routing traffic to nominal gateways.)
Our immediate plan of action is to re-evaluate our change procedure on this type of equipment (which natively applies the command line) and to strengthen the relative change process accordingly.
Since this incident has had an impact on our customers’ use of IPv4, our teams around the world have been monitoring the situation as closely as possible to help customers restore their services and keep them informed.
You can view all information on our operations on our dedicated platform: http://travaux.ovh.net/
We apologise for any inconvenience this may cause you.
13 October - 11:30am
On the October 13th at 9:12am (CET/Paris time), we carried out interventions on a router in our Vint Hill datacentre in the United States, which caused disruptions to our entire network. These interventions were aimed at reinforcing our anti-DDoS protections, attacks have been particularly intense in recent weeks.
The OVHcloud teams quickly intervened to isolate the equipment at 10:15am. Services have been restored since this intervention.
We are currently contacting our customers to confirm that all their services have been restored.
We sincerely apologise to all our customers who were affected and will be as transparent as possible about the causes and consequences of this incident.