Error while configuring BGP resulted in a 27-minute downtime for Cloudflare

cloudflare company, providing content delivery network for 27 million Internet resources and serving traffic for 13% of the 1000 largest sites, uncovered details of the incident, as a result of which the work of many segments of the Cloudflare network was disrupted within 27 minutes, including those responsible for delivering traffic to London, Chicago, Los Angeles, Washington, Amsterdam, Paris, Moscow and St. Petersburg. The problem was caused by an incorrect configuration change on the Atlanta router. During the incident, which occurred on July 17 from 21:12 to 21:39 (UTC), the total volume of traffic on the Cloudflare network decreased by approximately 50%.

Error while configuring BGP resulted in a 27-minute downtime for Cloudflare

In the process of carrying out technical work, wanting to remove some of the traffic from one of the backbones, the engineers removed one line in the settings block that defines the list of routes received through the backbone, filtered in accordance with the specified list of prefixes. It would be correct to deactivate the entire block, but by mistake only the line with the list of prefixes was deleted.

{master}[edit] atl01# show | compare
[edit policy-options policy-statement 6-BBONE-OUT term 6-SITE-LOCAL from] ! inactive: prefix-list 6-SITE-LOCAL { … }

Block content:

from {
prefix-list 6-SITE-LOCAL;
}
then {
local preference 200;
community add SITE-LOCAL-ROUTE;
community add ATL01;
community add NORTH-AMERICA;
accept;
}

Due to the removal of the binding to the list of prefixes, the rest of the block began to be distributed to all prefixes and the router began to send all its BGP routes to routers of other backbones. By coincidence, the new routes had a higher priority (local-preference 200) compared to the priority (100) set for other routes by the automatic traffic optimization system. As a result, instead of removing routing from the backbone, higher priority BGP routes were leaked, as a result of which traffic destined for other backbones was directed to Atlanta, which led to router overload and the collapse of part of the network.

Error while configuring BGP resulted in a 27-minute downtime for Cloudflare

In order to prevent similar incidents from occurring in the future, several changes are planned to be made to the settings of Cloudflare backbones on Monday. For BGP sessions, a limit on the maximum number of prefixes (maximum-prefix) will be added, which will block the problematic backbone if too many prefixes are sent through it. If this restriction had been added earlier, then the problem in question would have led to the disconnection of the backbone in Atlanta, but would not affect the operation of the entire network, since the Cloudflare network is designed to allow individual backbones to fail. Of the changes already adopted, a revision of priorities (local-preference) for local routes is noted, which will not allow one router to influence traffic in other parts of the network.

Source: opennet.ru

Add a comment