Habr postmortem report: fell on a newspaper

The end of the first and the beginning of the second month of the summer of 2019 turned out to be difficult and were marked by several major drops in global IT services. Of the notable ones: two serious incidents in the CloudFlare infrastructure (the first with crooked hands and negligent attitude to BGP on the part of some US ISPs; the second with a crooked deployment of the CFs themselves, affected everyone using CF, and these are many notable services) and unstable operation of the Facebook CDN infrastructure (affected all FB products, including Instagram and WhatsApp). We also had to get under the distribution, although our outage was much less noticeable against the world background. Someone has already begun to drag in black helicopters and "sovereign" conspiracies, therefore we are releasing a public post mortem of our incident.

Habr postmortem report: fell on a newspaper

03.07.2019, 16: 05
We began to fix problems with resources, similar to a violation of internal network connectivity. Having not fully checked everything, they began to sin on the performance of the external channel towards the DataLine, since it became clear that the problem was with the internal network access to the Internet (NAT), to the point that they put the BGP session towards the DataLine.

03.07.2019, 16: 35
It became obvious that the equipment that performs network address translation and access from the local area network of the site to the Internet (NAT) failed. Attempts to reboot the equipment did not lead to anything, the search for alternative options for organizing connectivity began before receiving a response from technical support, since, according to experience, this would most likely not help.

The problem was somewhat aggravated by the fact that this equipment also terminated incoming connections of client VPN employees, remote recovery work became more difficult.

03.07.2019, 16: 40
Attempted to reanimate a pre-existing spare NAT scheme that had worked strongly before. But it became clear that a number of network re-equipments made this scheme almost completely inoperative, since its restoration could, at best, not work, at worst, break an already working one.

We began to work out a couple of ideas to transfer traffic to a set of new routers serving the backbone, but they seemed to be inoperative due to the peculiarities of route distribution in the core network.

03.07.2019, 17: 05
At the same time, a problem was discovered in the name resolution mechanism on name servers, which led to errors in resolving endpoints in applications, they began to quickly fill hosts files with records of critical services.

03.07.2019, 17: 27
The limited performance of Habr has been restored.

03.07.2019, 17: 43
But in the end, a relatively safe solution was found for organizing traffic through exactly one of the border routers, which was quickly screwed up. Internet connectivity has been restored.

Over the next few minutes, monitoring systems received a lot of notifications about the restoration of the monitoring agents, but some of the services turned out to be inoperative, since the name resolution mechanism on name servers (dns) was violated.

Habr postmortem report: fell on a newspaper

03.07.2019, 17: 52
NS was restarted, the cache was reset. Resolving has been restored.

03.07.2019, 17: 55
All services have been launched except MK, Freelance and Toaster.

03.07.2019, 18: 02
Earned MK and Freelance.

03.07.2019, 18: 07
We returned an innocent BGP session with DataLine.

03.07.2019, 18: 25
They began to fix flapping on resources, it was connected with a change in the external address of the NAT pool and its absence in the acl of a number of services, they were quickly corrected. The toaster started working right away.

03.07.2019, 20: 30
Noticed errors related to Telegram bots. It turned out that they forgot to register the external address in a couple of acl (proxy-servers), they quickly corrected it.

Habr postmortem report: fell on a newspaper

Conclusions

  • The equipment, which had previously sowed doubts about its suitability, failed. There were plans to remove it from work, as it interfered with the development of the network and had compatibility problems, but at the same time it performed a critical function, which is why any replacement was not technically easy without interrupting services. Now we can move on.
  • DNS problems can be avoided by moving them closer to the new backbone network outside of the NAT network and at the same time with full connectivity to the gray network without translation (which was planned before the incident).
  • You should not use domain names when assembling RDBMS clusters, since the convenience of transparently changing the IP address is not particularly necessary, since such manipulations still require rebuilding the cluster. This decision is dictated by historical reasons and, first of all, by the obviousness of endpoints by name in RDBMS configurations. In general, a classic trap.
  • In principle, exercises comparable to the "sovereignization of the Runet" have been carried out, there is something to think about in terms of strengthening the possibilities of autonomous survival.

Source: habr.com

Add a comment