The topic of major accidents in modern data centers raises questions that were not answered in the first article - we decided to develop it.
According to the statistics from the Uptime Institute, the majority of incidents in data centers are related to power failures - they account for 39% of incidents. They are followed by the human factor - this is another 24% of accidents. The third most important cause (15%) was the failure of the air conditioning system, and in fourth place (12%) were natural disasters. The total share of other troubles is only 10%. Without questioning the data of a respected organization, let's single out something in common in different accidents and try to understand whether they could have been avoided. Spoiler alert: you can in most cases.
Contact Science
Simply put, there are only two problems with power supply: either there is no contact where it should be, or it is where there should not be contact. You can talk for a long time about the reliability of modern uninterruptible power supply systems, but they do not always save. Take the notorious case of British Airways' data center owned by parent company International Airlines Group. Not far from Heathrow Airport there are two such objects - Boadicea House and Comet House. The first one, on May 27, 2017, experienced an accidental power outage that led to an overload and failure of the UPS system. As a result, part of the IT equipment was physically damaged, and it took three days to eliminate the last failure.
The airline had to cancel or reschedule more than a thousand flights, about 75 thousand passengers could not fly on time - it took $ 128 million to pay compensation, not counting the costs required to restore the efficiency of data centers. The history of the reasons for the blackout is not clear. According to the results of an internal investigation, voiced by the CEO of International Airlines Group Willie Walsh, it happened due to an error of engineers. However, the uninterruptible power supply system had to withstand such a shutdown - that's why it was installed. The data center was managed by specialists from the outsourcing company CBRE Managed Services, so British Airways tried to recover the amount of damage through the London court.
Power failures follow similar scenarios: first there is a power outage due to the fault of the power supplier, sometimes due to bad weather or internal problems (including human error), and then the uninterruptible power supply system cannot cope with the load or a short interruption of the sine wave causes failures of many services, on restoring the performance of which takes a lot of time and money. Can such accidents be avoided? Undoubtedly. If the system is designed correctly, however, even the creators of large data centers are not immune from errors.
Human factor
When the wrong actions of the data center personnel become the direct cause of the incident, the problems most often (but not always) affect the software part of the IT infrastructure. Such accidents happen even in large corporations. In February 2017, due to an incorrectly typed command by a member of the technical operation group of one of the data centers, part of the Amazon Web Services servers was disabled. An error occurred while debugging the billing process for Amazon Simple Storage Service (S3) cloud storage customers. The employee tried to remove a number of virtual servers used by the billing system, but hit a larger cluster.
As a result of an engineer's error, the servers that were running important Amazon cloud storage software modules were deleted. First of all, the indexing subsystem suffered, which contains information about the metadata and the location of all S3 objects in the US-EAST-1 region. The incident also affected the subsystem used to host data and manage available storage space. After removing the virtual machines, these two subsystems required a complete restart, and then a surprise awaited Amazon engineers - for a long time, public cloud storage could not serve customer requests.
The effect has been massive, as many large resources use Amazon S3. The outages affected Trello, Coursera, IFTTT and, worst of all, the services of large Amazon partners from the S&P 500 list. The damage in such cases is not easy to calculate, but it was in the order of hundreds of millions of US dollars. As you can see, one wrong command is enough to disable the service of the largest cloud platform. This is not an isolated case, on May 16, 2019, during maintenance work, the Yandex.Cloud service
frozen cooling
In January 2017, there was a major accident in the Dmitrovsky data center of the Megafon company. Then the temperature in the Moscow region dropped to -35 Β° C, which led to the failure of the object's cooling system. The press service of the operator did not particularly talk about the causes of the incident - Russian companies are extremely reluctant to talk about accidents at their facilities, in terms of publicity, we are far behind the West. There was a version on social networks about the freezing of the coolant in the pipes laid along the street and the leakage of ethylene glycol. According to her, the operation service was unable to quickly receive 30 tons of coolant due to long holidays and got out using improvised means, organizing an impromptu free-cooling in violation of the rules for operating the system. Severe cold exacerbated the problem - in January, winter suddenly happened in Russia, although no one expected it. As a result, the staff had to de-energize part of the server racks, due to which some of the operator's services were unavailable for two days.
Probably, here we can talk about a weather anomaly, but such frosts are not unusual for the capital region. The temperature in winter in the Moscow region can drop to lower levels, so data centers are built based on stable operation at -42Β°C. Most often, cooling systems in the cold fail due to insufficiently high concentrations of glycols and excess water in the coolant solution. There are problems with the installation of pipes or miscalculations in the design and testing of the system, mainly related to the desire to save money. As a result, out of the blue, a serious accident occurs, which could well have been avoided.
Natural disasters
Most often, thunderstorms and/or hurricanes disrupt the operation of the engineering infrastructure of the data center, which leads to service shutdowns and/or physical damage to equipment. Incidents provoked by bad weather occur quite often. In 2012, Hurricane Sandy swept the US West Coast with heavy rain. Located in a high-rise building in Lower Manhattan, the Peer 1 data center
The fuel pump also failed, so for several days the staff dragged the diesel to the generators by hand. The heroism of the team saved the data center from a serious accident, but was it really necessary? We live on a planet with a nitrogen-oxygen atmosphere and a lot of water. Thunderstorms and hurricanes are common here (especially in coastal areas). Designers should probably take into account the risks involved and build an appropriate uninterruptible power supply system. Or at least choose a more suitable place for the data center than a high-rise on the island.
Everything else
In this category, Uptime Institute identifies a variety of incidents, among which it is difficult to choose a typical one. Thefts of copper cables crashing into data centers, power transmission poles and transformer substations, cars, fires, spoiling optics excavators, rodents (rats, rabbits and even wombats, which are actually marsupials), as well as those who like to practice shooting on wires - the menu is extensive . Power failures can even cause
Source: habr.com