The main cause of accidents in data centers is the gasket between the computer and the chair

The topic of major accidents in modern data centers raises questions that were not answered in the first article - we decided to develop it.

The main cause of accidents in data centers is the gasket between the computer and the chair

According to the statistics from the Uptime Institute, the majority of incidents in data centers are related to power failures - they account for 39% of incidents. They are followed by the human factor - this is another 24% of accidents. The third most important cause (15%) was the failure of the air conditioning system, and in fourth place (12%) were natural disasters. The total share of other troubles is only 10%. Without questioning the data of a respected organization, let's single out something in common in different accidents and try to understand whether they could have been avoided. Spoiler alert: you can in most cases.

Contact Science

Simply put, there are only two problems with power supply: either there is no contact where it should be, or it is where there should not be contact. You can talk for a long time about the reliability of modern uninterruptible power supply systems, but they do not always save. Take the notorious case of British Airways' data center owned by parent company International Airlines Group. Not far from Heathrow Airport there are two such objects - Boadicea House and Comet House. The first one, on May 27, 2017, experienced an accidental power outage that led to an overload and failure of the UPS system. As a result, part of the IT equipment was physically damaged, and it took three days to eliminate the last failure.

The airline had to cancel or reschedule more than a thousand flights, about 75 thousand passengers could not fly on time - it took $ 128 million to pay compensation, not counting the costs required to restore the efficiency of data centers. The history of the reasons for the blackout is not clear. According to the results of an internal investigation, voiced by the CEO of International Airlines Group Willie Walsh, it happened due to an error of engineers. However, the uninterruptible power supply system had to withstand such a shutdown - that's why it was installed. The data center was managed by specialists from the outsourcing company CBRE Managed Services, so British Airways tried to recover the amount of damage through the London court.

The main cause of accidents in data centers is the gasket between the computer and the chair

Power failures follow similar scenarios: first there is a power outage due to the fault of the power supplier, sometimes due to bad weather or internal problems (including human error), and then the uninterruptible power supply system cannot cope with the load or a short interruption of the sine wave causes failures of many services, on restoring the performance of which takes a lot of time and money. Can such accidents be avoided? Undoubtedly. If the system is designed correctly, however, even the creators of large data centers are not immune from errors.

Human factor

When the wrong actions of the data center personnel become the direct cause of the incident, the problems most often (but not always) affect the software part of the IT infrastructure. Such accidents happen even in large corporations. In February 2017, due to an incorrectly typed command by a member of the technical operation group of one of the data centers, part of the Amazon Web Services servers was disabled. An error occurred while debugging the billing process for Amazon Simple Storage Service (S3) cloud storage customers. The employee tried to remove a number of virtual servers used by the billing system, but hit a larger cluster.

The main cause of accidents in data centers is the gasket between the computer and the chair

As a result of an engineer's error, the servers that were running important Amazon cloud storage software modules were deleted. First of all, the indexing subsystem suffered, which contains information about the metadata and the location of all S3 objects in the US-EAST-1 region. The incident also affected the subsystem used to host data and manage available storage space. After removing the virtual machines, these two subsystems required a complete restart, and then a surprise awaited Amazon engineers - for a long time, public cloud storage could not serve customer requests.

The effect has been massive, as many large resources use Amazon S3. The outages affected Trello, Coursera, IFTTT and, worst of all, the services of large Amazon partners from the S&P 500 list. The damage in such cases is not easy to calculate, but it was in the order of hundreds of millions of US dollars. As you can see, one wrong command is enough to disable the service of the largest cloud platform. This is not an isolated case, on May 16, 2019, during maintenance work, the Yandex.Cloud service removed virtual machines of users in the ru-central1-c zone that have at least once been in the SUSPENDED status. Client data has already suffered here, some of which was irretrievably lost. Of course, people are not perfect, but modern information security systems have long been able to control the actions of privileged users before executing the commands they entered. If such solutions are implemented in Yandex or Amazon, such incidents can be avoided.

The main cause of accidents in data centers is the gasket between the computer and the chair

frozen cooling

In January 2017, there was a major accident in the Dmitrovsky data center of the Megafon company. Then the temperature in the Moscow region dropped to -35 Β° C, which led to the failure of the object's cooling system. The press service of the operator did not particularly talk about the causes of the incident - Russian companies are extremely reluctant to talk about accidents at their facilities, in terms of publicity, we are far behind the West. There was a version on social networks about the freezing of the coolant in the pipes laid along the street and the leakage of ethylene glycol. According to her, the operation service was unable to quickly receive 30 tons of coolant due to long holidays and got out using improvised means, organizing an impromptu free-cooling in violation of the rules for operating the system. Severe cold exacerbated the problem - in January, winter suddenly happened in Russia, although no one expected it. As a result, the staff had to de-energize part of the server racks, due to which some of the operator's services were unavailable for two days.

The main cause of accidents in data centers is the gasket between the computer and the chair

Probably, here we can talk about a weather anomaly, but such frosts are not unusual for the capital region. The temperature in winter in the Moscow region can drop to lower levels, so data centers are built based on stable operation at -42Β°C. Most often, cooling systems in the cold fail due to insufficiently high concentrations of glycols and excess water in the coolant solution. There are problems with the installation of pipes or miscalculations in the design and testing of the system, mainly related to the desire to save money. As a result, out of the blue, a serious accident occurs, which could well have been avoided.

Natural disasters

Most often, thunderstorms and/or hurricanes disrupt the operation of the engineering infrastructure of the data center, which leads to service shutdowns and/or physical damage to equipment. Incidents provoked by bad weather occur quite often. In 2012, Hurricane Sandy swept the US West Coast with heavy rain. Located in a high-rise building in Lower Manhattan, the Peer 1 data center lost external power, after the salty sea water flooded the cellars. The facility's emergency generators were placed on the 18th floor, and their fuel supply was limited - rules introduced in New York after the 9/11 attacks prohibit the storage of large amounts of fuel on the upper floors.

The fuel pump also failed, so for several days the staff dragged the diesel to the generators by hand. The heroism of the team saved the data center from a serious accident, but was it really necessary? We live on a planet with a nitrogen-oxygen atmosphere and a lot of water. Thunderstorms and hurricanes are common here (especially in coastal areas). Designers should probably take into account the risks involved and build an appropriate uninterruptible power supply system. Or at least choose a more suitable place for the data center than a high-rise on the island.

Everything else

In this category, Uptime Institute identifies a variety of incidents, among which it is difficult to choose a typical one. Thefts of copper cables crashing into data centers, power transmission poles and transformer substations, cars, fires, spoiling optics excavators, rodents (rats, rabbits and even wombats, which are actually marsupials), as well as those who like to practice shooting on wires - the menu is extensive . Power failures can even cause stealing electricity to an illegal marijuana plantation. In most cases, the perpetrators of the incident are specific people, i.e. we are again dealing with the human factor when the problem has a first and last name. Even if, at first glance, an accident is associated with a technical malfunction or natural disasters, it can be avoided if the facility is properly designed and operated correctly. The only exceptions are cases of critical damage to the infrastructure of the data center or the destruction of buildings and structures due to a natural disaster. These are really force majeure circumstances, and all other problems are caused by the gasket between the computer and the chair - perhaps this is the most unreliable part of any complex system.

Source: habr.com

Add a comment