Major accidents in data centers: causes and consequences

Modern data centers are reliable, but any equipment breaks down from time to time. In a short note, we have collected the most significant incidents of 2018.

Major accidents in data centers: causes and consequences

The impact of digital technologies on the economy is growing, the volume of processed information is increasing, new facilities are being built, and this is good as long as everything works. Unfortunately, the impact of data center failures on the economy has also been increasing since people began to place business-critical IT infrastructure in them - this is an inevitable consequence of digitalization. We publish a small selection of the most notable accidents that occurred in different countries in the past year.

USA

This country is a recognized leader in the field of data center construction. The United States has the most large commercial and corporate data centers serving global services, so the consequences of incidents in them are most significant. In early March, due to a powerful cyclone, four Equinix facilities experienced power outages. The area was used for Amazon Web Services (AWS) equipment, the accident led to the unavailability of many popular services: GitHub, MongoDB, NewVoiceMedia, Slack, Zillow, Atlassian, Twilio and mCapital One were affected, as well as the Amazon Alexa virtual assistant.

In September, weather anomalies hit the Microsoft data centers located in Texas, then, due to a thunderstorm, the power supply system of the entire region was disrupted, and cooling was turned off in the data center that switched to power from a diesel generator. It took several days to clean up the consequences of the accident, and although the failure did not become critical thanks to load balancing, users around the world noticed some slowdown in Microsoft cloud services.

Russia

The most serious accident occurred on August 20 in one of Rostelecom's data centers. Because of it, the servers of the Unified State Register of Real Estate stopped for 66 hours, and therefore they had to be transferred to a backup site. Rosreestr was able to restore the processing of applications received through all channels only on September 3 - the state organization is trying to recover a large amount from Rostelecom for violating the service level agreement.

On February 16, due to problems in the networks of Lenenergo, the backup power supply system was switched on in the data center of the Xelnet company (St. Petersburg). A short-term interruption of the sinusoid led to disruptions in the operation of many services: the large cloud provider 1cloud suffered, in particular, but the most noticeable problem for the Russian Internet audience was the inability to access the VKontakte social networking site. The most interesting thing is that it took about 12 hours to completely eliminate the consequences of a short-term power failure.

EU

In the EU, several serious incidents were recorded in 2018. In March, there was a failure in the data center of the KLM air carrier: the power supply was turned off for 10 minutes, and the power of the diesel generator sets was insufficient to operate the equipment. Part of the servers went down, and the airline had to cancel or reschedule several dozen flights.

This is not the only incident related to air transportation - already in April, a failure occurred in the power supply system of the Eurocontrol data center. The organization manages the movement of aircraft in the European Union, and while specialists eliminated the consequences of the accident for 5 hours, passengers again had to endure delays and transfers of flights.

Very serious problems arise due to accidents in data centers serving the financial sector. The cost of interruptions in transactions is usually high here, and the level of reliability of objects is appropriate, but this does not save you from incidents. On April 18, the Nordic NASDAQ stock exchange (Helsinki, Finland) was unable to trade throughout Northern Europe during the day due to an unauthorized start of a gas fire extinguishing system in a DigiPlex commercial data center, which was blacked out.

On June 7, data center outages forced the London Stock Exchange (LSE) to postpone the start of trading for an hour. In addition, in June, in Europe, due to a failure in the data center, the services of the VISA international payment system were disabled for the whole day, and the details of the incident were not disclosed.

Japan

In the summer of 2018, a fire broke out in the underground levels of the Amazon data center under construction in the suburbs of Tokyo, in which 5 workers died and at least 50 were injured. The fire damaged about 5000 m2 of the premises of the facility. The investigation showed that the cause of the fire was a human factor: due to careless handling of acetylene torches, insulation ignited.

Reasons for failures

The above list of incidents is far from complete, due to accidents in data centers, customers of banks and telecom operators suffer, cloud providers go offline, and even emergency services are disrupted. A small service outage can result in significant losses, with the majority of failures (39%) related to the power supply system, according to the Uptime Institute. In second place (24%) is the human factor, and in third (15%) is the air conditioning system. Only 12% of accidents in data centers can be attributed to natural phenomena, and only 10% of them occur for reasons other than those listed.

Despite strict standards of reliability and safety, no object is insured against incidents. Most of them are due to power failures or human errors. These two factors should first of all be paid attention to by the owners of data centers and server rooms, and customers should understand that even market leaders cannot guarantee absolute reliability. If the equipment or cloud service serves business-critical processes, you should think about a backup site.

Photo source: telecombloger.ru

Source: habr.com

Add a comment