Should the server be “extinguished” if the smoke test of the data center “fired up”?

How would you feel if, one fine summer day, the data center with your equipment looked like this?

Should the server be “extinguished” if the smoke test of the data center “fired up”?

Hi all! My name is Dmitry Samsonov, I work as a leading system administrator in "Classmates". The photo shows one of the four data centers where the equipment serving our project is installed. Behind these walls there are about 4 thousand pieces of equipment: servers, data storage system, network equipment, etc. — almost ⅓ of all our equipment.
Most servers are Linux. There are also several dozen servers on Windows (MS SQL) - our legacy, which we have been systematically abandoning for many years.
So, on June 5, 2019 at 14:35 pm, engineers at one of our data centers reported a fire alarm.

Denial

14:45. Minor smoke incidents in data centers happen more often than you think. The indicators inside the halls were normal, so our first reaction was relatively calm: they imposed a ban on working with production, that is, on any configuration changes, on rolling out new versions, etc., except for work related to fixing something.

Anger

Have you ever tried to find out from the firefighters exactly where the fire broke out on the roof, or to get on the burning roof yourself to assess the situation? What will be the degree of confidence in the information received through five people?

14:50. Information has been received that the fire is approaching the cooling system. But will it come? The duty system administrator displays external traffic from the fronts of this data center.

At the moment, the fronts of all our services are duplicated in three data centers, balancing at the DNS level is used, which allows you to remove the addresses of one data center from the DNS, thereby protecting users from potential problems with access to services. In the event that problems have already occurred in the data center, it exits the rotation automatically. You can read more here: Load balancing and fault tolerance in Odnoklassniki.

The fire has not yet affected us in any way - neither users nor equipment have been affected. Is it an accident? The first section of the document "Accident Action Plan" defines the concept of "Accident", and the section ends as follows:
«If there is any doubt, an accident or not, then this is an accident!»

14:53. An accident coordinator is appointed.

The coordinator is a person who controls communication between all participants, assesses the scale of the accident, uses the “Accident Action Plan”, attracts the necessary personnel, controls the completion of the repair, and most importantly, delegates any tasks. In other words, this is the person who manages the entire process of eliminating the accident.

Bargain

15:01. We are starting to turn off servers that are not tied to production.
15:03. Correctly turn off all reserved services.
This includes not only the fronts (which users no longer access by this moment) and their auxiliary services (business logic, caches, etc.), but also various databases with a replication factor of 2 or more (Cassandra, binary data store, cold storage, newsql etc.).
15:06. Information was received that a fire threatened one of the halls of the data center. We have no equipment in this hall, but the fact that fire can spread from the roof to the halls greatly changes the picture of what is happening.
(Later it turned out that there was no physical threat to the hall, since it was hermetically sealed from the roof. The threat was only to the cooling system of this hall.)
15:07. We allow the execution of commands on servers in accelerated mode without additional checks (without our favorite calculator).
15:08. The temperature in the rooms is within the normal range.
15:12. An increase in temperature in the halls was recorded.
15:13. More than half of the servers in the data center are turned off. We continue.
15:16. A decision was made to turn off all equipment.
15:21. We begin to turn off the power on stateless servers without properly shutting down the application and operating system.
15:23. A group of people responsible for MS SQL is singled out (there are few of them, the dependence of services on them is not great, but the recovery procedure takes more time and is more complicated than, for example, Cassandra).

Depression

15:25. Information was received about a power outage in four halls out of 16 (No. 6, 7, 8, 9). Our equipment is located in the 7th and 8th halls. There is no information about our two halls (No. 1 and 3).
Usually, during fires, the power supply is immediately turned off, but in this case, thanks to the coordinated work of firefighters and technical staff of the data center, it was turned off not everywhere and not immediately, but out of necessity.
(Later it turned out that the power in rooms 8 and 9 was not cut off.)
15:28. We are starting to deploy MS SQL databases from backups in other data centers.
How long will it take? Is there enough network bandwidth for the entire route?
15:37. Fixed disconnection of some sections of the network.
Management and production network are physically isolated from each other. If the production network is available, then you can go to the server, stop the application and turn off the OS. If it is not available, then you can go through IPMI, stop the application and turn off the OS. If there is none of the networks, then you can't do anything. “Thank you, cap!”, you think.
“Yes, and in general, there is somehow a lot of turmoil,” you might also think.
The thing is that servers, even without a fire, generate a huge amount of heat. More precisely, when there is cooling, they generate heat, and when there is none, they create hellish inferno, which at best will melt part of the equipment and turn off the other part, and at worst ... cause a fire inside the hall, which is almost guaranteed to destroy everything.

Should the server be “extinguished” if the smoke test of the data center “fired up”?

15:39. We fix problems with the conf base.

The conf base is a backend for the service of the same name, which is used by all production applications to quickly change settings. Without this base, we cannot manage the operation of the portal, but the portal itself can work at the same time.

15:41. Temperature sensors on Core network equipment record readings close to the maximum allowable. This is a box that occupies an entire rack and ensures the operation of all networks inside the data center.

Should the server be “extinguished” if the smoke test of the data center “fired up”?

15:42. Issue tracker and wiki are unavailable, switch to standby.
This is not production, but in case of an accident, the availability of any knowledge base can be critical.
15:50. One of the monitoring systems has been disabled.
There are several of them, and they are responsible for different aspects of the services. Some of them are configured to work autonomously within each data center (that is, they monitor only their own data center), others consist of distributed components that transparently survive the loss of any data center.
In this case it stopped working. business logic indicator anomaly detection system, which works in master-standby mode. Switched to standby.

Adoption

15:51. Through IPMI, all servers were turned off without a correct shutdown, except for MS SQL.
Are you ready to bulk manage servers via IPMI if necessary?

The very moment when the rescue of equipment in the data center is completed at this stage. Everything that could be done has been done. Some colleagues can take a break.
16:13. There was information that freon pipes from air conditioners had burst on the roof - this would delay the launch of the data center after the fire was extinguished.
16:19. According to data received from the technical staff of the data center, the temperature increase in the halls has stopped.
17:10. Restored the work of the conf database. Now we can change the application settings.
Why is it so important if everything is fault-tolerant and works even without one data center?
First, not everything is fault-tolerant. There are various secondary services that are not yet well enough to survive the failure of the data center, and there are bases in master-standby mode. The ability to manage settings allows you to do everything necessary to minimize the impact of the consequences of an accident on users even in difficult conditions.
Secondly, it became clear that the work of the data center would not be fully restored in the next few hours, so it was necessary to take measures so that the long-term unavailability of replicas did not lead to additional troubles such as disk overflow in the remaining data centers.
17:29. Pizza time! We employ people, not robots.

Should the server be “extinguished” if the smoke test of the data center “fired up”?

Rehabilitation

18:02. In halls No. 8 (ours), 9, 10 and 11, the temperature has stabilized. One of those that remain offline (#7) contains our equipment, and the temperature there continues to rise.
18:31. They gave the go-ahead to start the equipment in halls No. 1 and 3 - these halls were not affected by the fire.

At the moment, servers are being launched in halls No. 1, 3, 8, starting with the most critical ones. The correct operation of all running services is checked. There are still problems with hall number 7.

18:44. The technical staff of the data center discovered that in room number 7 (where only our equipment is located), many servers were not turned off. According to our data, 26 servers remain on there. After re-checking, we find 58 servers.
20:18. The technical staff of the data center blows the air in the room without air conditioning through mobile air ducts laid through the corridors.
23:08. Let the first admin go home. Someone has to sleep at night in order to continue work tomorrow. Next, we release another part of the admins and developers.
02:56. We launched everything that could be launched. We do a large check of all services with autotests.

Should the server be “extinguished” if the smoke test of the data center “fired up”?

03:02. Air conditioning in the last, 7th hall has been restored.
03:36. We brought the fronts in the data center into rotation in the DNS. From this moment, user traffic begins to come.
We are sending most of the admin team home. But we leave a few people.

Small FAQ:
Q: What happened from 18:31 to 02:56?
A: Following the Disaster Response Plan, we launch all services, starting with the most important ones. At the same time, the coordinator in the chat issues the service to a free administrator, who checks whether the OS and the application have started, whether there are any errors, whether the indicators are normal. After the launch is completed, he reports in the chat that he is free, and receives a new service from the coordinator.
The process is additionally inhibited by the failed iron. Even if the shutdown of the OS and the shutdown of the servers went well, some of the servers do not return due to suddenly failed disks, memory, chassis. When power is lost, the percentage of failures increases.
Q: Why can't you just run everything at once, and then fix what comes out in the monitoring?
A: Everything should be done gradually, because there are dependencies between services. And everything should be checked immediately, without waiting for monitoring - because it is better to deal with problems right away, not to wait for them to worsen.

7:40. The last admin (coordinator) went to sleep. The work of the first day is completed.
8:09. The first developers, data center engineers, and administrators (including the new coordinator) have begun restoration work.
09:37. We started raising the hall number 7 (the last one).
At the same time, we continue to restore what was not completed in other rooms: replacing disks / memory / servers, fixing everything that “burns” in monitoring, reverse role switching in master-standby schemes and other little things, which are nevertheless quite a lot.
17:08. We allow all regular work with production.
21:45. The work of the second day is completed.
09:45. Today is Friday. There are still quite a few minor problems in monitoring. The weekend is upon us and everyone wants to relax. We continue to massively repair everything that we can. Regular admin tasks that could have been postponed have been postponed. New coordinator.
15:40. Suddenly, half of the Core stack of network equipment in ANOTHER data center restarted. The fronts were taken out of rotation to minimize risks. There is no effect for users. Later it turned out that it was a faulty chassis. The coordinator is working on repairing two accidents at once.
17:17. Network operation in another data center has been restored, everything has been checked. The data center is in rotation.
18:29. The work of the third day and in general the recovery after the accident is completed.

Afterword

04.04.2013, on the day of the 404 error, "Classmates" survived the biggest crash —for three days, the portal was completely or partially unavailable. During all this time, more than 100 people from different cities, from different companies (thanks again!), remotely and directly in data centers, manually and automatically repaired thousands of servers.
We have drawn conclusions. To prevent this from happening again, we have carried out and continue to carry out extensive work to this day.

What are the main differences between the current accident and 404?

  • We have an Accident Action Plan. Once a quarter, we conduct an exercise - we play out an emergency that a group of administrators (each in turn) must solve using the “Disaster Response Plan”. Leading system administrators take turns fulfilling the role of coordinator.
  • Quarterly, in test mode, we isolate data centers (all in turn) over the LAN and WAN networks, which allows us to identify bottlenecks in a timely manner.
  • Fewer bad drives because we've tightened our regulations: less running hours, stricter SMART thresholds,
  • We completely abandoned BerkeleyDB, an old and unstable database that required a lot of time to recover after a server restart.
  • We reduced the number of servers with MS SQL and reduced dependence on the remaining ones.
  • We have our own cloud - one-cloud, where we have been actively migrating all services for the past two years. The cloud greatly simplifies the entire cycle of working with the application, and in case of an accident it provides such unique tools as:
    • correct stop of all applications in one click;
    • simple migration of applications from failed servers;
    • automatic ranked (in order of service priority) launch of the entire data center.

The accident described in this article was the largest since the 404th day. Of course, not everything went smoothly. For example, during the unavailability of the fire-damaged data center in another data center, a disk crashed on one of the servers, that is, only one of the three replicas in the Cassandra cluster remained available, due to which 4,2% of mobile application users could not log in . At the same time, already connected users continued to work. In total, more than 30 problems were identified as a result of the accident - from banal bugs to shortcomings in the architecture of services.

But the most important difference between the current accident and the 404th is that while we were eliminating the consequences of the fire, users were still texting and making video calls in Tam Tam, played games, listened to music, gave gifts to each other, watched videos, series and TV channels in OK, and also streamed in OK Live.

How are your accidents going?

Source: habr.com

Add a comment