Preparing DRP - do not forget to take into account the meteorite

Preparing DRP - do not forget to take into account the meteorite
Even during a disaster, there is always time for a cup of tea.

DRP (disaster recovery plan) is a thing that ideally will never be needed. But if suddenly the beavers migrating during the mating season gnaw through the main optical fiber or the junior admin drops the productive base, you definitely want to be sure that you will have a pre-made plan of what to do with all this disgrace.

While panicked customers start calling tech support, a junior is looking for cyanide, you wisely open the red envelope and start putting everything in order.

In this post, I want to share recommendations on how to write a DRP and what it should contain. We'll also look at the following:

  1. Learn to think like a villain.
  2. Let's analyze the benefits of a cup of tea during the apocalypse.
  3. Think over a convenient DRP structure
  4. Let's see how to test it

Which companies might benefit from this?

It is very difficult to draw a line when the IT department starts to need these things. I would say that you are guaranteed to need DRP if:

  • Stopping a server, an application, or losing some database will lead to significant losses for the business as a whole.
  • You have a full-fledged IT department. I mean, a department as a full-fledged unit of the company, with its own budget, and not just a few tired employees laying a network, cleaning viruses and refilling printers.
  • You have a realistic budget for at least partial redundancy in the event of an emergency.

When the IT department has been begging for at least a couple of HDDs for an old server for backups for months, you are unlikely to be able to organize a full-fledged relocation of the fallen service to spare capacities. Although the documentation will not be superfluous here either.

Documentation is important

Start with documentation. Let's say that your service runs on a Perl script that was written three generations of admins ago, and no one knows how it works. The accumulated technical debt and the lack of documentation will inevitably shoot you not only in the knee, but also in other limbs, it is rather a matter of time.

Once you have a good description of the service components on hand, raise the crash statistics. Almost certainly they will be completely typical. For example, you have a disk full from time to time, which causes the node to fail until it is manually cleaned. Or the client service becomes unavailable due to the fact that someone again forgot to renew the certificate, and Let's Encrypt could not be configured or did not want to.

Thoughts like a saboteur

The most difficult part is in predicting those accidents that have never happened before, but which could potentially ruin your service completely. Here we usually play villains with colleagues. Take a lot of coffee and something tasty and lock yourself in the meeting room. Just make sure that in the same meeting you locked those engineers who themselves raised the target service or work with it regularly. Then, either on the board or on paper, you begin to draw all the possible horrors that can happen to your service. It is not necessary to detail down to a specific cleaning lady and pulling out cables, it is enough to consider the scenario “Violation of the integrity of the local network”.

Usually, most typical emergencies fit into the following types:

  • Network failure
  • OS service failure
  • Application failure
  • iron failure
  • Virtualization Failure

Just go through each view and see what applies to your service. For example, the Nginx daemon may fall and not rise - this is a failure on the part of the OS. A rare situation that drives your web application into a non-working state is a software failure. During the development of this stage, it is important to work out the diagnosis of the problem. How to distinguish a hung interface on virtualization from a fallen cisco and a network crash, for example. This is important to quickly find those responsible and start pulling their tail until the accident is fixed.

After the typical problems are written down, we pour more coffee and begin to consider the strangest scenarios, when some parameters start to go beyond the norm. For example:

  • What happens if the time on the active node moves back a minute relative to others in the cluster?
  • And if time moves forward, and if by 10 years?
  • What happens if a cluster node suddenly loses network during synchronization?
  • And what happens if two nodes do not share leadership due to temporary isolation of each other over the network?

At this stage, the reverse approach helps a lot. Take the most stubborn member of the team with a sick imagination and give him the task to arrange a diversion in the shortest possible time, which will put the service down. If it is difficult to diagnose, even better. You won't believe the weird and cool ideas that engineers come up with when given the idea to break something. And if you promise them a test stand for this, it’s very good.

What is this DRP of yours?!

So you have defined the threat model. They also took into account local residents who cut fiber optic cables in search of copper, and a military radar that drops a radio relay line strictly on Fridays at 16:46. Now we need to figure out what to do with it all.

Your task is to write the same red envelopes that will be opened in an emergency. Immediately expect that when (not if!) Everything is screwed up, only the most inexperienced trainee will be nearby, whose hands will shake violently from the horror of what is happening. See how emergency signs are implemented in medical offices. For example, what to do with anaphylactic shock. The medical staff knows all the protocols by heart, but when a person nearby begins to die, very often everyone grabs helplessly at everything. To do this, a clear instruction hangs on the wall with items like “open the package of such and such” and “inject so many units of the drug intravenously.”

It's hard to think in an emergency! There should be simple instructions for spinal parsing.

A good DRP consists of a few simple blocks:

  1. Whom to notify about the beginning of the accident. This is important in order to parallelize the elimination process as much as possible.
  2. How to diagnose correctly - we trace, look in systemctl status servicename and so on.
  3. How much time can be spent on each stage. If you do not have time to fix it with your hands during the SLA time, the virtual machine is killed and rolled from yesterday's backup.
  4. How to make sure the crash is over.

Remember that DRP starts when the service has completely failed and is completed by recovery, even with reduced efficiency. Simply losing a reservation should not activate DRP. You can also prescribe a cup of tea in DRP. Seriously. According to statistics, many accidents go from unpleasant to catastrophic due to the fact that the staff in a panic rushes to fix something, simultaneously killing the only live node with data or finally finishing off the cluster. As a rule, 5 minutes for a cup of tea will give you a little time to calm down and analyze what is happening.

Do not confuse DRP and system passport! Don't overload it with unnecessary data. Just give the opportunity to quickly and conveniently go to the required section of the documentation via hyperlinks and read in an expanded format about the necessary sections of the service architecture. And in the DRP itself, there are only direct instructions on where and how to connect with specific commands for copy-paste.

How to test correctly

Make sure that any responsible employee is able to complete all items. At the most crucial moment, it may turn out that the engineer does not have access rights to the required system, there are no passwords for the required account, or he has no idea what “Connect to the service management console through a proxy at the head office” means. Each item should be as simple as possible.

Wrong - "Go to virtualization and reboot the dead node"
Correctly - "Connect via the web interface to virt.example.com, in the node section, reload the node that is causing the error."

Avoid ambiguity. Remember the frightened intern.

Be sure to test DRP. This is not just a plan for show - it is something that will allow you and your clients to quickly get out of a critical situation. It is best to do this several times:

  • One expert and several interns work on a test bench that imitates a real service as much as possible. The expert breaks the service in various ways and enables the trainees to restore it according to the DRP. All problems, ambiguities in the documentation and errors are recorded. After training the trainees, DRP is complemented and simplified in obscure places.
  • Testing on a real service. In fact, you can never create a perfect copy of a real service. Therefore, a couple of times a year it is necessary to turn off part of the servers on a planned basis, break connections and arrange other accidents from the list of threats in order to evaluate the recovery order. It's better to have a 10-minute planned outage in the middle of the night than a sudden failure for several hours at peak load with data loss.
  • Real elimination of the accident. Yes, this is also part of the testing. If an accident occurs that was not in the list of threats, it is necessary to supplement and finalize the DRP based on the results of its investigation.

Key Points

  1. If bullshit can happen, it will not just happen, but it will do it in the most catastrophic scenario.
  2. Make sure you have the resources for failover.
  3. Make sure you have backups, they are automatically created and regularly checked for consistency.
  4. Consider typical threat scenarios.
  5. Give engineers the opportunity to come up with non-standard options to put the service.
  6. DRP should be simple and dumb instructions. All complex diagnostics only after customers have restored service. Even if it's on standby.
  7. List key phone numbers and contacts in DRP.
  8. Regularly test employees for DRP understanding.
  9. Arrange planned accidents on the product. Stands cannot replace everything.

Preparing DRP - do not forget to take into account the meteorite

Preparing DRP - do not forget to take into account the meteorite

Source: habr.com

Add a comment