Let at least a flood, but 1C should work! Negotiating with business about DR

Imagine: you serve the IT infrastructure of a large shopping center. It starts to rain in the city. Torrents of rain break through the roof, water fills ankle-deep retail spaces. We hope that your server room is not in the basement, otherwise problems cannot be avoided.  

The described story is not a fantasy, but a collective description of a couple of events in 2020. In large companies, in this case, a disaster recovery plan, or disaster recovery plan (DRP), is always at hand. In corporations, business continuity specialists are responsible for this. But in medium and small companies, the solution of such problems falls on IT services. You need to figure out the business logic yourself, understand what and where can fall, come up with protection and implement it. 

It's great if the IT professional can negotiate with the business and discuss the need for protection. But I have seen more than once how a company saved on a disaster recovery (DR) solution, as it considered it redundant. When an accident hit, a long recovery threatened losses, and the business was not ready. You can repeat as much as you like: β€œI told you so,” the IT service still has to restore services.

Let at least a flood, but 1C should work! Negotiating with business about DR

From the position of an architect, I will tell you how to avoid this situation. In the first part of the article I will show the preparatory work: how to discuss with the customer three questions for choosing protection tools: 

  • What are we protecting?
  • What are we protecting from?
  • How much do we protect? 

In the second part, we will talk about the options for answering the question: how to defend yourself. I will give examples of cases, how different customers build their protection.

What we protect: we find out critical business functions 

It is better to start preparations by discussing the disaster recovery plan with the business client. Here the main difficulty is to find a common language. The customer usually does not care how the IT solution works. He is concerned about whether the service can perform business functions and make money. For example: if the site is working, and the payment system is "lying", there are no receipts from customers, and the "extreme" are still IT specialists. 

An IT professional may experience difficulties in such negotiations for several reasons:

  • The IT service is not fully aware of the role of the information system in the business. For example, if there is no available description of business processes or a transparent business model. 
  • Not the whole process depends on the IT service. For example, when part of the work is performed by contractors, and IT specialists do not have direct influence on them.

I would structure the conversation like this: 

  1. We explain to businesses that accidents happen to everyone, and recovery takes time. It is best to demonstrate situations, how this happens and what consequences are possible.
  2. We show that not everything depends on the IT service, but you are ready to help with an action plan in your area of ​​responsibility.
  3. We ask the business customer to answer: if an apocalypse happens, which process should be restored first? Who participates and how? 

    A simple answer is needed from the business, for example: the call center needs to continue to register applications 24/7.

  4. We ask one or two users of the system to describe this process in detail. 
    It is better to enlist the help of an analyst, if your company has one.

    To begin with, the description may look like this: the call center receives applications by phone, by mail and through messages from the site. Then he brings them into 1C through the web interface, from there they are taken by production in this way.

  5. Then we look at what hardware and software solutions support the process. For comprehensive protection, we take into account three levels: 
    • applications and systems inside the site (software level),   
    • the platform itself, where the systems are spinning (infrastructural level), 
    • network (often forgotten about it).

  6. We find out possible points of failure: the nodes of the system, on which the performance of the service depends. Separately, we note the nodes that are supported by other companies: telecom operators, hosting providers, data centers, and so on. With this, you can return to the business customer for the next step.

What we protect against: risks

Then we find out from the business customer what risks we protect ourselves from in the first place. All risks can be conditionally divided into two groups: 

  • loss of time due to service downtime;
  • data loss due to physical impacts, human error, etc.

A business is afraid to lose both data and time - all this leads to a loss of money. So again we ask questions for each risk group: 

  • Can we estimate for this process how much is the cost of losing data and wasting time? 
  • What data can we not lose? 
  • Where we can not afford downtime? 
  • What events are most likely and threaten us the most?

After the discussion, we will understand how to prioritize points of failure. 

How strong we protect: RPO and RTO 

When the critical points of failure are understood, we calculate the RTO and RPO indicators. 

I remind that RTO (recovery time objective) - this is the allowable time from the moment of the accident until the full restoration of the service. In business parlance, this is acceptable downtime. If we know how much money the process brought in, then we can calculate the losses from each minute of downtime and calculate the allowable loss. 

RPO (recovery point objective) is a valid data recovery point. It determines the time for which we can lose data. From a business point of view, data loss can lead to fines, for example. Such losses can also be converted into money. 

Let at least a flood, but 1C should work! Negotiating with business about DR

Recovery time needs to be calculated for the end user: in what time he will be able to log into the system. So first add up the recovery time of all links in the chain. A mistake is often made here: they take the RTO of the provider from the SLA, but forget about the rest of the terms.

Let's look at a specific example. The user enters 1C, the system opens with a database error. He addresses the system administrator. The base is in the cloud, the system administrator reports the problem to the service provider. Let's say all communications take 15 minutes. In the cloud, a database of such a volume will be restored from a backup in an hour, therefore, RTO on the side of the service provider is an hour. But this is not the final deadline, for the user, 15 minutes were added to it to detect the problem. 
 
Next, the system administrator needs to check that the database is correct, connect it to 1C and start the services. This takes another hour, which means that RTO on the administrator's side is already 2 hours and 15 minutes. The user needs another 15 minutes: to log in, check that the necessary transactions have appeared. 2 hours 30 minutes is the total recovery time for the service in this example.

These calculations will show the business on what external factors the recovery period depends. For example, if the office is flooded, then the first thing to do is to find the leak and fix it. It will take time, which does not depend on IT.  

How we protect: choosing tools for different risks

After discussing all the points, the customer already understands the cost of the accident for the business. Now you can choose tools and discuss the budget. Using examples of client cases, I will show what tools we offer for different tasks. 

Let's start with the first group of risks: losses due to service downtime. Solutions for this task should provide a good RTO.

  1. Host the application in the cloud 

    For starters, you can simply move to the cloud - the provider has already thought about high availability issues there. Virtualization hosts are clustered, power and network are backed up, data is stored on fault-tolerant storage, and the service provider is financially responsible for downtime.

    For example, you can host a virtual machine with a database in the cloud. The application will connect to the database externally via the established channel or from the same cloud. If there are problems with one of the cluster servers, the VM will restart on the neighboring server in less than 2 minutes. After that, the DBMS will rise in it, and in a few minutes the database will become available.

    RTO: Measured in minutes. You can prescribe these terms in an agreement with the provider.
    Price: calculate the cost of cloud resources for your application. 
    What does not protect: from massive failures at the provider's site, for example, due to accidents at the city level.

  2. Cluster the application  

    If you want to improve RTO, you can strengthen the previous option and immediately place a clustered application in the cloud.

    You can implement a cluster in active-passive or active-active mode. We create several VMs based on the requirements of the vendor. For greater reliability, we distribute them to different servers and storage systems. If the server with one of the databases fails, the backup node takes over the load in a few seconds.

    RTO: Measured in seconds.
    Price: slightly more expensive than a regular cloud, additional resources will be required for clustering.
    What does not protect: still won't protect against massive site failures. But local failures will not be so long.

    From the practice: The retailer had several information systems and websites. All databases were located locally at the company's office. No DR was thought about until the office was left without electricity several times in a row. Clients were dissatisfied with failures on sites. 
     
    The problem with the availability of services was resolved after moving to the cloud. Plus, we managed to optimize the load on the databases by balancing traffic between nodes.

  3. Move to a disaster-proof cloud

    If it is necessary that even a natural disaster at the main site does not interfere with work, you can choose a disaster-proof cloud. In this option, the provider distributes the virtualization cluster already to 2 data centers. Between data centers there is constant synchronous replication, one-to-one. The channels between the data centers are reserved and go along different routes, so that such a cluster is not afraid of problems with the network. 

    RTO: approaches 0.
    Price: The most expensive option in the cloud. 
    What does not protect: It will not help against data corruption, as well as from the human factor, therefore it is recommended to make backups in parallel. 

    From the practice: One of our clients has developed a comprehensive disaster recovery plan. Here is the strategy he chose: 

    • A disaster-proof cloud protects the application from failures at the infrastructure level. 
    • A two-level backup provides protection in case of a human factor. There are two types of backups: "cold" and "hot". The "cold" backup is in the off state, it takes time to deploy it. A "hot" backup is already ready to go and recovers faster. It is stored on a dedicated storage system. The third copy is recorded on tape and stored in another room. 

    Once a week, the client tests the protection and checks the performance of all backups, including from tape. Every year, the company conducts testing of the entire disaster-proof cloud. 

  4. Organize replication to another site 

    Another option to avoid global problems at the main site is to provide geo-reservation. In other words, create backup virtual machines at a site in another city. Special solutions for DR are suitable for this: we at the company use VMware vCloud Availability (vCAV). With it, you can set up protection between several cloud provider sites or recover to the cloud from an on-premise site. I have already told you more about the scheme of working with vCAV here

    RPO and RTO: from 5 minutes. 

    Price: More expensive than the first option, but cheaper than hardware-based replication in a disaster cloud. The price consists of the cost of the vCAV license, administration fees, the cost of cloud resources and resources for reserve according to the PAYG model (10% of the cost of running resources for offline VMs).

    From the practice: The client kept 6 virtual machines with different databases in our cloud in Moscow. At first, backup provided protection: some of the backups were stored in the cloud in Moscow, and some on our St. Petersburg site. Over time, the databases have grown in size, and restoring from a backup has become more time consuming. 
     
    Added replication based on VMware vCloud Availability to backups. Replicas of virtual machines are stored on a backup site in St. Petersburg and are updated every 5 minutes. If a failure occurs at the main site, employees switch to the virtual machine replica in St. Petersburg on their own and continue working with it. 

All the solutions discussed provide high availability, but do not protect against data loss due to a ransomware virus or an accidental human error. In this case, we need backups that will provide the desired RPO.

5. Don't forget about backup

Everyone knows that you need to make backups, even if you have the coolest disaster recovery solution. So I will just briefly recall a few points.

Strictly speaking, backup is not DR. And that's why: 

  • It's long. If the data is measured in terabytes, it will take more than one hour to restore. We need to restore, assign a network, check that it turns on, see that the data is in order. So it is possible to provide a good RTO only if there is little data. 
  • Data may not be recovered the first time, and you need to set aside time for a second action. For example, there are times when we do not know exactly when the data was lost. Suppose the loss was noticed at 15.00, and copies are made every hour. From 15.00 we watch all recovery points: 14:00, 13:00 and so on. If the system is important, we try to minimize the age of the restore point. But if the necessary data was not found in the fresh backup, we take the next point - this is additional time. 

At the same time, the backup schedule can provide the desired RPO. For backups, it is important to provide geo-reservation in case of problems with the main site. Some backups are recommended to be stored separately.

The final disaster recovery plan should have at least 2 tools:  

  • One of options 1-4, which will protect systems from crashes and crashes.
  • Backup to protect data from loss. 

It is also worth taking care of a backup communication channel in case the main Internet provider fails. And - voila! - DR at the minimum is already ready. 

Source: habr.com

Add a comment