AERODISK Engine: Disaster recovery. Part 2. Metrocluster

AERODISK Engine: Disaster recovery. Part 2. Metrocluster

Hello Habr readers! In the last article, we talked about a simple means of disaster recovery in AERODISK ENGINE storage systems - replication. In this article, we will dive into a more complex and interesting topic - a metrocluster, that is, an automated disaster protection tool for two data centers that allows data centers to work in active-active mode. We will tell, show, break and fix.

As usual, at the beginning of the theory

A metrocluster is a cluster spaced into several sites within a city or district. The word "cluster" clearly hints to us that the complex is automated, that is, the switching of cluster nodes in case of failures (failover) occurs automatically.

This is where the main difference between a metrocluster and regular replication lies. Automation of operations. That is, in the event of certain incidents (failure of the data center, breakage of channels, etc.), the storage system will independently perform the necessary actions in order to maintain data availability. When using regular replicas, these actions are performed entirely or partially manually by the administrator.

What does it do?

The main goal pursued by customers using certain metrocluster implementations is to minimize RTO (Recovery Time Objective). That is, to minimize the recovery time of IT services after a failure. If you use conventional replication, then the recovery time will always be greater than the recovery time with a metro cluster. Why? Very simple. The administrator must be at the workplace and switch replication manually, while the metro cluster does this automatically.

If you do not have a dedicated duty administrator who does not sleep, does not eat, does not smoke or get sick, but looks at the state of the storage system 24 hours a day, then there is no way to guarantee that the administrator will be available for manual switching during a failure.

Accordingly, RTO in the absence of a metro cluster or an immortal admin of the 99th level of the duty service of administrators will be equal to the sum of the switching time of all systems and the maximum period of time after which the administrator is guaranteed to start working with the storage system and related systems.

Thus, we come to the obvious conclusion that the metrocluster should be used if the requirement for RTO is minutes, not hours or days. That is, when, in the event of the most terrible fall of the data center, the IT department must provide the business with time to restore access to IT services within minutes or even seconds.

How does it work?

At the lower level, the metrocluster uses the synchronous data replication mechanism that we described in the previous article (see below). link). Since replication is synchronous, the requirements for it are appropriate, or rather:

  • fiber as physics, 10 gigabit ethernet (or higher);
  • the distance between data centers is no more than 40 kilometers;
  • optical channel delay between data centers (between storage systems) up to 5 milliseconds (optimally 2).

All these requirements are advisory in nature, that is, the metro cluster will work even if these requirements are not met, but it must be understood that the consequences of non-compliance with these requirements are equal to slowing down the operation of both storage systems in the metro cluster.

So, a synchronous replica is used to transfer data between storage systems, but how do replicas automatically switch and, most importantly, how to avoid split-brain? To do this, at the level above, an additional entity is used - the arbiter.

How does an arbitrator work and what is his task?

The arbiter is a small virtual machine or a hardware cluster that needs to be launched on a third site (for example, in an office) and provide access to storage via ICMP and SSH. After starting, the arbiter should set the IP, and then specify its address from the storage side, plus the addresses of the remote controllers that participate in the metro cluster. After that, the referee is ready to go.

The arbitrator constantly monitors all storage systems in the metro cluster, and in the event that a particular storage system is unavailable, after confirmation of unavailability from another cluster member (one of the "live" storage systems), he decides to start the procedure for switching replication rules and mapping.

A very important point. The arbiter must always be located at a site different from those on which the storage systems are located, that is, neither in DPC 1, where storage 1 is located, nor in DPC 2, where storage 2 is installed.

Why? Because only in this way can the arbiter, with the help of one of the surviving storage systems, unambiguously and accurately determine the fall of any of the two sites where the storage systems are installed. Any other way of placing an arbitrator may result in a split-brain.

Now let's dive into the details of the arbiter's job.

Several services are running on the arbiter that constantly poll all storage controllers. If the result of the poll differs from the previous one (available/unavailable), then it is written to a small database that also works on the arbiter.

Consider the logic of the arbiter in more detail.

Step 1. Determination of unavailability. An event signaling a storage system failure is the absence of a ping from both controllers of the same storage system for 5 seconds.

Step 2. Starting the switching procedure. After the arbitrator realized that one of the storage systems is unavailable, he sends a request to the "live" storage system in order to make sure that the "dead" storage system is really dead.

After receiving such a command from the arbitrator, the second (live) storage system additionally checks the availability of the fallen first storage system and, if not, sends the arbitrator confirmation of his guess. SHD, indeed, is not available.

After receiving such confirmation, the arbitrator starts a remote procedure for switching replication and raising the mapping on those replicas that were active (primary) on the fallen storage system, and sends a command to the second storage system to make these replicas from secondary to primary and raise the mapping. Well, the second storage system, respectively, performs these procedures, after which it provides access to the lost LUNs from itself.

Why is additional verification needed? For quorum. That is, the majority of the total odd (3) number of cluster members must confirm the fall of one of the cluster nodes. Only then will this decision be exactly right. This is necessary in order to avoid erroneous switching and, accordingly, split-brain.

Step 2 takes about 5-10 seconds in time, so, taking into account the time required to determine unavailability (5 seconds), within 10-15 seconds after the failure, LUNs with a fallen storage system will be automatically available for working with live storage.

It is clear that in order to avoid disconnection with hosts, you must also take care of the correct setting of timeouts on the hosts. The recommended timeout is at least 30 seconds. This prevents the host from dropping the connection to the storage during a failover load and can ensure that there is no I/O interruption.

Wait a second, it turns out that if everything is fine with the metro cluster, why do we need regular replication at all?

In fact, everything is not so simple.

Consider the pros and cons of the metrocluster

So, we realized that the obvious advantages of the metrocluster compared to conventional replication are:

  • Full automation, ensuring minimal recovery time in the event of a disaster;
  • And that's it :-).

And now, attention, cons:

  • Solution cost. Although the metrocluster in Aerodisk systems does not require additional licensing (the same license is used as for the replica), the cost of the solution will still be even higher than using synchronous replication. You will need to implement all the requirements for a synchronous replica, plus the requirements for the metro cluster associated with additional switching and additional site (see metro cluster planning);
  • The complexity of the decision. A metrocluster is much more complex than a regular replica, and requires much more attention and effort to plan, configure, and document.

Eventually. Metrocluster is certainly a very technological and good solution when you really need to provide RTO in seconds or minutes. But if there is no such task, and RTO in hours is OK for business, then it makes no sense to shoot sparrows from a cannon. The usual worker-peasant replication is enough, since the metro cluster will cause additional costs and complication of the IT infrastructure.

Metro cluster planning

This section does not purport to be a comprehensive tutorial on metro cluster design, but only shows the main directions that should be worked out if you decide to build such a system. Therefore, during the actual implementation of the metro cluster, be sure to involve the manufacturer of the storage system (that is, us) and other related systems for consultation.

Playgrounds

As stated above, a metro cluster requires a minimum of three sites. Two data centers, where storage and related systems will work, as well as a third site, where the arbiter will work.

The recommended distance between data centers is no more than 40 kilometers. A greater distance is highly likely to cause additional delays, which are highly undesirable in the case of a metro cluster. Recall that delays should be up to 5 milliseconds, although it is desirable to keep within 2.

Delays are also recommended to be checked during the planning process. Any more or less adult provider that provides fiber between data centers can organize a quality check pretty quickly.

As for the delays to the arbitrator (that is, between the third site and the first two), the recommended delay threshold is up to 200 milliseconds, that is, a regular corporate VPN connection over the Internet will do.

Switching and network

Unlike the replication scheme, where it is enough to connect storage systems from different sites, the metro cluster scheme requires connecting hosts to both storage systems at different sites. To make it clearer what the difference is, both schemes are listed below.

AERODISK Engine: Disaster recovery. Part 2. Metrocluster

AERODISK Engine: Disaster recovery. Part 2. Metrocluster

As you can see from the diagram, we have site 1 hosts looking at both storage system 1 and storage system 2. Also, on the contrary, site 2 hosts look at both storage system 2 and storage system 1. That is, each host sees both storage systems. This is a prerequisite for the operation of the metrocluster.

Of course, there is no need to pull each host with an optical cord to another data center, there will not be enough ports and cords. All these connections must be made through Ethernet 10G+ or FibreChannel 8G+ switches (FC only for connecting hosts and storage for IO, the replication channel is currently only available over IP (Ethernet 10G+).

Now a few words about the network topology. An important point is the correct configuration of subnets. You must immediately define several subnets for the following types of traffic:

  • Subnet for replication, over which data will be synchronized between storage systems. There may be several of them, in this case it does not matter, it all depends on the current (already implemented) network topology. If there are two of them, then, obviously, routing between them must be configured;
  • Storage subnets through which hosts will access storage resources (if it is iSCSI). There should be one such subnet in each data center;
  • Control subnets, that is, three routable subnets at three sites from which storage is managed, and the arbiter is also located there.

We do not consider subnets for accessing host resources here, since they are highly dependent on tasks.

Separating different traffic into different subnets is extremely important (it is especially important to separate the replica from the I / O), because if you mix all the traffic into one “thick” subnet, then this traffic will be impossible to manage, and in the conditions of two data centers this can still cause various network collision options. We will not delve deeply into this issue within the framework of this article, since you can read about planning a network stretched between data centers on the resources of network equipment manufacturers, where this is described in great detail.

Arbiter Configuration

The arbitrator needs to provide access to all control interfaces of the storage system via ICMP and SSH protocols. You should also think about the fault tolerance of the arbiter. There is a nuance here.

Arbiter failover is highly desirable, but not required. And what happens if the arbitrator crashes at the wrong time?

  • The operation of the metrocluster in the normal mode will not change, because arbtir does not affect the operation of the metro cluster in the normal mode in any way (its task is to switch the load between data centers in time)
  • At the same time, if the arbiter for one reason or another falls and "sleeps" the accident in the data center, then no switching will occur, because there will be no one to give the necessary switching commands and organize a quorum. In this case, the metro cluster will turn into a regular replication scheme, which will have to be manually switched during a disaster, which will affect RTO.

What follows from this? If you really need to provide a minimum RTO, you need to ensure the fault tolerance of the arbitrator. There are two options for this:

  • Run a virtual machine with an arbiter on a fault-tolerant hypervisor, since all adult hypervisors support fault-tolerance;
  • If on the third site (in a conditional office) it’s too lazy to install a normal cluster and there is no existing cluster of hypervisors, then we have provided a hardware version of the arbiter, which is made in a 2U box, in which two ordinary x-86 servers work and which can survive a local failure.

We strongly recommend that you ensure the fault tolerance of the arbiter, despite the fact that the metro cluster does not need it in normal mode. But as both theory and practice show, if you build a truly reliable disaster-tolerant infrastructure, then it is better to play it safe. It is better to protect yourself and your business from the "law of meanness", that is, from the failure of both the arbitrator and one of the sites where the storage system is located.

Solution architecture

Considering the requirements above, we get the following general solution architecture.

AERODISK Engine: Disaster recovery. Part 2. Metrocluster

LUNs should be evenly distributed across the two sites to avoid severe congestion. At the same time, when sizing in both data centers, it is necessary to provide not only double the volume (which is necessary to store data simultaneously on two storage systems), but also double performance in IOPS and MB / s in order to prevent degradation of applications in the event of a failure of one of the data centers - ov.

Separately, we note that with the proper approach to sizing (that is, provided that we have provided for the proper upper limits of IOPS and MB / s, as well as the necessary CPU and RAM resources), if one of the storage systems in the metro cluster fails, there will be no serious performance drop in conditions temporary work on one storage system.

This is explained by the fact that in conditions of two sites working simultaneously, running synchronous replication “eats” half of the write performance, since each transaction needs to be written to two storage systems (similar to RAID-1 / 10). So, if one of the storage systems fails, the effect of replication temporarily (until the failed storage system rises) disappears, and we get a twofold increase in write performance. After the LUNs of the failed storage system restarted on the working storage system, this twofold increase disappears due to the fact that there is a load from the LUNs of another storage system, and we return to the same level of performance that we had before the “fall”, but only within the same platform.

With the help of competent sizing, it is possible to provide conditions under which users will not feel the failure of the entire storage system at all. But once again, this requires very careful sizing, which, by the way, you can contact us for free :-).

Setting up a metro cluster

Setting up a metro cluster is very similar to setting up regular replication, which we described in previous article. So let's just focus on the differences. We set up a bench in the laboratory based on the architecture above, only in the minimum version: two storage systems connected via 10G Ethernet to each other, two 10G switches and one host that looks through the switches to both storage systems with 10G ports. The arbitrator runs on a virtual machine.

AERODISK Engine: Disaster recovery. Part 2. Metrocluster

When configuring virtual IP (VIP) for a replica, you should select the VIP type - for a metro cluster.

AERODISK Engine: Disaster recovery. Part 2. Metrocluster

Created two replication links for two LUNs and distributed them over two storage systems: LUN TEST Primary on storage1 (METRO connection), LUN TEST2 Primary for storage2 (METRO2 connection).

AERODISK Engine: Disaster recovery. Part 2. Metrocluster

For them, we configured two identical targets (in our case, iSCSI, but FC is also supported, the configuration logic is the same).

storage1:

AERODISK Engine: Disaster recovery. Part 2. Metrocluster

storage2:

AERODISK Engine: Disaster recovery. Part 2. Metrocluster

For replication links, mappings were made on each storage system.

storage1:

AERODISK Engine: Disaster recovery. Part 2. Metrocluster

storage2:

AERODISK Engine: Disaster recovery. Part 2. Metrocluster

We set up multipath and presented it to the host.

AERODISK Engine: Disaster recovery. Part 2. Metrocluster

AERODISK Engine: Disaster recovery. Part 2. Metrocluster

Setting up an arbitrator

You don’t need to do anything special with the arbiter itself, you just need to turn it on on the third site, set its IP and configure access to it via ICMP and SSH. The configuration itself is performed from the storage systems themselves. At the same time, it is enough to configure the arbiter once on any of the storage controllers in the metro cluster, these settings will be distributed to all controllers automatically.

Under Remote Replication>> Metrocluster (on any controller)>> Configure button.

AERODISK Engine: Disaster recovery. Part 2. Metrocluster

We enter the IP of the arbiter, as well as the control interfaces of the two remote storage controllers.

AERODISK Engine: Disaster recovery. Part 2. Metrocluster

After that, you need to enable all services (button "Restart Everything"). If you reconfigure in the future, the services must be restarted for the settings to take effect.

AERODISK Engine: Disaster recovery. Part 2. Metrocluster

We check that all services are running.

AERODISK Engine: Disaster recovery. Part 2. Metrocluster

This completes the metrocluster setup.

Crash test

The crash test in our case will be quite simple and fast, since the replication functionality (switching, consistency, etc.) was considered in last article. Therefore, to test the reliability of the metro cluster, it is enough for us to check the automation of accident detection, switching and the absence of write losses (I/O stops).

To do this, we emulate a complete failure of one of the storage systems by physically turning off both of its controllers, after starting copying a large file to a LUN, which should be activated on another storage system.

AERODISK Engine: Disaster recovery. Part 2. Metrocluster

Disable one storage system. On the second storage system, we see alerts and messages in the logs that the connection with the neighboring system has disappeared. If notifications via SMTP or SNMP monitoring are configured, then the appropriate notifications will be sent to the administrator.

AERODISK Engine: Disaster recovery. Part 2. Metrocluster

Exactly 10 seconds later (seen in both screenshots), the METRO replication link (the one that was Primary on the fallen storage system) automatically became Primary on the working storage system. Using the existing mapping, LUN TEST remained available to the host, the recording dipped a little (within the promised 10 percent), but did not stop.

AERODISK Engine: Disaster recovery. Part 2. Metrocluster

AERODISK Engine: Disaster recovery. Part 2. Metrocluster

The test completed successfully.

Summing up

The current implementation of the metrocluster in the AERODISK Engine N-series storage systems fully allows you to solve problems where you need to eliminate or minimize the downtime of IT services and ensure their operation in 24/7/365 mode with minimal labor costs.

You can say, of course, that all this is theory, ideal laboratory conditions, and so on ... BUT we have a number of implemented projects in which we implemented the disaster recovery functionality, and the systems work perfectly. One of our fairly well-known customers, where just two storage systems are used in a disaster-tolerant configuration, has already agreed to publish information about the project, so in the next part we will talk about combat implementation.

Thanks, looking forward to a productive discussion.

Source: habr.com

Add a comment