Disaster Resilient Cloud: How It Works

Hey Habr!

After the New Year holidays, we restarted the disaster-proof cloud based on two sites. Today we will tell you how it works and show what happens to client virtual machines when individual elements of the cluster fail and the whole site crashes (spoiler - everything is fine with them).

Disaster Resilient Cloud: How It Works
Disaster cloud storage at the OST site.

What's inside

Under the hood, the cluster has Cisco UCS servers with a VMware ESXi hypervisor, two INFINIDAT InfiniBox F2240 storage systems, Cisco Nexus network equipment, and Brocade SAN switches. The cluster is divided into two sites - OST and NORD, that is, each data center has an identical set of equipment. Actually, this makes it disaster-resistant.

Within one site, the main elements are also duplicated (hosts, SAN switches, network network).
The two sites are connected by dedicated fiber optic routes, also reserved.

A few words about storage. We built the first version of a disaster-proof cloud on NetApp. INFINIDAT was chosen here, and here's why:

  • Active-Active replication option. It allows the virtual machine to remain in a healthy state even if one of the storage systems completely fails. I will talk about replication in more detail later.
  • Three disk controllers to improve system resiliency. Usually there are two.
  • Ready solution. An already assembled rack arrived to us, which only needs to be connected to the network and configured.
  • Attentive technical support. INFINIDAT engineers constantly analyze storage logs and events, install new versions in the firmware, and help with configuration.

Here are some photos from unpacking:

Disaster Resilient Cloud: How It Works

Disaster Resilient Cloud: How It Works

How does work

The cloud is already fault-tolerant within itself. It protects the client from single hardware and software failures. Disaster-proof, on the other hand, will help protect against massive failures within the same site: for example, the failure of a storage system (or an SDS cluster, which happens often πŸ™‚), massive errors in the storage network, and so on. And most importantly: such a cloud saves when an entire site becomes inaccessible due to a fire, blackout, raider capture, alien landing.

In all these cases, the client virtual machines continue to run, and here's why.

The cluster scheme is designed so that any ESXi host with client virtual machines can access any of the two storage systems. If the storage system on the OST site fails, then the virtual machines will continue to work: the hosts on which they work will contact the storage system on NORD for data.

Disaster Resilient Cloud: How It Works
This is how the connection diagram in the cluster looks like.

This is possible due to the fact that an Inter-Switch Link is configured between the SAN factories of the two sites: the Fabric A OST SAN switch is connected to the Fabric A NORD SAN switch, similarly for the Fabric B SAN switches.

Well, in order for all these intricacies of SAN factories to make sense, Active-Active replication is configured between the two storage systems: information is written almost simultaneously to the local and remote storage systems, RPO=0. It turns out that the original data is stored on one storage system, and their replica is stored on the other. Data is replicated at the level of storage volumes, and VM data (its disks, configuration file, swap file, etc.) is already stored on them.

The ESXi host sees the primary volume and its replica as one disk device (Storage Device). There are 24 paths from the ESXi host to each disk device:

12 paths connect it to the local storage (optimal paths), and the remaining 12 - to the remote (non-optimal paths). In a normal situation, ESXi accesses data on the local storage system using β€œoptimal” paths. When this storage system fails, ESXi loses optimal paths and switches to β€œnon-optimal” ones. Here's how it looks on the diagram.

Disaster Resilient Cloud: How It Works
Scheme of a disaster-tolerant cluster.

All client networks are brought to both sites through a common network factory. Provider Edge (PE) operates at each site, on which the client's networks are terminated. PEs are united in a common cluster. If a PE fails at one site, all traffic is redirected to the second site. This ensures that the virtual machines from the site left without PE remain available over the network to the client.

Let's now see what happens to the client virtual machines during various failures. Let's start with the lightest options and end with the most serious - the failure of the entire site. In the examples, the main site will be OST, and the backup site, with data replicas, will be NORD.

What happens to the client virtual machine if…

Replication Link fails. Replication between the storage systems of the two sites is stopped.
ESXi will only work with local disk devices (by optimal paths).
The virtual machines continue to run.

Disaster Resilient Cloud: How It Works

An ISL (Inter-Switch Link) break occurs. Incredible case. Unless some crazy excavator digs up several optical paths at once, which pass by independent routes and are brought to the sites through different inputs. But anyway. In this case, ESXi hosts lose half of the paths and can only access their local storage systems. Replicas are collected, but hosts will not be able to access them.

Virtual machines are working properly.

Disaster Resilient Cloud: How It Works

The SAN switch fails at one of the sites. ESXi hosts lose part of the paths to the storage system. In this case, the hosts at the site where the switch failed will work only through one of their HBAs.

Virtual machines continue to work normally.

Disaster Resilient Cloud: How It Works

All SAN switches fail at one of the sites. Let's say such a misfortune happened on the OST site. In this case, ESXi hosts on this site will lose all paths to their disk devices. The standard VMware vSphere HA mechanism comes into play: it will restart all virtual machines of the OST site in NORD in a maximum of 140 seconds.

Virtual machines running on the hosts of the NORD site are working normally.

Disaster Resilient Cloud: How It Works

ESXi host fails at one site. Here the vSphere HA mechanism works again: virtual machines from the failed host are restarted on other hosts - on the same or the same remote site. The virtual machine restart time is up to 1 minute.

If all the ESXi hosts of the OST site fail, there are no options: the VMs are restarted on another one. The restart time is the same.

Disaster Resilient Cloud: How It Works

Refuses storage at one site. Let's say the storage system failed at the OST site. Then the ESXi hosts of the OST site switch to work with storage replicas in NORD. After the failed storage system returns to operation, forced replication will occur, the OST ESXi hosts will again start accessing the local storage system.

Virtual machines all this time work normally.

Disaster Resilient Cloud: How It Works

One of the sites fails. In this case, all virtual machines will be restarted on the standby site through the vSphere HA mechanism. VM restart time - 140 seconds. In this case, all network settings of the virtual machine will be saved, and it will remain available to the client over the network.

In order for the machines to restart at the backup site without problems, each site is only half full. The second half is a reserve in case all virtual machines move from the second, affected, site.

Disaster Resilient Cloud: How It Works

A disaster-proof cloud based on two data centers protects against such failures.

This pleasure is not cheap, because, in addition to the main resources, you need a reserve on the second site. Therefore, business-critical services are placed in such a cloud, the long downtime of which incurs large financial and reputational losses, or if the information system is subject to disaster tolerance requirements from regulators or internal company regulations.

Sources:

  1. www.infinidat.com/sites/default/files/resource-pdfs/DS-INFBOX-190331-US_0.pdf
  2. support.infinidat.com/hc/en-us/articles/207057109-InfiniBox-best-practices-guides

Source: habr.com

Add a comment