Hands-free admin = hyperconvergence?

Hands-free admin = hyperconvergence?
Hands-free admin = hyperconvergence?

This is a myth that is quite common in the field of server hardware. In practice, hyperconverged solutions (when everything is in one) are needed for a lot of things. Historically, the first architectures were developed by Amazon and Google for their services. Then the idea was to make a computing farm of the same nodes, each of which has its own disks. All this was combined by some system-forming software (hypervisor) and was already divided into virtual machines. The main task is a minimum of effort to maintain one node and a minimum of problems when scaling: they simply bought another thousand or two of the same servers and connected them side by side. In practice, these are isolated cases, and much more often we are talking about a smaller number of nodes and a slightly different architecture.

But the plus remains the same - the incredible ease of scaling and control. Minus - different tasks consume resources in different ways, and somewhere there will be a lot of local disks, somewhere there will be little RAM, and so on, that is, with different types of tasks, resource utilization will fall.

It turned out that you pay 10-15% more for the convenience of setting up. This is what sparked the myth from the headline. We have been looking for a long time where the technology will be optimally applied, and we found it. The fact is that Tsiska did not have their own storage systems, but they wanted a complete server market. And they made Cisco Hyperflex - a solution with local storage on nodes.

And this suddenly turned out to be a very good solution for backup data centers (Disaster Recovery). Why and how - now I will tell. And I'll show the cluster tests.

Where needed

Hyperconvergence is:

  1. Transferring disks to computing nodes.
  2. Full integration of the storage subsystem with the virtualization subsystem.
  3. Transfer / integration with the network subsystem.

Such a bundle allows you to implement many storage features at the virtualization level and everything from one management window.

In our company, projects for designing backup data centers are in great demand, and the hyperconverged solution is often chosen because of the heap of replication options (up to the metro cluster) out of the box.

In the case of backup data centers, we are usually talking about a remote facility on a site on the other side of the city or even in another city. It allows you to restore critical systems in the event of a partial or complete failure of the main data center. Data from the seller is constantly replicated there, and this replication can be at the application level or at the block device level (SHD).

Therefore, now I will talk about the design of the system and tests, and then about a couple of real-life scenarios with savings data.

Tests

Our instance consists of four servers, each of which has 10 960 GB SSD drives. There is a dedicated disk for caching writes and storage of the service virtual machine. The solution itself is the fourth version. The first is frankly raw (judging by the reviews), the second is raw, the third is already quite stable, and this one can be called a release after the end of beta testing for the general public. During testing, I did not see any problems, everything works like clockwork.

Changes in v4Fixed a lot of bugs.

Initially, the platform could only work with the VMware ESXi hypervisor and supported a small number of nodes. Also, the deployment process did not always end successfully, some steps had to be restarted, there were problems updating from older versions, the data in the GUI was not always displayed correctly (although I am still not enthusiastic about displaying performance graphs), sometimes there were problems at the junction with virtualization .

Now all children's sores have been fixed, HyperFlex can do both ESXi and Hyper-V, plus it is possible to:

  1. Creation of a stretched cluster.
  2. Creating a cluster for offices without using Fabric Interconnect, from two to four nodes (we buy only servers).
  3. Ability to work with external storage systems.
  4. Support for containers and Kubernetes.
  5. Create availability zones.
  6. Integration with VMware SRM, if the built-in functionality does not suit you.

The architecture is not much different from the solutions of the main competitors, they did not create a bicycle. It all works on the VMware or Hyper-V virtualization platform. Hardware hosted on Cisco UCS proprietary servers. There are those who hate the platform for the relative complexity of the initial setup, a lot of buttons, a non-trivial system of templates and dependencies, but there are those who know the zen, imbued with the idea and no longer want to work with other servers.

We will consider the solution for VMware, since the solution was originally created for it and has more functionality, Hyper-V was finished along the way in order to keep up with competitors and meet market expectations.

There is a cluster from the servers stuffed with disks. There are disks for data storage (SSD or HDD - according to your taste and needs), there is one SSD disk for caching. When data is written to the datastore, the data is saved on the caching layer (dedicated SSD disk and RAM of the service VM). In parallel, the data block is sent to the nodes in the cluster (the number of nodes depends on the cluster replication factor). After confirmation from all nodes about a successful write, the write confirmation is sent to the hypervisor and then to the VM. Recorded data is deduplicated, compressed, and written to storage disks in the background. At the same time, a large block is always written to the storage disks and sequentially, which reduces the load on the storage disks.

Deduplication and compression are enabled all the time and cannot be disabled. Data is read directly from storage disks or from the RAM cache. If a hybrid configuration is used, the read is also cached on the SSD.

The data is not tied to the current location of the virtual machine and is distributed evenly between the nodes. This approach allows you to equally load all disks and network interfaces. An obvious minus arises: we cannot minimize the read latency, since there is no guarantee that the data is available locally. But I think that this is an insignificant sacrifice compared to the pluses received. Moreover, network delays have reached such values ​​that they practically do not affect the overall result.

A special service VM Cisco HyperFlex Data Platform controller, which is created on each storage node, is responsible for the entire logic of the disk subsystem. In our configuration, the service VM was allocated eight vCPUs and 72 GB of RAM, which is not so little. Let me remind you that the host itself has 28 physical cores and 512 GB of RAM.

The service VM has access to physical disks directly by forwarding the SAS controller to the VM. Communication with the hypervisor occurs through a special IOVisor module that intercepts I / O operations, and with the help of an agent that allows you to send commands to the hypervisor API. The agent is responsible for working with HyperFlex snapshots and clones.

In the hypervisor, disk resources are mounted as NFS or SMB shares (depending on the type of hypervisor, guess which one). And under the hood, this is a distributed file system that allows you to add features of adult full-fledged storage systems: thin allocation of volumes, compression and deduplication, snapshots using Redirect-on-Write technology, synchronous / asynchronous replication.

The service VM provides access to the HyperFlex subsystem management web interface. There is integration with vCenter, and most of the daily tasks can be performed from it, but datastores, for example, are more convenient to cut from a separate webcam if you have already switched to a fast HTML5 interface, or use a full-fledged Flash client with full integration. In the service webcam, you can see the performance and detailed status of the system.

Hands-free admin = hyperconvergence?

There is another type of nodes in the cluster - computing nodes. These can be rack or blade servers without built-in drives. On these servers, you can run VMs whose data is stored on servers with disks. From the point of view of data access, there is no difference between the types of nodes, because the architecture assumes an abstraction from the physical location of the data. The maximum ratio of compute nodes to storage nodes is 2:1.

The use of computing nodes increases the flexibility when scaling cluster resources: we do not have to buy additional nodes with disks if we only need CPU / RAM. In addition, we can add a blade basket and get savings on placing servers in a rack.

As a result, we have a hyperconverged platform with the following features:

  • Up to 64 nodes per cluster (up to 32 storage nodes).
  • The minimum number of nodes in a cluster is three (two for an Edge cluster).
  • Data redundancy mechanism: mirroring with replication factor 2 and 3.
  • Metro cluster.
  • Asynchronous VM replication to another HyperFlex cluster.
  • Orchestration of switching VMs to a remote data center.
  • Native snapshots using Redirect-on-Write technology.
  • Up to 1 PB of usable space with a replication factor of 3 and no deduplication. Replication factor 2 is not taken into account, because this is not an option for a serious seller.

Another huge plus is the ease of management and deployment. All the complexities of setting up UCS servers are handled by a specialized VM prepared by Cisco engineers.

Test bench configuration:

  • 2 x Cisco UCS Fabric Interconnect 6248UP as management cluster and network components (48 ports operating in 10G/FC 16G Ethernet mode).
  • Four Cisco UCS HXAF240 M4 servers.

Server characteristics:

CPU

2 x Intel® Xeon® E5-2690 v4

RAM

16 x 32GB DDR4-2400-MHz RDIMM/PC4-19200/dual rank/x4/1.2v

Network

UCSC-MLOM-CSC-02 (VIC 1227). 2 10G Ethernet ports

Storage HBA

Cisco 12G Modular SAS Pass through Controller

Storage Disks

1 x SSD Intel S3520 120 GB, 1 x SSD Samsung MZ-IES800D, 10 x SSD Samsung PM863a 960 GB

More configuration optionsIn addition to the selected hardware, the following options are currently available:

  • HXAF240c M5.
  • One or two CPUs ranging from Intel Silver 4110 to Intel Platinum I8260Y. Second generation available.
  • 24 memory slots, brackets from 16 GB RDIMM 2600 to 128 GB LRDIMM 2933.
  • 6 to 23 data drives, one cache drive, one system drive, and one boot drive.

Capacity Drives

  • HX-SD960G61X-EV 960GB 2.5 Inch Enterprise Value 6G SATA SSD (1X endurance) SAS 960 GB.
  • HX-SD38T61X-EV 3.8TB 2.5 inch Enterprise Value 6G SATA SSD (1X endurance) SAS 3.8 TB.
  • Caching Drives
  • HX-NVMEXPB-I375 375GB 2.5 inch Intel Optane Drive, Extreme Perf & Endurance.
  • HX-NVMEHW-H1600* 1.6TB 2.5 inch Ent. Perf. NVMe SSD (3X endurance) NVMe 1.6 TB.
  • HX-SD400G12TX-EP 400GB 2.5 inch Ent. Perf. 12G SAS SSD (10X endurance) SAS 400 GB.
  • HX-SD800GBENK9** 800GB 2.5 inch Ent. Perf. 12G SAS SED SSD (10X endurance) SAS 800 GB.
  • HX-SD16T123X-EP 1.6TB 2.5 inch Enterprise performance 12G SAS SSD (3X endurance).

System/Log Drives

  • HX-SD240GM1X-EV 240GB 2.5 inch Enterprise Value 6G SATA SSD (Requires upgrade).

boot drives

  • HX-M2-240GB 240GB SATA M.2 SSD SATA 240 GB.

Network connection via 40G, 25G or 10G Ethernet ports.

FI can be HX-FI-6332 (40G), HX-FI-6332-16UP (40G), HX-FI-6454 (40G/100G).

The test itself

To test the disk subsystem, I used HCIBench 2.2.1. This is a free utility that allows you to automate the creation of a load from several virtual machines. The load itself is generated by the usual fio.

Our cluster consists of four nodes, replication factor 3, all Flash disks.

For testing, I created four datastores and eight virtual machines. For write tests, it is assumed that the caching disk is not full.

The test results are as follows:

100% Read 100% Random

0% Read 100%Random

Block/Queue Depth

128

256

512

1024

2048

128

256

512

1024

2048

4K

0,59 ms 213804 IOPS

0,84 ms 303540 IOPS

1,36ms 374348 IOPS

2.47 ms 414116 IOPS

4,86ms 420180 IOPS

2,22 ms 57408 IOPS

3,09 ms 82744 IOPS

5,02 ms 101824 IPOS

8,75 ms 116912 IOPS

17,2 ms 118592 IOPS

8K

0,67 ms 188416 IOPS

0,93 ms 273280 IOPS

1,7 ms 299932 IOPS

2,72 ms 376,484 IOPS

5,47 ms 373,176 IOPS

3,1 ms 41148 IOPS

4,7 ms 54396 IOPS

7,09 ms 72192 IOPS

12,77 ms 80132 IOPS

16K

0,77 ms 164116 IOPS

1,12 ms 228328 IOPS

1,9 ms 268140 IOPS

3,96 ms 258480 IOPS

3,8 ms 33640 IOPS

6,97 ms 36696 IOPS

11,35 ms 45060 IOPS

32K

1,07 ms 119292 IOPS

1,79 ms 142888 IOPS

3,56 ms 143760 IOPS

7,17 ms 17810 IOPS

11,96 ms 21396 IOPS

64K

1,84 ms 69440 IOPS

3,6 ms 71008 IOPS

7,26 ms 70404 IOPS

11,37 ms 11248 IOPS

Bold indicates values ​​after which there is no increase in performance, sometimes even degradation is visible. It is connected with the fact that we rest on the performance of the network / controllers / disks.

  • Sequential read 4432 MB/s.
  • Sequential write 804 MB/s.
  • If one controller fails (failure of a virtual machine or host), the performance drawdown is doubled.
  • If the storage disk fails, there is a drawdown of 1/3. Disk rebuild takes up 5% of the resources of each controller.

On a small block, we rest on the performance of the controller (virtual machine), its CPU is 100% loaded, with an increase in the block, we rest on the port bandwidth. 10 Gb / s is not enough to unlock the potential of the AllFlash system. Unfortunately, the parameters of the provided demo stand do not allow checking the operation at 40 Gbps.

In my impression from the tests and studying the architecture, due to the algorithm that places data between all hosts, we get scalable predictable performance, but this is also a limitation when reading, because more could be squeezed out from local disks, here it can to save a more productive network, for example, 40 Gbps FI is available.

Also, one disk for caching and deduplication may be a limitation, in fact, in this stand, we can write to four SSD disks. It would be great to be able to increase the number of caching disks and see the difference.

Real use

To organize a backup data center, you can use two approaches (we do not consider placing a backup on a remote site):

  1. Active-Passive. All applications are hosted in the main data center. Replication is synchronous or asynchronous. In the event of a fall in the main data center, we need to activate the backup. You can do this manually / scripts / orchestration applications. Here we will get an RPO commensurate with the frequency of replication, and RTO depends on the reaction and skills of the administrator and the quality of the development / debugging of the switch plan.
  2. Active Active. In this case, there is only synchronous replication, the availability of data centers is determined by the quorum / arbitrator, located strictly on the third site. RPO = 0, and RTO can be as high as 0 (if the application allows it) or equal to the time to failover a node in a virtualization cluster. At the virtualization level, a stretched (Metro) cluster is created that requires Active-Active storage.

Usually, we see clients with an already implemented architecture with classic storage in the main data center, so we are designing another one for replication. As I mentioned, Cisco HyperFlex offers asynchronous replication and the creation of a stretched virtualization cluster. At the same time, we do not need a dedicated storage system of the Midrange level and higher with expensive replication functions and Active-Active data access on two storage systems.

1 script: We have a primary and backup data center, a virtualization platform based on VMware vSphere. All productive systems are located in the main data center, and virtual machine replication is performed at the hypervisor level, this will allow you not to keep VMs turned on in the backup data center. We replicate databases and special applications using built-in tools and keep the VMs turned on. If the main data center fails, we start the systems in the backup data center. We consider that we have about 100 virtual machines. While the primary data center is operational, test environments and other systems can run in the secondary data center, which can be disabled in the event of a primary data center switchover. It is also possible that we use two-way replication. In terms of equipment, nothing will change.

In the case of a classical architecture, we will install in each data center a hybrid storage system with access via FibreChannel, tiering, deduplication and compression (but not online), 8 servers per site, 2 FibreChannel switches and Ethernet 10G. For replication and failover management in the classic architecture, we can use VMware tools (Replication + SRM) or third-party tools that will be a little cheaper and sometimes more convenient.

The figure shows a diagram.

Hands-free admin = hyperconvergence?

In the case of using Cisco HyperFlex, the following architecture is obtained:

Hands-free admin = hyperconvergence?

For HyperFlex, I used servers with large CPU / RAM resources, because part of the resources will go to the VM of the HyperFlex controller, in terms of CPU and memory, I even reloaded the HyperFlex configuration a little so as not to play along with Cisco and guarantee resources for the rest of the VMs. But we can refuse FibreChannel switches, and we do not need Ethernet ports for each server, local traffic is switched inside FI.

The result is the following configuration for each data center:

Servers

8 x 1U Server (384 GB RAM, 2 x Intel Gold 6132, FC HBA)

8 x HX240C-M5L (512 GB RAM, 2 x Intel Gold 6150, 3,2 GB SSD, 10 x 6 TB NL-SAS)

Storage

Hybrid Storage with FC Front-End (20TB SSD, 130 TB NL-SAS)

LAN

2 x Ethernet switches 10G 12 ports

SAN

2 x FC switch 32/16Gb 24 ports

2 x Cisco UCS FI 6332

Licenses

VMware Ent Plus

Replication and/or orchestration of VM failover

VMware Ent Plus

For Hyperflex, I did not pledge replication software licenses, since we have it available out of the box.

For classical architecture, I took a vendor that has established itself as a high-quality and inexpensive manufacturer. For both options, I applied the standard discount for a specific solution, and at the output I got real prices.

The Cisco HyperFlex solution turned out to be 13% cheaper.

2 script: creation of two active data centers. In this scenario, we are designing a stretch cluster on VMware.

The classical architecture consists of virtualization servers, SAN (FC protocol) and two storage systems that can read and write on the volume stretched between them. On each storage system, we lay a useful capacity for a lock.

Hands-free admin = hyperconvergence?

With HyperFlex, we simply create a Stretch Cluster with the same number of nodes on both sites. In this case, a 2+2 replication factor is used.

Hands-free admin = hyperconvergence?

The result is the following configuration:

classical architecture

hyperflex

Servers

16 x 1U Server (384 GB RAM, 2 x Intel Gold 6132, FC HBA, 2 x 10G NIC)

16 x HX240C-M5L (512 GB RAM, 2 x Intel Gold 6132, 1,6 TB NVMe, 12 x 3,8 TB SSD, VIC 1387)

Storage

2 x AllFlash storage (150 TB SSD)

LAN

4 x Ethernet switches 10G 24 ports

SAN

4 x FC switch 32/16Gb 24 ports

4 x Cisco UCS FI 6332

Licenses

VMware Ent Plus

VMware Ent Plus

In all calculations, I did not take into account the network infrastructure, data center costs, etc.: they will be the same for the classic architecture and for the HyperFlex solution.

At a cost, HyperFlex turned out to be 5% more expensive. It is worth noting here that in terms of CPU / RAM resources, I got a skew for Cisco, because in the configuration I filled the channels of the memory controllers evenly. The cost is slightly higher, but not an order of magnitude, which clearly indicates that hyperconvergence is not necessarily a "toy for the rich", but can compete with the standard approach to building a data center. It may also be of interest to those who already have Cisco UCS servers and the corresponding infrastructure for them.

From the pluses, we get the absence of costs for administering SAN and storage systems, online compression and deduplication, a single entry point for support (virtualization, servers, they are also storage systems), space savings (but not in all scenarios), simplification of operation.

As for support, here you get it from one vendor - Cisco. Judging by the experience of working with Cisco UCS servers, I like it, I didn’t have to open it on HyperFlex, everything worked anyway. Engineers respond quickly and can solve not only typical problems, but also complex edge cases. Sometimes I turn to them with questions: “Is it possible to do this, screw it up?” or “I configured something here, and it doesn’t want to work. Help!" - they will patiently find the necessary guide there and point out the correct actions, they will not answer: “We solve only hardware problems”.

references

Source: habr.com

Add a comment