As part of this article, I would like to talk about the features of All Flash AccelStor arrays with one of the most popular virtualization platforms - VMware vSphere. In particular, focus on those parameters that will help you get the maximum effect from using such a powerful tool as All Flash.
AccelStor NeoSapphire™ All Flash Arrays are
The entire process of deployment and subsequent configuration of the AccelStor array and the VMware vSphere virtualization system can be divided into several stages:
- Implementation of the connection topology and configuration of the SAN network;
- Setting up All Flash array;
- Setting up ESXi hosts;
- Setting up virtual machines.
The sample hardware was AccelStor NeoSapphire™ Fiber Channel and iSCSI arrays. The base software is VMware vSphere 6.7U1.
Before deploying the systems described in the article, it is highly recommended that you read the documentation from VMware regarding performance issues (
Topology of connection and configuration of SAN network
The main components of a SAN network are HBAs in ESXi hosts, SAN switches, and array nodes. A typical topology for such a network would look like this:
The term Switch here refers to both a separate physical switch or a set of switches (Fabric), and a device shared between different services (VSAN in the case of Fiber Channel and VLAN in the case of iSCSI). Using two independent switches/Fabrics will eliminate a possible point of failure.
Direct connection of hosts to the array, although supported, is highly discouraged. The performance of All Flash arrays is quite high. And for maximum speed, you need to use all ports of the array. Therefore, the presence of at least one switch between hosts and NeoSapphire™ is mandatory.
Having two ports on the HBA host is also a requirement for maximum performance and fault tolerance.
In the case of using the Fiber Channel interface, zoning must be configured to eliminate possible collisions between initiators and targets. Zones are built on the principle of "one port of the initiator - one or more ports of the array."
If you use iSCSI connection when using a switch shared with other services, then it is imperative to isolate iSCSI traffic within a separate VLAN. It is also highly recommended to enable support for Jumbo Frames (MTU = 9000) to increase the size of packets in the network and, thereby, reduce the amount of overhead during transmission. However, it is worth remembering that for correct operation, you need to change the MTU parameter on all network components along the “initiator-switch-target” chain.
Configuring an All Flash Array
The array is delivered to customers with already formed groups
For convenience, there is the functionality of batch creation of several volumes of a given volume at once. By default, "thin" volumes are created, as this allows more rational use of available storage space (including thanks to support for Space Reclamation). In terms of performance, the difference between "thin" and "thick" volumes does not exceed 1%. However, if you want to "squeeze all the juice" out of the array, you can always convert any "thin" volume to a "thick" volume. But it should be remembered that such an operation is irreversible.
Then it remains to “publish” the created volumes and set access rights to them from the hosts using ACLs (IP addresses for iSCSI and WWPN for FC) and physical separation by array ports. For iSCSI models, this is done by creating a Target.
For FC models, publishing occurs by creating a LUN for each array port.
To speed up the configuration process, hosts can be combined into groups. Moreover, if a multiport FC HBA is used on a host (which happens most often in practice), then the system automatically determines that the ports of such an HBA belong to a single host due to WWPNs that differ by one. Also batch creation of Target/LUN is supported for both interfaces.
An important note in the case of using the iSCSI interface is to create several targets for volumes at once to increase performance, since the queue on the target cannot be changed, and it will actually be a bottleneck.
Setting up ESXi hosts
From the side of ESXi hosts, the basic configuration is performed according to the expected scenario. Procedure for iSCSI connection:
- Add Software iSCSI Adapter (not required if already added, or if using Hardware iSCSI Adapter);
- Creating a vSwitch through which iSCSI traffic will pass, and adding physical uplink and VMkernal to it;
- Adding array addresses to Dynamic Discovery;
- Creating a Datastore
Some important notes:
- In the general case, of course, you can use an existing vSwitch, but in the case of a separate vSwitch, managing host settings will be much easier.
- It is necessary to separate Management traffic and iSCSI on separate physical links and/or VLANs to avoid performance issues.
- The IP addresses of the VMkernal and the corresponding All Flash array ports must be within the same subnet, again due to performance issues.
- To ensure fault tolerance according to VMware rules, vSwitch must have at least two physical uplinks
- If Jumbo Frames are used, you need to change the MTU for both vSwitch and VMkernal
- It would not be superfluous to recall that according to VMware recommendations for physical adapters that will be used to work with iSCSI traffic, it is imperative to configure Teaming and Failover. In particular, each VMkernal should work only through one uplink, the second uplink should be set to unused mode. For fault tolerance, you need to add two VMkernals, each of which will work through its own uplink.
VMkernel Adapter (vmk#)
Physical Network Adapter (vmnic#)
vmk1 (Storage01)
Active Adapters
vmnic2
Unused Adapters
vmnic3
vmk2 (Storage02)
Active Adapters
vmnic3
Unused Adapters
vmnic2
Fiber Channel connections do not require any prior steps. You can immediately create a Datastore.
After creating the Datastore, you need to make sure that the Round Robin policy for the paths to the Target / LUN is used as the most performant one.
By default, VMware configures this policy to use the scheme: 1000 requests through the first path, the next 1000 requests through the second path, and so on. This interaction of a host with a two-controller array would be unbalanced. Therefore, we recommend setting the Round Robin policy = 1 via Esxcli/PowerCLI.
Parameters
For Esxcli:
- List available LUNs
esxcli storage nmp device list
- Copy Device Name
- Edit Round Robin Policy
esxcli storage nmp psp roundrobin deviceconfig set --type=iops --iops=1 --device="Device_ID"
Most modern applications are designed to exchange large data packets in order to maximize bandwidth utilization and reduce CPU load. Therefore, ESXi by default passes I/O requests to the storage device in chunks up to 32767KB. However, for a number of scenarios, the exchange of smaller portions will be more productive. For AccelStor arrays, these scenarios are:
- VM uses UEFI instead of Legacy BIOS
- Used by vSphere Replication
For such scenarios, it is recommended to change the value of the Disk.DiskMaxIOSize parameter to 4096.
For iSCSI connections, it is recommended to change the Login Timeout parameter to 30 (default 5) to improve connection stability and disable the DelayedAck packet acknowledgment delay. Both options are in vSphere Client: Host → Configure → Storage → Storage Adapters → Advanced Options for the iSCSI adapter
A rather subtle point is the number of volumes used for the datastore. It is clear that for ease of management, there is a desire to create one large volume for the entire volume of the array. However, the presence of several volumes and, accordingly, the datastore has a beneficial effect on overall performance (more on queues a little later in the text). Therefore, we recommend creating at least two volumes.
More recently, VMware advised limiting the number of virtual machines on one datastore, again in order to get the highest possible performance. However, now, especially with the spread of VDI, this problem is no longer so acute. But this does not change the old rule - to distribute virtual machines that require intensive IO across different datastores. To determine the optimal number of virtual machines per volume, there is nothing better than to
Setting up virtual machines
When setting up virtual machines, there are no special requirements, or rather, they are quite ordinary:
- Using the highest possible VM version (compatibility)
- It is more accurate to set the size of the RAM when virtual machines are densely placed, for example, in VDI (because by default, at startup, a paging file is created that is commensurate with the size of the RAM, which consumes usable capacity and has an effect on the final performance)
- Use the most productive versions of adapters in terms of IO: network type VMXNET 3 and SCSI type PVSCSI
- Use the Thick Provision Eager Zeroed disk type for maximum performance and Thin Provisioning for the most efficient use of storage space
- If possible, limit the operation of non-I/O-critical machines using Virtual Disk Limit
- Be sure to install VMware Tools
Notes on Queues
The queue (or Outstanding I/Os) is the number of I/O requests (SCSI commands) that are waiting to be processed at any given time for a specific device/application. In case of queue overflow, QFULL errors are generated, which ultimately translates into an increase in the latency parameter. When using disk (spindle) storage systems, theoretically, the higher the queue, the higher their performance. However, you should not abuse it, since it is easy to run into QFULL. In the case of All Flash systems, on the one hand, everything is somewhat simpler: after all, the array has delays that are orders of magnitude lower, and therefore, most often, it is not necessary to separately adjust the size of the queues. But on the other hand, in some use cases (strong skew in IO requirements for specific virtual machines, tests for maximum performance, etc.), if you do not change the queue parameters, then at least understand what indicators can be achieved, and, the main thing is in what ways.
There are no volume or I/O port limits on the AccelStor All Flash array itself. If necessary, even a single volume can receive all the resources of the array. The only queue limitation is with iSCSI targets. It is for this reason that the above indicated the need to create several (ideally up to 8) targets per volume to overcome this limit. We also repeat that AccelStor arrays are very high performance solutions. Therefore, you should use all the interface ports of the system to achieve maximum speed.
From the ESXi host side, the situation is completely different. The host itself applies the practice of equal access to resources for all participants. Therefore, there are separate IO queues to the guest OS and HBA. The queues to the guest OS are combined from the queues to the virtual SCSI adapter and the virtual disk:
The queue to the HBA depends on the specific type / vendor:
The overall performance of the virtual machine will be determined by the lowest Queue Depth limit value among the host components.
Thanks to these values, we can evaluate the performance indicators that we can get in a particular configuration. For example, we want to know the theoretical performance of a virtual machine (without reference to a block) with a latency of 0.5ms. Then its IOPS = (1,000/latency) * Outstanding I/Os (Queue Depth limit)
Examples
Example 1
- FC Emulex HBA Adapter
- One VM per datastore
- VMware Paravirtual SCSI Adapter
Here Queue Depth limit is defined by Emulex HBA. Therefore IOPS = (1000/0.5)*32 = 64K
Example 2
- VMware iSCSI Software Adapter
- One VM per datastore
- VMware Paravirtual SCSI Adapter
Here the Queue Depth limit is already defined by the Paravirtual SCSI Adapter. Therefore IOPS = (1000/0.5)*64 = 128K
Top models of AccelStor All Flash arrays (for example,
As a result, with the correct configuration of all the described components of the virtual data center, you can get very impressive results in terms of performance.
4K Random, 70% Read/30% Write
In fact, the real world is much more complex than it can be described with a simple formula. One host always hosts multiple virtual machines with different configurations and IO requirements. And I/O processing is handled by the host processor, whose power is not infinite. So, to unlock the full potential of the same
Source: habr.com