AERODISK ENGINE N2 Storage Crash Test, Strength Test

AERODISK ENGINE N2 Storage Crash Test, Strength Test

Hi all! With this article, AERODISK opens a blog on Habré. Hurrah, comrades!

In previous articles on Habré, questions about the architecture and basic configuration of storage systems were considered. In this article, we will consider a question that has not been previously covered, but it was often asked - about the fault tolerance of AERODISK ENGINE storage systems. Our team will do everything to stop AERODISK storage system from working, i.e. break her.

It so happened that articles about the history of our company, about our products, as well as an example of successful implementation are already hanging on Habré, for which Many thanks to our partners - TS Solution and Softline.

Therefore, I will not train copy-paste management skills here, but simply give links to the originals of these articles:

I also want to share good news. But I'll start, of course, with the problem. We, as a young vendor, in addition to other costs, are constantly faced with the fact that many engineers and administrators simply do not know how to properly operate our storage system.
It is clear that the management of most storage systems looks approximately the same from the point of view of the administrator, but at the same time, each manufacturer has its own characteristics. And we are no exception here.

Therefore, in order to simplify the task of training IT specialists, we decided to dedicate this year to free education. To do this, in many large cities of Russia, we are opening a network of AERODISK Competence Centers, in which any technical specialist who wishes can take a course absolutely free of charge and receive a certificate in the administration of AERODISK ENGINE storage systems.

In each Competence Center, we will install a full-fledged demo stand from the AERODISK storage system and a physical server, on which our teacher will conduct face-to-face training. We will publish the work schedule of the Competence Centers upon their appearance, but now we have opened a center in Nizhny Novgorod and the city of Krasnodar is next in line. You can sign up for training using the links below. Here is the currently known information about cities and dates:

  • Nizhny Novgorod (ALWAYS WORKING - you can sign up here https://aerodisk.promo/nn/);
    Until April 16, 2019, you can visit the center at any working time, and on April 16, 2019, a large training course will be organized.
  • Krasnodar (OPENING SOON - sign up here) https://aerodisk.promo/krsnd/ );
    From April 9 to April 25, 2019, you can visit the center at any working time, and on April 25, 2019, a large training course will be organized.
  • Ekaterinburg (COMING SOON, follow the information on our website or on Habré);
    May-June 2019.
  • Novosibirsk (follow the information on our website or on Habré);
    October 2019
  • Krasnoyarsk (follow the information on our website or on Habré);
    November 2019 of the year.

And, of course, if Moscow is not far from you, then at any time you can visit our office in Moscow and undergo a similar training.

All. Done with marketing, let's move on to technology!

On Habré, we will regularly publish technical articles about our products, load tests, comparisons, usage features and interesting implementations.

AERODISK ENGINE N2 Storage Crash Test, Strength Test

DANGER! After reading the article, you can say: well, of course, the vendor will check itself so that everything works out with a bang, greenhouse conditions, etc. I will answer: nothing of the kind! Unlike our foreign competitors, we are here, close to you, and you can always come to us (in Moscow or any Central Committee) and test our storage system in any way. Thus, there is no special sense for us to adjust the results to the ideal picture of the world, because we are very easy to check. For those who are too lazy to go and have no time, we can organize remote testing. We have a special lab for this. Contact.

ACHTUNG-2! This test is not a load test, because here we are concerned only with fault tolerance. In a couple of weeks, we will prepare a more powerful stand and conduct load testing of storage systems, publishing the results here (by the way, requests for tests are accepted).

So, let's go break.

Test stand

Our stand consists of the following hardware:

  • 1 x Storage Aerodisk Engine N2 (2 controllers, 64GB cache, 8xFC ports 8Gb/s, 4xEthernet ports 10Gb/s SFP+, 4xEthernet ports 1Gb/s); The following disks are installed in the storage system:
  • 4 x SAS SSD drives 900 GB;
  • 12 x SAS 10k drives 1,2 TB;
  • 1 x Physical server with Windows Server 2016 (2xXeon E5 2667 v3, 96GB RAM, 2xFC ports 8Gb/s, 2xEthernet ports 10Gb/s SFP+);
  • 2 x SAN 8G switches;
  • 2 x LAN 10G switches;

We connected the server to the storage system through switches both via FC and Ethernet 10G. The scheme of the stand is below.

AERODISK ENGINE N2 Storage Crash Test, Strength Test

Windows Server has the components we need, such as MPIO and iSCSI initiator, installed.
Zones are configured on the FC switches, the corresponding VLANs are configured on the LAN switches and MTU 9000 is installed on the storage ports, switches and host (how to do all this is described in our documentation, so we will not describe this process here).

Testing technique

The crash test plan is as follows:

  • Check for failure of FC and Ethernet ports.
  • Power failure check.
  • Controller failure check.
  • Check for disk failure in a group/pool.

All tests will be performed under synthetic load conditions, which we will generate with the IOMETER program. In parallel, we will perform the same tests, but in the conditions of copying large files to the storage system.

The IOmeter config is the following:

  • Read/Write - 70/30
  • Block - 128k (we decided to wet the storage system with large blocks)
  • Number of threads - 128 (which is very similar to productive workload)
  • Full Random
  • Number of Workers - 4 (2 for FC, 2 for iSCSI)

AERODISK ENGINE N2 Storage Crash Test, Strength Test
AERODISK ENGINE N2 Storage Crash Test, Strength Test

The test has the following tasks:

  1. Make sure that the synthetic load and copy process will not be interrupted and will not cause errors under various failure modes.
  2. Make sure that the process of switching ports, controllers, etc. is sufficiently automated and does not require administrator actions in case of failures (that is, during failovers, of course, we are not talking about failbacks).
  3. Make sure that the information displayed in the logs is correct.

Host and storage preparation

On the storage system, we configured block access using FC and Ethernet ports (FC and iSCSI, respectively). How to do this, the guys from TS Solution described in detail in a previous article (https://habr.com/ru/company/tssolution/blog/432876/). And, of course, nobody canceled manuals and courses.

We set up a hybrid group using all the drives we have. 2 SSD drives added to cache, 2 SSD drives added as additional storage tier (Online-tier). We grouped 12 SAS10k disks into RAID-60P (triple parity) in order to check the failure of three disks in a group at once. One disk was left for autocorrect.

AERODISK ENGINE N2 Storage Crash Test, Strength Test

We connected two LUNs (one via FC, one via iSCSI).

AERODISK ENGINE N2 Storage Crash Test, Strength Test

Both LUNs are owned by Engine-0

AERODISK ENGINE N2 Storage Crash Test, Strength Test

Let's start the test

Enable IOMETER with the config above.

AERODISK ENGINE N2 Storage Crash Test, Strength Test

We fix the throughput of 1.8 GB / s and delays of 3 milliseconds. There are no errors (Total Error Count).

At the same time, from the local drive “C” of our host, we simultaneously start copying two large 100GB files to FC and iSCSI storage LUNs (drives E and G in Windows), using other interfaces.

At the top, the process of copying to LUN FC, at the bottom, to iSCSI.

AERODISK ENGINE N2 Storage Crash Test, Strength Test

Test #1 Disable I/O Ports

We approach the storage system from behind))) and with a slight movement of the hand we pull out all the FC and Ethernet 10G cables from the Engine-0 controller. As if a cleaner with a mop passed by and decided to wash the floor just where the snot was lying around the cables (that is, the controller remains working, but the I / O ports are dead).

AERODISK ENGINE N2 Storage Crash Test, Strength Test

We look at IOMETER and copying files. Throughput dropped to 0,5 GB/s, but quickly returned to its previous level (in about 4-5 seconds). There are no errors.

AERODISK ENGINE N2 Storage Crash Test, Strength Test

Copying files did not stop, there is a drawdown in speed, but it is not critical at all (from 840 MB / s it fell to 720 MB / s). Copying did not stop.

We look at the storage logs and see a message about the unavailability of ports and the automatic relocation of the group.

AERODISK ENGINE N2 Storage Crash Test, Strength Test

Also, the information panel tells us that not everything is good with the FC ports.

AERODISK ENGINE N2 Storage Crash Test, Strength Test

Storage I/O failure survived successfully.

Test #2: Disabling the storage controller

Almost immediately (having previously plugged the cables back into the storage system), we decided to finish off the storage system by pulling the controller out of the chassis.

Again we approach the storage system from behind (we liked it))) and this time we pull out the Engine-1 controller, which at that moment is the owner of the RDG (to which the group moved).

The situation in IOmeter is as follows. I/O stopped for about 5 seconds. Errors do not accumulate.

AERODISK ENGINE N2 Storage Crash Test, Strength Test

After 5 seconds, I/O resumed, with about the same throughput, but with delays of 35 milliseconds (delays corrected after about a couple of minutes). As you can see from the screenshots, the Total error count value is 0, that is, there were no write or read errors.

AERODISK ENGINE N2 Storage Crash Test, Strength Test

We look at copying our files. As you can see, it did not stop, there was a slight performance drop, but in general everything returned to the same ~ 800 MB / s.

AERODISK ENGINE N2 Storage Crash Test, Strength Test

We go to the storage system and see cursing in the information panel that the Engine-1 controller is unavailable (of course, we banged it).

AERODISK ENGINE N2 Storage Crash Test, Strength Test

We also see a similar entry in the logs.

AERODISK ENGINE N2 Storage Crash Test, Strength Test

The failure of the storage controller also survived successfully.

Test number 3. Turning off the power supply.

Just in case, we restarted copying files, but did not stop IOMETER.
We pull the BP-shnik.

AERODISK ENGINE N2 Storage Crash Test, Strength Test

Another alert has been added to the storage system in the information panel.

AERODISK ENGINE N2 Storage Crash Test, Strength Test

Also in the sensors menu we see that the sensors associated with the pulled out power supply turned red.

AERODISK ENGINE N2 Storage Crash Test, Strength Test

The storage system continues to work. The failure of the power supply unit does not affect the operation of the storage system in any way, from the point of view of the host, the copy speed and IOMETER indicators remained unchanged.

Power Failure Test Passed successfully.

Before the final test, we nevertheless decided to bring the storage system back to life a little, put the controller and power supply unit back, and also put the cables in order, which the storage system happily informed us about with green icons in its health panel.

AERODISK ENGINE N2 Storage Crash Test, Strength Test

Test No. 4. Failure of three disks in a group

Before this test, we performed an additional preparatory step. The fact is that the ENGINE storage system provides a very useful thing - different rebuild (rebuild) policies. Previously, TS Solution wrote about this feature, but let's recall its essence. The storage administrator can specify the resource allocation priority for rebuilding. Or in the direction of I / O performance, that is, a longer rebuild, but there is no performance drawdown. Or in the direction of the rebuild speed, but the performance will be reduced. Or a balanced one. Since storage performance during a disk group rebuild is always a headache for the administrator, we will test the policy with a bias towards I / O performance and at the expense of rebuild speed.

AERODISK ENGINE N2 Storage Crash Test, Strength Test

Now let's check for disk failures. We also enable writing to LUNs (files and IOMETER). Since we have a group with triple parity (RAID-60P), it means that the system must withstand the failure of three disks, and after the failure, auto-replacement should work, one disk should take the place of one of the failed ones in the RDG, and rebuild should begin on it.

Begin. To begin with, through the storage interface, highlight the disks that we want to pull out (so as not to miss and pull the autocorrect disk).

AERODISK ENGINE N2 Storage Crash Test, Strength Test

We check the indication on the gland. Everything is OK, we see the highlighted three disks.

AERODISK ENGINE N2 Storage Crash Test, Strength Test

Also we pull out these three disks.

AERODISK ENGINE N2 Storage Crash Test, Strength Test

Let's see what's on the host. And there ... nothing special happened.

AERODISK ENGINE N2 Storage Crash Test, Strength Test
AERODISK ENGINE N2 Storage Crash Test, Strength Test

The copying rates (they are higher than at the beginning, because the cache has warmed up) and IOMETER do not change much when pulling out disks and starting the rebuild (within 5-10%).

Let's see what's on the storage system.

AERODISK ENGINE N2 Storage Crash Test, Strength Test

In the status of the group, we see that the process of rebuilding has begun and it is close to completion.

AERODISK ENGINE N2 Storage Crash Test, Strength Test

In the RDG skeleton, you can see that 2 disks are in red status, and one has already been replaced. The autocorrect disk is no more, it replaced the 3rd failed disk. The rebuild ran for several minutes, the file recording did not stop when 3 disks failed, the I / O performance did not change much.

AERODISK ENGINE N2 Storage Crash Test, Strength Test

AERODISK ENGINE N2 Storage Crash Test, Strength Test

Disk failure test definitely passed successfully.

Conclusion

On this, we decided to stop violence against storage systems. Summing up:

  • FC ports failure check - successful
  • Ethernet ports failure check - successful
  • Controller Fail Check - Passed
  • Power failure test - successful
  • GroupPool Disk Failure Test Successful

None of the crashes stopped recording or caused synthetic load errors, of course there was a performance drop (and we know how to overcome it, which we will do soon), but given that these are seconds, it is quite acceptable. Conclusion: the fault tolerance of all AERODISK storage components worked at the level, there are no points of failure.

Obviously, we cannot test all failure scenarios in one article, but we tried to cover the most popular ones. Therefore, please send your comments, wishes for the next publications and, of course, adequate criticism. We will be glad to have discussions (or better, come to the training, just in case I duplicate the schedule)! Until new tests!

  • Nizhny Novgorod (ALWAYS WORKING - you can sign up here https://aerodisk.promo/nn/);
    Until April 16, 2019, you can visit the center at any working time, and on April 16, 2019, a large training course will be organized.
  • Krasnodar (OPENING SOON - sign up here) https://aerodisk.promo/krsnd/ );
    From April 9 to April 25, 2019, you can visit the center at any working time, and on April 25, 2019, a large training course will be organized.
  • Ekaterinburg (COMING SOON, follow the information on our website or on Habré);
    May-June 2019.
  • Novosibirsk (follow the information on our website or on Habré);
    October 2019
  • Krasnoyarsk (follow the information on our website or on Habré);
    November 2019 of the year.

Source: habr.com

Add a comment