Why it's Important to Validate Software on Your High Availability Storage (99,9999%)

Why it's Important to Validate Software on Your High Availability Storage (99,9999%)

Which firmware version is the most “correct” and “working”? If the storage system guarantees 99,9999% fault tolerance, does it mean that it will work smoothly even without software updates? Or vice versa, to get maximum fault tolerance, you should always install the latest firmware? We will try to answer these questions based on our experience.

A small introduction

We all understand that each version of software, whether it is an operating system or a driver for a device, often contains flaws / bugs and other “features” that may not “appear” until the end of the equipment’s service life, or “ open” only under certain conditions. The number and significance of such nuances depends on the complexity (functionality) of the software and on the quality of testing during its development. 

Often, users remain on the “firmware from the factory” (the famous “it works, so don’t go”) or always install the latest version (in their understanding, the latest means the most working one). We use a different approach - we look at the release note for everything used in mClouds equipment and carefully select the appropriate firmware for each piece of equipment.

We came to this conclusion, as they say, with experience. Using our example of operation, we will tell you why the promised 99,9999% reliability of storage means nothing if you do not follow the updates and software descriptions in a timely manner. Our case is suitable for users of storage systems of any vendor, since a similar situation can occur with the hardware of any manufacturer.

Choosing a New Storage System

At the end of last year, an interesting storage system was added to our infrastructure: the junior model from the IBM FlashSystem 5000 line, which at the time of purchase was called Storwize V5010e. Now it is sold under the name FlashSystem 5010, but in fact it is the same hardware base with the same Spectrum Virtualize inside. 

The presence of a single management system - this, by the way, is the main difference between IBM FlashSystem. In models of the younger series, it practically does not differ from models of more productive ones. The choice of a certain model only gives the appropriate hardware base, the characteristics of which make it possible to use one or another functionality or provide a higher level of scalability. The software at the same time identifies the hardware and provides the necessary and sufficient functionality for this platform.

Why it's Important to Validate Software on Your High Availability Storage (99,9999%)IBM FlashSystem 5010

Briefly about our model 5010. This is an entry-level two-controller block storage system. It can accommodate NLSAS, SAS, SSD drives. NVMe placement is not available in it, since this storage model is positioned for solving tasks that do not require the performance of NVMe drives.

The storage system was purchased to accommodate archival information or data that is not frequently accessed. Therefore, the standard set of its functionality was enough for us: tiering (Easy Tier), Thin Provision. Performance on NLSAS disks at the level of 1000-2000 IOPS also suited us quite well.

Our experience - how we did not update the firmware on time

Now about the software update itself. At the time of purchase, the system already had a slightly outdated version of the Spectrum Virtualize software, namely, 8.2.1.3.

We studied the description of the firmware and planned to update to 8.2.1.9. If we were a little quicker, then this article would not exist - the bug would not have occurred on more recent firmware. However, due to certain reasons, the update of this system was delayed.

As a result, a slight update delay led to an extremely unpleasant picture, as in the description at the link: https://www.ibm.com/support/pages/node/6172341

Yes, the so-called APAR (Authorized Program Analysis Report) HU02104 was just relevant in the firmware of that version. It appears as follows. Under load, under certain circumstances, the cache begins to overflow, then the system goes into protective mode, in which it disables I / O for the pool (Pool). In our case, it looked like a shutdown of 3 disks for a RAID group in RAID 6 mode. The shutdown occurs for 6 minutes. Next, access to the Tomes in the Pool is restored.

If someone is not familiar with the structure and naming of logical entities in the context of IBM Spectrum Virtualize, I will now briefly describe.

Why it's Important to Validate Software on Your High Availability Storage (99,9999%)The structure of the logical elements of the storage system

Disks are collected in groups called MDisk (Managed Disk). MDisk can be classic RAID (0,1,10,5,6) or virtualized - DRAID (Distributed RAID). Using DRAID allows you to increase the performance of the array, because. all disks in the group will be used, and reduce the rebuild time, due to the fact that only certain blocks will need to be restored, and not all data from the failed disk.

Why it's Important to Validate Software on Your High Availability Storage (99,9999%)Distribution of data blocks across disks when using Distributed RAID (DRAID) in RAID-5 mode.

And this diagram shows the logic of the DRAID rebuild in case of failure of one disk:

Why it's Important to Validate Software on Your High Availability Storage (99,9999%)The logic of the DRAID rebuild in case of failure of one disk

Further, one or more MDisks form the so-called Pool. Within the same pool, it is not recommended to use MDisk with different RAID / DRAID levels on disks of the same type. We will not go into this much, because. we plan to cover this in one of the following articles. Well, in fact, Pool is divided into Volumes, which are presented according to one or another block access protocol towards hosts.

So, we have, as a result of the situation described in APAR HU02104, due to a logical failure of three disks, the MDisk ceased to be operational, which, in turn, caused the failure of the Pool and the corresponding Volumes.

Since these systems are quite "smart", they can be connected to the IBM Storage Insights cloud-based monitoring system, which automatically submits a service request to IBM support when a problem occurs. An application is created and IBM specialists remotely carry out diagnostics and contact the user of the system. 

Thanks to this, the issue was resolved quite quickly and an operational recommendation was received from the support service to update our system to the previously selected firmware 8.2.1.9, in which at that time this moment had already been fixed. It confirms corresponding Release Note.

Results and our recommendations

As the saying goes, "all's well that ends well." The bug in the firmware did not turn into serious problems - the servers were restored as soon as possible and without data loss. Some customers had to restart their virtual machines, but in general we were prepared for more negative consequences, as we make daily backups of all infrastructure elements and client machines. 

We have received confirmation that even reliable systems with 99,9999% promised availability require attention and timely maintenance. Based on the situation, we made a number of conclusions for ourselves and share our recommendations:

  • Be sure to keep an eye on updates, study the Release Notes for fixes for potentially critical issues, and make scheduled updates in a timely manner.

    This is an organizational and even rather obvious moment, which, it would seem, should not be focused on. However, on this "flat ground" it is quite easy to stumble. Actually, it was this moment that added the troubles described above. Treat the drafting of the update schedule very carefully and follow its observance no less carefully. This point is more related to the concept of "discipline".

  • It is always best to keep the system up to date with the software version. Moreover, the current one is not the one that has a larger numerical designation, namely with a later release date. 

    For example, IBM keeps at least two software releases up to date for its storage systems. At the time of this writing, these are 8.2 and 8.3. Updates for 8.2 are coming out earlier. Following with a slight delay, a similar update for 8.3 usually comes out.

    Release 8.3 has a number of functional advantages, for example, the ability to expand MDisk (in DRAID mode) by adding one or more new disks (this feature has appeared since version 8.3.1). This is a fairly basic functionality, but, unfortunately, there is no such possibility in 8.2.

  • If for some reason it is not possible to upgrade, then for versions of Spectrum Virtualize software earlier than versions 8.2.1.9 and 8.3.1.0 (where the bug described above is relevant), to reduce the risk of its appearance, IBM technical support recommends limiting system performance at the pool level, as shown in the figure below (the picture was taken in the Russified version of the GUI). The value of 10000 IOPS is shown as an example and depends on the characteristics of your system.

Why it's Important to Validate Software on Your High Availability Storage (99,9999%)IBM storage performance limit

  • It is necessary to correctly calculate the load on storage systems and avoid overloading. To do this, you can use either the IBM sizer (if you have access to it), or the help of partners, or third-party resources. Be sure to understand the load profile on the storage system. performance in MB/s and IOPS varies greatly depending on at least the following parameters:

    • operation type: read or write,

    • operation block size,

    • the percentage of reads and writes in the total I/O stream.

    Also, the speed of operations is affected by how data blocks are read: sequentially or in random order. When performing multiple data access operations on the application side, there is the concept of dependent operations. This is also desirable to take into account. All this can help to see a set of data from performance counters of the OS, storage system, servers / hypervisors, as well as an understanding of the features of the operation of applications, DBMS and other "consumers" of disk resources.

  • And finally, be sure to have backups up to date and working. The backup schedule should be set up based on RPO values ​​that are acceptable to the business and it is mandatory to periodically check the integrity of the backups (quite a few backup software vendors have implemented automated verification in their products) to ensure an acceptable RTO.

Thank you for reading to the end.
Ready to answer your questions and comments in the comments. Also we invite you to subscribe to our telegram channel, in which we hold regular promotions (discounts on IaaS and drawings of promotional codes up to 100% on VPS), write interesting news and announce new articles on the Habr blog.

Source: habr.com

Add a comment