Industry Trends in Mass Storage Systems

Today we'll talk about how best to store data in a world where fifth-generation networks, genome scanners and self-driving cars produce more data in a day than all of humanity generated before the industrial revolution.

Industry Trends in Mass Storage Systems

Our world generates more and more information. Some of it is fleeting and is lost as quickly as it is collected. The other should be stored longer, and the other is completely designed "for centuries" - at least that's how we see it from the present. Information flows settle in data centers at such a speed that any new approach, any technology designed to meet this endless "demand" is rapidly becoming obsolete.

Industry Trends in Mass Storage Systems

40 years of distributed storage development

The first network storages in the form familiar to us appeared in the 1980s. Many of you have come across NFS (Network File System), AFS (Andrew File System) or Coda. A decade later, fashion and technology have changed, and distributed file systems have given way to cluster storage systems based on GPFS (General Parallel File System), CFS (Clustered File Systems) and StorNext. As a basis, block storages of classical architecture were used, on top of which a single file system was created using the software layer. These and similar solutions are still used, occupy their niche and are quite in demand.

At the turn of the millennium, the distributed storage paradigm changed somewhat, and systems with the SN (Shared-Nothing) architecture took the lead. There was a transition from cluster storage to storage on separate nodes, which, as a rule, were classic servers with software that provides reliable storage; such principles are built, say, HDFS (Hadoop Distributed File System) and GFS (Global File System).

Closer to 2010, the concepts underlying distributed storage systems increasingly began to be reflected in full-fledged commercial products, such as VMware vSAN, Dell EMC Isilon, and our Huawei OceanStor. Behind the mentioned platforms is no longer a community of enthusiasts, but specific vendors who are responsible for the functionality, support, service maintenance of the product and guarantee its further development. Such solutions are most in demand in several areas.

Industry Trends in Mass Storage Systems

Telecom operators

Perhaps one of the oldest consumers of distributed storage systems are telecom operators. The diagram shows which groups of applications produce the bulk of the data. OSS (Operations Support Systems), MSS (Management Support Services) and BSS (Business Support Systems) are three complementary software layers required for service delivery to subscribers, financial reporting to the provider, and operational support to the operator's engineers.

Often, the data of these layers is strongly mixed with each other, and in order to avoid the accumulation of unnecessary copies, distributed storages are used that accumulate the entire amount of information coming from a working network. The storages are combined into a common pool, to which all services access.

Our calculations show that the transition from classic to block storage systems allows you to save up to 70% of the budget only by abandoning dedicated hi-end storage systems and using conventional classical architecture servers (usually x86), working in conjunction with specialized software. Cellular operators have been acquiring such solutions in significant volumes for quite a long time. In particular, Russian operators have been using such products from Huawei for more than six years.

Yes, a number of tasks cannot be performed using distributed systems. For example, with increased performance requirements or compatibility with older protocols. But at least 70% of the data that the operator processes can be placed in a distributed pool.

Industry Trends in Mass Storage Systems

Banking sphere

In any bank, there are many diverse IT systems, ranging from processing to an automated banking system. This infrastructure also works with a huge amount of information, while most of the tasks do not require increased performance and reliability of storage systems, such as development, testing, automation of office processes, etc. Here, the use of classic storage systems is possible, but every year it is less and less profitable. In addition, in this case, there is no flexibility in spending storage resources, the performance of which is calculated from the peak load.

When using distributed storage systems, their nodes, which in fact are ordinary servers, can be converted at any time, for example, into a server farm and used as a computing platform.

Industry Trends in Mass Storage Systems

Data lakes

The diagram above shows a list of typical service consumers. datalake. These can be e-government services (for example, β€œGosuslugi”), enterprises that have undergone digitalization, financial structures, etc. All of them need to work with large volumes of heterogeneous information.

The operation of classical storage systems for solving such problems is inefficient, since both high-performance access to block databases and regular access to libraries of scanned documents stored as objects are required. Here, for example, a system of orders through a web portal can be tied. To implement all this on a classic storage platform, you will need a large set of equipment for different tasks. One horizontal universal storage system can easily cover all the previously listed tasks: you just need to create several pools in it with different storage characteristics.

Industry Trends in Mass Storage Systems

Generators of new information

The amount of information stored in the world is growing by about 30% per year. This is good news for storage vendors, but what is and will be the main source of this data?

Ten years ago, social networks became such generators, which required the creation of a large number of new algorithms, hardware solutions, etc. Now there are three main drivers of storage growth. The first is cloud computing. Currently, approximately 70% of companies use cloud services in one way or another. These can be email systems, backups, and other virtualized entities.
The fifth generation networks are becoming the second driver. These are new speeds and new volumes of data transfer. According to our forecasts, the widespread adoption of 5G will lead to a drop in demand for flash memory cards. No matter how much memory there is in the phone, it still ends, and if the gadget has a 100-megabit channel, there is no need to store photos locally.

The third group of reasons why the demand for storage systems is growing include the rapid development of artificial intelligence, the transition to big data analytics and the trend towards universal automation of everything that is possible.

A feature of "new traffic" is its unstructured. We need to store this data without defining its format in any way. It is required only for subsequent reading. For example, a bank scoring system to determine the available loan size will look at the photos you posted on social networks, determining how often you go to the sea and restaurants, and at the same time study extracts from your medical documents available to it. These data, on the one hand, are comprehensive, and on the other, they lack homogeneity.

Industry Trends in Mass Storage Systems

An ocean of unstructured data

What are the problems that the emergence of "new data" entails? The first among them, of course, is the amount of information itself and the estimated period of its storage. A modern autonomous driverless car alone generates up to 60TB of data every day from all its sensors and mechanisms. To develop new motion algorithms, this information must be processed within the same day, otherwise it will begin to accumulate. At the same time, it should be stored for a very long time - decades. Only then will it be possible to draw conclusions on the basis of large analytical samples in the future.

One device for deciphering genetic sequences produces about 6 terabytes per day. And the data collected with its help does not imply deletion at all, that is, hypothetically, they should be stored forever.

Finally, all the same networks of the fifth generation. In addition to the information itself transmitted, such a network is itself a huge data generator: activity logs, call records, intermediate results of machine-to-machine interactions, etc.

All this requires the development of new approaches and algorithms for storing and processing information. And such approaches are emerging.

Industry Trends in Mass Storage Systems

Technologies of the new era

Three groups of solutions designed to cope with new requirements for information storage systems can be distinguished: the introduction of artificial intelligence, the technical evolution of storage media, and innovations in the field of system architecture. Let's start with AI.

Industry Trends in Mass Storage Systems

In the new Huawei solutions, artificial intelligence is already used at the level of the storage itself, which is equipped with an AI processor that allows the system to independently analyze its state and predict failures. If the storage system is connected to a service cloud that has significant computing capabilities, artificial intelligence can process more information and improve the accuracy of its hypotheses.

In addition to failures, such AI is able to predict the future peak load and the time remaining until capacity is exhausted. This allows you to optimize performance and scale the system before any unwanted events occur.

Industry Trends in Mass Storage Systems

Now about the evolution of data carriers. The first flash drives were made using SLC (Single-Level Cell) technology. The devices based on it were fast, reliable, stable, but had a small capacity and were very expensive. The increase in volume and decrease in price was achieved through certain technical concessions, due to which the speed, reliability and life of the drives were reduced. Nevertheless, the trend did not affect the storage systems themselves, which, due to various architectural tricks, in general, became both more productive and more reliable.

But why did you need All-Flash class storage systems? Wasn't it enough just to replace old HDDs in an already running system with new SSDs of the same form factor? This was required in order to efficiently use all the resources of new SSDs, which was simply impossible in older systems.

Huawei, for example, has developed a number of technologies to solve this problem, one of which is FlashLink, which made it possible to optimize the disk-controller interactions as much as possible.

Intelligent identification made it possible to decompose data into several streams and cope with a number of undesirable phenomena, such as WA (write amplification). At the same time, new recovery algorithms, in particular RAID 2.0+, increased the speed of the rebuild, reducing its time to completely insignificant values.

Failure, overcrowding, garbage collection - these factors also no longer affect the performance of the storage system thanks to special refinement of the controllers.

Industry Trends in Mass Storage Systems

And block data stores are preparing to meet NVMe. Recall that the classic scheme for organizing data access worked like this: the processor accessed the RAID controller via the PCI Express bus. That, in turn, interacted with mechanical disks via SCSI or SAS. The use of NVMe on the backend significantly speeded up the whole process, but carried one drawback: the drives had to be directly connected to the processor in order to provide it with direct memory access.

The next phase of technology development that we are seeing now is the use of NVMe-oF (NVMe over Fabrics). As for Huawei block technologies, they already support FC-NVMe (NVMe over Fiber Channel), and NVMe over RoCE (RDMA over Converged Ethernet) is on the way. The test models are quite functional, a few months are left before their official presentation. Note that all this will also appear in distributed systems, where "Ethernet without loss" will be in great demand.

Industry Trends in Mass Storage Systems

An additional way to optimize the work of distributed storages was the complete rejection of data mirroring. Huawei solutions no longer use n copies, as in the usual RAID 1, and completely switch to the mechanism EC (Erasure coding). A special mathematical package calculates control blocks with a certain frequency, which allow you to restore intermediate data in case of loss.

Deduplication and compression mechanisms become mandatory. If in classic storage systems we are limited by the number of processors installed in controllers, then in distributed horizontally scalable storage systems, each node contains everything you need: disks, memory, processors and interconnect. These resources are enough for deduplication and compression to have a minimal impact on performance.

And about hardware optimization methods. Here, it was possible to reduce the load on the central processors with the help of additional dedicated microcircuits (or dedicated blocks in the processor itself), which play the role TOE (TCP/IP Offload Engine) or taking on the math tasks of EC, deduplication, and compression.

Industry Trends in Mass Storage Systems

New approaches to data storage are embodied in a disaggregated (distributed) architecture. In centralized storage systems, there is a server factory connected via Fiber Channel to SAN with lots of arrays. The disadvantages of this approach are difficulties with scaling and providing a guaranteed level of service (in terms of performance or latency). Hyperconverged systems use the same hosts for both storage and processing of information. This gives almost unlimited scope for scaling, but entails high costs for maintaining data integrity.

Unlike both of the above, a disaggregated architecture implies partitioning the system into a compute factory and a horizontal storage system. This provides the advantages of both architectures and allows almost unlimited scaling of only the element whose performance is not enough.

Industry Trends in Mass Storage Systems

From integration to convergence

A classic task, the relevance of which has only grown over the past 15 years, is the need to simultaneously provide block storage, file access, access to objects, the operation of a farm for big data, etc. The icing on the cake can also be, for example, a backup system to magnetic tape.

At the first stage, only the management of these services could be unified. Heterogeneous data storage systems were closed to some specialized software, through which the administrator distributed resources from the available pools. But since these pools were different in hardware, migrating the load between them was impossible. At a higher level of integration, the consolidation took place at the gateway level. If there was a shared file access, it could be given through different protocols.

The most advanced convergence method available to us now involves the creation of a universal hybrid system. Just the way ours should be OceanStor 100D. Universal access uses the same hardware resources, logically divided into different pools, but allowing for load migration. All this can be done through a single management console. In this way, we managed to implement the concept of "one data center - one storage system."

Industry Trends in Mass Storage Systems

The cost of storing information now determines many architectural decisions. And although it can be safely put at the forefront, we are discussing "live" storage with active access today, so performance must also be taken into account. Another important property of next-generation distributed systems is unification. After all, no one wants to have several disparate systems managed from different consoles. All these qualities are embodied in the new series of Huawei products. OceanStor Pacific.

Next generation mass storage

OceanStor Pacific meets six nines (99,9999%) reliability requirements and can be used to create a HyperMetro class data center. With a distance between two data centers of up to 100 km, the systems demonstrate an additional delay of 2 ms, which makes it possible to build any disaster-proof solutions based on them, including those with quorum servers.

Industry Trends in Mass Storage Systems

The products of the new series demonstrate versatility in terms of protocols. Already, OceanStor 100D supports block access, object access, and Hadoop access. File access will be implemented in the near future. There is no need to keep multiple copies of the data if they can be issued through different protocols.

Industry Trends in Mass Storage Systems

It would seem, what does the concept of "lossless network" have to do with storage? The fact is that distributed storage systems are built on the basis of a fast network that supports the appropriate algorithms and the RoCE mechanism. The artificial intelligence system supported by our switches helps to further increase network speed and reduce latency. AI Fabric. The performance gain of storage systems when AI Fabric is activated can reach 20%.

Industry Trends in Mass Storage Systems

What is the new OceanStor Pacific distributed storage node? The 5U form factor solution includes 120 drives and can replace three classic nodes, more than doubling the rack space. Due to the refusal to store copies, the efficiency of drives increases significantly (up to + 92%).

We are used to the fact that software-defined storage is a special software installed on a classic server. But now, in order to achieve optimal parameters, this architectural solution also requires special nodes. It consists of two servers based on ARM processors that manage an array of three-inch drives.

Industry Trends in Mass Storage Systems

These servers are not well suited for hyperconverged solutions. Firstly, there are few applications for ARM, and secondly, it is difficult to maintain a load balance. We suggest switching to separate storage: a computing cluster, represented by classic or rack servers, operates separately, but is connected to OceanStor Pacific storage nodes, which also perform their direct tasks. And it justifies itself.

For example, let's take a classic hyperconverged big data storage solution that occupies 15 server racks. If you distribute the load between individual OceanStor Pacific compute servers and storage nodes, separating them from each other, the number of required racks will be halved! This reduces the cost of operating the data center and lowers the total cost of ownership. In a world where the volume of stored information is growing at 30% per year, such benefits are not scattered.

***

For more information about Huawei solutions and their application scenarios, please visit our Online or by contacting the representatives of the company directly.

Source: habr.com

Add a comment