Today we'll talk about how best to store data in a world where fifth-generation networks, genome scanners and self-driving cars produce more data in a day than all of humanity generated before the industrial revolution.
Our world generates more and more information. Some of it is fleeting and is lost as quickly as it is collected. The other should be stored longer, and the other is completely designed "for centuries" - at least that's how we see it from the present. Information flows settle in data centers at such a speed that any new approach, any technology designed to meet this endless "demand" is rapidly becoming obsolete.
40 years of distributed storage development
The first network storages in the form familiar to us appeared in the 1980s. Many of you have come across NFS (Network File System), AFS (Andrew File System) or Coda. A decade later, fashion and technology have changed, and distributed file systems have given way to cluster storage systems based on GPFS (General Parallel File System), CFS (Clustered File Systems) and StorNext. As a basis, block storages of classical architecture were used, on top of which a single file system was created using the software layer. These and similar solutions are still used, occupy their niche and are quite in demand.
At the turn of the millennium, the distributed storage paradigm changed somewhat, and systems with the SN (Shared-Nothing) architecture took the lead. There was a transition from cluster storage to storage on separate nodes, which, as a rule, were classic servers with software that provides reliable storage; such principles are built, say, HDFS (Hadoop Distributed File System) and GFS (Global File System).
Closer to 2010, the concepts underlying distributed storage systems increasingly began to be reflected in full-fledged commercial products, such as VMware vSAN, Dell EMC Isilon, and our
Telecom operators
Perhaps one of the oldest consumers of distributed storage systems are telecom operators. The diagram shows which groups of applications produce the bulk of the data. OSS (Operations Support Systems), MSS (Management Support Services) and BSS (Business Support Systems) are three complementary software layers required for service delivery to subscribers, financial reporting to the provider, and operational support to the operator's engineers.
Often, the data of these layers is strongly mixed with each other, and in order to avoid the accumulation of unnecessary copies, distributed storages are used that accumulate the entire amount of information coming from a working network. The storages are combined into a common pool, to which all services access.
Our calculations show that the transition from classic to block storage systems allows you to save up to 70% of the budget only by abandoning dedicated hi-end storage systems and using conventional classical architecture servers (usually x86), working in conjunction with specialized software. Cellular operators have been acquiring such solutions in significant volumes for quite a long time. In particular, Russian operators have been using such products from Huawei for more than six years.
Yes, a number of tasks cannot be performed using distributed systems. For example, with increased performance requirements or compatibility with older protocols. But at least 70% of the data that the operator processes can be placed in a distributed pool.
Banking sphere
In any bank, there are many diverse IT systems, ranging from processing to an automated banking system. This infrastructure also works with a huge amount of information, while most of the tasks do not require increased performance and reliability of storage systems, such as development, testing, automation of office processes, etc. Here, the use of classic storage systems is possible, but every year it is less and less profitable. In addition, in this case, there is no flexibility in spending storage resources, the performance of which is calculated from the peak load.
When using distributed storage systems, their nodes, which in fact are ordinary servers, can be converted at any time, for example, into a server farm and used as a computing platform.
Data lakes
The diagram above shows a list of typical service consumers.
The operation of classical storage systems for solving such problems is inefficient, since both high-performance access to block databases and regular access to libraries of scanned documents stored as objects are required. Here, for example, a system of orders through a web portal can be tied. To implement all this on a classic storage platform, you will need a large set of equipment for different tasks. One horizontal universal storage system can easily cover all the previously listed tasks: you just need to create several pools in it with different storage characteristics.
Generators of new information
The amount of information stored in the world is growing by about 30% per year. This is good news for storage vendors, but what is and will be the main source of this data?
Ten years ago, social networks became such generators, which required the creation of a large number of new algorithms, hardware solutions, etc. Now there are three main drivers of storage growth. The first is cloud computing. Currently, approximately 70% of companies use cloud services in one way or another. These can be email systems, backups, and other virtualized entities.
The fifth generation networks are becoming the second driver. These are new speeds and new volumes of data transfer. According to our forecasts, the widespread adoption of 5G will lead to a drop in demand for flash memory cards. No matter how much memory there is in the phone, it still ends, and if the gadget has a 100-megabit channel, there is no need to store photos locally.
The third group of reasons why the demand for storage systems is growing include the rapid development of artificial intelligence, the transition to big data analytics and the trend towards universal automation of everything that is possible.
A feature of "new traffic" is its
An ocean of unstructured data
What are the problems that the emergence of "new data" entails? The first among them, of course, is the amount of information itself and the estimated period of its storage. A modern autonomous driverless car alone generates up to 60TB of data every day from all its sensors and mechanisms. To develop new motion algorithms, this information must be processed within the same day, otherwise it will begin to accumulate. At the same time, it should be stored for a very long time - decades. Only then will it be possible to draw conclusions on the basis of large analytical samples in the future.
One device for deciphering genetic sequences produces about 6 terabytes per day. And the data collected with its help does not imply deletion at all, that is, hypothetically, they should be stored forever.
Finally, all the same networks of the fifth generation. In addition to the information itself transmitted, such a network is itself a huge data generator: activity logs, call records, intermediate results of machine-to-machine interactions, etc.
All this requires the development of new approaches and algorithms for storing and processing information. And such approaches are emerging.
Technologies of the new era
Three groups of solutions designed to cope with new requirements for information storage systems can be distinguished: the introduction of artificial intelligence, the technical evolution of storage media, and innovations in the field of system architecture. Let's start with AI.
In the new Huawei solutions, artificial intelligence is already used at the level of the storage itself, which is equipped with an AI processor that allows the system to independently analyze its state and predict failures. If the storage system is connected to a service cloud that has significant computing capabilities, artificial intelligence can process more information and improve the accuracy of its hypotheses.
In addition to failures, such AI is able to predict the future peak load and the time remaining until capacity is exhausted. This allows you to optimize performance and scale the system before any unwanted events occur.
Now about the evolution of data carriers. The first flash drives were made using SLC (Single-Level Cell) technology. The devices based on it were fast, reliable, stable, but had a small capacity and were very expensive. The increase in volume and decrease in price was achieved through certain technical concessions, due to which the speed, reliability and life of the drives were reduced. Nevertheless, the trend did not affect the storage systems themselves, which, due to various architectural tricks, in general, became both more productive and more reliable.
But why did you need All-Flash class storage systems? Wasn't it enough just to replace old HDDs in an already running system with new SSDs of the same form factor? This was required in order to efficiently use all the resources of new SSDs, which was simply impossible in older systems.
Huawei, for example, has developed a number of technologies to solve this problem, one of which is
Intelligent identification made it possible to decompose data into several streams and cope with a number of undesirable phenomena, such as
Failure, overcrowding, garbage collection - these factors also no longer affect the performance of the storage system thanks to special refinement of the controllers.
And block data stores are preparing to meet
The next phase of technology development that we are seeing now is the use of NVMe-oF (NVMe over Fabrics). As for Huawei block technologies, they already support FC-NVMe (NVMe over Fiber Channel), and NVMe over RoCE (RDMA over Converged Ethernet) is on the way. The test models are quite functional, a few months are left before their official presentation. Note that all this will also appear in distributed systems, where "Ethernet without loss" will be in great demand.
An additional way to optimize the work of distributed storages was the complete rejection of data mirroring. Huawei solutions no longer use n copies, as in the usual RAID 1, and completely switch to the mechanism
Deduplication and compression mechanisms become mandatory. If in classic storage systems we are limited by the number of processors installed in controllers, then in distributed horizontally scalable storage systems, each node contains everything you need: disks, memory, processors and interconnect. These resources are enough for deduplication and compression to have a minimal impact on performance.
And about hardware optimization methods. Here, it was possible to reduce the load on the central processors with the help of additional dedicated microcircuits (or dedicated blocks in the processor itself), which play the role
New approaches to data storage are embodied in a disaggregated (distributed) architecture. In centralized storage systems, there is a server factory connected via Fiber Channel to
Unlike both of the above, a disaggregated architecture implies partitioning the system into a compute factory and a horizontal storage system. This provides the advantages of both architectures and allows almost unlimited scaling of only the element whose performance is not enough.
From integration to convergence
A classic task, the relevance of which has only grown over the past 15 years, is the need to simultaneously provide block storage, file access, access to objects, the operation of a farm for big data, etc. The icing on the cake can also be, for example, a backup system to magnetic tape.
At the first stage, only the management of these services could be unified. Heterogeneous data storage systems were closed to some specialized software, through which the administrator distributed resources from the available pools. But since these pools were different in hardware, migrating the load between them was impossible. At a higher level of integration, the consolidation took place at the gateway level. If there was a shared file access, it could be given through different protocols.
The most advanced convergence method available to us now involves the creation of a universal hybrid system. Just the way ours should be
The cost of storing information now determines many architectural decisions. And although it can be safely put at the forefront, we are discussing "live" storage with active access today, so performance must also be taken into account. Another important property of next-generation distributed systems is unification. After all, no one wants to have several disparate systems managed from different consoles. All these qualities are embodied in the new series of Huawei products.
Next generation mass storage
OceanStor Pacific meets six nines (99,9999%) reliability requirements and can be used to create a HyperMetro class data center. With a distance between two data centers of up to 100 km, the systems demonstrate an additional delay of 2 ms, which makes it possible to build any disaster-proof solutions based on them, including those with quorum servers.
The products of the new series demonstrate versatility in terms of protocols. Already, OceanStor 100D supports block access, object access, and Hadoop access. File access will be implemented in the near future. There is no need to keep multiple copies of the data if they can be issued through different protocols.
It would seem, what does the concept of "lossless network" have to do with storage? The fact is that distributed storage systems are built on the basis of a fast network that supports the appropriate algorithms and the RoCE mechanism. The artificial intelligence system supported by our switches helps to further increase network speed and reduce latency.
What is the new OceanStor Pacific distributed storage node? The 5U form factor solution includes 120 drives and can replace three classic nodes, more than doubling the rack space. Due to the refusal to store copies, the efficiency of drives increases significantly (up to + 92%).
We are used to the fact that software-defined storage is a special software installed on a classic server. But now, in order to achieve optimal parameters, this architectural solution also requires special nodes. It consists of two servers based on ARM processors that manage an array of three-inch drives.
These servers are not well suited for hyperconverged solutions. Firstly, there are few applications for ARM, and secondly, it is difficult to maintain a load balance. We suggest switching to separate storage: a computing cluster, represented by classic or rack servers, operates separately, but is connected to OceanStor Pacific storage nodes, which also perform their direct tasks. And it justifies itself.
For example, let's take a classic hyperconverged big data storage solution that occupies 15 server racks. If you distribute the load between individual OceanStor Pacific compute servers and storage nodes, separating them from each other, the number of required racks will be halved! This reduces the cost of operating the data center and lowers the total cost of ownership. In a world where the volume of stored information is growing at 30% per year, such benefits are not scattered.
***
For more information about Huawei solutions and their application scenarios, please visit our
Source: habr.com