AERODISK vAIR hyperconverged solution. The basis is the ARDFS file system

AERODISK vAIR hyperconverged solution. The basis is the ARDFS file system

Hello Habr readers. With this article, we open a cycle that will talk about the hyperconverged system AERODISK vAIR developed by us. Initially, we wanted to tell everything about everything in the first article, but the system is quite complicated, so we will eat the elephant in parts.

Let's start the story with the history of the creation of the system, delve into the ARDFS file system, which is the basis of vAIR, and also discuss a little about the positioning of this solution on the Russian market.

In future articles, we will talk in more detail about different architectural components (cluster, hypervisor, load balancer, monitoring system, etc.), the configuration process, raise licensing issues, separately show crash tests and, of course, write about load testing and sizing. We will also devote a separate article to the community version of vAIR.

Aerodisk is like a story about storage systems? Or why did we start hyperconvergent at all?

Initially, the idea to create our own hyperconvergent came to us somewhere around 2010. At that time, there was neither Aerodisk nor similar solutions (commercial boxed hyperconverged systems) on the market. Our task was as follows: from a set of servers with local disks, connected by an interconnect via the Ethernet protocol, it was necessary to make an extended storage and run virtual machines and a program network there. All this needed to be implemented without storage (because there was simply no money for storage and its strapping, and we had not yet invented our storage system).

We tried a lot of open source solutions and still solved this problem, but the solution was very complex and difficult to repeat. In addition, this solution was from the category of “Works? Don't touch!" Therefore, having solved that problem, we did not further develop the idea of ​​turning the result of our work into a full-fledged product.

After that incident, we moved away from this idea, but we still had the feeling that this task was completely solvable, and the benefits of such a decision were more than obvious. In the future, HCI products released by foreign companies only confirmed this feeling.

Therefore, in the middle of 2016, we returned to this task as part of the creation of a full-fledged product. At that time we did not have any relationships with investors, so we had to buy a development stand for our own not very big money. Having typed BU servers and switches on Avito, we set to work.

AERODISK vAIR hyperconverged solution. The basis is the ARDFS file system

The main initial task was to create our own, albeit simple, but our own file system, which could automatically and evenly distribute data in the form of virtual blocks on the n-th number of cluster nodes, which are connected by an Ethernet interconnect. At the same time, the FS should scale well and easily and be independent of adjacent systems, i.e. be alienated from vAIR as "just storage".

AERODISK vAIR hyperconverged solution. The basis is the ARDFS file system

The first vAIR concept

AERODISK vAIR hyperconverged solution. The basis is the ARDFS file system

We deliberately abandoned the use of ready-made open source solutions for organizing stretched storage (ceph, gluster, lustre, and the like) in favor of our own development, since we already had a lot of project experience with them. Of course, these solutions are excellent in themselves, and before working on Aerodisk, we implemented more than one integration project with them. But it is one thing to implement a specific task of one customer, train staff and, possibly, buy support from a large vendor, and quite another thing is to create an easily replicable product that will be used for various tasks that we, as a vendor, may even know about ourselves. we will not. For the second goal, the existing open source products did not suit us, so we decided to cut the distributed file system ourselves.
Two years later, with the help of several developers (who combined work on vAIR with work on the classic Storage Engine), they achieved a certain result.

By 2018, we had written the simplest file system and supplemented it with the necessary binding. The system combined physical (local) disks from different servers into one flat pool via an internal interconnect and “cut” them into virtual blocks, then block devices with varying degrees of fault tolerance were created from the virtual blocks, on which virtual devices were created and executed using the KVM hypervisor. cars.

We didn’t bother much with the name of the file system and succinctly called it ARDFS (guess how it stands for))

This prototype looked good (not visually, of course, there was no visual design then) and showed good results in terms of performance and scaling. After the first real result, we launched this project, having already organized a full-fledged development environment and a separate team that dealt only with vAIR.

Just by that time, the general architecture of the solution had matured, which has not yet undergone major changes.

Dive into the ARDFS file system

ARDFS is the backbone of vAIR, which provides distributed fault-tolerant storage for the entire cluster. One of (but not the only) distinguishing features of ARDFS is that it does not use any additional dedicated servers for meta and management. This was originally conceived to simplify the configuration of the solution and for its reliability.

Storage Structure

Within all cluster nodes, ARDFS organizes a logical pool from all available disk space. It is important to understand that the pool is not yet data or formatted space, but simply markup, i.e. any nodes with vAIR installed when added to the cluster are automatically added to the shared ARDFS pool and disk resources are automatically shared across the entire cluster (and available for future data storage). This approach allows you to add and remove nodes on the fly without any serious impact on the already running system. Those. the system is very easy to scale with “bricks”, adding or removing nodes in the cluster if necessary.

On top of the ARDFS pool, virtual disks (storage objects for virtual machines) are added, which are built from virtual blocks of 4 megabytes in size. Data is directly stored on virtual disks. The fault tolerance scheme is also set at the virtual disk level.

As you might have guessed, for the fault tolerance of the disk subsystem, we do not use the concept of RAID (Redundant array of independent Disks), but use RAIN (Redundant array of independent Nodes). Those. fault tolerance is measured, automated, and managed in terms of nodes, not disks. Disks, of course, are also a storage object, they, like everything else, are monitored, you can perform all standard operations with them, including building a local hardware RAID, but the cluster operates precisely with nodes.

In a situation where you really want RAID (for example, a scenario that supports multiple failures on small clusters), nothing prevents you from using local RAID controllers, and on top of doing stretched storage and a RAIN architecture. Such a scenario is quite alive and supported by us, so we will talk about it in an article about typical scenarios for using vAIR.

Storage failover schemes

There can be two fault tolerance schemes for virtual disks in vAIR:

1) Replication factor or just replication - this fault tolerance method is simple “like a stick and a rope”. Synchronous replication is performed between nodes with a factor of 2 (2 copies per cluster) or 3 (3 copies, respectively). RF-2 allows the virtual disk to withstand the failure of one node in the cluster, but “eats up” half of the usable volume, and RF-3 can withstand the failure of 2 nodes in the cluster, but already reserves 2/3 of the usable volume for its needs. This scheme is very similar to RAID-1, that is, a virtual disk configured in RF-2 is resistant to the failure of any one node of the cluster. In this case, the data will be all right and even the I / O will not stop. When the downed node returns to service, automatic data recovery / synchronization will begin.

Below are examples of distribution of RF-2 and RF-3 data in normal mode and in a failure situation.

We have a virtual machine with a volume of 8MB of unique (useful) data, which runs on 4 vAIR nodes. It is clear that in reality it is unlikely that there will be such a small volume, but for a scheme that reflects the logic of ARDFS, this example is the most understandable. ABs are 4MB virtual blocks containing unique virtual machine data. RF-2 creates two copies of these blocks A1+A2 and B1+B2, respectively. These blocks are “decomposed” into nodes, avoiding the intersection of the same data on the same node, that is, a copy of A1 will not be located on the same node as a copy of A2. With B1 and B2 it is similar.

AERODISK vAIR hyperconverged solution. The basis is the ARDFS file system

In the event of a failure of one of the nodes (for example, node No. 3, which contains a copy of B1), this copy is automatically activated on the node where there is no copy of its copy (that is, a copy of B2).

AERODISK vAIR hyperconverged solution. The basis is the ARDFS file system

Thus, the virtual disk (and VM, respectively) will easily survive the failure of one node in the RF-2 scheme.

The replication scheme, despite its simplicity and reliability, suffers from the same sore as RAID1 - there is little usable space.

2) Erasure coding or erasure coding (also known as "redundant coding", "erasure coding" or "redundancy code") just exists to solve the problem above. EC is a redundancy scheme that provides high data availability with less disk space overhead compared to replication. The principle of operation of this mechanism is similar to RAID 5, 6, 6P.

When encoding, the EC process divides a virtual block (default 4MB) into several smaller "data chunks" depending on the EC scheme (for example, a 2+1 scheme divides each 4MB block into 2 2MB chunks). Further, this process generates "parity chunks" for "data chunks" no larger than one of the previously separated chunks. When decoding, EC generates the missing pieces by reading the "surviving" data from the entire cluster.

For example, a virtual disk with a 2 + 1 EC scheme implemented on 4 cluster nodes will easily survive the failure of one node in the cluster in the same way as RF-2. At the same time, overhead costs will be lower, in particular, the useful capacity factor for RF-2 is 2, and for EC 2 + 1 it will be 1,5.

If it is easier to describe, then the point is that the virtual block is divided into 2-8 (why from 2 to 8 see below) “pieces”, and for these pieces parity “pieces” of a similar volume are calculated.

As a result, data and parity are evenly distributed across all nodes of the cluster. At the same time, as with replication, ARDFS automatically distributes data among nodes in such a way as to prevent the storage of identical data (copies of data and their parity) on one node, in order to eliminate the chance of losing data due to the fact that the data and their parity will suddenly end up on a single storage node that will fail.

Below is an example, with the same 8 MB virtual machine and 4 nodes, but with the EC 2 + 1 scheme.

Blocks A and B are divided into two pieces of 2 MB each (two because 2 + 1), that is, into A1 + A2 and B1 + B2. Unlike a replica, A1 is not a copy of A2, it is a virtual block A, divided into two parts, the same with block B. In total, we get two sets of 4MB each, each of which contains two two-megabyte pieces. Further, for each of these sets, the parity is calculated with a volume of no more than one piece (i.e. 2 MB), we get an additional + 2 parity pieces (AP and BP). In total we have 4×2 data + 2×2 parity.

Next, the pieces are “decomposed” into nodes so that the data does not intersect with their parity. Those. A1 and A2 will not lie on the same node as the AP.

AERODISK vAIR hyperconverged solution. The basis is the ARDFS file system

In the event of a failure of one node (let's say also the third one), the fallen block B1 will be automatically restored from the BP parity, which is stored on node No. 2, and will be activated on the node where there is no B-parity, i.e. piece of BP. In this example, this is node number 1

AERODISK vAIR hyperconverged solution. The basis is the ARDFS file system

I'm sure the reader is wondering:

“Everything that you described has long been implemented both by competitors and in open source solutions, what is the difference between your implementation of EC in ARDFS?”

And then there will be interesting features of the work of ARDFS.

Erasure coding with a focus on flexibility

Initially, we provided a rather flexible EC X+Y scheme, where X is a number from 2 to 8, and Y is a number from 1 to 8, but always less than or equal to X. This scheme is provided for flexibility. Increasing the number of pieces of data (X) into which the virtual block is divided allows you to reduce overhead, that is, increase the usable space.
Increasing the number of parity chunks (Y) increases the reliability of the virtual disk. The larger the Y value, the more nodes in the cluster can fail. Of course, increasing the amount of parity reduces the amount of usable capacity, but this is a trade-off for reliability.

The dependence of performance on EC schemes is almost direct: the more "pieces", the lower the performance, here, of course, a balanced view is needed.

This approach allows administrators to configure the stretched storage as flexibly as possible. Within the ARDFS pool, you can use any fault tolerance schemes and their combinations, which, in our opinion, is also very useful.

Below is a table comparing several (not all possible) RF and EC schemes.

AERODISK vAIR hyperconverged solution. The basis is the ARDFS file system

The table shows that even the most “terry” combination of EC 8 + 7, which allows the loss of up to 7 nodes in a cluster at the same time, “eats up” less usable space (1,875 vs. 2) than standard replication, and protects 7 times better, which makes this protection mechanism, although more complex, but much more attractive in situations where you need to ensure maximum reliability in conditions of lack of disk space. At the same time, you need to understand that each “plus” to X or Y will be an additional performance overhead, so you need to choose very carefully in the triangle between reliability, economy and performance. For this reason, we will devote a separate article to the sizing of erasing coding.

AERODISK vAIR hyperconverged solution. The basis is the ARDFS file system

Reliability and autonomy of the file system

ARDFS runs locally on all cluster nodes and synchronizes them by its own means via dedicated Ethernet interfaces. An important point is that ARDFS itself synchronizes not only data, but also metadata related to storage. In the course of working on ARDFS, we studied a number of existing solutions in parallel and we found that many synchronize the meta file system using an external distributed DBMS, which we also use to synchronize, but only configurations, and not file system metadata (about this and other related subsystems in the next article).

Synchronizing FS metadata using an external DBMS is, of course, a working solution, but then the consistency of the data stored on ARDFS would depend on the external DBMS and its behavior (and, frankly, she is a capricious lady), which in our opinion is bad. Why? If the FS metadata gets corrupted, the FS data itself can also be said goodbye, so we decided to take a more complicated but reliable path.

We made the metadata synchronization subsystem for ARDFS on our own, and it lives absolutely independently of adjacent subsystems. Those. no other subsystem can corrupt ARDFS data. In our opinion, this is the most reliable and correct way, but time will tell whether this is actually the case. In addition, with this approach, there is an additional advantage. ARDFS can be used independently from vAIR, just like a stretched storage, which we will certainly use in future products.

As a result, by developing ARDFS, we got a flexible and reliable file system that gives a choice where you can save on capacity or give everything on performance, or make storage ultra-reliable for a modest cost, but with lower performance requirements.

Together with a simple licensing policy and a flexible delivery model (looking ahead, vAIR is licensed by nodes, and delivered either as software or as a PACK), this makes it possible to fine-tune the solution to a variety of customer requirements and easily maintain this balance in the future.

Who needs this miracle?

On the one hand, we can say that there are already players on the market who have serious solutions in the field of hyperconvergent, and where we are actually climbing. This statement seems to be true, BUT...

On the other hand, going out into the fields and communicating with customers, we and our partners see that this is not at all the case. There are many tasks for hyperconvergent, somewhere people simply didn’t know that such solutions existed, somewhere it seemed expensive, somewhere there were unsuccessful tests of alternative solutions, and somewhere it’s forbidden to buy at all, because of the sanctions. In the general field it turned out to be unplowed, so we went to raise virgin soil))).

When is storage better than GKS?

In the course of working with the market, we are often asked when it is better to use the classic scheme with storage systems, and when is it better to use hyperconvergent? Many companies producing GKS (especially those that do not have storage systems in their portfolios) say: “SAN is becoming obsolete, hyperconvergent only!”. This is a bold statement, but it does not quite reflect reality.

In truth, the storage market is indeed moving towards hyperconverged and similar solutions, but there is always a “but”.

Firstly, data centers and IT infrastructures built according to the classical scheme with storage cannot be rebuilt like that, so the modernization and completion of such infrastructures is still a legacy for 5-7 years.

Secondly, those infrastructures that are being built now for the most part (meaning the Russian Federation) are built according to the classical scheme using storage systems, and not because people do not know about hyperconvergent, but because the hyperconvergent market is new, solutions and standards have not yet been established , IT specialists have not yet been trained, there is little experience, but it is necessary to build data centers here and now. And this trend is for another 3-5 years (and then another legacy, see point 1).

Thirdly, a purely technical limitation in additional small delays of 2 milliseconds per write (without taking into account the local cache, of course), which is a payment for distributed storage.

Well, let's not forget about the use of large physical servers that love vertical scaling of the disk subsystem.

There are many necessary and popular tasks where the storage system behaves better than the GCS. Here, of course, those manufacturers who do not have storage systems in their product portfolio will disagree with us, but we are ready to argue with reason. Of course, we, as the developers of both products, in one of the future publications will definitely compare storage systems and GCS, where we will clearly demonstrate which is better under what conditions.

And where will hyperconverged solutions work better than storage?

Based on the thesis above, three obvious conclusions can be drawn:

  1. Where an additional 2 milliseconds of write delay, which steadily occur in any product (now we are not talking about synthetics, you can show nanoseconds on synthetics), are not critical, hyperconvergent is suitable.
  2. Where the load from large physical servers can be turned into many small virtual ones and distributed among the nodes, hyperconvergence will also work well there.
  3. Where horizontal scaling is more important than vertical scaling, GKS will also work fine there.

What are these solutions?

  1. All standard infrastructure services (directory service, mail, EDMS, file servers, small or medium ERP and BI systems, etc.). We call this "general computing".
  2. Cloud provider infrastructure where it is necessary to quickly and standardize horizontally expand and easily “slice” a large number of virtual machines for customers.
  3. Virtual Desktop Infrastructure (VDI), where many small custom virtual machines start up and quietly “float” inside a uniform cluster.
  4. Branch networks, where in each branch you need to get a standard, fault-tolerant, but at the same time inexpensive infrastructure of 15-20 virtual machines.
  5. Any distributed computing (big data services, for example). Where the load is not "in depth", but "in breadth".
  6. Test environments where additional small delays are acceptable, but there are budgetary restrictions, because these are tests.

At the moment, it is for these tasks that we have made AERODISK vAIR and we are focusing on them (so far successfully). Perhaps this will change soon, because. the world does not stand still.

So…

This concludes the first part of a large series of articles, in the next article we will talk about the architecture of the solution and the components used.

We will be glad to questions, suggestions and constructive disputes.

Source: habr.com

Add a comment