Brief comparison of SDS architecture or finding the right storage platform (GlusterVsCephVsVirtuozzoStorage)

This article is written to help you choose the right solution for you and understand the differences between SDS such as Gluster, Ceph and Vstorage (Virtuozzo).

The text uses links to articles with a more detailed disclosure of certain problems, so the descriptions will be as brief as possible using key points without unnecessary water and introductory information that you can independently obtain on the Internet if you wish.

In fact, of course, the topics covered require tones of text, but in the modern world, more and more people do not like to read a lot))), so you can read quickly and make a choice, and if something is not clear, follow the links or google incomprehensible words))), and this article is like a transparent wrapper for these deep topics, showing the filling - the main key points of each solution.

Gluster

Let's start with Gluster, which is actively used by hyperconverged platform vendors with open source SDS for virtual environments and can be found on the RedHat website in the storage section, where you can choose from two SDS options: Gluster or Ceph.

Gluster consists of a stack of translators - services that do all the work of distributing files, etc. Brick is a service that serves one disk, Volume is a volume (pool) that combines these bricks. Next comes the service of distributing files into groups due to the DHT (distributed hash table) function. We will not include the Sharding service in the description, since the links below will contain a description of the problems associated with it.

Brief comparison of SDS architecture or finding the right storage platform (GlusterVsCephVsVirtuozzoStorage)

When writing, the file is completely placed in the brick and its copy is written in parallel to the brick on the second server. Further, the second file will already be written to the second group of two bricks (or more) on different servers.

If the files are about the same size and the volume will consist of only one group, then everything is fine, but under other conditions, the following problems will arise from the descriptions:

  • the space in the groups is utilized unevenly, it depends on the size of the files, and if there is not enough space in the group to write the file, you will get an error, the file will not be written and will not be redistributed to another group;
  • when writing one file, IO goes to only one group, the rest are idle;
  • you cannot get the IO of the entire volume when writing a single file;
  • and the general concept looks less productive due to the lack of data distribution into blocks, where it is easier to balance and solve the problem of uniform distribution, and not as the whole file is now included in the brick.

From the official description architecture also involuntarily comes the understanding that gluster works as a file storage on top of the classic hardware RAID. There have been development attempts to shard files into blocks, but all this is an addition that imposes a performance penalty on the already existing architectural approach, plus the use of freely redistributable performance-limiting components like Fuse. There are no metadata services, which limits the performance and fault tolerance of the storage when distributing files into blocks. Better performance can be observed with the β€œDistributed Replicated” configuration and the number of nodes must be at least 6 to organize a reliable replica 3 with optimal load distribution.

These conclusions are also related to the description of the experience of using Gluster and when compared with front, and there is also a description of the experience to come to understand this more productive and more reliable configuration "Replicated Distributed".
Brief comparison of SDS architecture or finding the right storage platform (GlusterVsCephVsVirtuozzoStorage)

The picture shows the load distribution when writing two files, where copies of the first file are distributed over the first three servers, which are combined into a volume 0 group, and three copies of the second file are placed on the second volume1 group of three servers. Each server has one disk.

The general conclusion is that you can use Gluster, but with the understanding that there will be limitations in performance and fault tolerance that create difficulties under certain conditions of a hyperconverged solution, where resources are also needed for the computing loads of virtual environments.

There are also some Gluster performance indicators that can be achieved under certain conditions by limiting fault tolerance.

front

Now let's look at Ceph from the architecture descriptions that I managed to to find. There is also a comparison between Glusterfs and Ceph, where you can immediately understand that it is desirable to deploy Ceph on separate servers, since its services require all the resources of iron during loads.

Architecture Ceph more complex than Gluster and there are services such as metadata services, but the whole stack of components is quite complex and not very flexible to use in a virtualization solution. The data is stacked in blocks, which looks more productive, but there are losses and latency in the hierarchy of all services (components) under certain loads and emergency conditions, for example, the following article.

From the description of the architecture, the heart is CRUSH, thanks to which a place is chosen for placing data. Next comes PG - this is the most difficult abstraction (logical group) to understand. PGs are needed to make CRUSH more effective. The main purpose of PG is to group objects to reduce resource consumption, improve performance and scalability. Addressing objects directly, individually, without combining them into a PG, would be very costly. OSD is a service for each individual disk.

Brief comparison of SDS architecture or finding the right storage platform (GlusterVsCephVsVirtuozzoStorage)

Brief comparison of SDS architecture or finding the right storage platform (GlusterVsCephVsVirtuozzoStorage)

A cluster can have one or many data pools for different purposes and with different settings. Pools are divided into placement groups. Placement groups store objects accessed by clients. At this, the logical level ends and the physical one begins, because each placement group has one main disk and several replica disks (how much depends on the pool replication factor). In other words, at the logical level, the object is stored in a specific placement group, and at the physical level, on the disks that are assigned to it. At the same time, disks can physically be located on different nodes or even in different data centers.

In this scheme, placement groups look like a necessary level for the flexibility of the entire solution, but at the same time, they look like an extra link in this chain, which involuntarily suggests a loss of performance. For example, when writing data, the system needs to divide them into these groups and then at the physical level to the main disk and disks for replicas. That is, the Hash function works when searching and inserting an object, but there is a side effect - these are very high costs and restrictions on rebuilding the hash (when adding, removing a disk). Another hash problem is a well-nailed location of data that cannot be changed. That is, if somehow the disk experiences an increased load, then the system does not have the opportunity not to write to it (by selecting another disk), the hash function obliges the data to be arranged according to the rule, no matter how bad the disk is, so Ceph eats a lot of memory when rebuilding PG into case of self-healing or storage expansion. The conclusion is that Ceph works well (albeit slowly), but only when there is no scaling, crashes and updates.

Of course, there are options for improving performance through caching and cache tiring, but you need good hardware and there will still be losses. But in general, Ceph looks more tempting than Gluster for productivity. Also, when using these products, it is necessary to take into account an important factor - this is a high level of competencies, experience and professionalism with a great emphasis on linux, since it is very important to properly deploy, configure and maintain everything, which imposes even more responsibility and burden on the administrator.

Vstorage

Even more interesting is the architecture of Virtuozzo storage(Vstorage), which can be used in conjunction with a hypervisor on the same nodes, on the same gland, but it is very important to configure everything correctly in order to achieve good performance. That is, deploying such a product from the box to any configuration without considering the recommendations in accordance with the architecture will be very easy, but not productive.

What can coexist for storage next to the services of the kvm-qemu hypervisor, and these are just a few services where a compact optimal hierarchy of components is found: client service mounted via FUSE (modified, not open source), MDS metadata service (Metadata service), service Chunk service data blocks, which at the physical level is equal to one disk and that's it. In terms of speed, of course, it is optimal to use a fault-tolerant scheme in two replicas, but if you use caching and journals on SSD disks, then noise-correcting coding (erase coding or raid6) can be decently overclocked on a hybrid scheme or even better on all flash. With EC (erase coding) there is some minus: when changing one data block, it is necessary to recalculate the parity sums. To bypass the losses for this operation, Ceph writes to the EC is delayed and performance problems can occur with a certain request, when, for example, all blocks need to be read, and in the case of Virtuozzo Storage, the modified blocks are written using the β€œlog-structured file system” approach, which minimizes cost of parity calculations. To estimate approximately the options with and without EC acceleration, there are calculator. – the figures can be approximated depending on the accuracy factor of the equipment manufacturer, but the result of the calculation helps to plan the configuration well.

The simple layout of storage components does not mean that these components do not absorb iron resources, but if you calculate all the costs in advance, then you can count on working together next to the hypervisor.
There is a scheme for comparing the consumption of iron resources by Ceph and Virtuozzo storage services.

Brief comparison of SDS architecture or finding the right storage platform (GlusterVsCephVsVirtuozzoStorage)

If earlier it was possible to compare Gluster and Ceph using old articles, using the most important lines from them, then with Virtuozzo it is more difficult. There are not so many articles on this product and information can only be obtained from the documentation for English or in Russian if we consider Vstorage as a storage used in some hyperconverged solutions in companies such as Rosplatforma and Acronis.

I will try to help with the description of this architecture, so there will be a little more text, but it takes a lot of time to understand the documentation yourself, and the existing documentation can only be used as a reference by revising the table of contents or searching by keyword.

Consider the recording process in a hybrid hardware configuration with the components described above: the recording starts going to the node from which it was initiated by the client (FUSE mount point service), but the Metadata Service (MDS) master component will of course direct the client directly to the desired chunk service (storage service CS blocks), that is, MDS does not participate in the recording process, but simply directs it to the necessary chunk service. In general, we can draw an analogy with the recording of water pouring into barrels. Each barrel is a 256MB block of data.

Brief comparison of SDS architecture or finding the right storage platform (GlusterVsCephVsVirtuozzoStorage)

That is, one disk is a certain number of such barrels, that is, the volume of the disk is divided by 256MB. Each copy is distributed to one node, the second is almost parallel to another node, etc... If we have three replicas and there are SSD disks for cache (for reading and writing logs), then the write confirmation will occur after the log is written to the SSD, and a parallel reset from the SSD will continue on the HDD, as if in the background. In the case of three replicas, the write commit will be after confirmation from the SSD of the third node. It may seem that the sum of the write speeds of three SSDs can be divided by three and we get the write speed of one replica, but copies are written in parallel and the network latency is usually higher than that of the SSD, and in fact the write performance will depend on the network. In this regard, in order to see real IOPS, you need to correctly load the entire Vstorage by methodology, that is, to test the real load, and not memory and cache, where it is necessary to take into account the correct size of the data block, the number of threads, etc.

The above-mentioned write log on the SSD works in such a way that as soon as data gets into it, it is immediately read by the service and written to the HDD. There are several metadata services (MDS) per cluster and their number is determined by the quorum, which works according to the Paxos algorithm. From the client's point of view, the FUSE mount point is a cluster storage folder that is simultaneously visible to all cluster nodes, each node has a client mounted in this way, so each node has access to this storage.

For the performance of any of the above approaches, it is very important, at the planning and deployment stage, to properly configure the network, where there will be balancing due to aggregation and the right bandwidth of the network channel. In aggregation, it is important to choose the right hashing mode and frame sizes. There is also a very strong difference from the SDS described above, this is fuse with fast path technology in Virtuozzo Storage. Which, in addition to the upgraded fuse, unlike other open source solutions, significantly adds IOPS and allows you not to be limited to horizontal or vertical scaling. In general, compared to the architectures described above, this one looks more powerful, but for such pleasure, of course, you need to buy licenses, unlike Ceph and Gluster.

Summing up, we can emphasize the top of the three: the first place in terms of performance and reliability of the architecture is occupied by Virtuozzo Storage, the second by Ceph and the third by Gluster.

Criteria by which Virtuozzo Storage is selected: it is the optimal set of architecture components, upgraded to this Fuse approach with fast path, flexible set of hardware configuration, less resource consumption and the ability to share with compute (computing / virtualization), that is, it is completely suitable for a hyperconverged solution , in which he goes. Second place is Ceph, because it is a more productive architecture than Gluster, due to operating with blocks, as well as more flexible scenarios and the ability to work in larger clusters.

There is a desire to write a comparison between vSAN, Space Direct Storage, Vstorage and Nutanix Storage, testing Vstorage on HPE, Huawei equipment, as well as scenarios for integrating Vstorage with external hardware storage systems, so if you liked the article, it would be nice to get feedback from you , which could increase the motivation for new articles, taking into account your comments and wishes.

Source: habr.com

Add a comment