This article is written to help you choose the right solution for you and understand the differences between SDS such as Gluster, Ceph and Vstorage (Virtuozzo).
The text uses links to articles with a more detailed disclosure of certain problems, so the descriptions will be as brief as possible using key points without unnecessary water and introductory information that you can independently obtain on the Internet if you wish.
In fact, of course, the topics covered require tones of text, but in the modern world, more and more people do not like to read a lot))), so you can read quickly and make a choice, and if something is not clear, follow the links or google incomprehensible words))), and this article is like a transparent wrapper for these deep topics, showing the filling - the main key points of each solution.
Gluster
Let's start with Gluster, which is actively used by hyperconverged platform vendors with open source SDS for virtual environments and can be found on the RedHat website in the storage section, where you can choose from two SDS options: Gluster or Ceph.
Gluster consists of a stack of translators - services that do all the work of distributing files, etc. Brick is a service that serves one disk, Volume is a volume (pool) that combines these bricks. Next comes the service of distributing files into groups due to the DHT (distributed hash table) function. We will not include the Sharding service in the description, since the links below will contain a description of the problems associated with it.
When writing, the file is completely placed in the brick and its copy is written in parallel to the brick on the second server. Further, the second file will already be written to the second group of two bricks (or more) on different servers.
If the files are about the same size and the volume will consist of only one group, then everything is fine, but under other conditions, the following problems will arise from the descriptions:
- the space in the groups is utilized unevenly, it depends on the size of the files, and if there is not enough space in the group to write the file, you will get an error, the file will not be written and will not be redistributed to another group;
- when writing one file, IO goes to only one group, the rest are idle;
- you cannot get the IO of the entire volume when writing a single file;
- and the general concept looks less productive due to the lack of data distribution into blocks, where it is easier to balance and solve the problem of uniform distribution, and not as the whole file is now included in the brick.
From the official description
These conclusions are also related to the description of the experience of using
The picture shows the load distribution when writing two files, where copies of the first file are distributed over the first three servers, which are combined into a volume 0 group, and three copies of the second file are placed on the second volume1 group of three servers. Each server has one disk.
The general conclusion is that you can use Gluster, but with the understanding that there will be limitations in performance and fault tolerance that create difficulties under certain conditions of a hyperconverged solution, where resources are also needed for the computing loads of virtual environments.
There are also some Gluster performance indicators that can be achieved under certain conditions by limiting
front
Now let's look at Ceph from the architecture descriptions that I managed to
Architecture
From the description of the architecture, the heart is CRUSH, thanks to which a place is chosen for placing data. Next comes PG - this is the most difficult abstraction (logical group) to understand. PGs are needed to make CRUSH more effective. The main purpose of PG is to group objects to reduce resource consumption, improve performance and scalability. Addressing objects directly, individually, without combining them into a PG, would be very costly. OSD is a service for each individual disk.
A cluster can have one or many data pools for different purposes and with different settings. Pools are divided into placement groups. Placement groups store objects accessed by clients. At this, the logical level ends and the physical one begins, because each placement group has one main disk and several replica disks (how much depends on the pool replication factor). In other words, at the logical level, the object is stored in a specific placement group, and at the physical level, on the disks that are assigned to it. At the same time, disks can physically be located on different nodes or even in different data centers.
In this scheme, placement groups look like a necessary level for the flexibility of the entire solution, but at the same time, they look like an extra link in this chain, which involuntarily suggests a loss of performance. For example, when writing data, the system needs to divide them into these groups and then at the physical level to the main disk and disks for replicas. That is, the Hash function works when searching and inserting an object, but there is a side effect - these are very high costs and restrictions on rebuilding the hash (when adding, removing a disk). Another hash problem is a well-nailed location of data that cannot be changed. That is, if somehow the disk experiences an increased load, then the system does not have the opportunity not to write to it (by selecting another disk), the hash function obliges the data to be arranged according to the rule, no matter how bad the disk is, so Ceph eats a lot of memory when rebuilding PG into case of self-healing or storage expansion. The conclusion is that Ceph works well (albeit slowly), but only when there is no scaling, crashes and updates.
Of course, there are options for improving performance through caching and cache tiring, but you need good hardware and there will still be losses. But in general, Ceph looks more tempting than Gluster for productivity. Also, when using these products, it is necessary to take into account an important factor - this is a high level of competencies, experience and professionalism with a great emphasis on linux, since it is very important to properly deploy, configure and maintain everything, which imposes even more responsibility and burden on the administrator.
Vstorage
Even more interesting is the architecture of
What can coexist for storage next to the services of the kvm-qemu hypervisor, and these are just a few services where a compact optimal hierarchy of components is found: client service mounted via FUSE (modified, not open source), MDS metadata service (Metadata service), service Chunk service data blocks, which at the physical level is equal to one disk and that's it. In terms of speed, of course, it is optimal to use a fault-tolerant scheme in two replicas, but if you use caching and journals on SSD disks, then noise-correcting coding (erase coding or raid6) can be decently overclocked on a hybrid scheme or even better on all flash. With EC (erase coding) there is some minus: when changing one data block, it is necessary to recalculate the parity sums. To bypass the losses for this operation, Ceph writes to the EC is delayed and performance problems can occur with a certain request, when, for example, all blocks need to be read, and in the case of Virtuozzo Storage, the modified blocks are written using the βlog-structured file systemβ approach, which minimizes cost of parity calculations. To estimate approximately the options with and without EC acceleration, there are
The simple layout of storage components does not mean that these components do not absorb
There is a scheme for comparing the consumption of iron resources by Ceph and Virtuozzo storage services.
If earlier it was possible to compare Gluster and Ceph using old articles, using the most important lines from them, then with Virtuozzo it is more difficult. There are not so many articles on this product and information can only be obtained from the documentation for
I will try to help with the description of this architecture, so there will be a little more text, but it takes a lot of time to understand the documentation yourself, and the existing documentation can only be used as a reference by revising the table of contents or searching by keyword.
Consider the recording process in a hybrid hardware configuration with the components described above: the recording starts going to the node from which it was initiated by the client (FUSE mount point service), but the Metadata Service (MDS) master component will of course direct the client directly to the desired chunk service (storage service CS blocks), that is, MDS does not participate in the recording process, but simply directs it to the necessary chunk service. In general, we can draw an analogy with the recording of water pouring into barrels. Each barrel is a 256MB block of data.
That is, one disk is a certain number of such barrels, that is, the volume of the disk is divided by 256MB. Each copy is distributed to one node, the second is almost parallel to another node, etc... If we have three replicas and there are SSD disks for cache (for reading and writing logs), then the write confirmation will occur after the log is written to the SSD, and a parallel reset from the SSD will continue on the HDD, as if in the background. In the case of three replicas, the write commit will be after confirmation from the SSD of the third node. It may seem that the sum of the write speeds of three SSDs can be divided by three and we get the write speed of one replica, but copies are written in parallel and the network latency is usually higher than that of the SSD, and in fact the write performance will depend on the network. In this regard, in order to see real IOPS, you need to correctly load the entire Vstorage by
The above-mentioned write log on the SSD works in such a way that as soon as data gets into it, it is immediately read by the service and written to the HDD. There are several metadata services (MDS) per cluster and their number is determined by the quorum, which works according to the Paxos algorithm. From the client's point of view, the FUSE mount point is a cluster storage folder that is simultaneously visible to all cluster nodes, each node has a client mounted in this way, so each node has access to this storage.
For the performance of any of the above approaches, it is very important, at the planning and deployment stage, to properly configure the network, where there will be balancing due to aggregation and the right bandwidth of the network channel. In aggregation, it is important to choose the right hashing mode and frame sizes. There is also a very strong difference from the SDS described above, this is fuse with fast path technology in Virtuozzo Storage. Which, in addition to the upgraded fuse, unlike other open source solutions, significantly adds IOPS and allows you not to be limited to horizontal or vertical scaling. In general, compared to the architectures described above, this one looks more powerful, but for such pleasure, of course, you need to buy licenses, unlike Ceph and Gluster.
Summing up, we can emphasize the top of the three: the first place in terms of performance and reliability of the architecture is occupied by Virtuozzo Storage, the second by Ceph and the third by Gluster.
Criteria by which Virtuozzo Storage is selected: it is the optimal set of architecture components, upgraded to this Fuse approach with fast path, flexible set of hardware configuration, less resource consumption and the ability to share with compute (computing / virtualization), that is, it is completely suitable for a hyperconverged solution , in which he goes. Second place is Ceph, because it is a more productive architecture than Gluster, due to operating with blocks, as well as more flexible scenarios and the ability to work in larger clusters.
There is a desire to write a comparison between vSAN, Space Direct Storage, Vstorage and Nutanix Storage, testing Vstorage on HPE, Huawei equipment, as well as scenarios for integrating Vstorage with external hardware storage systems, so if you liked the article, it would be nice to get feedback from you , which could increase the motivation for new articles, taking into account your comments and wishes.
Source: habr.com