He doesn't tell you

In connection with the growing popularity of Rook, I want to talk about its pitfalls and problems that await you along the way.

About me: ceph administration experience since hammer version, community founder t.me/ceph_ru in telegram.

In order not to be unfounded, I will refer to the posts accepted by Habr (judging by the rating) about problems with ceph. I also encountered most of the problems in these posts. Links to the material used at the end of the post.

In the post about Rook, we mention ceph for a reason - Rook is essentially ceph wrapped in kubernetes, which means it inherits all its problems. Let's start with ceph problems.

Simplify cluster management

One of the advantages of Rook is the convenience of managing ceph through kuberentes.

However, ceph contains more than 1000 parameters to configure, while at the same time we can edit only a small part of them through rook.

Example on Luminous
> ceph daemon mon.a config show | wc -l
1401

Rook is positioned as a convenient way to install and update ceph
There are no problems with installing ceph without Rook - ansible playbook is written in 30 minutes, but there are a lot of problems with updating.

Quote from Krok

Example: crush tunables not working correctly after upgrading from hummer to jewel

> ceph osd crush show-tunables
{
...
"straw_calc_version": 1,
"allowed_bucket_algs": 22,
"profile": "unknown",
"optimal_tunables": 0,
...
}

But even within the minor versions there are problems.

Example: Update 12.2.6 bringing cluster to health err state and conditionally broken PG
ceph.com/releases/v12-2-8-released

Do not update, wait and test? But we kind of use Rook for the convenience of updates as well.

Complexity of disaster recovery cluster in Rook

Example: OSD crashes with errors under its feet. You suspect that the problem is in one of the parameters in the config, you want to change the config for a specific daemon, but you can’t, because you have kubernetes and DaemonSet.

There is no alternative. ceph tell osd.Num injectargs does not work - the OSD is lying.

Debug complexity

For some setups and performance tests it is necessary to connect directly to the daemon's osd socket. In the case of Rook, you first need to find the right container, then go into it, find the tooling that is missing for debug and get very upset.

Difficulty of consistently raising OSD

Example: OSD falls on OOM, rebalance starts, after that the next ones fall.

Solution: Raise OSD one at a time, wait for it to be fully included in the cluster and raise the next ones. (More details in the report Ceph. Anatomy of a disaster).

In the case of a baremetal installation, this is done simply by hand, in the case of Rook and one OSD per node, there are no particular problems, problems with sequential raising will arise if OSD > 1 per node.

Of course, they are solvable, but we carry Rook for simplification, but we get complication.

Difficulty in selecting limits for ceph daemons

For a baremetal installation of ceph, it is quite easy to calculate the required resources per cluster - there are formulas and there are studies. When using weaker CPUs, you still have to do a number of performance tests to find out what Numa is, but it's still easier than in Rook.

In the case of Rook, in addition to the memory limits that can be calculated, the question arises of setting a CPU limit.

And here you have to sweat with performance tests. If you lower the limits, you will get a slow cluster, if you set unlim, you will get active CPU usage during rebalancing, which will badly affect your applications in kubernetes.

Networking Issues v1

For ceph, it is recommended to use a 2x10gb network. One for client traffic, the other for ceph service needs (rebalance). If you live with ceph on baremetal, then this separation is easily configured, if you live with Rook, then network separation will cause you problems, due to the fact that not every cluster config allows you to feed two different networks into the pod.

Networking Issues v2

If you refuse to separate networks, then when rebalancing, ceph traffic will clog the entire channel and your applications in kubernetes will slow down or fall. You can reduce the ceph rebalance speed, but then, due to a long rebalance, you get an increased risk of the second node falling out of the cluster on disks or OOM, and there is already a guaranteed read only on the cluster.

Long rebalancing - long application brakes

Quote from Ceph. Anatomy of a disaster.

Test cluster performance:

A 4 KB write operation takes 1 ms, the performance is 1000 operations / second in 1 thread.

A 4 MB operation (the size of an object) takes 22 ms, with a performance of 45 operations/second.

Therefore, when one domain out of three fails, the cluster is in a degraded state for a while, and half of the hot objects will spread across different versions, then half of the write operations will begin with a hard recovery.

We calculate the forced recovery time approximately - write operations to a degraded object.

First we read 4 MB in 22 ms, we write 22 ms, and then in 1 ms we write 4 KB of actual data. In total, a total of 45 ms per write operation to a degraded object on an SSD, when the standard performance was 1 ms, is a 45-fold drop in performance.

The more we have the percentage of degraded objects, the worse everything becomes.

It turns out that the speed of rebalancing is critical for the correct operation of the cluster.

Server specific settings for ceph

ceph sometimes needs specific host tuning.

Example: sysctl settings and the same JumboFrame, some of these settings can negatively affect your payload.

The real need for Rook remains in question

If you are in the cloud, you have storage from your cloud provider, which is much more convenient.

If you are on your own servers, then ceph management will be more convenient without kubernetes.

Do you rent servers in some low cost hosting? Then you will have a lot of fun with the network, its delays and throughput, which clearly negatively affects ceph.

Total: Implementing kuberentes and implementing storage are different tasks with different inputs and different solutions - mixing them up is doing a possibly dangerous trade-off to please one or the other. It will be very difficult to combine these solutions even at the design stage, but there is still a period of operation.

List of used literature:

Post #1 But you say Ceph ... is it really that good?
Post #2 Ceph. Anatomy of a disaster

Source: habr.com

Buy reliable hosting for sites with DDoS protection, VPS VDS servers 🔥 Buy reliable website hosting with DDoS protection, VPS VDS servers | ProHoster