🥇Our hands are not for boredom: restoring a Rook cluster in K8s

Мы already toldhow/why we like Rook: it makes storage in Kubernetes clusters much easier. However, with this simplicity comes some complexity. We hope that the new material will help to better understand such complexities even before they manifest themselves.

And to make it more interesting to read, let's start with of the consequences hypothetical problem in the cluster.

"Everything is lost!"

Imagine that you once set up and launched Rook in your K8s cluster, he pleased with his work, but at some “beautiful” moment the following happens:

New pods cannot mount RBD images from Ceph.
Commands like lsblk и df do not work on Kubernetes nodes. This automatically means: “something is wrong” with the RBD images mounted on the nodes. I can't read them, indicating that the monitors are unavailable...
Yes, there are no working monitors in the cluster. Moreover, there are not even OSD pods or MGR pods.

When was the pod launched rook-ceph-operator? Not so long ago, as it was deployed. Why? Rook-operator decided to make a new cluster... How can we now restore the cluster and data in it?

To begin with, let's take a longer interesting path by conducting a thoughtful investigation into the “insides” of Rook and step-by-step restoration of its components. Of course, there is a shorter correct way: using backups. As you know, admins are divided into two types: those who do not make backups, and those who already do them ... But more on that after the investigation.

A little practice, or a long way

Let's look around and restore the monitors

So, let's look at the list of ConfigMaps: there are necessary for redundancy rook-ceph-config и rook-config-override. They appear when the cluster is successfully deployed.

NB: In new versions, after acceptance this PR, ConfigMaps are no longer an indicator of the success of a cluster deployment.

To perform further actions, we need a hard reboot of all servers that have mounted RBD images (ls /dev/rbd*). It must be done through sysrq (or "on foot" in the data center). This requirement is caused by the task of unmounting mounted RBDs, for which a regular reboot will not work (it will unsuccessfully try to unmount them normally).

The theater starts with a hanger, and the Ceph cluster starts with monitors. Let's look at them.

Rook mounts the following entities in the monitor pod:

Volumes:
 rook-ceph-config:
   Type:      ConfigMap (a volume populated by a ConfigMap)
   Name:      rook-ceph-config
 rook-ceph-mons-keyring:
   Type:        Secret (a volume populated by a Secret)
   SecretName:  rook-ceph-mons-keyring
 rook-ceph-log:
   Type:          HostPath (bare host directory volume)
   Path:          /var/lib/rook/kube-rook/log
 ceph-daemon-data:
   Type:          HostPath (bare host directory volume)
   Path:          /var/lib/rook/mon-a/data
Mounts:
  /etc/ceph from rook-ceph-config (ro)
  /etc/ceph/keyring-store/ from rook-ceph-mons-keyring (ro)
  /var/lib/ceph/mon/ceph-a from ceph-daemon-data (rw)
  /var/log/ceph from rook-ceph-log (rw)

Let's see what's in the secret rook-ceph-mons-keyring:

kind: Secret
data:
 keyring: LongBase64EncodedString=

We decode and get a regular keyring with rights for the admin and monitors:

[mon.]
       key = AQAhT19dlUz0LhBBINv5M5G4YyBswyU43RsLxA==
       caps mon = "allow *"
[client.admin]
       key = AQAhT19d9MMEMRGG+wxIwDqWO1aZiZGcGlSMKp==
       caps mds = "allow *"
       caps mon = "allow *"
       caps osd = "allow *"
       caps mgr = "allow *"

Let's remember. And now let's look at the keyring in secret rook-ceph-admin-keyring:

kind: Secret
data:
 keyring: anotherBase64EncodedString=

What's in it?

[client.admin]
       key = AQAhT19d9MMEMRGG+wxIwDqWO1aZiZGcGlSMKp==
       caps mds = "allow *"
       caps mon = "allow *"
       caps osd = "allow *"
       caps mgr = "allow *"

Same. Let's see more ... Here, for example, is a secret rook-ceph-mgr-a-keyring:

[mgr.a]
       key = AQBZR19dbVeaIhBBXFYyxGyusGf8x1bNQunuew==
       caps mon = "allow *"
       caps mds = "allow *"
       caps osd = "allow *"

In the end, we find a few more secrets in the ConfigMap rook-ceph-mon:

kind: Secret
data:
 admin-secret: AQAhT19d9MMEMRGG+wxIwDqWO1aZiZGcGlSMKp==
 cluster-name: a3ViZS1yb29r
 fsid: ZmZiYjliZDMtODRkOS00ZDk1LTczNTItYWY4MzZhOGJkNDJhCg==
 mon-secret: AQAhT19dlUz0LhBBINv5M5G4YyBswyU43RsLxA==

And this is the initial list with keyrings, where all the secrets described above come from.

As is known (cf. dataDirHostPath в documentation), Rook stores this data in two places. So let's go to the nodes to look at the keyrings that are in the directories that are mounted in the pods with monitors and OSDs. To do this, we find on the nodes /var/lib/rook/mon-a/data/keyring and see:

# cat /var/lib/rook/mon-a/data/keyring
[mon.]
       key = AXAbS19d8NNUXOBB+XyYwXqXI1asIzGcGlzMGg==
       caps mon = "allow *"

Suddenly here the secret turned out to be different - not like in ConfigMaps.

What about the admin keyring? We also have it:

# cat /var/lib/rook/kube-rook/client.admin.keyring
[client.admin]
       key = AXAbR19d8GGSMUBN+FyYwEqGI1aZizGcJlHMLgx= 
       caps mds = "allow *"
       caps mon = "allow *"
       caps osd = "allow *"
       caps mgr = "allow *"

Here is the problem. There was some kind of failure: the cluster was recreated ... but in reality it was not.

It becomes clear that newly generated keyrings are stored in secrets, and they not from our old cluster. That's why:

take the keyring from the monitor from the file /var/lib/rook/mon-a/data/keyring (or from a backup);
change keyring in secret rook-ceph-mons-keyring;
prescribe keyring from admin and monitor in ConfigMap rook-ceph-mon;
remove pod controllers with monitors.

The miracle will not keep itself waiting long: the monitors will appear and start up. Hooray, it's a start!

Restore OSD

Go to pod rook-operator: challenge ceph mon dump shows that all monitors are in place, and ceph -s — that they are in a quorum. However, if you look at the OSD tree (ceph osd tree), we will see something strange in it: OSD's began to appear, but they are empty. It turns out that they also need to be restored somehow. But how?

In the meantime, in ConfigMaps appeared the so necessary for us rook-ceph-config и rook-config-override, as well as many other ConfigMaps with names like rook-ceph-osd-$nodename-config. Let's look at them:

kind: ConfigMap
data:
 osd-dirs: '{"/mnt/osd1":16,"/mnt/osd2":18}'

Everything is wrong, everything is mixed up!

Let's scale the operator's pod to zero, remove the generated Deployment pods from the OSD and fix these ConfigMaps. But where to get right OSD map by nodes?

Let's try to dig into the directories again /mnt/osd[1-2] on the knots - in the hope that we can catch on to something there.
In the catalog /mnt/osd1 there are 2 subdirectories: osd0 и osd16. The last one is exactly the ID that is specified in ConfigMap (16)?
Let's check the dimensions and see that osd0 a lot more osd16.

We conclude that osd0 - this is the desired OSD, which was indicated as /mnt/osd1 in ConfigMap (because we use directory based osd.)

Step by step we check all nodes and edit ConfigMaps. After all the instructions, you can start the pod Rook operator and read its logs. And they are amazing:

I am the cluster operator;
I found disks on nodes;
I found monitors;
monitors became friends, i.e. formed a quorum;
running OSD deployments...

Let's go back to the Rook pod and check the liveness of the cluster... yes, we made a little mistake with the conclusions about the OSD names on some nodes! It doesn't matter: we corrected ConfigMaps again, removed unnecessary directories from new OSDs and came to the long-awaited state HEALTH_OK!

Let's check the images in the pool:

# rbd ls -p kube
pvc-9cfa2a98-b878-437e-8d57-acb26c7118fb
pvc-9fcc4308-0343-434c-a65f-9fd181ab103e
pvc-a6466fea-bded-4ac7-8935-7c347cff0d43
pvc-b284d098-f0fc-420c-8ef1-7d60e330af67
pvc-b6d02124-143d-4ce3-810f-3326cfa180ae
pvc-c0800871-0749-40ab-8545-b900b83eeee9
pvc-c274dbe9-1566-4a33-bada-aabeb4c76c32
…

Everything is in place - the cluster is saved!

I'm lazy making backups, or the Fast way

If backups for Rook were made, then the recovery procedure becomes much simpler and boils down to the following:

We scale the deployment of the Rook operator to zero;
We delete all deployments, except for the Rook operator;
Restoring all secrets and ConfigMaps from the backup;
Restoring the contents of directories /var/lib/rook/mon-* on nodes;
We restore (if you suddenly lost) CRD CephCluster, CephFilesystem, CephBlockPool, CephNFS, CephObjectStore;
We scale back the Rook operator's deployment to 1.

Useful Tips

Make backups!

And to avoid situations when you need to restore from them:

Before large-scale work with the cluster, which consists in restarting servers, scale the Rook operator to zero so that it does not do too much.
On monitors in advance add nodeAffinity.
Pay attention to prior setting timeouts ROOK_MON_HEALTHCHECK_INTERVAL и ROOK_MON_OUT_TIMEOUT.

Instead of a conclusion

There is no point in arguing that Rook, being an additional “layer” (in the overall storage organization scheme in Kubernetes), both simplifies a lot and adds new complexities and potential problems in the infrastructure. The matter remains for the “small”: to make a balanced, informed choice between these risks, on the one hand, and the benefits that the decision brings in your particular case, on the other.

By the way, recently in the Rook documentation was added See "Adopt an existing Rook Ceph cluster into a new Kubernetes cluster". It describes in more detail what needs to be done in order to move the existing data to a new Kubernetes cluster or restore the operation of a cluster that has collapsed for one reason or another.