Tips & tricks for working with Ceph in loaded projects

Tips & tricks for working with Ceph in loaded projects

Using Ceph as a network storage in projects of different load, we can face various tasks that at first glance do not seem simple or trivial. For example:

  • migrating data from the old Ceph to the new one with partial use of the previous servers in the new cluster;
  • solving the disk space allocation problem in Ceph.

Dealing with such tasks, we are faced with the need to correctly extract the OSD without losing data, which is especially important for large amounts of data. This will be discussed in the article.

The methods described below are relevant for any version of Ceph. In addition, the fact that a large amount of data can be stored in Ceph will be taken into account: to prevent data loss and other problems, some actions will be "split" into several others.

Introduction about OSD

Since two of the three recipes considered are devoted to OSD (Object Storage Daemon), before diving into the practical part - briefly about what it is in general in Ceph and why it is so important.

First of all, it should be said that the entire Ceph cluster consists of many OSDs. The more of them, the more free data volume in Ceph. From here it is easy to understand OSD main function: it saves the data of Ceph objects on the file systems of all cluster nodes and provides network access to them (for reading, writing and other requests).

At the same level, replication parameters are set by copying objects between different OSDs. And here you can encounter various problems, the solution of which will be discussed later.

Case number 1. Safely remove OSD from Ceph cluster without data loss

The need to remove the OSD can be caused by the removal of the server from the cluster - for example, to replace it with another server - which happened with us, giving rise to the writing of the article. Thus, the ultimate goal of manipulation is to extract all OSDs and mons on a given server so that it can be stopped.

For convenience and to avoid a situation where we make a mistake in specifying the desired OSD during the execution of commands, we will set a separate variable, the value of which will be the number of the OSD to be deleted. Let's call her ${ID} - hereinafter, such a variable replaces the number of the OSD with which we are working.

Let's look at the state before starting work:

root@hv-1 ~ # ceph osd tree
ID CLASS WEIGHT  TYPE NAME      STATUS REWEIGHT PRI-AFF
-1       0.46857 root default
-3       0.15619      host hv-1
-5       0.15619      host hv-2
 1   ssd 0.15619      osd.1     up     1.00000  1.00000
-7       0.15619      host hv-3
 2   ssd 0.15619      osd.2     up     1.00000  1.00000

To initiate OSD removal, you will need to smoothly execute reweight on it to zero. Thus, we reduce the amount of data in the OSD by balancing to other OSDs. To do this, the following commands are executed:

ceph osd reweight osd.${ID} 0.98
ceph osd reweight osd.${ID} 0.88
ceph osd reweight osd.${ID} 0.78

...and so on down to zero.

Smooth balancing requiredso as not to lose data. This is especially true if the OSD contains a large amount of data. To make sure that after executing the commands reweight everything went well, you can ceph -s or in a separate terminal window run ceph -w to see changes in real time.

When the OSD is "empty", you can proceed with the standard operation to remove it. To do this, we will transfer the desired OSD to the state down:

ceph osd down osd.${ID}

"Pull out" OSD from the cluster:

ceph osd out osd.${ID}

Stop the OSD service and unmount its partition in the FS:

systemctl stop ceph-osd@${ID}
umount /var/lib/ceph/osd/ceph-${ID}

Remove OSD from CRUSH map:

ceph osd crush remove osd.${ID}

Delete the OSD user:

ceph auth del osd.${ID}

And finally, remove the OSD itself:

ceph osd rm osd.${ID}

Note: If you are using Ceph Luminous or higher, then the above OSD removal steps can be reduced to two commands:

ceph osd out osd.${ID}
ceph osd purge osd.${ID}

If, after performing the above steps, run the command ceph osd tree, it should be clear that on the server where the work was done, there are no more OSDs for which the operations above were performed:

root@hv-1 ~ # ceph osd tree
ID CLASS WEIGHT  TYPE NAME     STATUS REWEIGHT PRI-AFF
-1       0.46857      root default
-3       0.15619      host hv-1
-5       0.15619      host hv-2
-7       0.15619      host hv-3
 2   ssd 0.15619      osd.2    up     1.00000  1.00000

In passing, note that the state of the Ceph cluster will change to HEALTH_WARN, and we will also see a decrease in the number of OSDs and the amount of available disk space.

The following will describe the steps that will be required if you want to completely stop the server and, accordingly, remove it from Ceph. In this case, it is important to remember that all OSD must be removed before shutting down the server on this server.

If there are no more OSDs left on this server, then after deleting them, you need to exclude the server from the OSD map hv-2by running the following command:

ceph osd crush rm hv-2

Delete mon from the server hv-2by running the command below on another server (i.e. in this case, on hv-1):

ceph-deploy mon destroy hv-2

After that, you can stop the server and proceed with subsequent actions (its redeployment, etc.).

Case number 2. Disk space allocation in an already created Ceph cluster

I'll start the second story with a preface about PG (Placement Groups). The main role of the PG in Ceph is primarily in the aggregation of Ceph objects and further replication in the OSD. The formula by which you can calculate the required amount of PG is in the relevant section Ceph documentation. In the same place this question is analyzed and on specific examples.

So: one of the common problems during the operation of Ceph is an unbalanced number of OSDs and PGs between pools in Ceph.

Firstly, because of this, a situation may arise when too many PGs are specified in a small pool, which in fact is an inefficient use of disk space in the cluster. Secondly, in practice there is a more serious problem: data overflow in one of the OSDs. This entails the transition of the cluster first to the state HEALTH_WARNand then HEALTH_ERR. The reason for this is that Ceph, when calculating the available amount of data (you can find it out by MAX AVAIL in command output ceph df for each pool separately) is based on the amount of data available in the OSD. If there is not enough space in at least one OSD, then no more data can be written until the data is properly distributed among all OSDs.

It should be noted that these problems largely resolved at the Ceph cluster configuration stage. One of the tools you can use is Ceph PGCalc. With its help, the required amount of PG is clearly calculated. However, it can also be used in situations where the Ceph cluster already configured incorrectly. It's worth clarifying here that as part of the patching work, you will most likely need to reduce the number of PGs, and this feature is not available in older versions of Ceph (it only appeared from version Nautilus).

So, let's imagine the following picture: the cluster has the status HEALTH_WARN because one of the OSDs runs out of space. This will be indicated by an error. HEALTH_WARN: 1 near full osd. Below is an algorithm to get out of this situation.

First of all, it is required to distribute the available data among the remaining OSDs. We already performed a similar operation in the first case, when we "drained" the node - with the only difference that now we need to slightly reduce reweight. For example, up to 0.95:

ceph osd reweight osd.${ID} 0.95

This frees up disk space in the OSD and fixes a bug in ceph health. However, as already mentioned, this problem mainly occurs due to incorrect Ceph configuration at the initial stages: it is very important to do a reconfiguration so that it does not appear in the future.

In our particular case, everything rested on:

  • too much value replication_count in one of the pools
  • Too much PG in one pool and too little in another.

Let's use the already mentioned calculator. It clearly shows what needs to be entered and, in principle, there is nothing complicated. Having set the necessary parameters, we get the following recommendations:

Note: If you are setting up a Ceph cluster from scratch, another useful feature of the calculator will be the generation of commands that will create pools from scratch with the parameters specified in the table.

The last column helps to orient - Suggested PG Count. In our case, the second one is also useful, where the replication parameter is indicated, since we decided to change the replication multiplier as well.

So, first you need to change the replication parameters - this is worth doing in the first place, since by reducing the multiplier, we will free up disk space. During the execution of the command, you can notice that the value of the available disk space will increase:

ceph osd pool $pool_name set $replication_size

And after its completion - change the parameter values pg_num и pgp_num in the following way:

ceph osd pool set $pool_name pg_num $pg_number
ceph osd pool set $pool_name pgp_num $pg_number

It's important: we must change the number of PGs in each pool sequentially and do not change the values ​​in other pools until the warnings disappear "Degraded data redundancy" и "n-number of pgs degraded".

You can also check that everything went well by the output of the commands ceph health detail и ceph -s.

Case number 3. Migrating a virtual machine from LVM to Ceph RBD

In a situation where a project uses virtual machines installed on rented bare-metal servers, the issue of fault-tolerant storage often arises. And it is also very desirable that there is enough space in this storage ... Another common situation: there is a virtual machine with local storage on the server and you need to expand the disk, but there is nowhere, because there is no free disk space left on the server.

The problem can be solved in different ways - for example, migrating to another server (if any) or adding new disks to the server. But it is not always possible to do this, so migrating from LVM to Ceph can be a great solution to this problem. By choosing this option, we also simplify the further process of migrating between servers, since there will be no need to move local storage from one hypervisor to another. The only snag is that you will have to stop the VM for the duration of the work.

The following recipe is taken article from this blog, the instructions of which have been tested in action. By the way, the method of seamless migration is also described there, however, in our case, it simply was not required, so we did not check it. If this is critical for your project, we will be glad to know about the results in the comments.

Let's move on to the practical part. In the example, we are using virsh and, accordingly, libvirt. First, make sure that the Ceph pool where the data will be migrated is connected to libvirt:

virsh pool-dumpxml $ceph_pool

The description of the pool must contain connection data to Ceph with data for authorization.

The next step is that the LVM image is converted to Ceph RBD. The execution time depends primarily on the size of the image:

qemu-img convert -p -O rbd /dev/main/$vm_image_name rbd:$ceph_pool/$vm_image_name

After the conversion, an LVM image will remain, which will be useful if you fail to migrate the VM to RBD and you have to roll back the changes. Also, in order to be able to quickly roll back changes, we will make a backup of the virtual machine configuration file:

virsh dumpxml $vm_name > $vm_name.xml
cp $vm_name.xml $vm_name_backup.xml

... and edit the original (vm_name.xml). Find a block with a description of the disk (starts with the line <disk type='file' device='disk'> and ends with </disk>) and bring it to the following form:

<disk type='network' device='disk'>
<driver name='qemu'/>
<auth username='libvirt'>
  <secret type='ceph' uuid='sec-ret-uu-id'/>
 </auth>
<source protocol='rbd' name='$ceph_pool/$vm_image_name>
  <host name='10.0.0.1' port='6789'/>
  <host name='10.0.0.2' port='6789'/>
</source>
<target dev='vda' bus='virtio'/> 
<alias name='virtio-disk0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
</disk>

Let's go over some details:

  1. To the protocol source the address to the storage in Ceph RBD is indicated (this is the address with the name of the Ceph pool and RBD image, which was determined at the first stage).
  2. In the block secret the type is indicated ceph, as well as the UUID of the secret to connect to. Its uuid can be found with the command virsh secret-list.
  3. In the block host addresses to Ceph monitors are specified.

After editing the configuration file and completing the LVM to RBD conversion, you can apply the modified configuration file and start the virtual machine:

virsh define $vm_name.xml
virsh start $vm_name

It's time to check that the virtual machine has started correctly: you can find out, for example, by connecting to it via SSH or via virsh.

If the virtual machine is working correctly and you have not found any other problems, then you can remove the LVM image that is no longer in use:

lvremove main/$vm_image_name

Conclusion

We have encountered all the cases described in practice - we hope that the instructions will help other administrators solve similar problems. If you have comments or other similar stories from the experience of operating Ceph, we will be glad to see them in the comments!

PS

Read also on our blog:

Source: habr.com

Add a comment