Experience with CEPH

When there is more data than fits on one disk, it's time to think about RAID. As a child, I often heard from my elders: β€œone day RAID will become a thing of the past, object storage will flood the world, and you don’t even know what CEPH is,” so the first thing in an independent life was creating your own cluster. The purpose of the experiment was to get acquainted with the internal structure of ceph and understand the scope of its application. How justified is the introduction of ceph in a medium business, but in a small one? After several years of operation and a couple of irreversible data losses, an understanding of the subtleties arose that not everything is so simple. Features of CEPH create obstacles to its wide distribution, and because of them, experiments have come to a standstill. Below is a description of all the steps taken, the result obtained and the conclusions drawn. If knowledgeable people will share their experience and explain some points, I will be grateful.

Note: Commentators have pointed out serious errors in some of the assumptions, requiring a revision of the entire article.

CEPH strategy

The CEPH cluster combines an arbitrary number K of disks of arbitrary size and stores data on them, duplicating each piece (4 MB by default) a given number N times.

Consider the simplest case with two identical disks. You can either build a RAID 1 or a cluster with N=2 from them - the result will be the same. If there are three disks, and they are of different sizes, then it is easy to assemble a cluster with N=2: some of the data will be on disks 1 and 2, some on 1 and 3, and some on 2 and 3, while RAID is not (you can collect such a RAID, but it would be a perversion). If there are even more disks, then it is possible to create RAID 5, CEPH has an analogue - erasure_code, which contradicts the early concepts of the developers, and therefore is not considered. RAID 5 assumes that there are a small number of disks, and all of them are in good condition. If one fails, the rest must hold out until the disk is replaced and the data is restored to it. CEPH, with N>=3, encourages the use of old disks, in particular, if you keep several good disks to store one copy of data, and store the remaining two or three copies on a large number of old disks, then the information will be safe, because for now new disks are alive - there are no problems, and if one of them breaks, then the simultaneous failure of three disks with a service life of more than five years, preferably from different servers, is an extremely unlikely event.

There is a subtlety in the distribution of copies. By default, it is assumed that the data is divided into more (~100 per disk) PG distribution groups, each of which is duplicated on some disks. Suppose K=6, N=2, then if any two disks fail, data is guaranteed to be lost, since according to the theory of probability, there will be at least one PG that will be located on these two disks. And the loss of one group makes all the data in the pool inaccessible. If the disks are divided into three pairs and allowed to store data only on disks within one pair, then such a distribution is also resistant to the failure of any one disk, but if two fail, the probability of data loss is not 100%, but only 3/15, and even in case of failure three discs - only 12/20. Hence, entropy in data distribution does not contribute to fault tolerance. Also note that for a file server, free RAM greatly increases responsiveness. The more memory in each node, and the more memory in all nodes, the faster it will be. This is undoubtedly the advantage of a cluster over a single server and, moreover, a hardware NAS, where a very small amount of memory is built in.

It follows that CEPH is a good way to create a reliable storage system for tens of TB with the possibility of scaling with minimal investment from outdated equipment (here, of course, costs will be required, but small compared to commercial storage systems).

Cluster Implementation

For the experiment, let's take a decommissioned computer Intel DQ57TM + Intel core i3 540 + 16 GB of RAM. We organize four 2 TB disks into something like RAID10, after a successful test we will add a second node and the same number of disks.

Install Linux. The distribution is required to be customizable and stable. Debian and Suse fit the requirements. Suse has a more flexible installer that allows you to disable any package; unfortunately, I could not understand which ones can be thrown out without damage to the system. Install Debian via debootstrap buster. The min-base option installs a non-working system that lacks drivers. The difference in size compared to the full version is not so big as to bother. Since the work is being done on a physical machine, I want to take snapshots, just like on virtual machines. Either LVM or btrfs (or xfs, or zfs - the difference is not great) provides such an opportunity. Snapshots are not LVM's forte. Install btrfs. And the bootloader is in the MBR. It makes no sense to clog a 50 MB disk with a FAT partition when you can push it into a 1 MB partition table area and allocate all the space for the system. It took 700 MB on the disk. How much the base installation of SUSE has - I do not remember, it seems, about 1.1 or 1.4 GB.

Install CEPH. We ignore version 12 in the debian repository and connect directly from the site 15.2.3. We follow the instructions from the section "Installing CEPH manually" with the following caveats:

  • Before connecting the repository, you must install gnupg wget ca-certificates
  • After connecting the repository, but before installing the cluster, package installation is omitted: apt -y --no-install-recommends install ceph-common ceph-mon ceph-osd ceph-mds ceph-mgr
  • At the time of installation of CEPH, for unknown reasons, it will try to install lvm2. In principle, it's not a pity, but the installation fails, so CEPH will not install either.

    This patch helped:

    cat << EOF >> /var/lib/dpkg/status
    Package: lvm2
    Status: install ok installed
    Priority: important
    Section: admin
    Installed-Size: 0
    Maintainer: Debian Adduser Developers <[email protected]>
    Architecture: all
    Multi-Arch: foreign
    Version: 113.118
    Description: No-install
    EOF
    

Cluster Overview

ceph-osd - responsible for storing data on disk. For each disk, a network service is started that accepts and executes requests to read or write to objects. Two partitions are created on the disk. One of them contains information about the cluster, disk number, and cluster keys. This 1KB information is created once when adding a disk and never noticed to change again. The second partition has no file system and stores CEPH binary data. Automatic installation in previous versions created a 100MB xfs partition for service information. I converted the disk to MBR and allocated only 16MB - the service does not complain. I think, without problems, xfs could be replaced with ext. This partition is mounted in /var/lib/… where the service reads information about the OSD and also finds a link to the block device where the binary data is stored. Theoretically, you can immediately place auxiliary ones in / var / lib / ..., and allocate the entire disk for data. When creating an OSD via ceph-deploy, a rule is automatically created to mount a partition in /var/lib/…, and the rights to the ceph user are assigned to read the desired block device. With a manual installation, you need to do this yourself, the documentation does not say about it. It is also advisable to specify the osd memory target parameter so that there is enough physical memory.

ceph-mds. At a low level, CEPH is object storage. The block storage capability boils down to saving each 4MB block as an object. File storage works on the same principle. Two pools are created: one for metadata, the other for data. They are combined into a file system. At this moment, some kind of record is created, so if you delete the file system, but save both pools, then you will not be able to restore it. There is a procedure for extracting files in blocks, I have not tested it. The ceph-mds service is responsible for accessing the file system. Each file system requires a separate instance of the service. There is an "index" option that allows you to create a semblance of several file systems in one - also not tested.

ceph-mon - This service keeps a map of the cluster. It includes information about all OSDs, the PG distribution algorithm in the OSD, and, most importantly, information about all objects (the details of this mechanism are not clear to me: there is a /var/lib/ceph/mon/…/store.db directory, it contains a large the file is 26MB, and in a cluster of 105K objects, it turns out a little more than 256 bytes per object - I think that the monitor keeps a list of all objects and the PG in which they lie). Damage to this directory results in the loss of all data in the cluster. From here it was concluded that CRUSH shows how PGs are located according to OSD, and how objects are located according to PG - they are centrally stored inside the database, no matter how the developers avoid this word. As a result, firstly, we cannot install the system on a flash drive in RO mode, since the database is constantly written to, an additional disk is needed for these (hardly more than 1 GB), and secondly, it is necessary to have a real-time copy this base. If there are several monitors, then fault tolerance is provided automatically, but in our case there is only one monitor, maximum two. There is a theoretical procedure for restoring a monitor based on OSD data, I resorted to it three times for various reasons, and three times no error messages, as well as data too. Unfortunately, this mechanism does not work. Either we operate a miniature OSD partition and assemble a RAID to store the database, which will probably have a very bad effect on performance, or we allocate at least two reliable physical media, preferably USB, so that the ports do not take up.

rados-gw - exports object storage using the S3 protocol and the like. Creates a lot of pools, it is not clear why. Didn't really experiment.

ceph-mgr - Installing this service starts several modules. One of them is the non-disabled autoscale. It strives to maintain the correct number of PG/OSDs. If you want to control the ratio manually, you can disable scaling for each pool, but in this case the module falls with a division by 0, and the cluster status becomes ERROR. The module is written in python, and if you comment out the necessary line in it, this leads to its shutdown. Too lazy to remember the details.

List of sources used:

CEPH Installation
Recovery from a complete monitor failure

Script listings:

Installing the system via debootstrap

blkdev=sdb1
mkfs.btrfs -f /dev/$blkdev
mount /dev/$blkdev /mnt
cd /mnt
for i in {@,@var,@home}; do btrfs subvolume create $i; done
mkdir snapshot @/{var,home}
for i in {var,home}; do mount -o bind @${i} @/$i; done
debootstrap buster @ http://deb.debian.org/debian; echo $?
for i in {dev,proc,sys}; do mount -o bind /$i @/$i; done
cp /etc/bash.bashrc @/etc/

chroot /mnt/@ /bin/bash
echo rbd1 > /etc/hostname
passwd
uuid=`blkid | grep $blkdev | cut -d """ -f 2`
cat << EOF > /etc/fstab
UUID=$uuid / btrfs noatime,nodiratime,subvol=@ 0 1
UUID=$uuid /var btrfs noatime,nodiratime,subvol=@var 0 2
UUID=$uuid /home btrfs noatime,nodiratime,subvol=@home 0 2
EOF
cat << EOF >> /var/lib/dpkg/status
Package: lvm2
Status: install ok installed
Priority: important
Section: admin
Installed-Size: 0
Maintainer: Debian Adduser Developers <[email protected]>
Architecture: all
Multi-Arch: foreign
Version: 113.118
Description: No-install

Package: sudo
Status: install ok installed
Priority: important
Section: admin
Installed-Size: 0
Maintainer: Debian Adduser Developers <[email protected]>
Architecture: all
Multi-Arch: foreign
Version: 113.118
Description: No-install
EOF

exit
grub-install --boot-directory=@/boot/ /dev/$blkdev
init 6

apt -yq install --no-install-recommends linux-image-amd64 bash-completion ed btrfs-progs grub-pc iproute2 ssh  smartmontools ntfs-3g net-tools man
exit
grub-install --boot-directory=@/boot/ /dev/$blkdev
init 6

Create a cluster

apt -yq install --no-install-recommends gnupg wget ca-certificates
echo 'deb https://download.ceph.com/debian-octopus/ buster main' >> /etc/apt/sources.list
wget -q -O- 'https://download.ceph.com/keys/release.asc' | apt-key add -
apt update
apt -yq install --no-install-recommends ceph-common ceph-mon

echo 192.168.11.11 rbd1 >> /etc/hosts
uuid=`cat /proc/sys/kernel/random/uuid`
cat << EOF > /etc/ceph/ceph.conf
[global]
fsid = $uuid
auth cluster required = cephx
auth service required = cephx
auth client required = cephx
mon allow pool delete = true
mon host = 192.168.11.11
mon initial members = rbd1
mon max pg per osd = 385
osd crush update on start = false
#osd memory target = 2147483648
osd memory target = 1610612736
osd scrub chunk min = 1
osd scrub chunk max = 2
osd scrub sleep = .2
osd pool default pg autoscale mode = off
osd pool default size = 1
osd pool default min size = 1
osd pool default pg num = 1
osd pool default pgp num = 1
[mon]
mgr initial modules = dashboard
EOF

ceph-authtool --create-keyring ceph.mon.keyring --gen-key -n mon. --cap mon 'allow *'
ceph-authtool --create-keyring ceph.client.admin.keyring --gen-key -n client.admin --cap mon 'allow *' --cap osd 'allow *' --cap mds 'allow *' --cap mgr 'allow *'
cp ceph.client.admin.keyring /etc/ceph/
ceph-authtool --create-keyring bootstrap-osd.ceph.keyring --gen-key -n client.bootstrap-osd --cap mon 'profile bootstrap-osd' --cap mgr 'allow r'
cp bootstrap-osd.ceph.keyring /var/lib/ceph/bootstrap-osd/ceph.keyring
ceph-authtool ceph.mon.keyring --import-keyring /etc/ceph/ceph.client.admin.keyring
ceph-authtool ceph.mon.keyring --import-keyring /var/lib/ceph/bootstrap-osd/ceph.keyring
monmaptool --create --add rbd1 192.168.11.11 --fsid $uuid monmap
rm -R /var/lib/ceph/mon/ceph-rbd1/*
ceph-mon --mkfs -i rbd1 --monmap monmap --keyring ceph.mon.keyring
chown ceph:ceph -R /var/lib/ceph
systemctl enable ceph-mon@rbd1
systemctl start ceph-mon@rbd1
ceph mon enable-msgr2
ceph status

# dashboard

apt -yq install --no-install-recommends ceph-mgr ceph-mgr-dashboard python3-distutils python3-yaml
mkdir /var/lib/ceph/mgr/ceph-rbd1
ceph auth get-or-create mgr.rbd1 mon 'allow profile mgr' osd 'allow *' mds 'allow *' > /var/lib/ceph/mgr/ceph-rbd1/keyring
systemctl enable ceph-mgr@rbd1
systemctl start ceph-mgr@rbd1
ceph config set mgr mgr/dashboard/ssl false
ceph config set mgr mgr/dashboard/server_port 7000
ceph dashboard ac-user-create root 1111115 administrator
systemctl stop ceph-mgr@rbd1
systemctl start ceph-mgr@rbd1

Adding OSD (part)

apt install ceph-osd

osdnum=`ceph osd create`
mkdir -p /var/lib/ceph/osd/ceph-$osdnum
mkfs -t xfs /dev/sda1
mount -t xfs /dev/sda1 /var/lib/ceph/osd/ceph-$osdnum
cd /var/lib/ceph/osd/ceph-$osdnum
ceph auth get-or-create osd.0 mon 'profile osd' mgr 'profile osd' osd 'allow *' > /var/lib/ceph/osd/ceph-$osdnum/keyring
ln -s /dev/disk/by-partuuid/d8cc3da6-02  block
ceph-osd -i $osdnum --mkfs
#chown ceph:ceph /dev/sd?2
chown ceph:ceph -R /var/lib/ceph
systemctl enable ceph-osd@$osdnum
systemctl start ceph-osd@$osdnum

Summary

The main marketing advantage of CEPH is CRUSH, an algorithm for calculating the location of data. Monitors propagate this algorithm to clients, after which clients directly request the desired node and the desired OSD. CRUSH provides no centralization. It is a small file that you can even print and hang on the wall. Practice has shown that CRUSH is not an exhaustive map. Destroying and re-creating the monitors while keeping all the OSDs and CRUSH is not enough to restore the cluster. From this it is concluded that each monitor stores some metadata about the entire cluster. The insignificant amount of this metadata does not impose restrictions on the size of the cluster, but it requires their safety, which eliminates disk savings due to installing the system on a flash drive and excludes clusters with less than three nodes. Aggressive developer policy regarding optional features. Far from minimalism. Documentation at the level: β€œthank you for what it is, but very, very meagerly.” The ability to interact with services at a low level is provided, but the documentation is too superficial on this topic, so more likely no than yes. Virtually no chance of recovering data from an emergency situation.

Options for further action: abandon CEPH and use the banal multi-disk btrfs (or xfs, zfs), learn new information about CEPH, which will allow you to operate it in the specified conditions, try to write your own storage as an advanced training.

Source: habr.com

Add a comment