Setting up the Linux kernel for GlusterFS

The translation of the article was prepared on the eve of the start of the course Administrator Linux. Professional».

Setting up the Linux kernel for GlusterFS

Periodically, here and there, questions arise about Gluster's recommendations regarding kernel tuning and whether there is a need for this.

Such a need rarely arises. On most workloads, the kernel performs very well. Although there is a downside. Historically, the Linux kernel has been willing to consume a lot of memory if given the opportunity, including for caching as the main way to improve performance.

In most cases, this works fine, but under heavy load it can lead to problems.

We have a lot of experience with memory intensive systems such as CAD, EDA and the like, which started to slow down under heavy load. And sometimes we ran into problems in Gluster. After carefully observing the memory usage and disk latency for many days, we got their overload, huge iowait, kernel errors (kernel oops), freezes, etc.

This article is the result of many tuning experiments performed in various situations. Thanks to these parameters, not only the overall responsiveness has improved, but the cluster has also significantly stabilized.

When it comes to memory tuning, the first thing to look at is the virtual memory subsystem (VM, virtual memory), which has a large number of options that can confuse you.

vm.swappiness

Parameter vm.swappiness determines how much the kernel uses swap (swap, paging) compared to RAM. It is also defined in the source code as "tendency to steal mapped memory". A high swappiness means that the kernel will be more inclined to swap mapped pages. A low swappiness value means the opposite: the kernel will page less from memory. In other words, the higher the value vm.swappiness, the more the system will use swap.

A large use of swapping is undesirable, since huge blocks of data are loaded and unloaded into RAM. Many people argue that the swapiness value should be large, but in my experience, setting it to "0" leads to better performance.

You can read more here - lwn.net/Articles/100978

But, again, these settings should be applied with care and only after testing a particular application. For highly loaded streaming applications, this parameter should be set to "0". When changed to "0", system responsiveness improves.

vm.vfs_cache_pressure

This setting controls the memory consumed by the kernel for caching directory and inode objects (dentry and inode).

With a default value of 100, the kernel will try to free the dentry and inode caches on a "fair" basis to pagecache and swapcache. Decreasing vfs_cache_pressure causes the kernel to keep dentry and inode caches. When the value is "0", the kernel will never flush the dentry and inode cache due to memory pressure, and this can easily lead to an out-of-memory error. Increasing vfs_cache_pressure above 100 causes the kernel to prioritize dentry and inode flushing.

When using GlusterFS, many users with large amounts of data and many small files can easily use a significant amount of RAM on the server due to inode/dentry caching, which can lead to performance degradation as the kernel has to process data structures on a system with 40 GB of memory . Setting this value above 100 has helped many users achieve fairer caching and improved kernel responsiveness.

vm.dirty_background_ratio and vm.dirty_ratio

First parameter (vm.dirty_background_ratio) determines the percentage of memory with dirty pages, after reaching which it is necessary to start flushing dirty pages in the background to disk. Until this percentage is reached, no pages are flushed to disk. And when the reset starts, it runs in the background without interrupting running processes.

The second parameter (vm.dirty_ratio) defines the percentage of memory that can be occupied by dirty pages before forced flash begins. Once this threshold is reached, all processes become synchronous (blocked) and are not allowed to continue until the I/O they requested is actually completed and the data is on disk. With heavy I/O this causes a problem because there is no data caching and all processes doing I/O are blocked waiting for I/O. This leads to a large number of hung processes, high load, system instability and poor performance.

Decreasing these settings causes data to be flushed to disk more frequently and not stored in RAM. This can help memory-heavy systems where it's normal to flush 45-90 GB page caches to disk, resulting in huge latency for front-end applications, reducing overall responsiveness and interactivity.

"1" > /proc/sys/vm/pagecache

A page cache is a cache that stores the data of files and executable programs, that is, these are pages with the actual contents of files or block devices. This cache is used to reduce the number of disk reads. A value of "1" means that 1% of RAM is used for the cache and there will be more reads from disk than from RAM. It is not necessary to change this setting, but if you are paranoid about controlling the page cache, you can use it.

"deadline" > /sys/block/sdc/queue/scheduler

The I/O scheduler is a Linux kernel component that handles read and write queues. In theory, it's better to use "noop" for a smart RAID controller, because Linux knows nothing about the physical geometry of the disk, so it's more efficient to let the controller, which knows the disk geometry well, process the request as quickly as possible. But it looks like "deadline" improves performance. You can read more about schedulers in the Linux kernel source code documentation: linux/Documentation/block/*osched.txt. And also I have seen an increase in read throughput during mixed operations (many write operations).

"256" > /sys/block/sdc/queue/nr_requests

The number of I/O requests in the buffer before they are passed to the scheduler. Some controllers' internal queue size (queue_depth) is larger than the I/O scheduler's nr_requests, so the I/O scheduler has little chance of properly prioritizing and merging requests. For deadline and CFQ schedulers, it's better when nr_requests is 2 times the controller's internal queue. Merging and reordering requests helps the scheduler to be more responsive under heavy load.

echo "16" > /proc/sys/vm/page-cluster

The page-cluster parameter controls the number of pages that are written to the swap at one time. In the above example, the value is set to "16" according to the RAID stripe size of 64 KB. It doesn't make sense with swappiness = 0, but if you set swappiness to 10 or 20 then using this value will help you when the RAID stripe size is 64K.

blockdev --setra 4096 /dev/<devname> (-sdb, hdc or dev_mapper)

The default block device settings for many RAID controllers often result in terrible performance. Adding the above option sets up read-ahead for 4096 * 512-byte sectors. At the very least, for streaming operations, speed is increased by filling the on-chip disk cache with read-ahead during the period used by the kernel to prepare I/O. The cache can contain data that will be requested on the next read. Too much prefetch can kill random I/O for large files if it uses up potentially useful disk time or loads data outside of the cache.

Below are a few more recommendations at the file system level. But they haven't been tested yet. Make sure your filesystem knows the size of the stripe and the number of disks in the array. For example, that this is a 5K stripe raid64 array of six disks (actually five, because one disk is used for parity). These recommendations are based on theoretical assumptions and compiled from various blogs/articles by RAID experts.

-> ext4 fs, 5 disks, 64K stripe, units in 4K blocks
mkfs -text4 -E stride=$((64/4))
-> xfs, 5 disks, 64K stripe, units in 512-byte sectors
mkfs -txfs -d sunit=$((64*2)) -d swidth=$((5*64*2))

For large files, consider increasing the stripe sizes listed above.

ATTENTION! Everything described above is highly subjective for some types of applications. This article does not guarantee any improvements without prior user testing of related applications. It should only be used if it is necessary to improve the overall responsiveness of the system, or if it solves current problems.

Additional materials:

Setting up the Linux kernel for GlusterFS

Read more

Source: habr.com

Add a comment