How to increase read speed from HBase up to 3 times and from HDFS up to 5 times

High performance is one of the key requirements when working with big data. We, in the data loading management at Sberbank, are pumping almost all transactions into our Hadoop-based Data Cloud and therefore we are dealing with really large information flows. Naturally, we are always looking for ways to improve performance, and now we want to tell you how we managed to patch the RegionServer HBase and HDFS client, which made it possible to significantly increase the speed of the read operation.
How to increase read speed from HBase up to 3 times and from HDFS up to 5 times

However, before moving on to the essence of the improvements, it is worth talking about the limitations that, in principle, cannot be bypassed if you sit on the HDD.

Why HDD and Fast Random Access Reads Are Incompatible
As you know, HBase, and many other databases, store data in blocks of several tens of kilobytes in size. By default, this is about 64 KB. Now imagine that we need to get only 100 bytes and we ask HBase to give us this data using a certain key. Since the block size in HFiles is 64 KB, the request will be 640 times larger (just a minute!) than needed.

Further, since the request will go through HDFS and its metadata caching mechanism ShortCircuitCache (which allows direct access to files), then this results in reading already 1 MB from disk. However, this can be adjusted with the parameter dfs.client.read.shortcircuit.buffer.size and in many cases it makes sense to reduce this value, for example to 126 KB.

Let's say we do this, but in addition, when we start reading data through the java api, with functions such as FileChannel.read and asking the operating system to read the specified amount of data, it subtracts "just in case" 2 times more, i.e. in 256 KB in our case. This is because there is no easy way in java to set the FADV_RANDOM flag to prevent this behavior.

As a result, to get our 100 bytes, 2600 times more is subtracted under the hood. It would seem that the solution is obvious, let's reduce the block size to a kilobyte, set the mentioned flag and gain great enlightenment acceleration. But the trouble is that by reducing the block size by 2 times, we also reduce the number of subtracted bytes per unit of time by 2 times.

Some gain from setting the FADV_RANDOM flag can be obtained, but only with high multithreading and with a block size of 128 KB, but this is a maximum of a couple of tens of percent:

How to increase read speed from HBase up to 3 times and from HDFS up to 5 times

The tests were carried out on 100 files, each 1 GB in size and located on 10 HDDs.

Let's calculate what we can basically count on at such a speed:
Let's say we read from 10 disks at a speed of 280 MB / s, i.e. 3 million times 100 bytes. But as we remember, we need data that is 2600 times less than what is read. Thus, we divide 3 million by 2600 and get 1100 records per second.

Disappointing, isn't it? Such is nature Random access access to data on the HDD - regardless of the block size. This is the physical limit of random access, and no database can squeeze out more under such conditions.

How then can the bases achieve much higher speeds? To answer this question, let's see what happens in the following picture:

How to increase read speed from HBase up to 3 times and from HDFS up to 5 times

Here we see that the first few minutes the speed is really about a thousand records per second. However, further, due to the fact that much more is subtracted than was requested, the data is deposited in the buff / cache of the operating system (linux) and the speed grows to a more decent 60 thousand per second

Thus, further we will deal with accelerating access only to those data that are in the OS cache or are located in SSD / NVMe storages of comparable access speed.

In our case, we will conduct tests on a stand of 4 servers, each of which is charged as follows:

CPU: Xeon E5-2680 v4 @ 2.40GHz 64 threads.
Memory: 730 GB.
java version: 1.8.0_111

And here the key point is the amount of data in the tables that needs to be read. The fact is that if you read data from a table that fits entirely into the HBase cache, then it will not even come to reading from buff / cache of the operating system. Because HBase by default allocates 40% of memory to a structure called BlockCache. In fact, this is a ConcurrentHashMap, where the key is the file name + offset of the block, and value is the actual data at this offset.

Thus, when reading only from this structure, we see great speed, like a million requests per second. But let's imagine that we cannot give hundreds of gigabytes of memory only for the needs of the database, because a lot of other useful things are spinning on these servers.

For example, in our case, the volume of BlockCache on one RS is about 12 GB. We landed two RS on one node, i.e. 96 GB allocated for BlockCache on all nodes. And at the same time, there is much more data, for example, let it be 4 tables, 130 regions each, in which files of 800 MB in size, compressed by FAST_DIFF, i.e. a total of 410 GB (this is pure data, i.e. without taking into account the replication factor).

Thus, BlockCache is only about 23% of the total data and this is much closer to the real conditions of what is called BigData. And here the most interesting begins - after all, it is obvious that the fewer cache hits, the worse the performance. After all, in case of a miss, you will have to do a lot of work - i.e. descend before calling system functions. However, this cannot be avoided, and therefore let's consider a completely different aspect - what happens to the data inside the cache?

Let's simplify the situation and assume that we have a cache in which only 1 object is placed. Here is an example of what happens when we try to work with a data volume 3 times larger than the cache, we will have to:

1. Place block 1 in the cache
2. Remove block 1 from the cache
3. Place block 2 in the cache
4. Remove block 2 from the cache
5. Place block 3 in the cache

Done 5 actions! However, this situation cannot be called normal, in fact, we are forcing HBase to do a bunch of completely useless work. He constantly reads data from the OS cache, puts it in his BlockCache, in order to throw it out almost immediately, because a new portion of data has arrived. The animation at the beginning of the post shows the essence of the problem - the Garbage Collector is off the charts, the atmosphere is warming up, little Greta is getting upset in far and hot Sweden. And we IT people really don’t like it when children are sad, so we start to think what can be done about it.

But what if not all blocks are placed in the cache, but only a certain percentage of them, so that the cache does not overflow? Let's just start by adding just a few lines of code to the top of the BlockCache push function:

  public void cacheBlock(BlockCacheKey cacheKey, Cacheable buf, boolean inMemory) {
    if (cacheDataBlockPercent != 100 && buf.getBlockType().isData()) {
      if (cacheKey.getOffset() % 100 >= cacheDataBlockPercent) {
        return;
      }
    }
...

The meaning here is the following, offset is the position of the block in the file and its last digits are randomly and evenly distributed from 00 to 99. Therefore, we will skip only those that fall into the range we need.

For example, set cacheDataBlockPercent = 20 and see what happens:

How to increase read speed from HBase up to 3 times and from HDFS up to 5 times

The result is there. In the graphs below, it becomes clear why this acceleration happened - we save a bunch of GC resources without doing the Sisyphean labor of placing data in the cache only to immediately throw it down the drain on the Martian dogs:

How to increase read speed from HBase up to 3 times and from HDFS up to 5 times

At the same time, CPU utilization grows, but much less than performance:

How to increase read speed from HBase up to 3 times and from HDFS up to 5 times

It is also worth noting here that the blocks that are stored in BlockCache are different. Most of it, about 95%, is data itself. And the rest is metadata, like Bloom filters or LEAF_INDEX and etc.. This data is not enough, but it is very useful, because before turning directly to the data, HBase refers to the meta to understand whether it is necessary to look further here and if so, where exactly is the block of interest to it.

Therefore, in the code we see the condition check buf.getBlockType().isData() and thanks to this meta we will keep in the cache anyway.

Now let's increase the load and at the same time slightly tune the feature. In the first test, we made the cutoff percentage = 20 and BlockCache was a bit underloaded. Now let's set it to 23% and add 100 threads every 5 minutes to see when saturation occurs:

How to increase read speed from HBase up to 3 times and from HDFS up to 5 times

Here we see that the original version almost immediately hits the ceiling at about 100 thousand requests per second. Whereas the patch gives an acceleration of up to 300 thousand. At the same time, it is clear that further acceleration is no longer so “free”, while CPU utilization is also growing.

However, this is not a very elegant solution, since we do not know in advance what percentage of blocks to cache, it depends on the load profile. Therefore, a mechanism was implemented to automatically adjust this parameter depending on the activity of read operations.

Three options have been added to control this:

hbase.lru.cache.heavy.eviction.count.limit - sets how many times the process of evicting data from the cache should start before we start using optimization (i.e. skip blocks). By default, it is equal to MAX_INT = 2147483647 and actually means that the feature will never start working with this value. Because the eviction process starts every 5 - 10 seconds (it depends on the load) and 2147483647 * 10 / 60 / 60 / 24 / 365 = 680 years. However, we can set this parameter to 0 and make the feature work immediately after the start.

However, there is also a payload in this parameter. If we have the nature of the load such that short-term reads (let's say during the day) and long-term reads (at night) are constantly interspersed, then we can make the feature turn on only when long-term read operations are in progress.

For example, we know that short-term readings usually last about 1 minute. There is no need to start throwing out blocks, the cache will not have time to become obsolete, and then we can set this parameter to, for example, 10. This will lead to the fact that optimization will start working only when long active reading has begun, i. after 100 seconds. Thus, if we have a short-term read, then all blocks will get into the cache and will be available (except for those that will be evicted by the standard algorithm). And when we do long-term reads, the feature turns on and we would have much better performance.

hbase.lru.cache.heavy.eviction.mb.size.limit - sets how many megabytes we would like to place in the cache (and naturally evict) in 10 seconds. The feature will try to reach this value and maintain it. The point is, if we shove gigabytes into the cache, then gigabytes will have to be evicted, and this, as we saw above, is very expensive. However, you should not try to set it too small, as this will lead to premature exit from the block skipping mode. For powerful servers (about 20-40 physical cores), it is optimal to set about 300-400 MB. For the middle class (~10 cores) 200-300 MB. For weak systems (2-5 cores), 50-100 MB can be normal (not tested on such systems).

Let's see how it works: let's say we set hbase.lru.cache.heavy.eviction.mb.size.limit = 500, there is some kind of load (reading) and then every ~10 seconds we calculate how many bytes were evicted by the formula :

Overhead = Freed Bytes Sum (MB) * 100 / Limit (MB) - 100;

If in fact 2000 MB were evicted, then Overhead is equal to:

2000 * 100 / 500 - 100 = 300%

Algorithms try to support no more than a few tens of percent, so the feature will reduce the percentage of cached blocks, thereby implementing the auto-tuning mechanism.

However, if the load has dropped, let's say only 200 MB are evicted and Overhead becomes negative (the so-called overshooting):

200 * 100 / 500 - 100 = -60%

That feature, on the contrary, will increase the percentage of cached blocks until Overhead becomes positive.

Below is an example of how it looks on real data. No need to try to reach 0%, it's impossible. It is very good when when it is about 30 - 100%, this helps to avoid premature exit from the optimization mode during short-term bursts.

hbase.lru.cache.heavy.eviction.overhead.coefficient - sets how fast we would like to get the result. If we know for sure that our reads are mostly long and don't want to wait, we can increase this ratio and get high performance faster.

For example, we set this coefficient = 0.01. This means that Overhead (see above) will be multiplied by this number by the result and the percentage of blocks cached will be reduced. Let's say that Overhead = 300%, and the coefficient = 0.01, then the percentage of cached blocks will be reduced by 3%.

A similar “Backpressure” logic is also implemented for negative values ​​of Overhead (overshooting). Since short-term fluctuations in the volume of reads and evictions are always possible, this mechanism allows you to avoid premature exit from the optimization mode. Backpressure has an inverted logic: the stronger the overshooting, the more blocks are cached.

How to increase read speed from HBase up to 3 times and from HDFS up to 5 times

Implementation code

        LruBlockCache cache = this.cache.get();
        if (cache == null) {
          break;
        }
        freedSumMb += cache.evict()/1024/1024;
        /*
        * Sometimes we are reading more data than can fit into BlockCache
        * and it is the cause a high rate of evictions.
        * This in turn leads to heavy Garbage Collector works.
        * So a lot of blocks put into BlockCache but never read,
        * but spending a lot of CPU resources.
        * Here we will analyze how many bytes were freed and decide
        * decide whether the time has come to reduce amount of caching blocks.
        * It help avoid put too many blocks into BlockCache
        * when evict() works very active and save CPU for other jobs.
        * More delails: https://issues.apache.org/jira/browse/HBASE-23887
        */

        // First of all we have to control how much time
        // has passed since previuos evict() was launched
        // This is should be almost the same time (+/- 10s)
        // because we get comparable volumes of freed bytes each time.
        // 10s because this is default period to run evict() (see above this.wait)
        long stopTime = System.currentTimeMillis();
        if ((stopTime - startTime) > 1000 * 10 - 1) {
          // Here we have to calc what situation we have got.
          // We have the limit "hbase.lru.cache.heavy.eviction.bytes.size.limit"
          // and can calculte overhead on it.
          // We will use this information to decide,
          // how to change percent of caching blocks.
          freedDataOverheadPercent =
            (int) (freedSumMb * 100 / cache.heavyEvictionMbSizeLimit) - 100;
          if (freedSumMb > cache.heavyEvictionMbSizeLimit) {
            // Now we are in the situation when we are above the limit
            // But maybe we are going to ignore it because it will end quite soon
            heavyEvictionCount++;
            if (heavyEvictionCount > cache.heavyEvictionCountLimit) {
              // It is going for a long time and we have to reduce of caching
              // blocks now. So we calculate here how many blocks we want to skip.
              // It depends on:
             // 1. Overhead - if overhead is big we could more aggressive
              // reducing amount of caching blocks.
              // 2. How fast we want to get the result. If we know that our
              // heavy reading for a long time, we don't want to wait and can
              // increase the coefficient and get good performance quite soon.
              // But if we don't sure we can do it slowly and it could prevent
              // premature exit from this mode. So, when the coefficient is
              // higher we can get better performance when heavy reading is stable.
              // But when reading is changing we can adjust to it and set
              // the coefficient to lower value.
              int change =
                (int) (freedDataOverheadPercent * cache.heavyEvictionOverheadCoefficient);
              // But practice shows that 15% of reducing is quite enough.
              // We are not greedy (it could lead to premature exit).
              change = Math.min(15, change);
              change = Math.max(0, change); // I think it will never happen but check for sure
              // So this is the key point, here we are reducing % of caching blocks
              cache.cacheDataBlockPercent -= change;
              // If we go down too deep we have to stop here, 1% any way should be.
              cache.cacheDataBlockPercent = Math.max(1, cache.cacheDataBlockPercent);
            }
          } else {
            // Well, we have got overshooting.
            // Mayby it is just short-term fluctuation and we can stay in this mode.
            // It help avoid permature exit during short-term fluctuation.
            // If overshooting less than 90%, we will try to increase the percent of
            // caching blocks and hope it is enough.
            if (freedSumMb >= cache.heavyEvictionMbSizeLimit * 0.1) {
              // Simple logic: more overshooting - more caching blocks (backpressure)
              int change = (int) (-freedDataOverheadPercent * 0.1 + 1);
              cache.cacheDataBlockPercent += change;
              // But it can't be more then 100%, so check it.
              cache.cacheDataBlockPercent = Math.min(100, cache.cacheDataBlockPercent);
            } else {
              // Looks like heavy reading is over.
              // Just exit form this mode.
              heavyEvictionCount = 0;
              cache.cacheDataBlockPercent = 100;
            }
          }
          LOG.info("BlockCache evicted (MB): {}, overhead (%): {}, " +
            "heavy eviction counter: {}, " +
            "current caching DataBlock (%): {}",
            freedSumMb, freedDataOverheadPercent,
            heavyEvictionCount, cache.cacheDataBlockPercent);

          freedSumMb = 0;
          startTime = stopTime;
       }

Consider now all this on a real example. We have the following test scenario:

  1. Start doing Scan (25 threads, batch = 100)
  2. After 5 minutes, add multi-gets (25 threads, batch = 100)
  3. After 5 minutes, turn off multi-gets (only scan remains again)

We do two runs, first hbase.lru.cache.heavy.eviction.count.limit = 10000 (which actually disables the feature), and then set limit = 0 (enables).

In the logs below, we see how the feature turns on, resets Overshooting to 14-71%. From time to time the load drops, which turns on Backpressure and HBase caches more blocks again.

RegionServer log
evicted (MB): 0, ratio 0.0, overhead (%): -100, heavy eviction counter: 0, current caching DataBlock (%): 100
evicted (MB): 0, ratio 0.0, overhead (%): -100, heavy eviction counter: 0, current caching DataBlock (%): 100
evicted (MB): 2170, ratio 1.09, overhead (%): 985, heavy eviction counter: 1, current caching DataBlock (%): 91 < start
evicted (MB): 3763, ratio 1.08, overhead (%): 1781, heavy eviction counter: 2, current caching DataBlock (%): 76
evicted (MB): 3306, ratio 1.07, overhead (%): 1553, heavy eviction counter: 3, current caching DataBlock (%): 61
evicted (MB): 2508, ratio 1.06, overhead (%): 1154, heavy eviction counter: 4, current caching DataBlock (%): 50
evicted (MB): 1824, ratio 1.04, overhead (%): 812, heavy eviction counter: 5, current caching DataBlock (%): 42
evicted (MB): 1482, ratio 1.03, overhead (%): 641, heavy eviction counter: 6, current caching DataBlock (%): 36
evicted (MB): 1140, ratio 1.01, overhead (%): 470, heavy eviction counter: 7, current caching DataBlock (%): 32
evicted (MB): 913, ratio 1.0, overhead (%): 356, heavy eviction counter: 8, current caching DataBlock (%): 29
evicted (MB): 912, ratio 0.89, overhead (%): 356, heavy eviction counter: 9, current caching DataBlock (%): 26
evicted (MB): 684, ratio 0.76, overhead (%): 242, heavy eviction counter: 10, current caching DataBlock (%): 24
evicted (MB): 684, ratio 0.61, overhead (%): 242, heavy eviction counter: 11, current caching DataBlock (%): 22
evicted (MB): 456, ratio 0.51, overhead (%): 128, heavy eviction counter: 12, current caching DataBlock (%): 21
evicted (MB): 456, ratio 0.42, overhead (%): 128, heavy eviction counter: 13, current caching DataBlock (%): 20
evicted (MB): 456, ratio 0.33, overhead (%): 128, heavy eviction counter: 14, current caching DataBlock (%): 19
evicted (MB): 342, ratio 0.33, overhead (%): 71, heavy eviction counter: 15, current caching DataBlock (%): 19
evicted (MB): 342, ratio 0.32, overhead (%): 71, heavy eviction counter: 16, current caching DataBlock (%): 19
evicted (MB): 342, ratio 0.31, overhead (%): 71, heavy eviction counter: 17, current caching DataBlock (%): 19
evicted (MB): 228, ratio 0.3, overhead (%): 14, heavy eviction counter: 18, current caching DataBlock (%): 19
evicted (MB): 228, ratio 0.29, overhead (%): 14, heavy eviction counter: 19, current caching DataBlock (%): 19
evicted (MB): 228, ratio 0.27, overhead (%): 14, heavy eviction counter: 20, current caching DataBlock (%): 19
evicted (MB): 228, ratio 0.25, overhead (%): 14, heavy eviction counter: 21, current caching DataBlock (%): 19
evicted (MB): 228, ratio 0.24, overhead (%): 14, heavy eviction counter: 22, current caching DataBlock (%): 19
evicted (MB): 228, ratio 0.22, overhead (%): 14, heavy eviction counter: 23, current caching DataBlock (%): 19
evicted (MB): 228, ratio 0.21, overhead (%): 14, heavy eviction counter: 24, current caching DataBlock (%): 19
evicted (MB): 228, ratio 0.2, overhead (%): 14, heavy eviction counter: 25, current caching DataBlock (%): 19
evicted (MB): 228, ratio 0.17, overhead (%): 14, heavy eviction counter: 26, current caching DataBlock (%): 19
evicted (MB): 456, ratio 0.17, overhead (%): 128, heavy eviction counter: 27, current caching DataBlock (%): 18 < added gets (but table the same)
evicted (MB): 456, ratio 0.15, overhead (%): 128, heavy eviction counter: 28, current caching DataBlock (%): 17
evicted (MB): 342, ratio 0.13, overhead (%): 71, heavy eviction counter: 29, current caching DataBlock (%): 17
evicted (MB): 342, ratio 0.11, overhead (%): 71, heavy eviction counter: 30, current caching DataBlock (%): 17
evicted (MB): 342, ratio 0.09, overhead (%): 71, heavy eviction counter: 31, current caching DataBlock (%): 17
evicted (MB): 228, ratio 0.08, overhead (%): 14, heavy eviction counter: 32, current caching DataBlock (%): 17
evicted (MB): 228, ratio 0.07, overhead (%): 14, heavy eviction counter: 33, current caching DataBlock (%): 17
evicted (MB): 228, ratio 0.06, overhead (%): 14, heavy eviction counter: 34, current caching DataBlock (%): 17
evicted (MB): 228, ratio 0.05, overhead (%): 14, heavy eviction counter: 35, current caching DataBlock (%): 17
evicted (MB): 228, ratio 0.05, overhead (%): 14, heavy eviction counter: 36, current caching DataBlock (%): 17
evicted (MB): 228, ratio 0.04, overhead (%): 14, heavy eviction counter: 37, current caching DataBlock (%): 17
evicted (MB): 109, ratio 0.04, overhead (%): -46, heavy eviction counter: 37, current caching DataBlock (%): 22 < back pressure
evicted (MB): 798, ratio 0.24, overhead (%): 299, heavy eviction counter: 38, current caching DataBlock (%): 20
evicted (MB): 798, ratio 0.29, overhead (%): 299, heavy eviction counter: 39, current caching DataBlock (%): 18
evicted (MB): 570, ratio 0.27, overhead (%): 185, heavy eviction counter: 40, current caching DataBlock (%): 17
evicted (MB): 456, ratio 0.22, overhead (%): 128, heavy eviction counter: 41, current caching DataBlock (%): 16
evicted (MB): 342, ratio 0.16, overhead (%): 71, heavy eviction counter: 42, current caching DataBlock (%): 16
evicted (MB): 342, ratio 0.11, overhead (%): 71, heavy eviction counter: 43, current caching DataBlock (%): 16
evicted (MB): 228, ratio 0.09, overhead (%): 14, heavy eviction counter: 44, current caching DataBlock (%): 16
evicted (MB): 228, ratio 0.07, overhead (%): 14, heavy eviction counter: 45, current caching DataBlock (%): 16
evicted (MB): 228, ratio 0.05, overhead (%): 14, heavy eviction counter: 46, current caching DataBlock (%): 16
evicted (MB): 222, ratio 0.04, overhead (%): 11, heavy eviction counter: 47, current caching DataBlock (%): 16
evicted (MB): 104, ratio 0.03, overhead (%): -48, heavy eviction counter: 47, current caching DataBlock (%): 21 < interrupt gets
evicted (MB): 684, ratio 0.2, overhead (%): 242, heavy eviction counter: 48, current caching DataBlock (%): 19
evicted (MB): 570, ratio 0.23, overhead (%): 185, heavy eviction counter: 49, current caching DataBlock (%): 18
evicted (MB): 342, ratio 0.22, overhead (%): 71, heavy eviction counter: 50, current caching DataBlock (%): 18
evicted (MB): 228, ratio 0.21, overhead (%): 14, heavy eviction counter: 51, current caching DataBlock (%): 18
evicted (MB): 228, ratio 0.2, overhead (%): 14, heavy eviction counter: 52, current caching DataBlock (%): 18
evicted (MB): 228, ratio 0.18, overhead (%): 14, heavy eviction counter: 53, current caching DataBlock (%): 18
evicted (MB): 228, ratio 0.16, overhead (%): 14, heavy eviction counter: 54, current caching DataBlock (%): 18
evicted (MB): 228, ratio 0.14, overhead (%): 14, heavy eviction counter: 55, current caching DataBlock (%): 18
evicted (MB): 112, ratio 0.14, overhead (%): -44, heavy eviction counter: 55, current caching DataBlock (%): 23 < back pressure
evicted (MB): 456, ratio 0.26, overhead (%): 128, heavy eviction counter: 56, current caching DataBlock (%): 22
evicted (MB): 342, ratio 0.31, overhead (%): 71, heavy eviction counter: 57, current caching DataBlock (%): 22
evicted (MB): 342, ratio 0.33, overhead (%): 71, heavy eviction counter: 58, current caching DataBlock (%): 22
evicted (MB): 342, ratio 0.33, overhead (%): 71, heavy eviction counter: 59, current caching DataBlock (%): 22
evicted (MB): 342, ratio 0.33, overhead (%): 71, heavy eviction counter: 60, current caching DataBlock (%): 22
evicted (MB): 342, ratio 0.33, overhead (%): 71, heavy eviction counter: 61, current caching DataBlock (%): 22
evicted (MB): 342, ratio 0.33, overhead (%): 71, heavy eviction counter: 62, current caching DataBlock (%): 22
evicted (MB): 342, ratio 0.33, overhead (%): 71, heavy eviction counter: 63, current caching DataBlock (%): 22
evicted (MB): 342, ratio 0.32, overhead (%): 71, heavy eviction counter: 64, current caching DataBlock (%): 22
evicted (MB): 342, ratio 0.33, overhead (%): 71, heavy eviction counter: 65, current caching DataBlock (%): 22
evicted (MB): 342, ratio 0.33, overhead (%): 71, heavy eviction counter: 66, current caching DataBlock (%): 22
evicted (MB): 342, ratio 0.32, overhead (%): 71, heavy eviction counter: 67, current caching DataBlock (%): 22
evicted (MB): 342, ratio 0.33, overhead (%): 71, heavy eviction counter: 68, current caching DataBlock (%): 22
evicted (MB): 342, ratio 0.32, overhead (%): 71, heavy eviction counter: 69, current caching DataBlock (%): 22
evicted (MB): 342, ratio 0.32, overhead (%): 71, heavy eviction counter: 70, current caching DataBlock (%): 22
evicted (MB): 342, ratio 0.33, overhead (%): 71, heavy eviction counter: 71, current caching DataBlock (%): 22
evicted (MB): 342, ratio 0.33, overhead (%): 71, heavy eviction counter: 72, current caching DataBlock (%): 22
evicted (MB): 342, ratio 0.33, overhead (%): 71, heavy eviction counter: 73, current caching DataBlock (%): 22
evicted (MB): 342, ratio 0.33, overhead (%): 71, heavy eviction counter: 74, current caching DataBlock (%): 22
evicted (MB): 342, ratio 0.33, overhead (%): 71, heavy eviction counter: 75, current caching DataBlock (%): 22
evicted (MB): 342, ratio 0.33, overhead (%): 71, heavy eviction counter: 76, current caching DataBlock (%): 22
evicted (MB): 21, ratio 0.33, overhead (%): -90, heavy eviction counter: 76, current caching DataBlock (%): 32
evicted (MB): 0, ratio 0.0, overhead (%): -100, heavy eviction counter: 0, current caching DataBlock (%): 100
evicted (MB): 0, ratio 0.0, overhead (%): -100, heavy eviction counter: 0, current caching DataBlock (%): 100

The scans were needed in order to show the same process in the form of a graph of the relationship between two sections of the cache - single (where blocks that no one has ever requested yet) and multi (the “requested” data at least once are stored here):

How to increase read speed from HBase up to 3 times and from HDFS up to 5 times

And finally, how the parameters work in the form of a graph. For comparison, the cache was completely turned off at the beginning, then there was the launch of HBase with caching and a delay in the start of optimization by 5 minutes (30 eviction cycles).

Full code can be found in Pull Request HBASE-23887 on github.

However, 300 thousand reads per second is not all that can be squeezed out on this hardware under these conditions. The fact is that when you need to access data through HDFS, the ShortCircuitCache (hereinafter SSC) mechanism is used, which allows you to access data directly, avoiding network interactions.

Profiling has shown that although this mechanism gives a big gain, it also becomes a bottleneck at some point, because almost all heavy operations occur inside the lock, which leads to locks most of the time.

How to increase read speed from HBase up to 3 times and from HDFS up to 5 times

Realizing this, we realized that the problem can be circumvented by creating an array of independent SSCs:

private final ShortCircuitCache[] shortCircuitCache;
...
shortCircuitCache = new ShortCircuitCache[this.clientShortCircuitNum];
for (int i = 0; i < this.clientShortCircuitNum; i++)
  this.shortCircuitCache[i] = new ShortCircuitCache(…);

And then work with them, excluding intersections also by the last offset digit:

public ShortCircuitCache getShortCircuitCache(long idx) {
    return shortCircuitCache[(int) (idx % clientShortCircuitNum)];
}

Now you can start testing. To do this, we will read files from HDFS with a simple multi-threaded application. We set the parameters:

conf.set("dfs.client.read.shortcircuit", "true");
conf.set("dfs.client.read.shortcircuit.buffer.size", "65536"); // по дефолту = 1 МБ и это сильно замедляет чтение, поэтому лучше привести в соответствие к реальным нуждам
conf.set("dfs.client.short.circuit.num", num); // от 1 до 10

And just read the files:

FSDataInputStream in = fileSystem.open(path);
for (int i = 0; i < count; i++) {
    position += 65536;
    if (position > 900000000)
        position = 0L;
    int res = in.read(position, byteBuffer, 0, 65536);
}

This code is executed in separate threads and we will increase the number of simultaneously read files (from 10 to 200 - the horizontal axis) and the number of caches (from 1 to 10 - graphics). The vertical axis shows the acceleration that the increase in SSC gives relative to the case when there is only one cache.

How to increase read speed from HBase up to 3 times and from HDFS up to 5 times

How to read the graph: 100k reads in 64K blocks with a single cache take 78 seconds to complete. Whereas with 5 caches it takes 16 seconds. Those. there is an acceleration of ~5 times. As you can see from the graph, on a small number of parallel reads, the effect is not very noticeable, it starts to play a noticeable role when the thread reads are more than 50. It is also noticeable that increasing the number of SSCs from 6 and above gives a significantly less performance gain.

Note 1: since the test results are quite volatile (see below), 3 launches were carried out and the values ​​obtained were averaged.

Note 2: The performance gain from the random access setting is the same, although the access itself is slightly slower.

However, it should be clarified that, unlike the case with HBase, this acceleration is not always free. Here we are more "unlocking" the ability of the CPU to do work, instead of hanging on locks.

How to increase read speed from HBase up to 3 times and from HDFS up to 5 times

Here you can see that in general, an increase in the number of caches gives an approximately proportional increase in CPU utilization. However, there are several more winning combinations.

For example, let's take a closer look at the setting SSC = 3. The increase in performance on the range is about 3.3 times. Below are the results of all three separate runs.

How to increase read speed from HBase up to 3 times and from HDFS up to 5 times

Whereas CPU consumption grows by about 2.8 times. The difference is not very big, but little Greta is already happy and maybe there will be time to attend school and lessons.

Thus, this will have a positive effect on any tool that uses HDFS bulk access (for example, Spark, etc.), provided that the application code is lightweight (i.e. plugging it on the HDFS client side) and there is free CPU power. To check, let's test what effect the combined use of BlockCache optimization and tuning SSC to read from HBase will give.

How to increase read speed from HBase up to 3 times and from HDFS up to 5 times

Here you can see that in such conditions the effect is not as big as in the refined tests (reading without any processing), but it's quite possible to squeeze out an additional 80K here. Together, both optimizations give an acceleration up to 4 times.

A PR was also made for this optimization [HDFS-15202], which has been merged and this functionality will be available in future releases.

And finally, it was interesting to compare the reading performance of a similar wide-column database Cassandra and HBase.

To do this, instances of the standard YCSB load testing utility were launched from two hosts (800 threads in total). On the server side - 4 instances of RegionServer and Cassandra on 4 hosts (not where clients are running to avoid their influence). Readings came from tables of size:

HBase - 300 GB on HDFS (100 GB raw data)

Cassandra - 250 GB (replication factor = 3)

Those. the volume was about the same (in HBase a little more).

Hbase options:

dfs.client.short.circuit.num = 5 (HDFS client optimization)

hbase.lru.cache.heavy.eviction.count.limit = 30 - this means that the patch will start working after 30 evictions (~5 minutes)

hbase.lru.cache.heavy.eviction.mb.size.limit = 300 - target volume of caching and eviction

The YCSB logs have been parsed and compiled into Excel charts:

How to increase read speed from HBase up to 3 times and from HDFS up to 5 times

As you can see, the optimization data makes it possible to equalize the performance of these databases under these conditions and achieve 450 reads per second.

We hope this information can be useful to someone in the course of an exciting struggle for performance.

Source: habr.com

Add a comment