TSDB analysis in Prometheus 2

TSDB analysis in Prometheus 2

The time series database (TSDB) in Prometheus 2 is a great example of an engineering solution that offers significant improvements over Prometheus 2 storage v1 in terms of data collection and query execution speed, and resource efficiency. We were implementing Prometheus 2 in Percona Monitoring and Management (PMM) and I had the opportunity to understand the performance of Prometheus 2 TSDB. In this article I will talk about the results of these observations.

Prometheus Average Workload

For those used to dealing with general purpose databases, the typical Prometheus workload is quite interesting. The rate of data accumulation tends to be stable: usually the services you monitor send about the same amount of metrics, and the infrastructure changes relatively slowly.
Requests for information may come from various sources. Some of them, like alerts, also aim for a stable and predictable value. Others, such as user requests, may cause spikes, although this is not the case for most of the workload.

Load test

During testing, I focused on the ability to accumulate data. I deployed Prometheus 2.3.2 compiled with Go 1.10.1 (as part of PMM 1.14) on a Linode service using this script: StackScript. For the most realistic load generation, with this StackScript I ran several MySQL nodes with a real load (Sysbench TPC-C Test), each of which emulated 10 Linux/MySQL nodes.
All of the following tests were performed on a Linode server with eight vCores and 32 GB of memory, running 20 load simulations monitoring 800 MySQL instances. Or, in terms of Prometheus, 440 targets (targets), 380 fees (scrapes) per second, 1,7 thousand records (samples) per second and XNUMX million active time series.

Design

The usual traditional database approach, including the one used by Prometheus 1.x, is to memory limit. If it's not enough to handle the load, you'll experience high latency and some requests won't be fulfilled. Memory usage in Prometheus 2 is configurable via a key storage.tsdb.min-block-duration, which determines how long records will be kept in memory before being flushed to disk (default is 2 hours). The amount of memory needed will depend on the number of time series, labels, and scrapes, plus the net input. In terms of disk space, Prometheus aims to use 3 bytes per write (sample). On the other hand, the memory requirements are much higher.

While it's possible to configure the block size, it's not recommended to set it manually, so you're left with the task of giving Prometheus as much memory as it needs for your workload.
If there is not enough memory to support the incoming stream of metrics, Prometheus will fall from out of memory or get caught by an OOM killer.
Adding swap to delay the crash when Prometheus runs out of memory doesn't really help, because using this feature causes memory explosions. I think it's about Go, its garbage collector, and how it works with swap.
Another interesting approach is to set the head block to be flushed to disk at a certain time, instead of counting it from the time the process started.

TSDB analysis in Prometheus 2

As you can see from the graph, flushes to disk happen every two hours. If you change the min-block-duration parameter to one hour, then these resets will happen every hour, starting in half an hour.
If you want to use this and other graphs in your Prometheus installation, you can use this dashboard. It was designed for PMM, but with a few modifications it fits into any Prometheus installation.
We have an active block, called the head block, which is stored in memory; blocks with older data are available via mmap(). This removes the need to configure the cache separately, but it also means that you need to leave enough room for the operating system cache if you want to query data older than the head block.
It also means that Prometheus's virtual memory consumption will look quite high, which is nothing to worry about.

TSDB analysis in Prometheus 2

Another interesting design point is the use of WAL (write ahead log). As you can see from the repository documentation, Prometheus uses WAL to avoid crashes. Specific mechanisms for guaranteeing data survivability, unfortunately, are not well documented. Prometheus 2.3.2 flushes WAL to disk every 10 seconds and this setting is not user configurable.

Seals

Prometheus TSDB is designed in the same way as LSM (Log Structured merge) storage: the head block is flushed periodically to disk, while the compaction mechanism merges multiple blocks together to avoid scanning too many blocks on queries. Here you can see the number of blocks that I observed on the test system after a day of load.

TSDB analysis in Prometheus 2

If you want to learn more about storage, you can check out the meta.json file, which has information about the blocks available and how they came about.

{
       "ulid": "01CPZDPD1D9R019JS87TPV5MPE",
       "minTime": 1536472800000,
       "maxTime": 1536494400000,
       "stats": {
               "numSamples": 8292128378,
               "numSeries": 1673622,
               "numChunks": 69528220
       },
       "compaction": {
               "level": 2,
               "sources": [
                       "01CPYRY9MS465Y5ETM3SXFBV7X",
                       "01CPYZT0WRJ1JB1P0DP80VY5KJ",
                       "01CPZ6NR4Q3PDP3E57HEH760XS"
               ],
               "parents": [
                       {
                               "ulid": "01CPYRY9MS465Y5ETM3SXFBV7X",
                               "minTime": 1536472800000,
                               "maxTime": 1536480000000
                       },
                       {
                               "ulid": "01CPYZT0WRJ1JB1P0DP80VY5KJ",
                               "minTime": 1536480000000,
                               "maxTime": 1536487200000
                       },
                       {
                               "ulid": "01CPZ6NR4Q3PDP3E57HEH760XS",
                               "minTime": 1536487200000,
                               "maxTime": 1536494400000
                       }
               ]
       },
       "version": 1
}

Seals in Prometheus are tied to the time the head block is flushed to disk. At this point, several such operations may be carried out.

TSDB analysis in Prometheus 2

It appears that compactions are not limited in any way and can cause large disk I/O spikes during execution.

TSDB analysis in Prometheus 2

CPU load spikes

TSDB analysis in Prometheus 2

Of course, this has a rather negative effect on the speed of the system, and is also a serious challenge for LSM storages: how to make compactions to support high query speeds without causing too much overhead?
The use of memory in the compaction process also looks quite curious.

TSDB analysis in Prometheus 2

We can see how, after compaction, most of the memory changes state from Cached to Free: it means that potentially valuable information has been removed from there. Curious if it's used here fadvice() or some other minimization technique, or is it because the cache has been emptied of blocks destroyed by compaction?

Failover recovery

Disaster recovery takes time, and for good reason. For an incoming stream of a million records per second, I had to wait about 25 minutes while the recovery was taking into account the SSD drive.

level=info ts=2018-09-13T13:38:14.09650965Z caller=main.go:222 msg="Starting Prometheus" version="(version=2.3.2, branch=v2.3.2, revision=71af5e29e815795e9dd14742ee7725682fa14b7b)"
level=info ts=2018-09-13T13:38:14.096599879Z caller=main.go:223 build_context="(go=go1.10.1, user=Jenkins, date=20180725-08:58:13OURCE)"
level=info ts=2018-09-13T13:38:14.096624109Z caller=main.go:224 host_details="(Linux 4.15.0-32-generic #35-Ubuntu SMP Fri Aug 10 17:58:07 UTC 2018 x86_64 1bee9e9b78cf (none))"
level=info ts=2018-09-13T13:38:14.096641396Z caller=main.go:225 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2018-09-13T13:38:14.097715256Z caller=web.go:415 component=web msg="Start listening for connections" address=:9090
level=info ts=2018-09-13T13:38:14.097400393Z caller=main.go:533 msg="Starting TSDB ..."
level=info ts=2018-09-13T13:38:14.098718401Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536530400000 maxt=1536537600000 ulid=01CQ0FW3ME8Q5W2AN5F9CB7R0R
level=info ts=2018-09-13T13:38:14.100315658Z caller=web.go:467 component=web msg="router prefix" prefix=/prometheus
level=info ts=2018-09-13T13:38:14.101793727Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536732000000 maxt=1536753600000 ulid=01CQ78486TNX5QZTBF049PQHSM
level=info ts=2018-09-13T13:38:14.102267346Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536537600000 maxt=1536732000000 ulid=01CQ78DE7HSQK0C0F5AZ46YGF0
level=info ts=2018-09-13T13:38:14.102660295Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536775200000 maxt=1536782400000 ulid=01CQ7SAT4RM21Y0PT5GNSS146Q
level=info ts=2018-09-13T13:38:14.103075885Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536753600000 maxt=1536775200000 ulid=01CQ7SV8WJ3C2W5S3RTAHC2GHB
level=error ts=2018-09-13T14:05:18.208469169Z caller=wal.go:275 component=tsdb msg="WAL corruption detected; truncating" err="unexpected CRC32 checksum d0465484, want 0" file=/opt/prometheus/data/.prom2-data/wal/007357 pos=15504363
level=info ts=2018-09-13T14:05:19.471459777Z caller=main.go:543 msg="TSDB started"
level=info ts=2018-09-13T14:05:19.471604598Z caller=main.go:603 msg="Loading configuration file" filename=/etc/prometheus.yml
level=info ts=2018-09-13T14:05:19.499156711Z caller=main.go:629 msg="Completed loading of configuration file" filename=/etc/prometheus.yml
level=info ts=2018-09-13T14:05:19.499228186Z caller=main.go:502 msg="Server is ready to receive web requests."

The main problem of the recovery process is high memory consumption. Even though the server can run stably with the same amount of memory in a normal situation, if it crashes, it may not rise due to OOM. The only solution I've found is to turn off data collection, bring the server up, let it recover, and reboot with collection enabled.

Warm up

Another behavior to keep in mind during warm up is the ratio of low performance to high resource consumption right after startup. During some, but not all starts, I observed a serious load on the CPU and memory.

TSDB analysis in Prometheus 2

TSDB analysis in Prometheus 2

Dips in memory usage indicate that Prometheus cannot configure all fees from the start, and some information is lost.
I have not found out the exact reasons for the high CPU and memory usage. I suspect that this is due to the creation of new time series in the head block with a high frequency.

CPU spikes

In addition to densifications, which create a fairly high I/O load, I have noticed serious spikes in CPU load every two minutes. The bursts are longer with high incoming traffic and look like they are caused by the Go garbage collector, at least some cores are fully loaded.

TSDB analysis in Prometheus 2

TSDB analysis in Prometheus 2

These jumps are not so insignificant. It seems that when they occur, the Prometheus internal entry point and metrics become unavailable, causing data dips in the same time intervals.

TSDB analysis in Prometheus 2

You may also notice that the Prometheus exporter shuts down for one second.

TSDB analysis in Prometheus 2

We can see correlations with garbage collection (GC).

TSDB analysis in Prometheus 2

Conclusion

TSDB in Prometheus 2 is fast, capable of handling millions of time series and at the same time thousands of writes per second using rather modest hardware. The CPU and disk I/O utilization is also impressive. My example showed up to 200 metrics per second per used core.

To plan for expansion, you need to remember that there is enough memory, and it must be real memory. The amount of memory used, which I observed, was about 5 GB per 100 records per second of the incoming stream, which, in total with the operating system cache, gave about 000 GB of occupied memory.

Of course, there is still a lot of work to be done to tame CPU and disk I/O spikes, and this is not surprising, given how young TSDB Prometheus 2 is compared to InnoDB, TokuDB, RocksDB, WiredTiger, but they all had similar problems at the beginning of their life cycle.

Source: habr.com

Add a comment