High performance and native partitioning: Zabbix with TimescaleDB support

Zabbix is ​​a monitoring system. Like any other system, it faces three main problems of all monitoring systems: collecting and processing data, storing history, and clearing it.

The stages of receiving, processing and recording data take time. Not much, but for a large system, this can result in large delays. The storage problem is a matter of data access. They are used for reports, checks and triggers. Data access delays also affect performance. When the database grows, irrelevant data has to be deleted. Deletion is a heavy operation that also eats up some resources.

High performance and native partitioning: Zabbix with TimescaleDB support

Collection and storage delay problems in Zabbix are solved by caching: several types of caches, caching in the database. To solve the third problem, caching is not suitable, so TimescaleDB was used in Zabbix. Will tell about it Andrey Gushchin - technical support engineer Zabbix SIA. Andrey has been supporting Zabbix for over 6 years and is directly involved in performance.

How does TimescaleDB work, what performance can it give compared to regular PostgreSQL? What role does Zabbix play in TimescaleDB? How to start from scratch and how to migrate from PostgreSQL and which configuration performance is better? All this under the cut.

Performance Challenges

Every monitoring system faces certain performance challenges. I will talk about three of them: data collection and processing, storage, history cleaning.

Fast data collection and processing. A good monitoring system should quickly receive all the data and process it according to trigger expressions - according to its own criteria. After processing, the system must also quickly store this data in the database in order to use it later.

History storage. A good monitoring system should store history in a database and provide easy access to metrics. The history is needed to be used in reports, graphs, triggers, thresholds, and calculated alert items.

Clearing history. Sometimes there comes a day when you don't need to store metrics. Why do you need data that was collected 5 years ago, a month or two: some nodes have been removed, some hosts or metrics are no longer needed, because they are outdated and no longer collected. A good monitoring system should store historical data and delete it from time to time so that the database does not grow.

Cleaning up stale data is a thorny issue that has a big impact on database performance.

Caching in Zabbix

In Zabbix, the first and second calls are resolved using caching. RAM is used to collect and process data. For storage - history in triggers, graphs and calculated items. On the DB side, there is some caching for basic selections, such as graphs.

Caching on the side of the Zabbix server itself is:

  • ConfigurationCache;
  • ValueCache;
  • HistoryCache;
  • TrendsCache.

Consider them in more detail.

ConfigurationCache

This is the main cache in which we store metrics, hosts, data items, triggers - everything that is needed for PreProcessing and for data collection.

High performance and native partitioning: Zabbix with TimescaleDB support

All this is stored in the ConfigurationCache so as not to create unnecessary queries in the database. After the server starts, we update this cache, create and periodically update configurations.

Сбор данных

The scheme is quite large, but the main thing in it is collectors. These are various "pollers" - assembly processes. They are responsible for different types of assembly: they collect data via SNMP, IPMI, and transfer it all to PreProcessing.

High performance and native partitioning: Zabbix with TimescaleDB supportPickers are circled in orange.

Zabbix has computed aggregation items that are needed to aggregate checks. If we have them, we take the data for them directly from the ValueCache.

PreProcessing HistoryCache

All collectors use the ConfigurationCache to receive jobs. Then they pass them on to PreProcessing.

High performance and native partitioning: Zabbix with TimescaleDB support

PreProcessing uses ConfigurationCache to get PreProcessing steps. It processes this data in various ways.

After processing the data with PreProcessing, we store it in the HistoryCache to process it. This completes the data collection and we move on to the main process in Zabbix - history syncer, as it is a monolithic architecture.

Note: PreProcessing is a fairly heavy operation. Since v 4.2 it has been moved to a proxy. If you have a very large Zabbix with a large number of items and collection frequency, this makes things a lot easier.

ValueCache, history & trends cache

History syncer is the main process that atomically processes each data element, that is, each value.

History syncer takes values ​​from HistoryCache and checks Configuration for triggers for calculations. If they are - calculates.

History syncer creates an event, escalates to create alerts if required by configuration, and records. If there are triggers for further processing, then it remembers this value in the ValueCache so as not to refer to the history table. This is how the ValueCache is filled with the data that is needed to calculate triggers, calculated items.

History syncer writes all data to the database, and it writes to disk. The processing process ends here.

High performance and native partitioning: Zabbix with TimescaleDB support

DB caching

There are various caches on the DB side when you want to look at graphs or event reports:

  • Innodb_buffer_pool on the MySQL side;
  • shared_buffers on the PostgreSQL side;
  • effective_cache_size on the Oracle side;
  • shared_pool on the DB2 side.

There are many other caches, but these are the main ones for all databases. They allow you to keep data in RAM that is often needed for queries. They have their own technology for this.

Database performance is critical

Zabbix server is constantly collecting data and writing it down. When restarted, it also reads from the history to fill the ValueCache. Scripts and reports uses Zabbix API, which is built on the basis of the Web interface. The Zabbix API accesses the database and retrieves the necessary data for graphs, reports, event lists, and recent issues.

High performance and native partitioning: Zabbix with TimescaleDB support

For visualization - grafana. This is a popular solution among our users. She can directly send requests through the Zabbix API and to the database, and creates a certain concurrency for obtaining data. Therefore, a finer and better database tuning is needed to match the rapid delivery of results and testing.

housekeeper

The third performance challenge in Zabbix is ​​cleaning history with Housekeeper. It respects all the settings - the data elements indicate how long to store the dynamics of changes (trends) in days.

TrendsCache we calculate on the fly. When the data comes in, we aggregate it in one hour and put it in tables for the dynamics of trend changes.

Housekeeper starts and removes information from the database with the usual "selects". This is not always efficient, which can be understood from the performance graphs of internal processes.

High performance and native partitioning: Zabbix with TimescaleDB support

The red graph shows that the History syncer is constantly busy. The orange chart at the top is Housekeeper, which is constantly running. It waits for the database to delete all the rows it has specified.

When should you disable Housekeeper? For example, there is an “Item ID” and you need to delete the last 5 thousand lines in a certain time. Of course, this happens by indexes. But usually the dataset is very large, and the database still reads from the disk and puts it in the cache. This is always a very expensive operation for the database and, depending on the size of the database, can lead to performance problems.

High performance and native partitioning: Zabbix with TimescaleDB support

Housekeeper is simple to disable. In the Web interface, there is a setting in "Administration general" for Housekeeper. Disable internal housekeeping for internal trend history and it no longer controls this.

Housekeeper has been disabled, graphics have leveled off - what could be the problems in this case and what can help in solving the third performance challenge?

Partitioning - partitioning or partitioning

Partitioning is usually configured in a different way on each relational database that I have listed. Each has its own technology, but they are similar, in general. Creating a new partition often leads to certain problems.

Typically, partitions are configured depending on the "setup" - the amount of data that is created in one day. As a rule, Partitioning is set up in one day, this is the minimum. For new partition trends — 1 month.

The values ​​may change in the case of a very large "setup". If a small “setup” is up to 5 nvps (new values ​​per second), an average one is from 000 to 5, then a large one is above 000 nvps. These are large and very large installations that require careful configuration of the database itself.

On very large installations, one day may not be optimal. I have seen MySQL partitions of 40 GB or more per day. This is a very large amount of data that can lead to problems and should be reduced.

What gives Partitioning?

Partitioning tables. Often these are separate files on disk. The query plan selects one partition more optimally. Usually partitioning is used by range - this is also true for Zabbix. We use there "timestamp" - the time since the beginning of the epoch. We have regular numbers. You set the beginning and end of the day - this is a partition.

Quick removalDELETE. One file/subtable is selected, not a selection of rows for deletion.

Significantly speeds up data sampling SELECT - uses one or more partitions, not the entire table. If you're accessing two days old data, it fetches it from the database faster because you only have to load it into the cache and return only one file, not a large table.

Often many databases also speed up INSERT - inserts into the child table.

TimescaleDB

For v 4.2 we turned our attention to TimescaleDB. This is a PostgreSQL extension with a native interface. The extension works efficiently with time series data without losing the advantages of relational databases. TimescaleDB also automatically partitions.

TimescaleDB has a concept hypertable (hypertable) that you create. In it are chunks - partitions. Chunks are automatically managed fragments of a hypertable that do not affect other fragments. Each chunk has its own time range.

High performance and native partitioning: Zabbix with TimescaleDB support

TimescaleDB vs PostgreSQL

TimescaleDB is really efficient. The producers of the extension claim that they use a more correct query processing algorithm, in particular, inserts . As the size of the dataset insert grows, the algorithm maintains constant performance.

High performance and native partitioning: Zabbix with TimescaleDB support

After 200 million rows, PostgreSQL usually starts to sag a lot and lose performance to 0. TimescaleDB allows you to efficiently insert "inserts" with any amount of data.

Installation

Installing TimescaleDB is easy enough for any packages. IN documentation everything is detailed - it depends on the official PostgreSQL packages. TimescaleDB can also be built and compiled by hand.

For the Zabbix database, we simply activate the extension:

echo "CREATE EXTENSION IF NOT EXISTS timescaledb CASCADE;" | sudo -u postgres psql zabbix

you activate extension and create it for the Zabbix database. The last step is to create a hypertable.

Migrating history tables to TimescaleDB

There is a special function for this. create_hypertable:

SELECT create_hypertable(‘history’, ‘clock’, chunk_time_interval => 86400, migrate_data => true);
SELECT create_hypertable(‘history_unit’, ‘clock’, chunk_time_interval => 86400, migrate_data => true);
SELECT create_hypertable(‘history_log’, ‘clock’, chunk_time_interval => 86400, migrate_data => true);
SELECT create_hypertable(‘history_text’, ‘clock’, chunk_time_interval => 86400, migrate_data => true);
SELECT create_hypertable(‘history_str’, ‘clock’, chunk_time_interval => 86400, migrate_data => true);
SELECT create_hypertable(‘trends’, ‘clock’, chunk_time_interval => 86400, migrate_data => true);
SELECT create_hypertable(‘trends_unit’, ‘clock’, chunk_time_interval => 86400, migrate_data => true);
UPDATE config SET db_extension=’timescaledb’, hk_history_global=1, hk_trends_global=1

The function has three parameters. First - table in databaseThe for which you want to create a hypertable. Second - field, according to which it is necessary to create chunk_time_interval — interval of partition chunks to be used. In my case, the interval is one day - 86.

The third parameter is migrate_data. If set true, then all current data is transferred to pre-created chunks. I myself have used migrate_data. I had about 1TB which took over an hour. Even in some cases, when testing, I deleted the historical data of character types, which are optional for storage, so as not to transfer them.

Last step - UPDATE: In db_extension put timescaledbso that the database understands that there is this extension. Zabbix activates it and correctly uses the syntax and queries already to the database - those features that are necessary for TimescaleDB.

Hardware configuration

I used two servers. First - VMware machine. It's small enough: 20 Intel® Xeon® CPU E5-2630 v 4 @ 2.20GHz, 16 GB of RAM and a 200 GB SSD drive.

I installed PostgreSQL 10.8 on it with Debian OS 10.8-1.pgdg90+1 and xfs file system. I configured everything minimally to use this particular database, minus what Zabbix itself will use.

On the same machine there was a Zabbix server, PostgreSQL and load agents. I had 50 active agents who used LoadableModuleto generate various results very quickly: numbers, strings. I filled the database with a lot of data.

Initially, the configuration contained 5 elements data per host. Almost every element contained a trigger to make it look like real installations. In some cases there was more than one trigger. One network node had 3-000 triggers.

Item update interval − 4-7 seconds. I regulated the load itself by using not only 50 agents, but adding more. Also, with the help of data elements, I dynamically regulated the load and reduced the update interval to 4 s.

PostgreSQL. 35 nvps

My first run on this hardware was on pure PostgreSQL - 35 thousand values ​​per second. As you can see, inserting data takes fractions of a second - everything is fine and fast. The only thing is that the 200 GB SSD drive fills up quickly.

High performance and native partitioning: Zabbix with TimescaleDB support

This is a standard Zabbix server performance dashboard.

High performance and native partitioning: Zabbix with TimescaleDB support

The first blue graph is the number of values ​​per second. The second graph on the right is loading build processes. The third is the loading of internal build processes: history syncers and Housekeeper, which has been running for quite some time here.

The fourth graph shows the use of HistoryCache. This is a kind of buffer before inserting into the database. The green fifth graph shows the usage of ValueCache, that is, how many ValueCache hits for triggers is several thousand values ​​per second.

PostgreSQL. 50 nvps

Then I increased the load to 50 thousand values ​​per second on the same hardware.

High performance and native partitioning: Zabbix with TimescaleDB support

When loading from Housekeeper, inserting 10 thousand values ​​took 2-3 seconds.

High performance and native partitioning: Zabbix with TimescaleDB support
Housekeeper is already starting to get in the way.

The third graph shows that, in general, the loading of trappers and history syncers is still at the level of 60%. On the fourth graph, the HistoryCache during the operation of Housekeeper is already starting to fill up quite actively. It is 20% full - about 0,5 GB.

PostgreSQL. 80 nvps

Then I increased the load to 80 thousand values ​​per second. This is approximately 400 thousand data elements and 280 thousand triggers.

High performance and native partitioning: Zabbix with TimescaleDB support
The loading insert of thirty history syncers is already quite high.

I also increased various parameters: history syncers, caches.

High performance and native partitioning: Zabbix with TimescaleDB support

On my hardware, the loading of history syncers increased to the maximum. HistoryCache quickly filled up with data - the buffer has accumulated data for processing.

All this time, I watched how the processor, RAM and other system parameters were used, and found that disk utilization was maximum.

High performance and native partitioning: Zabbix with TimescaleDB support

I have made use of maximum disk capacity on this hardware and on this virtual machine. With such intensity, PostgreSQL began to dump data quite actively, and the disk no longer had time to write and read.

Second server

I took another server that already had 48 processors and 128 GB of RAM. I tuned it - set 60 history syncer, and achieved acceptable performance.

High performance and native partitioning: Zabbix with TimescaleDB support

In fact, this is already a performance limit where something needs to be done.

timescaledb. 80 nvps

My main task is to test the capabilities of TimescaleDB against a Zabbix load. 80 thousand values ​​per second is a lot, the frequency of collecting metrics (except for Yandex, of course) and a fairly large “setup”.

High performance and native partitioning: Zabbix with TimescaleDB support

There is a dip on every graph - this is just a data migration. After the failures in the Zabbix server, the loading profile of the history syncer has changed a lot - it fell three times.

TimescaleDB allows you to insert data almost 3 times faster and use less HistoryCache.

Accordingly, you will receive data in a timely manner.

timescaledb. 120 nvps

Then I increased the number of data items to 500 thousand. The main task was to check the capabilities of TimescaleDB - I got a calculated value of 125 thousand values ​​per second.

High performance and native partitioning: Zabbix with TimescaleDB support

This is a working "setup" that can take a long time to work. But since my disk was only 1,5 TB, I filled it up in a couple of days.

High performance and native partitioning: Zabbix with TimescaleDB support

Most importantly, new TimescaleDB partitions were being created at the same time.

For performance, this is completely unnoticeable. When partitions are created in MySQL, for example, things are different. This usually happens at night, because it blocks general insertion, table manipulation and can create degradation of the service. This is not the case with TimescaleDB.

For example, I will show one graph from the set in community. In the picture, TimescaleDB is enabled, thanks to which the load on the use of io.weight on the processor has fallen. The use of elements of internal processes has also decreased. Moreover, this is an ordinary virtual machine on ordinary pancake disks, and not an SSD.

High performance and native partitioning: Zabbix with TimescaleDB support

Conclusions

TimescaleDB is a good solution for small "setups", which rest on the performance of the disk. It will allow you to continue working well until the database is migrated to hardware faster.

TimescaleDB is easy to set up, gives a performance boost, works well with Zabbix and has advantages over PostgreSQL.

If you use PostgreSQL and do not plan to change it, then I recommend use PostgreSQL with TimescaleDB extension in conjunction with Zabbix. This solution works effectively up to medium "setup".

We say "high performance" - we mean HighLoad++. It won't be long before you get to know the technologies and practices that allow services to serve millions of users. List reports for November 7 and 8, we have already drawn up, but meetups more can be suggested.

Subscribe to our Newsletter и telegram, in which we reveal the features of the upcoming conference, and find out how to get the most out of it.

Source: habr.com

Add a comment