HighLoad++, Mikhail Makurov, Maxim Chernetsov (Intersvyaz): Zabbix, 100kNVPS on one server

The next HighLoad++ conference will be held on April 6 and 7, 2020 in St. Petersburg Details and tickets here to register:. HighLoad++ Moscow 2018. Moscow Hall. November 9, 15:00. Abstracts and presentation.

HighLoad++, Mikhail Makurov, Maxim Chernetsov (Intersvyaz): Zabbix, 100kNVPS on one server

* Monitoring - online and analytics.
* Main limitations of the ZABBIX platform.
* Solution for scaling analytics storage.
* Optimization of the ZABBIX server.
* UI optimization.
* Experience in operating the system under loads of more than 40k NVPS.
* Brief conclusions.

Mikhail Makurov (hereinafter - MM): - Hi all!

Maxim Chernetsov (hereinafter - MCH): - Good afternoon!

MM: Let me introduce Maxim. Max is a talented engineer, the best networker I know. Maxim deals with networks and services, their development and operation.

HighLoad++, Mikhail Makurov, Maxim Chernetsov (Intersvyaz): Zabbix, 100kNVPS on one server

MCH: – And I would like to tell you about Michael. Michael is a C developer. He wrote several high-load traffic processing solutions for our company. We live and work in the Urals, in the city of tough men Chelyabinsk, in the Intersvyaz company. Our company is an Internet and cable TV service provider for one million people in 16 cities.

MM: - And it is worth saying that Intersvyaz is much more than just a provider, it is an IT company. Most of our solutions are made by our IT department.

A: from servers processing traffic to a call center and a mobile application. The IT department now has about 80 people with very, very diverse competencies.

About Zabbix and its architecture

MCH: - And now I will try to set a personal record and in one minute say what Zabbix (hereinafter - "Zabbiks") is.

Zabbix positions itself as an out-of-the-box enterprise level monitoring system. It has many features that make life easier: advanced escalation rules, API for integration, grouping and auto-discovery of hosts and metrics. Zabbix has so-called scaling tools - proxies. Zabbix is ​​an open source system.

Briefly about architecture. We can say that it consists of three components:

HighLoad++, Mikhail Makurov, Maxim Chernetsov (Intersvyaz): Zabbix, 100kNVPS on one server

  • Server. Written in C. With rather complex processing and transfer of information between threads. All processing takes place in it: from receiving to saving to the database.
  • All data is stored in the database. Zabbix supports MySQL, PostreSQL and Oracle.
  • The web interface is written in PHP. In most systems, it comes with the Apache server, but it works more efficiently in combination with nginx + php.

Today we would like to tell one story related to Zabbix from the life of our company…

A story from the life of the Intersvyaz company. What do we have and what do we need?

HighLoad++, Mikhail Makurov, Maxim Chernetsov (Intersvyaz): Zabbix, 100kNVPS on one server
5 or 6 months ago. One day after work...

MCH: - Misha, hello! I'm glad I managed to catch you - there is a conversation. We again had problems with monitoring. During a major accident, everything slowed down, and there was no information about the state of the network. Unfortunately, this is not the first time this has happened. I need your help. Let's make our monitoring work under any circumstances!

MM: But let's sync first. I haven't looked there in a couple of years. As far as I remember, we abandoned Nagios and switched to Zabbix about 8 years ago. And now we seem to have 6 powerful servers and about a dozen proxies. Am I confusing anything?

MCH: - Almost. 15 servers, some of which are virtual machines. Most importantly, it does not save us at the moment when we need it most. Like an accident - the servers slow down and nothing is visible. We tried to optimize the configuration, but this does not give an optimal performance boost.

MM: - It's clear. Did you look at something, have you already dug up something from the diagnostics?

MCH: - The first thing you have to deal with is just the database. MySQL is already constantly loaded, saving new metrics, and when Zabbix starts generating a bunch of events, the database goes into itself for literally a few hours. I already told you about the configuration optimization, but literally this year they updated the hardware: the servers have more than a hundred gigabytes of memory and disk arrays on SSD RAIDs - there is no point in growing it linearly. What do we do?

MM: - It's clear. In general, MySQL is an LTP database. Apparently, it is no longer suitable for storing an archive of metrics of our size. Let's figure it out.

MCH: - Let's!

Zabbix and Clickhouse integration as a result of the hackathon

After some time, we received interesting data:

HighLoad++, Mikhail Makurov, Maxim Chernetsov (Intersvyaz): Zabbix, 100kNVPS on one server

Most of the space in our database was occupied by the archive of metrics and less than 1% was used for configuration, templates and settings. By that time, we had been operating a Big data solution based on Clickhouse for more than a year. The direction of movement for us was obvious. At our spring Hackathon, I wrote the integration of Zabbix with Clickhouse for the server and frontend. At that time, Zabbix already had support for ElasticSearch, and we decided to compare them.

HighLoad++, Mikhail Makurov, Maxim Chernetsov (Intersvyaz): Zabbix, 100kNVPS on one server

Comparison of Clickhouse and Elasticsearch

MM: - For comparison, we generated the load the same as that provided by the Zabbix server and looked at how the systems would behave. We wrote data in batches of 1000 lines using CURL. We assumed in advance that Clickhouse would be more efficient for the load profile that Zabbix does. The results even exceeded our expectations:

HighLoad++, Mikhail Makurov, Maxim Chernetsov (Intersvyaz): Zabbix, 100kNVPS on one server

Under the same test conditions, Clickhouse wrote three times as much data. At the same time, both systems consumed very efficiently (a small amount of resources) when reading data. But Elastics required a large amount of processor when recording:

HighLoad++, Mikhail Makurov, Maxim Chernetsov (Intersvyaz): Zabbix, 100kNVPS on one server

In total, Clickhouse significantly outperformed Elastics in terms of processor consumption and speed. At the same time, due to data compression, Clickhouse uses 11 times less on the hard disk and does about 30 times less disk operations:

HighLoad++, Mikhail Makurov, Maxim Chernetsov (Intersvyaz): Zabbix, 100kNVPS on one server

MCH: - Yes, Clickhouse's work with the disk subsystem is implemented very efficiently. You can use huge SATA drives for databases and get write speeds of hundreds of thousands of lines per second. The system out of the box supports sharding, replication, and is very easy to set up. We are more than satisfied with its operation during the year.

To optimize resources, you can install Clickhouse next to the existing main base and thereby save a lot of CPU time and disk operations. We moved the archive of metrics to the existing Clickhouse clusters:

HighLoad++, Mikhail Makurov, Maxim Chernetsov (Intersvyaz): Zabbix, 100kNVPS on one server

We unloaded the main MySQL database so much that we could combine it on the same machine with the Zabbix server and abandon the dedicated server for MySQL.

How does polling work in Zabbix?

4 months ago

MM: - Well, you can forget about the problems with the base?

MCH: - That's for sure! Another challenge we need to solve is slow data collection. Now all of our 15 proxy servers are overloaded with SNMP and polling processes. And there is no other way than to install new and new servers.

MM: - Great. But first, how does polling work in Zabbix?

MCH: – In short, there are 20 types of metrics and a dozen ways to get them. Zabbix can collect data either in the "request-response" mode, or wait for new data through the "Trapper Interface".

HighLoad++, Mikhail Makurov, Maxim Chernetsov (Intersvyaz): Zabbix, 100kNVPS on one server

It is worth noting that in the original Zabbix this method (Trapper) is the fastest.

There are proxy servers for load balancing:

HighLoad++, Mikhail Makurov, Maxim Chernetsov (Intersvyaz): Zabbix, 100kNVPS on one server

Proxies can perform the same collection functions as the Zabbix server, receiving tasks from it and sending the collected metrics just through the Trapper interface. This is the officially recommended way to distribute the load. Proxies are also useful for monitoring a remote infrastructure running through NAT or a slow link:

HighLoad++, Mikhail Makurov, Maxim Chernetsov (Intersvyaz): Zabbix, 100kNVPS on one server

MM: Everything is clear with architecture. I need to look at the source...

A couple of days later

The story of how nmap fping won

MM: “Looks like I dug up something.

MCH: - Tell me!

MM: - I found that when checking availability, Zabbix checks up to a maximum of 128 hosts at the same time. I tried to increase this figure to 500 and removed the inter-packet interval in their ping (ping) - this doubled the performance. But I would like big numbers.

MCH: - In my practice, I sometimes have to check the availability of thousands of hosts, and I have not seen anything faster than nmap for this. I'm sure this is the fastest way. Let's try it! You need to significantly increase the number of hosts in one iteration.

MM: – Check more than five hundred? 600?

MCH: “At least a couple of thousand.

MM: - OK. The most important thing I wanted to say is that I found that most of the polling in Zabbix is ​​done synchronously. We must change it to asynchronous mode. Then we can drastically increase the number of metrics collected by the pollers, especially if we increase the number of metrics per iteration.

MCH: - Great! And when?

MM: - As usual, yesterday.

MCH: – We compared both versions of fping and nmap:

HighLoad++, Mikhail Makurov, Maxim Chernetsov (Intersvyaz): Zabbix, 100kNVPS on one server

On a large number of hosts, nmap was expected to be up to five times more efficient. Since nmap only checks availability and response time, we moved the loss calculation to triggers and significantly reduced availability check intervals. We found the optimal number of hosts for nmap to be around 4 thousand per iteration. Nmap allowed us to reduce the CPU cost of availability checks by a factor of three and reduce the interval from 120 seconds to 10.

Polling Optimization

MM: “Then we got into pollers. We were mainly interested in SNMP capture and agents. In Zabbix, polling is done synchronously and special measures are taken to increase the efficiency of the system. In synchronous mode, the unavailability of hosts causes significant polling degradation. There is a whole system of states, there are special processes - the so-called unreachable pollers, which work only with unreachable hosts:

HighLoad++, Mikhail Makurov, Maxim Chernetsov (Intersvyaz): Zabbix, 100kNVPS on one server

This is a commentary that demonstrates the state matrix, the complexity of the transition system that is required for the system to remain efficient. In addition, synchronous polling itself is quite slow:

HighLoad++, Mikhail Makurov, Maxim Chernetsov (Intersvyaz): Zabbix, 100kNVPS on one server

That is why thousands of poller threads on a dozen proxies could not collect the required amount of data for us. The asynchronous implementation solved not only the problems with the number of threads, but also greatly simplified the state system of unavailable hosts, because for any number checked in one polling iteration, the maximum wait time was 1 timeout:

HighLoad++, Mikhail Makurov, Maxim Chernetsov (Intersvyaz): Zabbix, 100kNVPS on one server

Additionally, we have modified and improved the polling system for SNMP requests. The fact is that most cannot respond to multiple SNMP requests at the same time. Therefore, we made a hybrid mode, when SNMP polling of the same host is done asynchronously:

HighLoad++, Mikhail Makurov, Maxim Chernetsov (Intersvyaz): Zabbix, 100kNVPS on one server

This is done for the entire bundle of hosts. This mode is ultimately no slower than fully asynchronous, since polling one and a half hundred SNMP values ​​is still much faster than 1 timeout.

Our experiments have shown that the optimal number of requests in one iteration is about 8 thousand with SNMP polling. In total, the transition to asynchronous mode allowed us to speed up polling performance by 200 times, several hundred times.

MCH: – The resulting polling optimizations have shown that we can not only get rid of all proxies, but also reduce the intervals for many checks, and proxies will no longer be needed as a way to share the load.

About three months ago

Change the architecture - increase the load!

MM: - Well, Max, it's time to be productive? I need a powerful server and a good engineer.

MCH: - Okay, let's plan. It is high time to move from the dead point in 5 thousand metrics per second.

Morning after upgrade

MCH: – Misha, we upgraded, but rolled back in the morning… Guess what speed you managed to achieve?

MM: - 20 thousand maximum.

MCH: - Yeah, 25! Unfortunately, we are right where we started.

MM: - Why? Did you run any diagnostics?

MCH: - Yes, sure! Here, for example, is an interesting top:

HighLoad++, Mikhail Makurov, Maxim Chernetsov (Intersvyaz): Zabbix, 100kNVPS on one server

MM: - Let's watch. I see that we tried a huge number of polling threads:

HighLoad++, Mikhail Makurov, Maxim Chernetsov (Intersvyaz): Zabbix, 100kNVPS on one server

But at the same time, they could not utilize the system even by half:

HighLoad++, Mikhail Makurov, Maxim Chernetsov (Intersvyaz): Zabbix, 100kNVPS on one server

And the overall performance is quite small, about 4 thousand metrics per second:

HighLoad++, Mikhail Makurov, Maxim Chernetsov (Intersvyaz): Zabbix, 100kNVPS on one server

Is there anything else?

MCH: – Yes, the strace of one of the pollers:

HighLoad++, Mikhail Makurov, Maxim Chernetsov (Intersvyaz): Zabbix, 100kNVPS on one server

MM: - Here you can clearly see that the polling process is waiting for "semaphores". These are the locks:

HighLoad++, Mikhail Makurov, Maxim Chernetsov (Intersvyaz): Zabbix, 100kNVPS on one server

MCH: - Unclear.

MM: – Look, it's like a situation where a bunch of threads are trying to work with resources that only one can work with at the same time. Then all they can do is divide this resource by time:

HighLoad++, Mikhail Makurov, Maxim Chernetsov (Intersvyaz): Zabbix, 100kNVPS on one server

And the total performance of working with such a resource is limited by the speed of one core:

HighLoad++, Mikhail Makurov, Maxim Chernetsov (Intersvyaz): Zabbix, 100kNVPS on one server

There are two ways to solve this problem.

Upgrade machine hardware, switch to faster cores:

HighLoad++, Mikhail Makurov, Maxim Chernetsov (Intersvyaz): Zabbix, 100kNVPS on one server

Or change the architecture and in parallel - the load:

HighLoad++, Mikhail Makurov, Maxim Chernetsov (Intersvyaz): Zabbix, 100kNVPS on one server

MCH: - By the way, on the test machine we will use a smaller number of cores than on the combat machine, but they are 1,5 times faster in terms of frequency per core!

MM: - Clear? It is necessary to look at the server code.

Data path in Zabbix server

MCH: - To understand, we began to analyze how data is transmitted inside the Zabbix server:

HighLoad++, Mikhail Makurov, Maxim Chernetsov (Intersvyaz): Zabbix, 100kNVPS on one server

Cool picture, right? Let's go through it step by step to more or less clarify. There are threads and services responsible for collecting data:

HighLoad++, Mikhail Makurov, Maxim Chernetsov (Intersvyaz): Zabbix, 100kNVPS on one server

They pass the collected metrics through the socket to the Preprocessor manager, where they are stored in the queue:

HighLoad++, Mikhail Makurov, Maxim Chernetsov (Intersvyaz): Zabbix, 100kNVPS on one server

The preprocessor manager" passes the data to its workers, which execute the preprocessing instructions and return them back through the same socket:

HighLoad++, Mikhail Makurov, Maxim Chernetsov (Intersvyaz): Zabbix, 100kNVPS on one server

The preprocessor manager then stores them in the history cache:

HighLoad++, Mikhail Makurov, Maxim Chernetsov (Intersvyaz): Zabbix, 100kNVPS on one server

From there, they are taken by history sinkers that perform quite a few functions: for example, calculating triggers, filling the value cache and, most importantly, saving metrics in the history store. In general, the process is complex and very confusing.

HighLoad++, Mikhail Makurov, Maxim Chernetsov (Intersvyaz): Zabbix, 100kNVPS on one server

MM: - The first thing we saw is that most threads compete for the so-called "config cache" (the memory area where all server configurations are stored). Especially a lot of locks are made by the threads responsible for retrieving data:

HighLoad++, Mikhail Makurov, Maxim Chernetsov (Intersvyaz): Zabbix, 100kNVPS on one server

…since the configuration stores not only metrics with their parameters, but also queues from which pollers take information about what to do next. When there are many pollers, and one blocks the configuration, the rest are waiting for requests:

HighLoad++, Mikhail Makurov, Maxim Chernetsov (Intersvyaz): Zabbix, 100kNVPS on one server

Pollers should not conflict

HighLoad++, Mikhail Makurov, Maxim Chernetsov (Intersvyaz): Zabbix, 100kNVPS on one server

Therefore, the first thing we did was to divide the queue into 4 parts and allow pollers to safely block these queues, these parts at the same time:

HighLoad++, Mikhail Makurov, Maxim Chernetsov (Intersvyaz): Zabbix, 100kNVPS on one server

This removed the competition for the configuration cache, and the speed of the pollers increased significantly. But then we were faced with the fact that the preprocessor manager began to accumulate a job queue:

HighLoad++, Mikhail Makurov, Maxim Chernetsov (Intersvyaz): Zabbix, 100kNVPS on one server

Preprocessor manager should be able to prioritize

This happened in cases where it lacked performance. Then all it could do was accumulate requests from data collection processes and add their buffer until it ate all the memory and crashed:

HighLoad++, Mikhail Makurov, Maxim Chernetsov (Intersvyaz): Zabbix, 100kNVPS on one server

To solve this problem, we added a second socket that was dedicated specifically for workers:

HighLoad++, Mikhail Makurov, Maxim Chernetsov (Intersvyaz): Zabbix, 100kNVPS on one server

Thus, the preprocessor manager got the opportunity to prioritize its work, and in case of buffer growth, the task is to slow down the removal, giving workers the opportunity to pick up this buffer:

HighLoad++, Mikhail Makurov, Maxim Chernetsov (Intersvyaz): Zabbix, 100kNVPS on one server

Then we discovered that one of the reasons for the slowdown was the workers themselves, as they competed for a resource that was completely unimportant for their work. We have issued a bug-fix for this problem, and it has already been resolved in the new versions of Zabbix:

HighLoad++, Mikhail Makurov, Maxim Chernetsov (Intersvyaz): Zabbix, 100kNVPS on one server

We increase the number of sockets - we get the result

Further, the preprocessor-manager itself became a bottleneck, since it is one thread. It rested on the speed of the core, giving a maximum speed of about 70 thousand metrics per second:

HighLoad++, Mikhail Makurov, Maxim Chernetsov (Intersvyaz): Zabbix, 100kNVPS on one server

So we made four, with four socket sets, workers:

HighLoad++, Mikhail Makurov, Maxim Chernetsov (Intersvyaz): Zabbix, 100kNVPS on one server

And this allowed us to increase the speed to about 130 thousand metrics:

HighLoad++, Mikhail Makurov, Maxim Chernetsov (Intersvyaz): Zabbix, 100kNVPS on one server

The non-linearity of growth is explained by the fact that there is competition for the history cache. 4 preprocessor managers and history sinkers competed for it. By this point, we were getting about 130 thousand metrics per second on the test machine, utilizing it by about 95% of the processor:

HighLoad++, Mikhail Makurov, Maxim Chernetsov (Intersvyaz): Zabbix, 100kNVPS on one server

About 2,5 months ago

Rejection of snmp-community increased NVPs by one and a half times

MM: Max, I need a new test car! We no longer fit into the current one.

MCH: – What do you have now?

MM: - Now - 130k NVPs and a shelf processor.

MCH: - Wow! Cool! Wait, I have two questions. According to my calculations, our need is in the region of 15-20 thousand metrics per second. Why do we need more?

MM: - I want to finish the job. I want to see how much we can squeeze out of this system.

MCH: - But ...

MM: But it's useless for business.

MCH: - It's clear. And the second question: will we be able to support what we have now on our own, without the help of a developer?

MM: - I don't think. Changing how the configuration cache works is a problem. It deals with changes in most streams and is quite difficult to maintain. Most likely, it will be very difficult to maintain it.

MCH: “Then we need some alternative.”

MM: - There is such an option. We can switch to fast cores, while abandoning the new blocking system. We will still get a performance of 60-80 thousand metrics. In this case, we can leave the rest of the code. Clickhouse, asynchronous polling will work. And it will be easy to maintain.

MCH: - Amazing! I suggest stopping there.

After optimizing the server side, we were finally able to run the new code in production. We abandoned some of the changes in favor of moving to a machine with fast kernels and minimizing the number of changes in the code. We also simplified the configuration and avoided macros in items where possible, as they are a source of additional locks.

HighLoad++, Mikhail Makurov, Maxim Chernetsov (Intersvyaz): Zabbix, 100kNVPS on one server

For example, the rejection of the snmp-community macro, which is often found in the documentation and examples, in our case allowed us to additionally speed up NVPs by about 1,5 times.

After two days in production

Remove incident history pop-ups

MCH: – Misha, we have been using the system for two days, and everything works. But only when everything works! We had planned work with the transfer of a fairly large segment of the network, and we again checked with our hands what went up and what did not.

MM: - Can't be! We checked everything 10 times. The server handles even complete network unavailability instantly.

MCH: - Yes, I understand everything: server, database, top, austat, logs - everything is fast ... But we look at the web interface, and there is a processor "in the shelf" on the server and this:

HighLoad++, Mikhail Makurov, Maxim Chernetsov (Intersvyaz): Zabbix, 100kNVPS on one server

MM: - It's clear. Let's watch the web. We found that in a situation where there were a large number of active incidents, most of the operational widgets started to work very slowly:

HighLoad++, Mikhail Makurov, Maxim Chernetsov (Intersvyaz): Zabbix, 100kNVPS on one server

The reason for this was the generation of incident history popups that are generated for each item in the list. Therefore, we refused to generate these windows (commented out 5 lines in the code), and this solved our problems.

Widget loading time, even when completely unavailable, has been reduced from several minutes to 10-15 seconds, which is acceptable for us, and you can still watch the history by clicking on the time:

HighLoad++, Mikhail Makurov, Maxim Chernetsov (Intersvyaz): Zabbix, 100kNVPS on one server

After work. 2 months ago

MCH: Misha, are you leaving? We have to talk.

MM: - I didn't mean to. Something with Zabbix again?

MCH: - No, relax! I just wanted to say: everything works, thanks! Beer from me.

Zabbix is ​​efficient

Zabbix is ​​a fairly versatile and rich system and function. It can be used for small installations out of the box, but as needs grow, it has to be optimized. To store a large archive of metrics, use a suitable storage:

  • you can use the built-in tools in the form of integration with Elasticsearch or uploading history to text files (available from the fourth version);
  • you can use our experience and integration with Clickhouse.

To drastically increase the speed of collecting metrics, collect them using asynchronous methods and transfer them through the trapper interface to the Zabbix server; or you can use the patch for asynchronous pollers of Zabbix itself.

Zabbix is ​​written in C and is quite efficient. The solution of several architectural bottlenecks allows to further increase its performance and, in our experience, to obtain more than 100 metrics on a single-processor machine.

HighLoad++, Mikhail Makurov, Maxim Chernetsov (Intersvyaz): Zabbix, 100kNVPS on one server

The same Zabbix patch

MM: – I want to add a couple of points. The entire current report, all tests, figures are given for the configuration that we use. We are now taking about 20 thousand metrics per second from it. If you are trying to understand if it will work for you - you can compare. What we talked about today is posted on GitHub as a patch: github.com/miklert/zabbix

HighLoad++, Mikhail Makurov, Maxim Chernetsov (Intersvyaz): Zabbix, 100kNVPS on one server

The patch includes:

  • full integration with Clickhouse (both Zabbix server and frontend);
  • solving problems with the preprocessor manager;
  • asynchronous polling.

The patch is compatible with all version 4, including lts. Most likely, with minimal changes, it will work on version 3.4.

Thank you for attention.

Questions

Question from the audience (hereinafter - A): - Good afternoon! Please tell me, do you have plans for intensive interaction with the Zabbix team or do they have with you, so that this is not a patch, but the normal behavior of Zabbix?

MM: – Yes, we will definitely commit some of the changes. Something will be, something will remain in the patch.

A: – Thank you very much for the excellent report! Tell me, please, after applying the patch, support from Zabbix will remain and how to upgrade to higher versions? Will it be possible to update Zabbix after your patch to 4.2, 5.0?

MM: I can't speak for support. If I were Zabbix technical support, then, apparently, I would say no, because this is someone else's code. As for the 4.2 codebase, our position is: “We will go with time, and we will update ourselves on the next version.” Therefore, for some time we will upload a patch for updated versions. I already said in the report: the number of changes with versions is still quite small. I think the transition from 3.4 to 4 took us, it seems, about 15 minutes. Something has changed there, but not very important.

A: - So you plan to support your patch and you can safely put it into production, in the future receiving updates in some way?

MM: - We strongly recommend it. It solves a lot of problems for us.

MCH: - Once again, I would like to emphasize that the changes that do not concern the architecture and do not concern locks, queues - they are modular, they are in separate modules. Even on their own with minor changes, they can be maintained quite easily.

MM: - If you are interested in details, then Clickhouse uses the so-called history library. It is untethered - this is a copy of Elastics support, that is, it is configurable. Polling only changes pollers. We believe that this will work for a long time.

A: - Thanks a lot. And tell me, is there any documentation of the changes made?

HighLoad++, Mikhail Makurov, Maxim Chernetsov (Intersvyaz): Zabbix, 100kNVPS on one server

MM: – Documentation is a patch. Obviously, with the introduction of Clickhouse, with the introduction of new types of pollers, new configuration options appear. The link from the last slide has a short description of how to use it.

About replacing fping with nmap

A: - How did you end up doing it? Can you give specific examples: do you have strappers and an external script? What ends up checking so many hosts so quickly? How do you mine these hosts? Should nmap somehow feed them, get them from somewhere, put them in, run something?..

MM: - Cool. A very valid question! The point is this. We have modified the library (ICMP ping, an integral part of Zabbix) for ICMP checks, which indicate the number of packets - one (1), and the code tries to use nmap. That is, this is the internal work of Zabbix, it has become the internal work of the pinger. Accordingly, no synchronization or use of a trapper is required. This was done deliberately in order to keep the system intact and not to synchronize the two base systems: what to check, upload through the poller, and whether our upload is broken?.. It's much easier.

A: – Does it work for proxies too?

MM: Yes, but we haven't checked. The polling code is the same in both Zabbix and the server. Should work. I emphasize once again: the performance of the system is such that we do not need a proxy.

MCH: - The correct answer to the question is: “Why do you need a proxy with such a system?” Only because of NAT'a or to monitor through a slow channel some ...

A: - And you use Zabbix as an allertor, if I understand correctly. Or did the graphics (where is the archive layer) go to another system, such as Grafana? Or are you not using this functionality?

MM: – I will emphasize once again: we have made full integration. We pour history into Clickhouse, but at the same time we changed the php frontend. Php-frontend goes to Clickhouse and does all graphics from there. At the same time, to be honest, we have a part that builds data in other graphic display systems from the same Clickhouse, from the same Zabbix data.

MCH: - In "Grafana" including.

How was the decision to allocate resources made?

A: – Share a bit of the inner kitchen. How was the decision made that it was necessary to allocate resources for a serious processing of the product? These are, in general, certain risks. And please tell me, in the context of the fact that you are going to support new versions: how does this decision justify from a management point of view?

MM: – Apparently, we did not tell the drama of history very well. We found ourselves in a situation where something had to be done, and went in fact with two parallel commands:

  • One was involved in launching a monitoring system based on new methods: monitoring as a service, a standard set of open source solutions that we combine and then try to change the business process in order to work with the new monitoring system.
  • In parallel, we had an enthusiastic programmer who did this (about himself). It so happened that he won.

A: – And what is the size of the team?

MCH: She is in front of you.

A: - That is, as always, a passionary is needed?

MM: – I don’t know what a passionary is.

A: In this case, it seems to be you. Thank you very much, you are awesome.

MM: - Thank.

About patches for Zabbix

A: - For a system that uses a proxy (for example, in some distributed systems), is it possible to adapt and patch, say, pollers, proxies and partially the preprocessor of Zabbix itself; and their interaction? Is it possible to optimize existing developments for a system with multiple proxies?

MM: - I know that the "Zabbix" server is assembled using a proxy (it is compiled and the code is obtained). We haven't tested this in production. I'm not sure about this, but I don't think the preprocessor manager is used in the proxy. The task of the proxy is to take a set of metrics from Zabbix, dump them (it also records the configuration, local database) and give it back to the Zabbix server. The server itself will then do the preprocessing when it receives it.

The interest in proxies is understandable. We will check it out. This is an interesting topic.

A: – The idea was this: if you can patch pollers, you can patch them on a proxy and patch the interaction with the server, and adapt the preprocessor for these purposes only on the server.

MM: - I think it's even easier. You take the code, apply a patch, then configure it the way you need - you collect proxy servers (for example, with ODBC) and distribute the patched code among the systems. Where necessary - collect the proxy, where necessary - the server.

A: - In addition, you won’t have to patch the proxy transfer to the server, most likely?

MCH: No, it's standard.

MM: - In fact, one of the ideas did not sound. We have always kept a balance between the explosion of ideas and the number of changes, ease of support.

Some ads 🙂

Thank you for staying with us. Do you like our articles? Want to see more interesting content? Support us by placing an order or recommending to friends, cloud VPS for developers from $4.99, a unique analogue of entry-level servers, which was invented by us for you: The whole truth about VPS (KVM) E5-2697 v3 (6 Cores) 10GB DDR4 480GB SSD 1Gbps from $19 or how to share a server? (available with RAID1 and RAID10, up to 24 cores and up to 40GB DDR4).

Dell R730xd 2 times cheaper in Equinix Tier IV data center in Amsterdam? Only here 2 x Intel TetraDeca-Core Xeon 2x E5-2697v3 2.6GHz 14C 64GB DDR4 4x960GB SSD 1Gbps 100 TV from $199 in the Netherlands! Dell R420 - 2x E5-2430 2.2Ghz 6C 128GB DDR3 2x960GB SSD 1Gbps 100TB - from $99! Read about How to build infrastructure corp. class with the use of Dell R730xd E5-2650 v4 servers worth 9000 euros for a penny?

Source: habr.com

Add a comment