How we built monitoring on Prometheus, Clickhouse and ELK

My name is Anton Baderin. I work at the Center for High Technologies and do system administration. A month ago, our corporate conference ended, where we shared our experience with the IT community of our city. I talked about monitoring web applications. The material was intended for the junior or middle level, which did not build this process from scratch.

How we built monitoring on Prometheus, Clickhouse and ELK

The cornerstone that underlies any monitoring system is the solution to business problems. Monitoring for the sake of monitoring is of no interest to anyone. What does the business want? To make everything work quickly and without errors. Business wants proactivity so that we ourselves identify problems in the operation of the service and eliminate them as quickly as possible. These, in fact, are the tasks that I have been solving for the past year on the project of one of our customers.

About Orchard

The project is one of the largest loyalty programs in the country. We help retail chains to increase the frequency of sales through various marketing tools such as bonus cards. In total, the project includes 14 applications that run on ten servers.

In the process of conducting interviews, I have repeatedly noticed that admins do not always have the right approach to monitoring web applications: until now, many dwell on the metrics of the operating system, and occasionally monitor services.

In my case, Icinga was the basis of the customer's monitoring system before. She did not solve the above problems in any way. Often the client himself informed us about the problems and at least we simply did not have enough data to get to the bottom of the cause.

In addition, there was a clear understanding of the futility of its further development. I think those who are familiar with Icinga will understand me. So, we decided to completely redesign the web application monitoring system on the project.

Prometheus

We chose Prometheus based on three key factors:

  1. Huge number of available metrics. In our case, there are 60 thousand of them. Of course, it is worth noting that we do not use the vast majority of them (probably about 95%). On the other hand, they are all relatively cheap. For us, this is another extreme, compared to the previously used Icinga. In it, adding metrics was a particular pain: the existing ones were expensive (just look at the source code of any plugin). Any plugin was a script in Bash or Python, the launch of which is not cheap in terms of consumed resources.
  2. This system consumes relatively few resources. 600 MB of RAM, 15% of one core and a couple of dozen IOPS are enough for all our metrics. Of course, you have to run metrics exporters, but they are all written in Go and are also not gluttonous. I do not think that in modern realities this is a problem.
  3. Enables transition to Kubernetes. Considering the customer's plans, the choice is obvious.

from ELK

Previously, we did not collect or process logs. The disadvantages are clear to all. We chose ELK because we already had experience with this system. We store only application logs there. The main selection criteria were full-text search and its speed.

Clickhouse

Initially, the choice fell on InfluxDB. We recognized the need to collect Nginx logs, statistics from pg_stat_statements, store Prometheus historical data. We did not like Influx, as it periodically began to consume a large amount of memory and crashed. In addition, I wanted to group requests by remote_addr, and grouping in this DBMS is only by tags. Tags are expensive (memory), their number is conditionally limited.

We started the search again. We needed an analytical base with minimal resource consumption, preferably with data compression on disk.

Clickhouse satisfies all these criteria, and we have never regretted our choice. We do not write any outstanding amounts of data into it (the number of inserts is only about five thousand per minute).

NewRelic

NewRelic has historically been with us since it was the customer's choice. We use it as APM.

Zabbix

We use Zabbix exclusively for Black Box monitoring of various APIs.

Determining the monitoring approach

We wanted to decompose the task and thereby systematize the approach to monitoring.

To do this, I divided our system into the following levels:

  • hardware and VMS;
  • operating system;
  • system services, software stack;
  • Appendix;
  • business logic.

What is the benefit of this approach:

  • we know who is responsible for the work of each of the levels and, based on this, we can send alerts;
  • we can use the structure when suppressing alerts - it would be strange to send an alert about the unavailability of the database when the whole virtual machine is unavailable.

Since our task is to detect violations in the system, we must select a certain set of metrics at each level that you should pay attention to when writing alerting rules. Next, let's go through the levels "VMS", "Operating system" and "System services, software stack".

Virtual machines

Hosting provides us with a processor, disk, memory and network. And we had problems with the first two. So the metrics are:

CPU stolen time - when you buy a virtual machine on Amazon (t2.micro, for example), you should understand that you are not allocated a whole processor core, but only a quota of its time. And when you run out of it, the processor will be taken away from you.

This metric allows you to track such moments and make decisions. For example, is it necessary to take a fatter tariff or to spread the processing of background tasks and API requests to different servers.

IOPS + CPU iowait time - for some reason, many cloud hosting companies sin by not providing IOPS. Moreover, a graph with low IOPS is not an argument for them. Therefore, it is worth collecting CPU iowait. With this pair of graphs - with low IOPS and high I/O latency - you can already talk to the hosting and solve the problem.

Operating system

Operating system metrics:

  • amount of available memory in %;
  • swap usage activity: vmstat swapin, swapout;
  • number of available inodes and free space on the file system in %
  • average load;
  • number of connections in state tw;
  • fullness of the conntrack table;
  • the quality of the network can be monitored using the ss utility, using the iproute2 package - to get the indicator of RTT connections from its output and group by dest port.

Also at the level of the operating system, we have such an entity as processes. It is important to single out a set of processes in the system that play an important role in its operation. If, for example, you have several pgpools, then you need to collect information on each of them.

The set of metrics is as follows:

  • CPUs;
  • memory - first of all, resident;
  • IO - preferably in IOPS;
  • FileFd - open and limit;
  • significant page failures - so you can understand which process is swapping.

All monitoring is deployed in Docker, we use Π‘advisor to collect metrics data. On other machines, use process-exporter.

System services, software stack

Each application has its own specifics, and it is difficult to single out any set of metrics.

The universal set is:

  • rate of requests;
  • number of mistakes;
  • latency;
  • saturation

The most striking examples of this level of monitoring are Nginx and PostgreSQL.

The most loaded service in our system is the database. In the past, we often had problems finding out what the database was doing.

We saw a high load on the disks, but the slow logs did not really show anything. We solved this problem with pg_stat_statements, a view that collects query statistics.

That's all the admin needs.

We build activity graphs for read and write requests:

How we built monitoring on Prometheus, Clickhouse and ELK
How we built monitoring on Prometheus, Clickhouse and ELK

Everything is simple and clear, each request has its own color.

An equally striking example is Nginx logs. Not surprisingly, few people parse them or mention them in the must-have list. The standard format is not very informative and needs to be extended.

Personally, I added request_time, upstream_response_time, body_bytes_sent, request_length, request_id. We plot the response time and the number of errors:

How we built monitoring on Prometheus, Clickhouse and ELK
How we built monitoring on Prometheus, Clickhouse and ELK

We build graphs of response time and the number of errors. Remember? Did I mention business objectives? To quickly and without errors? We have already closed these issues with two charts. And on them it is already possible to call the administrators on duty.

But there was one more problem - to ensure the rapid elimination of the causes of the incident.

Incident resolution

The whole process from identifying to solving a problem can be broken down into a number of steps:

  • identification of the problem;
  • notification of the administrator on duty;
  • reaction to the incident;
  • elimination of causes.

It is important that we must do this as quickly as possible. And if at the stages of identifying a problem and sending a notification, we cannot gain much time - two minutes will be spent on them anyway, then the subsequent ones are just an unplowed field for improvements.

Let's just imagine that the attendant's phone rang. What will he do? Look for answers to questions - what broke, where broke, how to respond? Here is how we answer these questions:

How we built monitoring on Prometheus, Clickhouse and ELK

We simply include all this information in the text of the notification, give a link to the wiki page in it, which describes how to respond to this problem, how to solve it and escalate it.

I still haven't said anything about the application layer and business logic. Unfortunately, our applications do not yet implement the collection of metrics. The only source of at least some information from these levels is the logs.

A couple of moments.

First, write structured logs. You don't need to include context in the message body. This makes it difficult to group and analyze them. Logstash takes a long time to normalize all this.

Second, use severity levels correctly. Each language has its own standard. Personally, I distinguish four levels:

  1. no error;
  2. error on the client side;
  3. the mistake is on our side, we do not lose money, we do not bear risks;
  4. mistake on our side, we lose money.

I summarize. It is necessary to try to build monitoring from business logic. Try to monitor the application itself and operate with such metrics as the number of sales, the number of new user registrations, the number of currently active users, and so on.

If your entire business is one button in the browser, you need to monitor whether it is being clicked, whether it works properly. All the rest does not matter.

If you don't have it, you can try to catch it up in the application logs, Nginx logs, and so on, as we did. You must be as close to the application as possible.

Operating system metrics are certainly important, but they are not interesting for business, we are not paid for them.

Source: habr.com

Add a comment