VictoriaMetrics and private cloud monitoring. Pavel Kolobaev

VictoriaMetrics and private cloud monitoring. Pavel Kolobaev

VictoriaMetrics is a fast and scalable DBMS for storing and processing data in the form of a time series (a record forms a time and a set of values ​​corresponding to this time, for example, obtained through a periodic polling of the status of sensors or collecting metrics).


VictoriaMetrics and private cloud monitoring. Pavel Kolobaev

My name is Pavel Kolobaev. DevOps, SRE, LeroyMerlin, everything is like code - it's all about us: about me and about other LeroyMerlin employees.

VictoriaMetrics and private cloud monitoring. Pavel Kolobaev

https://bit.ly/3jf1fIK

There is a cloud based on OpenStack. There is a small link to the technical radar.

VictoriaMetrics and private cloud monitoring. Pavel Kolobaev

It is built on the basis of Kubernetes iron, as well as on all related services to OpenStack and logging.

VictoriaMetrics and private cloud monitoring. Pavel Kolobaev

This is the scheme we had in development. When we developed all this, we had the Prometheus operator, which stored data inside the K8s cluster itself. He automatically finds what needs to be scrubbed and puts it under his feet, roughly speaking.

VictoriaMetrics and private cloud monitoring. Pavel Kolobaev

We will need to move all the data outside the Kubernetes cluster, because if something happens, then we must understand what and where.

VictoriaMetrics and private cloud monitoring. Pavel Kolobaev

The first solution is we use federation when we have a third party Prometheus when we go to the Kubernetes cluster through the federation mechanism.

VictoriaMetrics and private cloud monitoring. Pavel Kolobaev

But there are some minor problems here. In our case, the problems started when we had 250 metrics, and when it became 000 metrics, we realized that we couldn’t work like that. We have increased scrape_timeout to 400 seconds.

Why did we have to do this? Prometheus starts counting the timeout from the start of the pickup moment. It doesn't matter that the data is still pouring in. If during this specified period of time the data has not merged and the session is not closed via http, then it is considered that the session has failed and the data does not get into Prometheus itself.

VictoriaMetrics and private cloud monitoring. Pavel Kolobaev

Everyone knows the graphs that we get when part of the data is missing. The graphics are torn and we are not happy with it.

VictoriaMetrics and private cloud monitoring. Pavel Kolobaev

The next option is sharding based on two different Prometheus through the same federation mechanism.

For example, just take and shard them by name. This can also be used, but we decided to move on.

VictoriaMetrics and private cloud monitoring. Pavel Kolobaev

We will now have to process these shards somehow. You can take promxy, which descends into the shard area, multiplies the data. It works with two shards as a single entry point. This can be implemented through promxy, but it's too complicated for now.

VictoriaMetrics and private cloud monitoring. Pavel Kolobaev

The first option - we want to abandon the federation mechanism, because it is very slow.

The Prometheus developers explicitly say: "Guys, use other TimescaleDB, because we will not support long-term storage of metrics." This is not their task. VictoriaMetrics and private cloud monitoring. Pavel Kolobaev

We write down on a piece of paper that we still need to unload outside, so as not to store everything in one place.

VictoriaMetrics and private cloud monitoring. Pavel Kolobaev

The second disadvantage is memory consumption. Yes, I understand that many will say that in 2020 a couple of gigabytes of memory is worth a penny, but nevertheless.

Now we have a dev and a prod environment. In dev, this is about 9 gigabytes per 350 metrics. In prod, this is 000 gigabytes with a small 14 metrics. At the same time, we have only 780 minutes retention time. This is bad. And now I will explain why.

VictoriaMetrics and private cloud monitoring. Pavel Kolobaev

We make a calculation, that is, with one and a half million metrics, and we are already close to them, at the design stage we get 35-37 gigabytes of memory. But already by 4 million metrics, about 90 gigabytes of memory are already required. That is, it was calculated according to the formula provided by the Prometheus developers. We looked at the correlation and realized that we do not want to pay a couple of million for a server just for monitoring.

We will not only increase the number of machines, we also monitor the virtual machines themselves. Therefore, the more virtual machines, the more metrics of various kinds, etc. We will have a special growth of our cluster in terms of metrics.

VictoriaMetrics and private cloud monitoring. Pavel Kolobaev

With disk space, everything is not so sad here, but I would like to improve it. We received a total of 15 gigabytes in 120 days, of which 100 are compressed data, 20 are uncompressed data, but you always want less.

VictoriaMetrics and private cloud monitoring. Pavel Kolobaev

Accordingly, we write down one more point - this is a large consumption of resources that we still want to save, because we do not want our monitoring cluster to eat more resources than our cluster that manages OpenStack.

VictoriaMetrics and private cloud monitoring. Pavel Kolobaev

There is one more disadvantage of Prometheus, which we have identified for ourselves, this is at least some kind of memory limitation. With Prometheus, everything is much worse here, because he doesn’t have such twists at all. Using docker limit is also not an option. If suddenly your RAF has fallen and there are 20-30 gigabytes, then it will take a very long time to rise.

VictoriaMetrics and private cloud monitoring. Pavel Kolobaev

This is another reason why Prometheus is not suitable for us, i.e. we cannot limit memory consumption.

VictoriaMetrics and private cloud monitoring. Pavel Kolobaev

It would be possible to come up with such a scheme. We need this scheme in order to organize an HA cluster. We want our metrics to be available anytime, anywhere, even if the server that stores these metrics crashes. And thus we have to build such a scheme.

This scheme says that we will have duplication of shards, and, accordingly, duplication of the costs of consumed resources. It can be almost horizontally scaled, but nevertheless the resource consumption will be infernal.

VictoriaMetrics and private cloud monitoring. Pavel Kolobaev

Disadvantages in order, in the form in which we wrote them out for ourselves:

  • Requires uploading metrics to the outside.
  • High consumption of resources.
  • You cannot limit memory consumption.
  • Complicated and resource-intensive implementation of HA.

VictoriaMetrics and private cloud monitoring. Pavel Kolobaev

For ourselves, we decided that we are moving away from Prometheus as a repository.

We have identified additional requirements for ourselves that we need. This:

  • This is promql support, because a lot has already been written for Prometheus: queries, alerts.
  • And then we have Grafana, which is already written in the same way under Prometheus as a backend. I don't want to rewrite dashboards.
  • We want to build a normal HA architecture.
  • We want to reduce the consumption of any resources.
  • There is another small nuance. We cannot use various kinds of cloud systems for collecting metrics. We do not know what will fly into these metrics for now. And since anything can fly there, we have to limit ourselves to local placement.

VictoriaMetrics and private cloud monitoring. Pavel Kolobaev

The choice was small. We have collected everything on which we had experience. We looked at the Prometheus page in the integration section, read a bunch of articles, looked at what is generally available. And for ourselves, we chose VictoriaMetrics as a replacement for Prometheus.

Why? Because:

  • Able to promql.
  • There is a modular architecture.
  • Requires no changes to Grafana.
  • And most importantly, we will probably provide a metrics storage within our company as a service, so we are looking in advance towards various kinds of restrictions so that users can use all the resources of the cluster in some limited way, because there is a chance that it will multitenancy.

VictoriaMetrics and private cloud monitoring. Pavel Kolobaev

We make the first comparison. We take the same Prometheus inside the cluster, external Prometheus goes to it. We add via remoteWrite VictoriaMetrics.

VictoriaMetrics and private cloud monitoring. Pavel Kolobaev

I’ll make a reservation right away that here we caught a slight increase in CPU consumption from VictoriaMetrics. The VictoriaMetrics wiki says which parameters are best suited. We checked them. They very well reduced the consumption of the CPU.

In our case, the memory consumption of Prometheus, which is located in a Kubernetes cluster, did not increase significantly.

VictoriaMetrics and private cloud monitoring. Pavel Kolobaev

We compare two data sources of the same data. In Prometheus, we see all the same missing data. Everything is good at VictoriaMetrics.

VictoriaMetrics and private cloud monitoring. Pavel Kolobaev

Results of tests with disk space. We at Prometheus got 120 gigabytes in total. At VictoriaMetrics, we are already getting 4 gigabytes per day. There is a slightly different mechanism than what you are used to seeing in Prometheus. That is, the data is already quite well compressed for a day, for half an hour. They are already well reaped in a day, in half an hour, despite the fact that later the data will be merged. As a result, we saved on disk space.

VictoriaMetrics and private cloud monitoring. Pavel Kolobaev

We also save on the consumption of memory resources. At the time of the tests, we had Prometheus deployed on a virtual machine - 8 cores, 24 gigabytes. Prometheus eats almost everything. He fell on OOM Killer. At the same time, only 900 active metrics were poured into it. This is about 000-25 metrics per second.

VictoriaMetrics was running on a dual-core virtual machine with 8 gigabytes of RAM. We managed to get VictoriaMetrics to work well by tweaking some things on an 8GB machine. As a result, we kept within 7 gigabytes. At the same time, we got the speed of content delivery, i.e., metrics, even higher than that of Prometheus.

VictoriaMetrics and private cloud monitoring. Pavel Kolobaev

The CPU is much better than Prometheus. Here Prometheus consumes 2,5 cores, and VictoriaMetrics consumes only 0,25 cores. At the start - 0,5 cores. As it merges, it reaches one core, but this is extremely, extremely rare.

VictoriaMetrics and private cloud monitoring. Pavel Kolobaev

In our case, the choice fell on VictoriaMetrics for obvious reasons, we wanted to save money and saved.

VictoriaMetrics and private cloud monitoring. Pavel Kolobaev

We cross out two points right off the bat - this is the unloading of metrics and a large consumption of resources. And it remains for us to decide two points that we still left for ourselves.

VictoriaMetrics and private cloud monitoring. Pavel Kolobaev

Here I will make a reservation right away, we consider VictoriaMetrics as a repository of metrics. But since we will most likely provide VictoriaMetrics as storage for all Leroy, we need to limit those who will use this cluster so that they do not put it to us.

There is a wonderful parameter that allows you to limit by time, by the amount of data and by execution time.

And there is also an excellent option that allows you to limit memory consumption, thus we can find the very balance that will allow us to get normal speed and adequate resource consumption.

VictoriaMetrics and private cloud monitoring. Pavel Kolobaev

Minus one more point, that is, we cross out the point - you can not limit memory consumption.

VictoriaMetrics and private cloud monitoring. Pavel Kolobaev

In the first iterations, we tested the VictoriaMetrics Single Node. Next we move on to the VictoriaMetrics Cluster Version.

Here we have a free hand on the subject of separation of different services in VictoriaMetrics, depending on what they will spin on and what resources they will consume. This is a very flexible and convenient solution. We have used it ourselves.

VictoriaMetrics and private cloud monitoring. Pavel Kolobaev

The main components of the VictoriaMetrics Cluster Version are vmstsorage. There may be N number. In our case, there are 2 of them.

And there is vminsert. This is a proxy server that allows us to: arrange sharding between all the storages that we told him about, and it allows another replica, i.e. you will have both sharding and a replica.

Vminsert supports the OpenTSDB, Graphite, InfluxDB and remoteWrite protocols from Prometheus.

VictoriaMetrics and private cloud monitoring. Pavel Kolobaev

There is also vmselect. Its main task is to go to vmstorage, get data from them, dedupe this data and give it to the client.

VictoriaMetrics and private cloud monitoring. Pavel Kolobaev

There is a wonderful thing like vmagent. We really like her. It allows you to configure just like Prometheus and still do everything just like Prometheus. That is, it collects metrics from different entities and services and sends them to vminsert. Then everything depends on you.

VictoriaMetrics and private cloud monitoring. Pavel Kolobaev

Another great service is vmalert, which allows you to use VictoriaMetrics as a backend, receive processed data from vminsert and send processed data to vmselect. It processes the alerts themselves, as well as the rules. In the case of alerts, we get an alert through alertmanager.

VictoriaMetrics and private cloud monitoring. Pavel Kolobaev

There is a wmauth component. We will probably use it, and perhaps not (we have not decided on this yet) as an authorization system for multitenancy versions of clusters. It supports remoteWrite for Prometheus and can authorize based on the url, or rather the second part of it, where you can or cannot write.

VictoriaMetrics and private cloud monitoring. Pavel Kolobaev

There is also vmbackup, vmrestore. This is, in fact, the restoration and backup of all data. Able to S3, GCS, file.

VictoriaMetrics and private cloud monitoring. Pavel Kolobaev

The first iteration of our cluster was made during quarantine. At that time, there was no replica, so our iteration consisted of two different and independent clusters, into which we received data via remoteWrite.

VictoriaMetrics and private cloud monitoring. Pavel Kolobaev

Here I will make a reservation that when we switched from VictoriaMetrics Single Node to VictoriaMetrics Cluster Version, we still remained in the same consumed resources, i.e. the main one is memory. Approximately in this way our data was distributed, i.e. resource consumption.

VictoriaMetrics and private cloud monitoring. Pavel Kolobaev

A replica has already been added here. We have combined all this into one relatively large cluster. All data is both sharded and replicated.

The entire cluster has N entry points, i.e. Prometheus can add data via HAPROXY. Here is our entry point. And through this entry point, you can log in with Grafana.

VictoriaMetrics and private cloud monitoring. Pavel Kolobaev

In our case, HAPROXY is the only port that proxies select, insert, and other services into this cluster. In our case, it was impossible to make one address, we had to make several entry points, because the virtual machines themselves, on which the VictoriaMetrics cluster is running, are located in different zones of the same cloud provider, i.e. not inside our cloud, but outside.

VictoriaMetrics and private cloud monitoring. Pavel Kolobaev

We have an alert. We use it. We use alertmanager from Prometheus. We use Opsgenie and Telegram as an alert delivery channel. In Telegram, they are pouring from dev, maybe something from prod, but more like something statistical that engineers need. And Opsgenie is critical. These are calls, incident management.

VictoriaMetrics and private cloud monitoring. Pavel Kolobaev

The age-old question: "Who monitors the monitoring?". In our case, monitoring monitors the monitoring itself, because we use vmagent on each node. And since our nodes are located in different data centers of the same provider, each data center has its own channel, they are independent, and even if a split brain comes, we will still receive alerts. Yes, there will be more of them, but it's better to get more alerts than none.

VictoriaMetrics and private cloud monitoring. Pavel Kolobaev

We end our list with the implementation of HA.

VictoriaMetrics and private cloud monitoring. Pavel Kolobaev

And further I would like to note the experience of communicating with the VictoriaMetrics community. It turned out to be very positive. Guys are responsive. They try to delve into every case that is offered.

I created issues on GitHub. They were resolved very quickly. There are a couple more issues that are not completely closed, but I can already see from the code that work is underway in this direction.

The main pain during the iterations for me was that if I cut down the node, then for the first 30 seconds vminsert could not understand that there was no backend. Now it has already been decided. And literally in a second or two, the data is taken from all the remaining nodes, and the request stops waiting for that missing node.

VictoriaMetrics and private cloud monitoring. Pavel Kolobaev

We wanted at some point from VictoriaMetrics to be the VictoriaMetrics operator. We waited for him. We are now actively building a binding over the VictoriaMetrics operator to take all the pre-calculating rules, etc. Prometheus, because we are quite actively using the rules that come with the Prometheus operator.

There are suggestions to improve the cluster implementation. I have outlined them above.

And I also really want downsampling. In our case, downsampling is needed exclusively for viewing trends. Roughly speaking, one metric is enough for me during the day. These trends are needed for a year, three, five, ten years. And one metric value is enough.
VictoriaMetrics and private cloud monitoring. Pavel Kolobaev

  • We have known pain, as have some of our colleagues, while using Prometheus.
  • We chose VictoriaMetrics for ourselves.
  • It scales quite well both vertically and horizontally.
  • We can distribute different components to a different number of nodes in the cluster, limit them in terms of memory, add memory, etc.

We will use VictoriaMetrics at home, because we really liked it. Here is what happened and what happened.

VictoriaMetrics and private cloud monitoring. Pavel Kolobaev

https://t.me/VictoriaMetrics_ru1

A couple of qr codes for VictoriaMetrics chat, my contacts, LeroyMerlin technical radar.

Source: habr.com

Add a comment