Monitoring as a Service: A Modular System for Microservice Architecture

Today, in addition to the monolithic code, dozens of microservices operate on our project. Each of them requires to be monitored. It is problematic to do this in such volumes by DevOps engineers. We have developed a monitoring system that works as a service for developers. They can independently write metrics to the monitoring system, use them, build dashboards based on them, attach alerts to them that will be triggered when threshold values ​​are reached. With DevOps engineers - only infrastructure and documentation.

This post is a transcript of my speech from our section on RIT++. Many asked us to make text versions of reports from there. If you've been to a conference or watched a video, you won't find anything new. And to everyone else - welcome under cat. I'll tell you how we came to such a system, how it works and how we plan to update it.

Monitoring as a Service: A Modular System for Microservice Architecture

Past: schemes and plans

How did we come to the existing monitoring system? In order to answer this question, you need to go to 2015. Here's what it looked like then:

Monitoring as a Service: A Modular System for Microservice Architecture

We had about 24 nodes that were responsible for monitoring. There is a whole bunch of different crons, scripts, daemons that monitor something somewhere, send messages, perform functions. We thought that the further, the less such a system would be viable. It makes no sense to develop it: it is too cumbersome.
We decided to choose those elements of monitoring that we will leave and develop, and those that we will abandon. There were 19 of them. Only graphites, aggregators and Grafana as a dashboard remained. But what will the new system look like? Like this:

Monitoring as a Service: A Modular System for Microservice Architecture

We have a repository of metrics: these are graphites that will be based on fast SSD drives, these are certain aggregators for metrics. Next - Grafana for displaying dashboards and Moira as an alert. We also wanted to develop a system for finding anomalies.

Standard: Monitoring 2.0

This is what the plans looked like in 2015. But we had to prepare not only the infrastructure and the service itself, but also the documentation for it. We have developed a corporate standard for ourselves, which we called monitoring 2.0. What were the requirements for the system?

  • constant availability;
  • metric storage interval = 10 seconds;
  • structured storage of metrics and dashboards;
  • SLA > 99,99%
  • collection of event metrics via UDP (!).

We needed UDP because we have a lot of traffic and events that generate metrics. If they are all written in graphite at once, the repository will collapse. We also chose first level prefixes for all metrics.

Monitoring as a Service: A Modular System for Microservice Architecture

Each of the prefixes has some property. There are metrics for servers, networks, containers, resources, applications, and so on. A clear, strict, typed filtering has been implemented, where we accept the first level metrics, and simply drop the rest. This is how we planned this system in 2015. What is in the present?

Present: the scheme of interaction of monitoring components

First of all, we monitor applications: our PHP code, applications and microservices - in a word, everything that our developers write. All applications send metrics via UDP to the Brubeck aggregator (statsd, rewritten in C). It turned out to be the fastest according to the results of synthetic tests. And it sends the already aggregated metrics to Graphite via TCP.

It has such type of metrics as timers. This is a very handy thing. For example, for each user connection to the service, you send a response time metric to Brubeck. A million answers came, and the aggregator gave out only 10 metrics. You have the number of people who came, the maximum, minimum and average response times, the median and 4 percentiles. Then the data is transferred to Graphite and we see them all live.

We also have aggregation for hardware, software, system metrics and our old Munin monitoring system (it worked with us until 2015). We collect all this through the C'ish daemon CollectD (a whole bunch of various plug-ins is sewn into it, it can query all the resources of the host system on which it is installed, just specify in the configuration where to write data) and write data through it in Graphite. It also supports python plugins and shell scripts, so you can write your own custom solutions: CollectD will collect this data from a local or remote host (assume there is Curl) and send it to Graphite.

Further, all the metrics that we have collected are sent to Carbon-c-relay. This is Graphite's Carbon Relay solution, modified in C. This is a router that collects all the metrics that we send from our aggregators and routes them through the nodes. Also at the routing stage, it checks the validity of the metrics. Firstly, they must match the prefix scheme I showed earlier and, secondly, they must be valid for graphite. Otherwise, they drop.

Then Carbon-c-relay sends the metrics to the Graphite cluster. We use Carbon-cache rewritten in Go as the main storage for metrics. Go-carbon, due to its multi-threading, is far superior in performance to Carbon-cache. It takes data into itself and writes it to disk using the whisper package (standard, written in python). In order to read data from our storages, we use the Graphite API. It works much faster than the standard Graphite WEB. What happens to the data next?

They go to Grafana. We use our graphite clusters as the main data source, plus we have Grafana as a web interface for displaying metrics, building dashboards. For each of their services, developers create their own dashboard. Then they build graphs based on them, which display the metrics that they write from their applications. In addition to Grafana, we also have SLAM. This is a pythonic demon that calculates SLA based on data from graphite. As I said, we have several dozen microservices, each of which has its own requirements. With the help of SLAM, we go to the documentation and compare it with what is in Graphite and compare how the requirements correspond to the availability of our services.

Going further: alerting. It is organized with a strong system - Moira. She is independent because she has her own Graphite under the hood. Developed by the guys from SKB Kontur, written in python and Go, fully open source. Moira receives all the same flow that goes into graphites. If for some reason your storage dies, then your alerting will work.

We deployed Moira in Kubernetes, it uses a cluster of Redis servers as the main database. The result is a fault-tolerant system. It compares the flow of metrics with the list of triggers: if there are no mentions in it, then it drops the metric. So she is able to digest gigabytes of metrics per minute.

We also added a corporate LDAP to it, with the help of which each user of the corporate system can create notifications for himself on existing (or newly created) triggers. Since Moira contains Graphite, it supports all of its features. So you first take the line and copy it into Grafana. See how the data is displayed on the charts. And then you take the same line and copy it into Moira. Hang it with limits and get an alert at the output. To do all this, you do not need any specific knowledge. Moira can alert via SMS, email, Jira, Slack… It also supports custom scripts. When she has a trigger, and she is subscribed to a custom script or binary, she launches it and sends this JSON binary to stdin. Accordingly, your program should parse it. What you will do with this JSON is up to you. If you want, send it to Telegram, if you want, open tasks in Jira, do whatever you want.

We also use our own development for alerting - Imagotag. We adapted the panel, which is usually used for electronic price tags in stores, to our needs. We brought triggers from Moira to it. It indicates what condition they are in, when they happened. Some of the guys from the development abandoned notifications in Slack and in the mail in favor of this panel.

Monitoring as a Service: A Modular System for Microservice Architecture

Well, since we are a progressive company, we also monitored Kubernetes in this system. Included it in the system using Heapster, which we installed in the cluster, it collects data and sends it to Graphite. As a result, the scheme looks like this:

Monitoring as a Service: A Modular System for Microservice Architecture

Monitoring Components

Here is a list of links to the components we used for this task. All of them are open source.

graphite:

Carbon-c-relay:

github.com/grobian/carbon-c-relay

Brubeck:

github.com/github/brubeck

Collected:

collectd.org

Moira:

github.com/moira-alert

Grafana:

grafana.com

heapster:

github.com/kubernetes/heapster

Statistics

And here are some numbers about how the system works for us.

Aggregator (brubeck)

Number of metrics: ~ 300 / sec
Graphite Metrics Sending Interval: 30 sec
Server resource utilization: ~ 6% CPU (we are talking about full-fledged servers); ~ 1Gb RAM; ~ 3 Mbps LAN

Graphite (go-carbon)

Number of metrics: ~ 1 / min
Metric update interval: 30 sec
Metrics storage scheme: 30sec 35d, 5min 90d, 10min 365d (gives an understanding of what happens to the service over a long period of time)
Server resource usage: ~10% CPU; ~ 20Gb RAM; ~ 30 Mbps LAN

Flexibility

We at Avito really appreciate the flexibility in our monitoring service. Why did he actually turn out like this? First, its constituent parts are interchangeable: both the components themselves and their versions. Second, maintainability. Since the whole project is built on open source, you can edit the code yourself, make changes, and implement functions that are not available out of the box. Quite common stacks are used, mainly Go and Python, so this is done quite simply.

Here is an example of a real problem. A metric in Graphite is a file. It has a name. File name = metric name. And there is a way to get there. Filenames in Linux are limited to 255 characters. And we have (as β€œinternal customers”) guys from the database department. They tell us: β€œWe want to monitor our SQL queries. And they are not 255 characters, but 8 MB each. We want to display them in Grafana, see the parameters for this request, and even better, we want to see the top of such requests. It will be great if it is displayed in real time. And it would be really cool to shove them into the alert.”

Monitoring as a Service: A Modular System for Microservice Architecture
The SQL query example is taken as an example from site postgrespro.ru

We raise the Redis server and our Collectd plugins that go to Postgres and take all the data from there, send metrics to Graphite. But we replace the name of the metric with hashes. The same hash is simultaneously sent to Redis as a key, and the entire SQL query as a value. It remains for us to make Grafana able to go to Redis and take this information. We are opening the Graphite API because this is the main interface for the interaction of all monitoring components with graphite, and we enter a new function there called aliasByHash () - we get the name of the metric from Grafana, and use it in a request to Redis as a key, in response we get the value of the key, which is our β€œSQL query ". Thus, we brought to Grafana the display of an SQL query, which, in theory, could not be displayed there, along with statistics on it (calls, rows, total_time, ...).

Results

Availability. Our monitoring service is available 24/7 from any application and any code. If you have access to the storages, you can write data to the service. Language is not important, decisions are not important. You only need to know how to open a socket, throw a metric there, and close the socket.

Reliability. All components are fault tolerant and handle our workloads well.

Low entry threshold. In order to use this system, you do not need to learn programming languages ​​and queries in Grafana. Just open your application, add a socket to it that will send metrics to Graphite, close it, open Grafana, create dashboards there and look at the behavior of your metrics, receiving notifications through Moira.

Independence. You can do all this yourself, without the help of DevOps engineers. And this is an overfeature, because you can monitor your project right now, you don’t have to ask anyone - neither to start work, nor to make changes.

What are we striving for?

Everything listed below is not just abstract thoughts, but something towards which at least the first steps have been taken.

  1. anomaly detector. We want to create a service that will go to our Graphite storages and check each metric using various algorithms. There are already algorithms that we want to view, there is data, we know how to work with it.
  2. metadata. We have many services, they change over time, as well as the people who work with them. Keeping records manually is not an option. Therefore, metadata is now embedded in our microservices. It states who developed it, the languages ​​it interacts with, SLA requirements, where and to whom to send notifications. When deploying a service, all entity data is created independently. As a result, you get two links - one for triggers, the other for dashboards in Grafana.
  3. Monitoring in every home. We believe that all developers should use such a system. In this case, you always understand where your traffic is, what happens to it, where it falls, where it has weak points. If, for example, something comes and crashes your service, then you will find out about it not during a call from the manager, but from an alert, and you can immediately open fresh logs and see what happened there.
  4. High performance. Our project is constantly growing, and today it processes about 2 metric values ​​per minute. A year ago, this figure was 000. And the growth continues, and this means that after some time Graphite (whisper) will begin to load the disk subsystem very heavily. As I said, this monitoring system is quite versatile due to the interchangeability of components. Someone specifically for Graphite maintains and constantly expands their infrastructure, but we decided to go the other way: use clickhouse as a repository for our metrics. This transition is almost completed, and very soon I will tell you in more detail how it was done: what were the difficulties and how they were overcome, how the migration process went, I will describe the components selected as binding and their configurations.

Thank you for your attention! Ask your questions on the topic, I will try to answer here or in the following posts. Perhaps someone has experience in building a similar monitoring system or switching to Clickhouse in a similar situation - share it in the comments.

Source: habr.com

Add a comment