Monitoring is dead? Long live monitoring

Monitoring is dead? Long live monitoring

Since 2008, our company has been mainly engaged in infrastructure management and round-the-clock technical support for web projects: we have more than 400 clients, which is about 15% of Russian e-commerce. Accordingly, a very diverse architecture is supported. If something falls, we are obliged to fix it within 15 minutes. But to understand that an accident has occurred, you need to monitor the project and respond to incidents. And how to do it?

I think that there is a problem in organizing a proper monitoring system. If there were no trouble, then my speech consisted of one thesis: “Please install Prometheus + Grafana and plugins 1, 2, 3.” Unfortunately, it doesn't work that way now. And the main problem is that everyone continues to believe in something that existed in 2008, in terms of software components.

With regard to the organization of the monitoring system, I would venture to say that ... there are no projects with competent monitoring. And the situation is so bad, if something falls, there is a risk that it will go unnoticed - after all, everyone is sure that "everything is being monitored."
Perhaps everything is being monitored. But how?

We all have come across a story like the following: a certain devops is working, a certain admin is working, a development team comes to them and says - “we are released, now monitor it.” What do you monitor? How it works?

OK. Monitor the old fashioned way. But it is already changing, and it turns out that you were monitoring service A, which became service B, which interacts with service C. But the development team tells you: “Install the software, it should monitor everything!”

So what has changed? - Everything has changed!

2008 Everything is fine

There are a couple of developers, one server, one database server. That's where everything comes from. We have some info, we install zabbix, Nagios, cacti. And then we set understandable alerts for the CPU, for disk operation, for disk space. We also do a couple of manual checks that the site says that orders are coming to the database. And that's it - we are more or less protected.

If we compare the amount of work that the administrator then did to ensure monitoring, then it was 98% automatic: the person who monitors must understand how to install Zabbix, how to set it up and set up alerts. And 2% - for external checks: that the site is responding and making a request to the database, that new orders have arrived.

Monitoring is dead? Long live monitoring

2010 The load is growing

We start scaling the web, adding a search engine. We want to make sure that the product catalog contains all the products. And that the product search works. That the database is working, that orders are being made, that the site responds externally and responds from two servers, and the user is not kicked out of the site while it is rebalanced to another server, etc. There are more entities.

Moreover, the essence associated with the infrastructure, as before, remains the largest in the head of the manager. The idea is still in my head that the person involved in monitoring is the person who will install zabbix and be able to configure it.

But at the same time, work appears on conducting external checks, on creating a set of search indexer query scripts, a set of scripts to check that the search changes during the indexing process, a set of scripts that check that goods are transferred to the delivery service, etc. and so on.

Monitoring is dead? Long live monitoring

Note: I wrote "script set" 3 times. That is, the person responsible for monitoring is no longer the one who simply installs zabbix. This is a person who starts coding. But nothing has changed in the minds of the team.

But the world is changing, becoming more and more complicated. A virtualization layer and several new systems are added. They begin to interact with each other. Who said "smells like microservices?" But each service still looks like a website individually. We can turn to it and understand that it gives out the necessary information and works by itself. And if you are an administrator who is constantly involved in a project that has been developing for 5-7-10 years, you accumulate this knowledge: a new level appears - you have realized it, another level appears - you have realized it ...

Monitoring is dead? Long live monitoring

But rarely does anyone accompany a project for 10 years.

Monitorman's resume

Suppose you came to a new startup that immediately recruited 20 developers, wrote 15 microservices, and you are an admin who is told: “Build CI / CD. Please." You have built CI / CD and suddenly you hear: “It is difficult for us to work with production in a “cube” without understanding how the application will work in it. Make us a sandbox in the same "cube".
You are making a sandbox in this cube. They immediately tell you: "We want to stage a database that is updated every day from production, so that we understand that it works on the database, but at the same time not mess up the production database."

You live in all this. There are 2 weeks left before the release, they say to you: “Now I would like to monitor all this ...” I.e. monitor the cluster infrastructure, monitor the microservice architecture, monitor the work with external services ...

And colleagues get such a familiar scheme out of their heads and say: “So here everything is clear! Install a program that will monitor all this. Yes, yes: Prometheus + Grafana + plugins.
And they add at the same time: “You have two weeks, make sure that everything is reliable.”

In a bunch of projects that we see, one person is allocated for monitoring. Imagine that we want to hire a person who will be engaged in monitoring for 2 weeks, and we are compiling a resume for him. What skills should this person have - considering everything we have said before?

  • He must understand the monitoring and specifics of the operation of the iron infrastructure.
  • He must understand the specifics of Kubernetes monitoring (and everyone wants to be in a "cube", because you can abstract from everything, hide, because the admin will deal with the rest) - in itself, its infrastructure, and understand how to monitor applications inside.
  • He must understand that services communicate with each other in special ways, and know the specifics of the interaction of services with each other. It is quite realistic to see a project where some of the services communicate synchronously, because there is no other way. For example, the backend goes via REST, via gRPC to the catalog service, receives a list of products and returns back. You can't wait here. And with other services, it works asynchronously. Transfer the order to the delivery service, send a letter, etc.
    You probably already swam from all this? And the administrator, who needs to monitor this, swam even more.
  • He must be able to plan and plan correctly - because there is more and more work.
  • He must, therefore, create a strategy from the created service in order to understand how to specifically monitor it. He needs an understanding of the project architecture and its development + an understanding of the technologies used in development.

Let's remember an absolutely normal case: some services in php, some services in Go, some services in JS. They somehow work with each other. This is where the term "microservice" came from: there are so many separate systems that developers cannot understand the project as a whole. One part of the team writes services in JS that work on their own and do not know how the rest of the system works. The other part writes services in Python and does not get into how other services work, they are isolated in their area. The third one writes services in php or something else.
All these 20 people are divided into 15 services, and there is only one admin who has to understand all this. Stop! we just split the system into 15 microservices, because 20 people cannot understand the whole system.

But it needs to be fixed somehow...

What is the result? As a result, there is one person who enters into his head everything that the whole development team cannot understand, and at the same time he still needs to know and be able to do what we indicated above - the iron infrastructure, the Kubernetes infrastructure, etc.

What can I say… Houston, we have a problem.

Monitoring a modern software project is a software project in itself

From the false confidence that monitoring is software, we have faith in miracles. But miracles, alas, do not happen. You can't install zabbix and wait for everything to work. It makes no sense to install Grafana and hope that everything will be ok. Most of the time will be spent on organizing checks on the operation of services and their interaction with each other, checks on how external systems work. In fact, 90% of the time will be spent not on scripting, but on software development. And it should be done by a team that understands the work of the project.
If in this situation one person is sent for monitoring, then trouble will happen. Which is what happens everywhere.

For example, there are several services that communicate with each other via Kafka. The order came, we sent a message about the order to Kafka. There is a service that listens to information about the order and ships the goods. There is a service that listens to information about the order and sends a letter to the user. And then there are a bunch of other services, and we start to get confused.

And if you still give it to the admin and developers at the stage when there is a short time left before the release, the person will need to understand this whole protocol. Those. a project of this magnitude takes a significant amount of time, and this should be built into the development of the system.
But very often, especially in burning, in startups, we see how monitoring is postponed until later. “Now we will make a Proof of Concept, we will launch with it, let it fall - we are ready to sacrifice. And then we will monitor it all.” When (or if) the project starts to bring in money, the business wants to cut even more features - because it has started to work, so you need to wind it up further! And you are at the point where you first need to monitor all the previous ones, which takes not 1% of the time, but much more. And by the way, developers will be needed for monitoring, and it’s easier to let them work on new features. As a result, new features are written, everything is wound up, and you are in an endless deadlock.

So how do you monitor a project from the beginning, and what if you get a project that needs to be monitored and you don't know where to start?

First, you need to plan.

Lyrical digression: very often they start with infrastructure monitoring. For example, we have Kubernetes. Let's start by installing Prometheus with Grafana, putting plugins under the monitoring of the "cube". Not only developers, but also admins have an unfortunate practice: “We will install this plugin, and the plugin probably knows how to do it.” People like to start with simple and clear, rather than important actions. And infrastructure monitoring is easy.

First, decide what and how you want to monitor, and then choose a tool, because other people cannot think for you. And should they? Other people thought to themselves, about a universal system - or did not think at all when this plugin was written. And the fact that this plugin has 5 thousand users does not mean that it is of any use. Perhaps you will become the 5001st simply because there were already 5000 people there before.

If you start monitoring infrastructure and your app's backend stops responding, all users will lose their connection to the mobile app. An error will pop up. They will come to you and say “The application is not working, what are you doing here?” “We are monitoring.” - “How do you monitor if you don’t see that the application is not working ?!”

  1. I believe that it is necessary to start monitoring from the user's entry point. If the user does not see that the application is working, that's it, it's a failure. And the monitoring system should warn about this in the first place.
  2. And only then can we monitor the infrastructure. Or do it in parallel. The infrastructure is easier - here we can finally just install zabbix.
  3. And now you need to go to the roots of the application in order to understand where something does not work.

My main idea is that monitoring should go in parallel with the development process. If you tear off the monitoring team for other tasks (creating CI / CD, sandboxing, reorganizing the infrastructure), monitoring will start to lag behind and you may never catch up with development (or sooner or later you will have to stop it).

All by levels

This is how I see the organization of the monitoring system.

1) Application level:

  • monitoring the business logic of the application;
  • monitoring health-metrics of services;
  • integration monitoring.

2) Infrastructure level:

  • orchestration level monitoring;
  • system software monitoring;
  • iron level monitoring.

3) Again the application level - but already as an engineering product:

  • collecting and monitoring application logs;
  • APM;
  • tracing.

4) Alerting:

  • organization of the warning system;
  • organization of the system of duty;
  • organization of a "knowledge base" and workflow for processing incidents.

It's important: we get to alerting not after, but immediately! There is no need to start monitoring and “somehow later” figure out who will receive alerts. After all, what is the task of monitoring: to understand where something works wrong in the system, and let the right people know about it. If this is left for the end, then the right people will find out that something is going wrong, only by calling “nothing works for us.”

Application Layer - Business Logic Monitoring

Here we are talking about checking the very fact that the application works for the user.

This level should be done during the development phase. For example, we have a conditional Prometheus: it climbs to the server that is engaged in checks, pulls the endpoint, and the endpoint goes and checks the API.

When often asked to monitor the main page to make sure the site works, programmers give a handle to pull every time they need to make sure the API works. And programmers at this moment still take and write / api / test / helloworld
The only way to make sure everything works? - No!

  • Creating such checks is, in fact, the task of developers. Unit tests should be written by the programmers who write the code. Because if you spit it on the admin "Dude, here's a list of API protocols for all 25 functions, please monitor everything!" - nothing will work.
  • If you print "hello world", no one will ever know that the API should and does work. Every change to the API must lead to a change in checks.
  • If you already have such a problem, stop the features and select developers who will write these checks, or put up with losses, put up with the fact that nothing is checked and will fall.

Technical Tips:

  • Be sure to organize an external server to organize checks - you must be sure that your project is available to the outside world.
  • Organize checks for the entire API protocol, and not just for individual endpoints.
  • Create a prometheus-endpoint with the results of the checks.

Application layer - monitoring health metrics

Now we are talking about external health-metrics of services.

We decided that we monitor all the “handles” of the application using external checks that we call from an external monitoring system. But these are precisely the “handles” that the user “sees”. We want to be sure that the services themselves work for us. Here is a better story: K8s has health checks so that at least the “cube” itself makes sure that the service is working. But half of the checks I've seen are the same print "hello world". Those. here he pulls once after the deployment, he answered that everything is fine - and that's it. And the service, if it gives out its API as a rest, has a huge number of entry points of the same API, which also needs to be monitored, because we want to know that it works. And we monitor it already inside.

How to implement it technically correctly: each service exposes an endpoint about its current performance, and in the graphs of Grafana (or any other application) we see the status of all services.

  • Every change to the API must lead to a change in checks.
  • Create a new service right away with health metrics.
  • The admin can come to the developers and ask “add a couple of features to me so that I understand everything and add information about it to my monitoring system.” But developers usually answer “We won’t add anything two weeks before the release.”
    Let the development managers know that there will be such losses, let the superiors of the development managers know too. Because when everything goes down, someone will still call and demand to monitor the "constantly falling service" (c)
  • By the way, assign developers to write plugins for Grafana - this will be a good help for admins.

Application Layer - Integration Monitoring

Integration monitoring focuses on monitoring communication between business-critical systems.

For example, there are 15 services that communicate with each other. They are no longer separate sites. Those. we can't pull the service by itself, get /helloworld and understand that the service is running. Because the checkout web service must send information about the order to the bus, the warehouse service must receive this message from the bus and work with it further. And the e-mail distribution service has to process it somehow further, and so on.

Accordingly, we cannot understand, poking into each individual service, that it all works. Because we have a certain bus through which everything communicates and interacts.
Therefore, this stage should denote the stage of testing services for interaction with other services. It is impossible to monitor the communication by monitoring the message broker. If there is a service that issues data and a service that receives them, when monitoring a broker, we will only see data that flies from side to side. Even if we somehow managed to monitor the interaction of this data inside - that a certain producer posts the data, someone reads it, this stream continues to go to Kafka - this still will not give us information if one service sent a message in one version, and the other service didn't expect this version and skipped it. We will not know about this, as the services will tell us that everything is working.

How I recommend doing:

  • For synchronous communication: the endpoint makes requests to related services. Those. we take this endpoint, pull the script inside the service, which goes to all points and says “I can pull there, and pull there, I can pull there ...”
  • For asynchronous communication: incoming messages - the endpoint checks the bus for test messages and issues a processing status.
  • For asynchronous communication: outgoing messages - the endpoint sends test messages to the bus.

As it usually happens: we have a service that throws data into the bus. We come to this service and ask to tell about his integration health. And if the service needs to produce some message somewhere further (WebApp), then it will produce this test message. And if we pull the service on the OrderProcessing side, it first posts that it can post an independent one, and if there are some dependent things, then it reads a set of test messages from the bus, understands that it can process them, report about it and , if necessary, post them further, and he says about this - everything is OK, I'm alive.

Very often we hear the question “how can we test this on combat data?” For example, we are talking about the same order service. The order sends messages to the warehouse where the goods are written off: we can’t test this on combat data, because “my goods will be written off!” The way out: at the initial stage, plan this entire test. You also have unit tests that make mocks. Now, do it at a deeper level where you have a channel of communication that doesn't hurt the business.

Infrastructure layer

Infrastructure monitoring is something that has long been considered monitoring itself.

  • Infrastructure monitoring can and should be run as a separate process.
  • You should not start with infrastructure monitoring on a running project, even if you really want to. This is a sore for all devops. “First I will monitor the cluster, I will monitor the infrastructure” - i.e. First, it will monitor what lies below, but it will not get into the application. Because the application is an incomprehensible thing for a devops. It was leaked to him, and he does not understand how it works. And he understands the infrastructure and starts with it. But no - you always need to monitor the application first.
  • Don't go overboard with the number of notifications. Given the complexity of modern systems, alerts fly all the time, and you have to somehow live with this bunch of alerts. And the on-call person, after looking at a hundred regular alerts, decides “I don’t want to think about it.” Alerts should notify only about critical things.

Application layer as a business unit

Key points:

  • ELK. This is the industry standard. If for some reason you are not aggregating logs, start doing it immediately.
  • APM. External APMs as a way to quickly shut down application monitoring (NewRelic, BlackFire, Datadog). You can temporarily put this thing in order to somehow understand what is happening with you.
  • tracing. In dozens of microservices, you have to trace everything, because the request no longer lives on its own. It is very difficult to add later, so it is better to schedule tracing in development right away - this is the work and utility of the developers. If you haven't implemented it yet, implement it! See Jaeger/Zipkin

alerting

  • Organization of an alert system: in the conditions of monitoring a bunch of things, there should be a single system for sending alerts. It is possible in Grafana. Everyone in the West uses PagerDuty. Alerts should be clear (e.g. where they came from…). And it is desirable to control that notifications reach at all
  • Organization of the watch system: alerts should not come to everyone (either everyone will react in a crowd, or no one will react). Developers also need to be Oncall: be sure to define areas of responsibility, make clear instructions and write down in it who exactly to call on Monday and Wednesday, and who - on Tuesday and Friday (otherwise they will not call anyone even in case of big trouble - they will be afraid to wake up, disturb : people generally don't like to call and wake other people up, especially at night). And explain that asking for help is not a sign of incompetence ("I ask for help - it means I'm a bad worker"), encourage requests for help.
  • Organization of a “knowledge base” and workflow for processing incidents: a post-mortem should be planned for each serious incident, as a temporary measure, actions that will solve the incident should be recorded. And make it a practice that repeated alerts are a sin; they need to be fixed in the code or infrastructure work.

Technology stack

Let's imagine that we have the following stack:

  • data collection - Prometheus + Grafana;
  • log analysis - ELK;
  • for APM or Tracing - Jaeger (Zipkin).

Monitoring is dead? Long live monitoring

The choice of options is not critical. Because, if at the beginning you understood how to monitor the system and drew up a plan, then you already begin to choose the tools to suit your requirements. The question is what you chose to monitor in the beginning. Because, perhaps, the tool that you chose at the beginning - it does not fit your requirements at all.

A few technical things I've been seeing everywhere lately:

Prometheus is shoved inside Kubernetes - who came up with this ?! If your cluster fails, what will you do? If you have a complex cluster inside, then some monitoring system inside the cluster should work, and some outside, which will collect data from inside the cluster.

Inside the cluster, we collect logs and everything else. But the monitoring system must be outside. Very often, in a cluster where there is Promtheus, which was installed inside, there are also systems that make external checks of the site. And if your connections to the outside world have dropped and the application does not work? It turns out that everything is fine inside you, but this does not make it easier for users.

Conclusions

  • Monitoring development is not the installation of utilities, but the development of a software product. 98% of today's monitoring is coding. Coding in services, coding external checks, checking external services, and everything.
  • Do not spare developers time for monitoring: it can take up to 30% of their work, but it's worth it.
  • Devops, don't worry about not being able to monitor something, because some things are a completely different mindset. You weren't a programmer, and the job of monitoring is their job.
  • If the project is already running and not monitored (and you are a manager), allocate resources for monitoring.
  • If the product is already in production, and you are a devops who was told to “set up monitoring” - try to explain to the management what I wrote all this about.

This is an extended version of the report at the Saint Highload++ conference.

If you are interested in my ideas and thoughts on it and so-so-topics, then here you can read the channel : )

Source: habr.com

Add a comment