Okerr hybrid monitoring system overview

I made a post two years ago Simple website failover about okerr. Now there is some development of the project, and I also published okerr server side source code under open license, so I decided to write this short review on Habr.

Okerr hybrid monitoring system overview
[ full size ]

Who might be interested

This may be of interest to you if you work in a small team or even alone. You do not have monitoring and you are not sure if you really need it. Or you tried some popular serious monitoring “for big boys”, but somehow it “didn’t take off” for you, or it works in an almost default configuration and did not change your life much. And also - if you definitely do not plan to allocate an entire employee (or even a department) to ensure that he or she monitors the monitoring dashboard at least a couple of hours a day or configures it.

What is unusual about okerr

Next, I will show interesting features of the okerr that distinguish it from some other monitors.

Okerr is hybrid monitoring

During internal monitoring, an "agent" is running on the monitored machines, which transmits data to the monitoring server (for example, free disk space). When external, the server performs checks over the network (for example, ping or website availability). Each approach has its own limitations. Okerr uses both options. Checks inside the servers are performed by a very light (30Kb) agent or your own scripts and applications, and network checks are done through okerr sensors in different countries.

okerr is not just software, but also a service

The server part of any monitoring is a big and complex thing, it is difficult to install and configure it, it requires resources. With okerr you can install your own monitoring server (it's free and open source), or you can just use the client side and use our server service. Also free.

If monitoring allows you to compensate, cover up the lack of reliability of servers and applications, then a philosophical question arises - who guards the guard? How will monitoring tell us about the problem if it "died" for some reason, separately or together with your other resources (for example, the channel to the data center fell)? When using the external okerr service - this problem is solved - you will receive an alert even if the entire data center with your servers is de-energized or is attacked by zombies.

Of course, there is a risk that the okerr server itself will be unavailable, this is true (as you know, 90% of reliability is always obtained simply and “for free”, 99% with a minimum of effort, and each next nine is exponentially more difficult). But, firstly, the chances of this are lower, and secondly, the problem may go unnoticed only if it coincides in time with problems on our servers. If we have 99.9% reliability, and you have 99.9% (not too high numbers), then the chance of an unnoticed failure is 0.1% of 0.1% = 0.0001%. Adding yourself three nines in reliability almost without effort and at no cost is very good!

Another advantage of monitoring as a service is that a hosting provider or web studio can install an okerr server and provide access to customers as a paid or free additional service. Your competitors just have hosting and websites - and you have reliable hosting with monitoring.

Okerr is about indicators

An indicator is a light bulb. It has two main states - green (OK) or red (ERR). A project has a lot of grouped (for example, by servers) indicators. On the main page of the project, you immediately see that either everything is green (and you can close it), or something is lit red and you need to fix it. When transitioning between these states, a notification is sent. Once a day while you are setting up, a summary of the project is sent.

Okerr hybrid monitoring system overview

Each okerr indicator has built-in conditions by which it changes state (in Zabbix this is called a trigger). For example, load average should be no more than 2 (of course, this is configurable). And for each internal check (load average, disk free, ...) there is a watchdog. If for some reason we did not receive a successful confirmation at the appointed time, an error is logged and an alert is sent.

Our usual scheme of work is to check mail in the morning, among other letters we look at the summary (we set its time for the start of work). If everything is ok in it, we are doing other important things (but for reliability, we can quickly look at the okerr dashboard, make sure that at this moment everything is green). If an alert comes, we react.

Of course, it is possible to simply keep "informational" indicators (to see the picture of the network from monitoring), but everything is done to simply, easily and quickly make indicators specifically for automatic monitoring and sending alerts.

The purpose for which you set up okerr is in alerts, so that you can create an indicator in a minute, it could “sleep” for a year, just accept updates, and when something breaks in a year, it lights up and sends an alert . The minute that you once spent on creating an indicator paid off, you learned about the problem right away, before anyone else. It might have been fixed before anyone noticed. What is quickly raised is not considered to be fallen!

Security

It would be a shame if you set up monitoring for the sake of increasing reliability, and as a result, you are attacked through it over the network, and there are quite a lot of network vulnerabilities in different monitoring tools (Zabbix, Nagios).

Agent (okerrmod from package okerrupdate) running on the system is not a network server, but a client. Therefore, there are no additional open ports on the monitored server, the client easily works behind a firewall or NAT, and it is very difficult (I would say “impossible”) to hack over the network, since it basically does not listen to a network socket.

Full monitoring coverage

Now we have a rule - we learn about all technical problems from okerr. If suddenly the rule is violated (okerr did not warn about its imminent occurrence (if possible) or that it has already occurred), we add checks to okerr.

External checks

Pretty typical set:

  • ping
  • http-status
  • checking the validity and freshness of the SSL certificate (will warn you if the term is about to expire)
  • open TCP port and a banner on it
  • http grep (page [should not] contain specific text)
  • sha1 hash to catch the page change.
  • DNS (DNS entry must have a specific value)
  • WHOIS (will warn you if the domain is going bad soon)
  • Antispam DNSBL (host check against 50+ anti-spam blacklists at once)

Internal checks

Also, a fairly typical set (but easily extensible).

  • df (free disk space)
  • load average
  • opentcp (open listening TCP sockets - will notify if something starts or crashes)
  • uptime - just uptime server. Will notify if it has changed down (i.e. the server has rebooted)
  • client_ip
  • dirsize - we use it to track when our rootfs of virtual machines go beyond the allowed size without introducing hard limits, and for the sizes of user home directories
  • empty and nonempty - keep track of files that should be empty (or not empty). For example, the error log of the okerr server itself should be empty, and if there is at least a line in it, I will receive a notification and check. But mail.log on the mail server must be NOT empty (in N minutes after the rotation). And sometimes it was empty for us, after updating the system, when logrotate could not properly restart rsyslog.
  • linecount is the number of lines in the file (like wc -l). We use it as a softer replacement for empty, when the error log can still grow, but only slowly (for example, we have a googlebot hammering on some closed pages). There is a limit of 2 lines in 20 minutes. If it is higher, there will be an alert

Interesting internal checks

If you have been reading “diagonally” up to this point, now it will be more interesting to read more carefully.

Backups

Keeps track of backups in the directory. Our backup files have names like "ServerName-20200530.tar.gz". For each server, an indicator ServerName-DATE.tar.gz is created in okerr (the actual date is changed to the line "DATE"). The very presence of a fresh backup and its size are monitored (for example, it cannot be less than 90% of the previous backup).

What needs to be done to start tracking a new backup after we start creating it and putting it in this directory? Nothing! This is a very handy approach when you need to do "nothing" because:

  • Doing “nothing” is pretty quick, it saves time
  • It's hard to forget to do "nothing"
  • It's hard to do "nothing" wrong, with a mistake. Nothing is the most reliable method

If fresh backup files suddenly stop appearing, there will be an alert. If, for example, you have disabled one of the servers, and there should be no more backups, you will need to remove the indicator (via the web interface or from the shell via the API).

maxfilesz

Keeps track of the size of the largest files (usually: /var/log/*). This allows you to catch unpredictable problems, such as brute force passwords or spamming through the server.

runstatus/runline

These are two important proxy modules for running other programs on the server. Runstatus reports the exit code of the program to the indicator. For example, okerr does not (not require) a module to check that systemd services are running. This is done via runstatus (see below). Runline - reports to the server the line that the program issues. For example, temp_RUN="cat /sys/class/thermal/thermal_zone0/temp" in the Runline config on our server creates an indicator servername:temp with the temperature of the processor.

sql

Executes a numeric query against MySQL and reports the result to an indicator. In a simple case, you can do, for example, "SELECT 1" - this will check that, in general, the DBMS is working.

But a much more interesting application is, for example, tracking the number of orders in an online store. If you know that you have at least 100 orders per hour, you can set the minimum limit to 100 or 80. Then if your sales suddenly drop, you will receive an alert and you can figure it out.

Note - no matter for what unpredictable reason this happened:

  • The server is simply unavailable (de-energized or without a network), and the alert came from the fact that the indicator was "rotten".
  • The server is loaded with something, it works slowly or packets are lost, users are uncomfortable and they leave without purchases
  • The server was included in the spam lists and mail from it is not accepted, users cannot register
  • The budget of the advertising campaign has run out, the banners are not spinning.

There can be any number of reasons, and you can’t foresee all of them in advance, and it’s technically difficult to track. But you can conveniently monitor the final parameter (orders) and determine from them that the situation is suspicious and deserves to be dealt with.

Logic indicators

Allows you to use boolean expressions (Python syntax) via a module invalidate(article on hub). The data of the project and its indicators are available for the expression. For example, in the chapter about SQL checking above, you may have noticed a weak point - during the day we can have from 100 sales per hour, but at night - 20, and this is a common thing, not a problem. How to be? After all, the indicator will constantly panic at night.

You can create two indicators, day and night. Make both "quiet" (they will not send alerts). And create a logical indicator that requires before 20:00 for the day indicator to be OK, and after 20:00 it is enough for the night indicator to be OK.

Another example of using a logical indicator is escalation. For example, a project manager unsubscribes from alerts (he doesn't need to, admins should respond to common problems), but subscribes to a logical indicator that turns red if any indicator in the project is not fixed within the allotted time.

Also, it is possible to set the allowed time for work, for example, from 3 to 5 in the morning. We don't care if servers and sites "down" at this time. But at 5:00 they have to work. If they do not work at any other time - an alert. Also, the logical indicator allows you to take into account server redundancy. If you have 5 web servers, then admins can shut down 1-2 servers at any time. But if there are less than 3 out of 5 servers in the battle, there will be an alert.

The examples above are not okerr functions, not some features that need to be activated and configured. There are no all these functions in okerra, but there is a logical module that allows you to implement this functionality (Approximately, as in a programming language - if we have arithmetic operators, then we do not need a special function for calculating 20% ​​VAT from the language, you can always do it yourself make it to your needs).

The logical indicator is probably one of the few relatively complex topics in okerr, but the good news is that you don't have to master them until you need to. But at the same time, they greatly expand the possibilities, while keeping the system itself quite simple.

Adding your checks

I would really like to convey the idea that okerr is not a set of thousands of ready-made checks for all occasions, but rather - first of all - a simple engine with a simple ability to create your own checks. Creating your own checks in okerr is not a task for hackers, co-developers of the system, or at least advanced users of okerr, but a feasible task for any admin who installed linux for the first time a month ago.

Minimal checks are done through the module runstatus:

This line in the config runstatus will notify if suddenly /bin/true does not start or returns not 0.

true_OK=/bin/true

Just one line - and here we are a little have expanded functional okerr.

Even such a check already has its value: if your server suddenly goes down, the corresponding indicator on the okerr server will not be updated in a timely manner, and after the time has elapsed, an alert will appear.

This check will notify that the apache2 server has crashed (well, you never know ...):

apache_OK="systemctl is-active --quiet apache2"

So, if you know any programming language, at least you can write shell scripts, then you can already add your own checks.

More difficult - you can write (in any language) your own module for okerrmod. In the simplest case, it looks like this:

#!/usr/bin/python3

print("STATUS: OK")

Is it really not very difficult? The module should do the check itself, and output the results to STDOUT. A more complex module gives, for example, this:

$ okerrmod --dump df
NAME: pi:df-/
TAGS: df
METHOD: numerical|maxlim=90
DETAILS: 49.52%, 13.9G/28.2G used, 13.0G free
STATUS: 49.52

NAME: pi:df-/boot
TAGS: df
METHOD: numerical|maxlim=90
DETAILS: 84.32%, 53.1M/62.9M used, 9.9M free
STATUS: 84.32

It updates several indicators at once (separated by an empty line), creates them if necessary, specifies the details of the check and a tag by which it is easy to find the necessary indicators in the dashboard.

Telegram

There is a Telegram bot @OkerrBot. You don’t need to litter your phone with separate applications (I myself don’t like that for Pyaterochka you need one application with a card, for Lenta another, for MTS a third, and so on for everyone-all-all). One telegram is enough. Through telegrams, you can immediately receive alerts and check the status of the project and give a command to recheck all problematic indicators. We got out of the theater / plane, didn’t keep our finger on the pulse for two hours, turned on the bodies, pressed one button in the chat bot, and made sure that everything was in order.

Status Pages

Nowadays, status pages are almost a must-have for any business that has IT, is responsible for reliability, and treats its customers/users with respect.

Imagine a situation - a user wants to do something, view information or place an order, and something does not work. He does not know what the matter is, on whose side the problem is and when it will be solved. Maybe your company just has a non-working website? Or did it break six months ago and be fixed in two years? But you need to buy a refrigerator right now, it’s already in the basket ... And it’s a completely different matter when a person sees that something is wrong with you (at least it’s clear that the problem is not on his side), that the problem has been discovered, that you are already working on it, and maybe even wrote an estimated fix time. The user can subscribe and receive a notification by mail when the problem is fixed and it will be possible to do what he wanted (buy a refrigerator).

Okerr hybrid monitoring system overview

Problems, downtime - everyone has them. But users and partners trust more those who are more transparent and responsible in this.

Here an overview of 10 other projects that allow you to make status pages. Here are examples of what these pages look like for projects Python и dropbox. okerr status page.

Failover

In order not to make this article even longer, I will once again refer to my previous article − Simple website failover . If you can make a duplicate server, then using failover, you will not have a long downtime in principle - as soon as a problem is detected, users will automatically be redirected to a working backup server. And it seems to me that this is a very interesting, bright feature that is not available anywhere.

Low system requirements

For okerr servers - we use machines with RAM from 2Gb. For network sensors - even 512Mb is enough. The client part is generally almost zero. (Plastic bag okerrupdate weighs 26 Kb, but requires Python3 and standard libraries). The client is run from a cron script, so it has zero persistent memory consumption. We have sensors (super-cheap VPS with 512Mb RAM) and Raspberry Pi among the observed machines. You can even without the client side send updates via curl! (see below)

With this in mind - okerr, probably most free a monitoring system from the available ones, because even to use another free open-source system like Zabbix or Nagios, you need to allocate resources (server) to it, and this is already money. In addition, some server maintenance is still required. With okerr - this part can be removed. Or you can not clean up and use your own server - depending on how you like it.

API and integration into own software

Simple and open architecture. okerr has a pretty simple one APIwhich is easy to work with. Need to create 1000 indicators? One shell script in 3-4 lines will do it. Need to reconfigure 1000 indicators? It's also very easy. For example, we want to double-check all our HTTPS certificates from a Russian sensor:

#!/bin/sh

for indicator in `okerrclient --api-filter sslcert`
do
    echo set location for $indicator
    okerrclient --api-set location=ru retest=1 --name $indicator
done

You can update the indicator even using our client module, even without it, just through curl.

# short and nice (using okerrupdate and config file)
$ okerrupdate MyIndicator OK

# only curl is enough!
$ curl -d 'textid=MyProject&name=MyIndicator&secret=MySecret&status=OK' https://bravo.okerr.com/

You can update indicators directly from your program. For example, sending heartbeat signals so that okerr knows that it is running and raises an alarm if it crashes or freezes. By the way, okerr components do just that - okerr monitors itself, and problems in almost any module will be detected and generate an alert about the problem. (And in case of this "almost" - they are cross-checked from another server)

Here is the code (simplified) in our telegram bot:

from okerrupdate import OkerrProject, OkerrExc

op = OkerrProject()
uptimei = op.indicator("{}:telebot_uptime".format(hostname))
...
uptimei.update('OK', 'pid: {} Uptime: {} cmds: {}'.format(
        os.getpid(), dhms(uptime), commands_cnt))

To update indicators from Python programs - there is a library okerrupdate, for any other languages ​​- there are no libraries, but you can either call the okerrupdate script, or make an HTTP request to the okerr server.

How okerr helps us

Okerr has changed our lives. Indeed. Perhaps another monitoring system could also, but it’s easy and simple for us to work with okerr and it has all the functions that we needed (which we didn’t have, we added it). By the way, if there is no feature, ask and I will add them (I don’t promise, but I want okerr to be the best monitoring system for small-medium projects). Or better yet, add it yourself - it's easy.

We managed to live according to the principle “learn about all the problems from the okerra”. If suddenly there was a problem that we did not learn about from okerr, we add a check to okerr. (in this case, by "we" - I mean us as users of the system, and not co-developers). At first it was common, but now it has become very rare.

Monitoring

Through okerr we keep track of log sizes on all servers. Thoughtfully reading every line of the log with your eyes is, of course, impossible, but just monitoring the growth rate already gives a lot. Through this, we detected both spamming and brute-force search of passwords, and when some of the applications “go crazy”, something doesn’t work out for them and they repeat again and again (each time adding a couple of lines to the log).

SSL certificates. Almost immediately after launch LetsEncrypt our customer started providing free SSL certificates to his clients (about a thousand of them). And it turned out to be just hell for administration! The fact is that the sites are “live”, clients periodically ask them to do something, programmers do it. They can completely freely transfer the site to another DocumentRoot for example. Or add an unconditional Rewrite to the virthost config. Naturally, after this, the automatic renewal of certificates breaks down. Now we have all SSL hosts added to okerr automatically through one more of our useful utilities from the package a2conf. Just launch a2okerr.py - and if several new sites appeared on the server - they will automatically appear in okerr. If suddenly for some reason the certificate is not updated, three weeks before the certificate becomes rotten - we are aware, and we understand why it is not updated, such a dog. a2certbot.py from the same package - it helps a lot in this (immediately checks the most likely problems - and writes what was checked well, and where there is most likely a problem).

We monitor the expiration of all of our domains. And all our mail servers that send mail are also checked against 50+ different blacklists. (And sometimes they fall into them). By the way, did you know that google mail servers are blacklisted too? Just for self-testing, we added mail-wr1-f54.google.com to the monitored servers, and it's on the SORBS blacklist! (This is about the value of "anti-spamers")

Backups - I already wrote above how easy it is to keep track of them with okerr. But we also monitor fresh backups on our server, and (using a separate utility that uses okerr) - backups that we upload to Amazon Glacier. And yes, problems do happen from time to time. No wonder they followed.

We use the escalation indicator. It shows if a problem has not been fixed for a long time. And I myself, when I solve some problems, sometimes I can forget about them. Escalation is a good reminder, even if you are watching yourself.

In general, I believe that the quality of our work has improved by an order of magnitude. There is almost no downtime (well, or the client does not have time to notice it. Only shhh!), while the amount of work has become less and the working conditions are calmer. We have moved from emergency work with patching holes with adhesive tape to a calm and measured work, when many problems are predicted in advance and there is time to prevent them. Even completed problems are also easier to fix: firstly, we find out about them before customers panic, and secondly, it often happens that the problem is related to recent work (while doing one thing, broke another) - therefore, hot It's easier to follow along with her.

And there was another case...

Did you know that in the popular Debian 9 (Stretch), such a popular package as phpmyadmin is still (many months!) Vulnerable? (CVE-2019-6798). When the vulnerability came out, we quickly covered it in different ways. But I set okerr to track the security-tracker's page in order to know when a "beautiful" solution will be released (via SHA1 sum of the content). Several times the indicator pulled me, the page changed, but as you can see, so far (since January 2019!) It has not indicated that the problem has been solved. Perhaps, by the way, someone knows what the problem is, that such an important package is still vulnerable for more than a year?

Another time in a similar situation: after a vulnerability in SSH, it was necessary to update all servers. And when you set a task, you need to control the execution. (Subordinates tend to misunderstand, forget, get confused, make mistakes). Therefore, at first we added SSH version checking on all servers to okerr, and through okerr we made sure that updates rolled on all servers. (Convenient! I chose this type of indicator, and you can immediately see which server has which version). When we made sure that the task was completed on all servers, we removed the indicators.

A couple of times there was a situation that some problem arises, and then it goes away by itself. (probably everyone knows?). Until you notice, until you check - and there is already nothing to check - everything is already working well. But then it breaks again. This happened to us, for example, with products that we uploaded to the Amazon Marketplace (MWS). At some point, the loaded inventory was incorrect (wrong quantities of goods and wrong prices). Understood. But in order to figure it out, it was important to find out about the problem right away. Unfortunately, MWS, like all Amazon services, is a little slow, so there was always a lag, but still, we managed to at least roughly catch the connection between the problem and the scripts that cause it (they did a check, stuck it to the okerr, and checked it right away on receiving an alert).

An interesting case was recently added to the piggy bank by a large and expensive European hoster, which is used by our customer. All of a sudden, ALL of our servers disappeared from the radar! First, the customer himself “handles” (faster than an okerra!) Noticed that the site with which he worked did not open and made a ticket about it. But, not one site went down, but everything in general! (Natasha, we dropped everything!). Here Okerr began to send long footcloths with all the indicators that he had lit up. Panic-panic, running in circles (what else to do?). Then everything went up. It turns out that there were routine maintenance in the data center (once every many years) and, of course, we should have been warned. But some kind of zaperdyka happened to them and they did not warn. Well, more heart attack, less heart attack. But after restoring everything - you need to double-check everything! I can't imagine how I would do it by hand. Okerr tested everything in a few minutes. It turned out that most of the servers were simply temporarily unavailable, but they worked. Some - overloaded, but also got up as it should. Of all the losses, we lost two backups, which, according to the crown, should have been created and loaded while this full banana was going on. I didn’t even create them, just a day later, alerts arrived that everything was OK, backups appeared. I really like this example, because okerr turned out to be very useful in a situation that we did not even think about in advance, but this is the task of monitoring - to resist the unpredictable.

For Okerr sensors, we use the cheapest possible hosting (where quality and reliability are not important, they insure each other). So, recently we found a very vigorous hosting and super cheap, drop dead benchmarks. But ... sometimes it turns out that outgoing connections from a virtual machine are made from another (neighboring) IP. Miracles. client_ip module with https://diagnostic.opendns.com/myip gets the wrong IP. And the server logs of the indicator show that the update also came from this neighboring IP. Let's deal with support now. It's good that they noticed this in peacetime. But, for example, it often happens that access is registered according to the IP white list - and if the server sometimes blinks like that for a short time - you can try to catch this problem for a very long time.

And one more thing - once we started talking about VPS hosting - we always use inexpensive ones (hetzner, ovh, scaleway). And in terms of benchmarks and stability - I really like it. We also use the much more expensive Amazon EC2 for other projects. So, thanks to okerr, we have our informed opinion. Falling - both. And I would not say that for a long time of our observations, cheap hosting like hetzner turned out to be noticeably less stable than EC2. Therefore, if you are not tied to other Amazon features, why pay more? 🙂

What's next?

If at this stage I have not scared you away from Okerr, then try it! You can go directly to this link okerr demo account (Click now!). But keep in mind that there is only one demo account for everyone, so if you are doing something, someone else in the same account can interfere with you at the same time. Or (better) register via the link at offsite okerr - everything is simple, without SMS. If you do not like to use your real email, you can use a one-time one, like a mailinator (I recommend getnada.com). Such accounts may be deleted over time - but for the test it will do.

After registration, you will be asked to take a training (perform several not very difficult training tasks). The initial limits are very small, but they are enough for a training or one server. After completing the training, the limits (for example, the maximum number of indicators) will be increased.

From the documentation - first of all WIKI on the server side and on the client (okerupdate wiki). But if something is not clear - write to support (at) okerr.com or leave a ticket - we will try to quickly resolve everything.

If you use it seriously and these increased limits will not be enough - also write to the support, we will increase it (for free).

Would you like to put okerr server on your server? Here okerr-dev repository. We recommend installing on a clean virtual machine, then it will turn out to be easy to do with the installation script. On your virtual machine - no restrictions :-). Well, again - if anything - we will always try to help.

We want this project to take off, so that the world becomes more reliable thanks to us. Thanks to free software and services, the world has become friendlier and develops more dynamically. Source codes can be stored in free github, for mail use free gmail. We use free freshworks for support. You don’t need to pay for servers for this, you don’t need to download and configure and solve various problems with the operation. Each new project, each team immediately has mail, repositories and CRM. And all this is very high quality and free of charge and immediately. We want it to be the same for monitoring - small companies and projects could use okerr for free and even at the stage of birth and growth have the reliability of adult serious projects.

Source: habr.com