I made a post two years ago
[
Who might be interested
This may be of interest to you if you work in a small team or even alone. You do not have monitoring and you are not sure if you really need it. Or you tried some popular serious monitoring “for big boys”, but somehow it “didn’t take off” for you, or it works in an almost default configuration and did not change your life much. And also - if you definitely do not plan to allocate an entire employee (or even a department) to ensure that he or she monitors the monitoring dashboard at least a couple of hours a day or configures it.
What is unusual about okerr
Next, I will show interesting features of the okerr that distinguish it from some other monitors.
Okerr is hybrid monitoring
During internal monitoring, an "agent" is running on the monitored machines, which transmits data to the monitoring server (for example, free disk space). When external, the server performs checks over the network (for example, ping or website availability). Each approach has its own limitations. Okerr uses both options. Checks inside the servers are performed by a very light (30Kb) agent or your own scripts and applications, and network checks are done through okerr sensors in different countries.
okerr is not just software, but also a service
The server part of any monitoring is a big and complex thing, it is difficult to install and configure it, it requires resources. With okerr you can install your own monitoring server (it's free and open source), or you can just use the client side and use our server service. Also free.
If monitoring allows you to compensate, cover up the lack of reliability of servers and applications, then a philosophical question arises - who guards the guard? How will monitoring tell us about the problem if it "died" for some reason, separately or together with your other resources (for example, the channel to the data center fell)? When using the external okerr service - this problem is solved - you will receive an alert even if the entire data center with your servers is de-energized or is attacked by zombies.
Of course, there is a risk that the okerr server itself will be unavailable, this is true (as you know, 90% of reliability is always obtained simply and “for free”, 99% with a minimum of effort, and each next nine is exponentially more difficult). But, firstly, the chances of this are lower, and secondly, the problem may go unnoticed only if it coincides in time with problems on our servers. If we have 99.9% reliability, and you have 99.9% (not too high numbers), then the chance of an unnoticed failure is 0.1% of 0.1% = 0.0001%. Adding yourself three nines in reliability almost without effort and at no cost is very good!
Another advantage of monitoring as a service is that a hosting provider or web studio can install an okerr server and provide access to customers as a paid or free additional service. Your competitors just have hosting and websites - and you have reliable hosting with monitoring.
Okerr is about indicators
An indicator is a light bulb. It has two main states - green (OK) or red (ERR). A project has a lot of grouped (for example, by servers) indicators. On the main page of the project, you immediately see that either everything is green (and you can close it), or something is lit red and you need to fix it. When transitioning between these states, a notification is sent. Once a day while you are setting up, a summary of the project is sent.
Each okerr indicator has built-in conditions by which it changes state (in Zabbix this is called a trigger). For example, load average should be no more than 2 (of course, this is configurable). And for each internal check (load average, disk free, ...) there is a watchdog. If for some reason we did not receive a successful confirmation at the appointed time, an error is logged and an alert is sent.
Our usual scheme of work is to check mail in the morning, among other letters we look at the summary (we set its time for the start of work). If everything is ok in it, we are doing other important things (but for reliability, we can quickly look at the okerr dashboard, make sure that at this moment everything is green). If an alert comes, we react.
Of course, it is possible to simply keep "informational" indicators (to see the picture of the network from monitoring), but everything is done to simply, easily and quickly make indicators specifically for automatic monitoring and sending alerts.
The purpose for which you set up okerr is in alerts, so that you can create an indicator in a minute, it could “sleep” for a year, just accept updates, and when something breaks in a year, it lights up and sends an alert . The minute that you once spent on creating an indicator paid off, you learned about the problem right away, before anyone else. It might have been fixed before anyone noticed. What is quickly raised is not considered to be fallen!
Security
It would be a shame if you set up monitoring for the sake of increasing reliability, and as a result, you are attacked through it over the network, and there are quite a lot of network vulnerabilities in different monitoring tools (
Agent (okerrmod from package
Full monitoring coverage
Now we have a rule - we learn about all technical problems from okerr. If suddenly the rule is violated (okerr did not warn about its imminent occurrence (if possible) or that it has already occurred), we add checks to okerr.
External checks
Pretty typical set:
- ping
- http-status
- checking the validity and freshness of the SSL certificate (will warn you if the term is about to expire)
- open TCP port and a banner on it
- http grep (page [should not] contain specific text)
- sha1 hash to catch the page change.
- DNS (DNS entry must have a specific value)
- WHOIS (will warn you if the domain is going bad soon)
- Antispam DNSBL (host check against 50+ anti-spam blacklists at once)
Internal checks
Also, a fairly typical set (but easily extensible).
- df (free disk space)
- load average
- opentcp (open listening TCP sockets - will notify if something starts or crashes)
- uptime - just uptime server. Will notify if it has changed down (i.e. the server has rebooted)
- client_ip
- dirsize - we use it to track when our rootfs of virtual machines go beyond the allowed size without introducing hard limits, and for the sizes of user home directories
- empty and nonempty - keep track of files that should be empty (or not empty). For example, the error log of the okerr server itself should be empty, and if there is at least a line in it, I will receive a notification and check. But mail.log on the mail server must be NOT empty (in N minutes after the rotation). And sometimes it was empty for us, after updating the system, when logrotate could not properly restart rsyslog.
- linecount is the number of lines in the file (like wc -l). We use it as a softer replacement for empty, when the error log can still grow, but only slowly (for example, we have a googlebot hammering on some closed pages). There is a limit of 2 lines in 20 minutes. If it is higher, there will be an alert
Interesting internal checks
If you have been reading “diagonally” up to this point, now it will be more interesting to read more carefully.
Backups
Keeps track of backups in the directory. Our backup files have names like "ServerName-20200530.tar.gz". For each server, an indicator ServerName-DATE.tar.gz is created in okerr (the actual date is changed to the line "DATE"). The very presence of a fresh backup and its size are monitored (for example, it cannot be less than 90% of the previous backup).
What needs to be done to start tracking a new backup after we start creating it and putting it in this directory? Nothing! This is a very handy approach when you need to do "nothing" because:
- Doing “nothing” is pretty quick, it saves time
- It's hard to forget to do "nothing"
- It's hard to do "nothing" wrong, with a mistake. Nothing is the most reliable method
If fresh backup files suddenly stop appearing, there will be an alert. If, for example, you have disabled one of the servers, and there should be no more backups, you will need to remove the indicator (via the web interface or from the shell via the API).
maxfilesz
Keeps track of the size of the largest files (usually: /var/log/*). This allows you to catch unpredictable problems, such as brute force passwords or spamming through the server.
runstatus/runline
These are two important proxy modules for running other programs on the server. Runstatus reports the exit code of the program to the indicator. For example, okerr does not (not require) a module to check that systemd services are running. This is done via runstatus (see below). Runline - reports to the server the line that the program issues. For example, temp_RUN="cat /sys/class/thermal/thermal_zone0/temp"
in the Runline config on our server creates an indicator servername:temp with the temperature of the processor.
sql
Executes a numeric query against MySQL and reports the result to an indicator. In a simple case, you can do, for example, "SELECT 1" - this will check that, in general, the DBMS is working.
But a much more interesting application is, for example, tracking the number of orders in an online store. If you know that you have at least 100 orders per hour, you can set the minimum limit to 100 or 80. Then if your sales suddenly drop, you will receive an alert and you can figure it out.
Note - no matter for what unpredictable reason this happened:
- The server is simply unavailable (de-energized or without a network), and the alert came from the fact that the indicator was "rotten".
- The server is loaded with something, it works slowly or packets are lost, users are uncomfortable and they leave without purchases
- The server was included in the spam lists and mail from it is not accepted, users cannot register
- The budget of the advertising campaign has run out, the banners are not spinning.
There can be any number of reasons, and you can’t foresee all of them in advance, and it’s technically difficult to track. But you can conveniently monitor the final parameter (orders) and determine from them that the situation is suspicious and deserves to be dealt with.
Logic indicators
Allows you to use boolean expressions (Python syntax) via a module
You can create two indicators, day and night. Make both "quiet" (they will not send alerts). And create a logical indicator that requires before 20:00 for the day indicator to be OK, and after 20:00 it is enough for the night indicator to be OK.
Another example of using a logical indicator is escalation. For example, a project manager unsubscribes from alerts (he doesn't need to, admins should respond to common problems), but subscribes to a logical indicator that turns red if any indicator in the project is not fixed within the allotted time.
Also, it is possible to set the allowed time for work, for example, from 3 to 5 in the morning. We don't care if servers and sites "down" at this time. But at 5:00 they have to work. If they do not work at any other time - an alert. Also, the logical indicator allows you to take into account server redundancy. If you have 5 web servers, then admins can shut down 1-2 servers at any time. But if there are less than 3 out of 5 servers in the battle, there will be an alert.
The examples above are not okerr functions, not some features that need to be activated and configured. There are no all these functions in okerra, but there is a logical module that allows you to implement this functionality (Approximately, as in a programming language - if we have arithmetic operators, then we do not need a special function for calculating 20% VAT from the language, you can always do it yourself make it to your needs).
The logical indicator is probably one of the few relatively complex topics in okerr, but the good news is that you don't have to master them until you need to. But at the same time, they greatly expand the possibilities, while keeping the system itself quite simple.
Adding your checks
I would really like to convey the idea that okerr is not a set of thousands of ready-made checks for all occasions, but rather - first of all - a simple engine with a simple ability to create your own checks. Creating your own checks in okerr is not a task for hackers, co-developers of the system, or at least advanced users of okerr, but a feasible task for any admin who installed linux for the first time a month ago.
Minimal checks are done through the module
This line in the config
true_OK=/bin/true
Just one line - and here we are a little have expanded functional okerr.
Even such a check already has its value: if your server suddenly goes down, the corresponding indicator on the okerr server will not be updated in a timely manner, and after the time has elapsed, an alert will appear.
This check will notify that the apache2 server has crashed (well, you never know ...):
apache_OK="systemctl is-active --quiet apache2"
So, if you know any programming language, at least you can write shell scripts, then you can already add your own checks.
More difficult - you can write (in any language) your own module for okerrmod. In the simplest case, it looks like this:
#!/usr/bin/python3
print("STATUS: OK")
Is it really not very difficult? The module should do the check itself, and output the results to STDOUT. A more complex module gives, for example, this:
$ okerrmod --dump df
NAME: pi:df-/
TAGS: df
METHOD: numerical|maxlim=90
DETAILS: 49.52%, 13.9G/28.2G used, 13.0G free
STATUS: 49.52
NAME: pi:df-/boot
TAGS: df
METHOD: numerical|maxlim=90
DETAILS: 84.32%, 53.1M/62.9M used, 9.9M free
STATUS: 84.32
It updates several indicators at once (separated by an empty line), creates them if necessary, specifies the details of the check and a tag by which it is easy to find the necessary indicators in the dashboard.
Telegram
There is a Telegram bot
Status Pages
Nowadays, status pages are almost a must-have for any business that has IT, is responsible for reliability, and treats its customers/users with respect.
Imagine a situation - a user wants to do something, view information or place an order, and something does not work. He does not know what the matter is, on whose side the problem is and when it will be solved. Maybe your company just has a non-working website? Or did it break six months ago and be fixed in two years? But you need to buy a refrigerator right now, it’s already in the basket ... And it’s a completely different matter when a person sees that something is wrong with you (at least it’s clear that the problem is not on his side), that the problem has been discovered, that you are already working on it, and maybe even wrote an estimated fix time. The user can subscribe and receive a notification by mail when the problem is fixed and it will be possible to do what he wanted (buy a refrigerator).
Problems, downtime - everyone has them. But users and partners trust more those who are more transparent and responsible in this.
Here
Failover
In order not to make this article even longer, I will once again refer to my previous article −
Low system requirements
For okerr servers - we use machines with RAM from 2Gb. For network sensors - even 512Mb is enough. The client part is generally almost zero. (Plastic bag
With this in mind - okerr, probably most free a monitoring system from the available ones, because even to use another free open-source system like Zabbix or Nagios, you need to allocate resources (server) to it, and this is already money. In addition, some server maintenance is still required. With okerr - this part can be removed. Or you can not clean up and use your own server - depending on how you like it.
API and integration into own software
Simple and open architecture. okerr has a pretty simple one
#!/bin/sh
for indicator in `okerrclient --api-filter sslcert`
do
echo set location for $indicator
okerrclient --api-set location=ru retest=1 --name $indicator
done
You can update the indicator even using our client module, even without it, just through curl.
# short and nice (using okerrupdate and config file)
$ okerrupdate MyIndicator OK
# only curl is enough!
$ curl -d 'textid=MyProject&name=MyIndicator&secret=MySecret&status=OK' https://bravo.okerr.com/
You can update indicators directly from your program. For example, sending heartbeat signals so that okerr knows that it is running and raises an alarm if it crashes or freezes. By the way, okerr components do just that - okerr monitors itself, and problems in almost any module will be detected and generate an alert about the problem. (And in case of this "almost" - they are cross-checked from another server)
Here is the code (simplified) in our telegram bot:
from okerrupdate import OkerrProject, OkerrExc
op = OkerrProject()
uptimei = op.indicator("{}:telebot_uptime".format(hostname))
...
uptimei.update('OK', 'pid: {} Uptime: {} cmds: {}'.format(
os.getpid(), dhms(uptime), commands_cnt))
To update indicators from Python programs - there is a library
How okerr helps us
Okerr has changed our lives. Indeed. Perhaps another monitoring system could also, but it’s easy and simple for us to work with okerr and it has all the functions that we needed (which we didn’t have, we added it). By the way, if there is no feature, ask and I will add them (I don’t promise, but I want okerr to be the best monitoring system for small-medium projects). Or better yet, add it yourself - it's easy.
We managed to live according to the principle “learn about all the problems from the okerra”. If suddenly there was a problem that we did not learn about from okerr, we add a check to okerr. (in this case, by "we" - I mean us as users of the system, and not co-developers). At first it was common, but now it has become very rare.
Monitoring
Through okerr we keep track of log sizes on all servers. Thoughtfully reading every line of the log with your eyes is, of course, impossible, but just monitoring the growth rate already gives a lot. Through this, we detected both spamming and brute-force search of passwords, and when some of the applications “go crazy”, something doesn’t work out for them and they repeat again and again (each time adding a couple of lines to the log).
SSL certificates. Almost immediately after launch a2okerr.py
- and if several new sites appeared on the server - they will automatically appear in okerr. If suddenly for some reason the certificate is not updated, three weeks before the certificate becomes rotten - we are aware, and we understand why it is not updated, such a dog. a2certbot.py
from the same package - it helps a lot in this (immediately checks the most likely problems - and writes what was checked well, and where there is most likely a problem).
We monitor the expiration of all of our domains. And all our mail servers that send mail are also checked against 50+ different blacklists. (And sometimes they fall into them). By the way, did you know that google mail servers are blacklisted too? Just for self-testing, we added mail-wr1-f54.google.com to the monitored servers, and it's on the SORBS blacklist! (This is about the value of "anti-spamers")
Backups - I already wrote above how easy it is to keep track of them with okerr. But we also monitor fresh backups on our server, and (using a separate utility that uses okerr) - backups that we upload to Amazon Glacier. And yes, problems do happen from time to time. No wonder they followed.
We use the escalation indicator. It shows if a problem has not been fixed for a long time. And I myself, when I solve some problems, sometimes I can forget about them. Escalation is a good reminder, even if you are watching yourself.
In general, I believe that the quality of our work has improved by an order of magnitude. There is almost no downtime (well, or the client does not have time to notice it. Only shhh!), while the amount of work has become less and the working conditions are calmer. We have moved from emergency work with patching holes with adhesive tape to a calm and measured work, when many problems are predicted in advance and there is time to prevent them. Even completed problems are also easier to fix: firstly, we find out about them before customers panic, and secondly, it often happens that the problem is related to recent work (while doing one thing, broke another) - therefore, hot It's easier to follow along with her.
And there was another case...
Did you know that in the popular Debian 9 (Stretch), such a popular package as phpmyadmin is still (many months!) Vulnerable? (
Another time in a similar situation: after a vulnerability in SSH, it was necessary to update all servers. And when you set a task, you need to control the execution. (Subordinates tend to misunderstand, forget, get confused, make mistakes). Therefore, at first we added SSH version checking on all servers to okerr, and through okerr we made sure that updates rolled on all servers. (Convenient! I chose this type of indicator, and you can immediately see which server has which version). When we made sure that the task was completed on all servers, we removed the indicators.
A couple of times there was a situation that some problem arises, and then it goes away by itself. (probably everyone knows?). Until you notice, until you check - and there is already nothing to check - everything is already working well. But then it breaks again. This happened to us, for example, with products that we uploaded to the Amazon Marketplace (MWS). At some point, the loaded inventory was incorrect (wrong quantities of goods and wrong prices). Understood. But in order to figure it out, it was important to find out about the problem right away. Unfortunately, MWS, like all Amazon services, is a little slow, so there was always a lag, but still, we managed to at least roughly catch the connection between the problem and the scripts that cause it (they did a check, stuck it to the okerr, and checked it right away on receiving an alert).
An interesting case was recently added to the piggy bank by a large and expensive European hoster, which is used by our customer. All of a sudden, ALL of our servers disappeared from the radar! First, the customer himself “handles” (faster than an okerra!) Noticed that the site with which he worked did not open and made a ticket about it. But, not one site went down, but everything in general! (Natasha, we dropped everything!). Here Okerr began to send long footcloths with all the indicators that he had lit up. Panic-panic, running in circles (what else to do?). Then everything went up. It turns out that there were routine maintenance in the data center (once every many years) and, of course, we should have been warned. But some kind of zaperdyka happened to them and they did not warn. Well, more heart attack, less heart attack. But after restoring everything - you need to double-check everything! I can't imagine how I would do it by hand. Okerr tested everything in a few minutes. It turned out that most of the servers were simply temporarily unavailable, but they worked. Some - overloaded, but also got up as it should. Of all the losses, we lost two backups, which, according to the crown, should have been created and loaded while this full banana was going on. I didn’t even create them, just a day later, alerts arrived that everything was OK, backups appeared. I really like this example, because okerr turned out to be very useful in a situation that we did not even think about in advance, but this is the task of monitoring - to resist the unpredictable.
For Okerr sensors, we use the cheapest possible hosting (where quality and reliability are not important, they insure each other). So, recently we found a very vigorous hosting and super cheap, drop dead benchmarks. But ... sometimes it turns out that outgoing connections from a virtual machine are made from another (neighboring) IP. Miracles. client_ip module with
And one more thing - once we started talking about VPS hosting - we always use inexpensive ones (hetzner, ovh, scaleway). And in terms of benchmarks and stability - I really like it. We also use the much more expensive Amazon EC2 for other projects. So, thanks to okerr, we have our informed opinion. Falling - both. And I would not say that for a long time of our observations, cheap hosting like hetzner turned out to be noticeably less stable than EC2. Therefore, if you are not tied to other Amazon features, why pay more? 🙂
What's next?
If at this stage I have not scared you away from Okerr, then try it! You can go directly to this link
After registration, you will be asked to take a training (perform several not very difficult training tasks). The initial limits are very small, but they are enough for a training or one server. After completing the training, the limits (for example, the maximum number of indicators) will be increased.
From the documentation - first of all
If you use it seriously and these increased limits will not be enough - also write to the support, we will increase it (for free).
Would you like to put okerr server on your server? Here
We want this project to take off, so that the world becomes more reliable thanks to us. Thanks to free software and services, the world has become friendlier and develops more dynamically. Source codes can be stored in free github, for mail use free gmail. We use free
Source: habr.com