Easy work with complex alerts. Or the history of Balerter
Everyone loves alerts.
Of course, it's much better to be notified when something happened (or fixed) than to sit back and look at charts and look for anomalies.
And there are many tools for this. Alertmanager from the Prometheus ecosystem and vmalert from the VictoriaMetrics product group. zabbix notifications and alerts in Grafana. Self-written scripts on bash and Telegram bots that periodically pull some URL and say if something is wrong. A lot of everything.
We, in our company, also used different solutions, until we ran into the complexity, or rather the impossibility of creating complex, composite alerts. What we wanted and what we ended up doing - under the cut. TLDR: This is how the open source project was born Balerter
For quite a long time, we lived well with alerts configured in Grafana. Yes, it's not the best way. It is always recommended to use some specialized solutions like Alertmanager. And we also looked at the move more than once. And then, slowly, we wanted more.
Say when a certain chart fell / rose by XX% and has been there for N minutes compared to the previous period of M hours? It seems that you can try to implement this with Grafana or Alertmanager, but it is rather difficult. (Or maybe not, I wonβt say now)
Things get even more complicated when the decision to alert has to be made based on data from different sources. Live example:
We check the data from two Clickhouse databases, then compare it with some data from Postgres, and decide on an alert. Signal or cancel
We have accumulated enough similar Wishlist for us to think about our decision. And then we tried to compile the first list of requirements / capabilities of this service that has not been created yet
access different data sources. For example, Prometheus, Clickhouse, Postgres
send alerts to various channels - telegram, slack, etc.
in the process of thinking, it became clear that I wanted not a declarative description, but the ability to write scripts
running scripts on schedule
easy updating of scripts without restarting the service
the ability to somehow expand the functionality without rebuilding the service from source codes
This list is approximate and most likely not very accurate. Some items changed, some died. Everything is as usual.
Actually, this is how the history of Balerter began.
I will try to describe briefly what happened in the end and how it works. (Yes, of course, this is not final. Lots of plans for product development. I'll just focus on today)
How does it work?
You write a Lua script where you explicitly send requests (to Prometheus, Clickhouse, etc.). You receive answers and somehow process and compare them. Then turn on/off some kind of alert. Balerter will automatically send a notification to the channels you set up (Email, telegram, slack, etc.). The script is executed at the specified frequency. And ... in general, that's all)
It's best to show with an example:
-- @interval 10s
-- @name script1
local minRequestsRPS = 100
local log = require("log")
local ch1 = require("datasource.clickhouse.ch1")
local res, err = ch1.query("SELECT sum(requests) AS rps FROM some_table WHERE date = now()")
if err ~= nil then
log.error("clickhouse 'ch1' query error: " .. err)
return
end
local resultRPS = res[1].rps
if resultRPS < minResultRPS then
alert.error("rps-min-limit", "Requests RPS are very small: " .. tostring(resultRPS))
else
alert.success("rps-min-limit", "Requests RPS ok")
end
What's going on here:
specify that this script should be executed every 10 seconds
specify the name of the script (for API, for display in logs, for use in tests)
connect the module for outputting logs
connect the module to access the clickhouse with the name ch1 (the connection itself is configured in the config)
send request to clickhouse
in case of an error, we display a message in the log and exit
compare the query result with a constant (in a live example, we could get this value, for example, from the Postgres database)
enable or disable alert with ID rps-min-limit
you will receive a notification if the alert status has changed
The example is quite simple and clear. However, of course, in real life, scripts can be quite sprawling and complex. It is easy to get confused and make mistakes.
Therefore, a logical desire has ripened - to be able to write tests for your scripts. And in version v0.4.0 it appeared.
Script testing
An example test for our script from the example above:
-- @test script1
-- @name script1-test
test = require('test')
local resp = {
{
rps = 10
}
}
test.datasource('clickhouse.ch1').on('query', 'SELECT sum(requests) AS rps FROM some_table WHERE date = now()').response(resp)
test.alert().assertCalled('error', 'rps-min-limit', 'Requests RPS are very small: 10')
test.alert().assertNotCalled('success', 'rps-min-limit', 'Requests RPS ok')
In steps:
specify the name of the script for which the test is written
test name (for logs)
connect the test module
we say what result should be returned for a specific request to the clickhouse ch1
check that an alert (error) rps-min-limit was called with the specified message
check that the rps-min-limit alert has not been disabled (success)
What else can Balerter do?
I will try to touch on the most important, in my opinion, skills of Balerter. You can see everything in detail on the official website. https://balerter.com
receive data from
clickhouse
postgres
mysql
prometheus
loki
send notifications to channels
slack
telegram
syslog
notiify (UI notifications on your computer)
Email
discord
build graphs based on your data, upload the image to S3 compatible storage and attach it to notifications (Example with pictures)
allows you to exchange data between scripts - global Key / Value storage
write your own libraries in Lua and use them in scripts (by default, lua libraries are supplied to work with json, csv)
send HTTP requests from your scripts (and receive responses, of course)
provides an API (not yet as functional as we would like)
exports metrics in Prometheus format
What else would you like to know?
It is already clear that users and we want the ability to control the launch of scripts using the syntax cron. This will be done before version v1.0.0
I would like to support more data sources and notification delivery channels. For example, someone will definitely miss MongoDB. Someone Elastic Search. Send SMS and/or dial to mobile. We want to be able to receive scripts not only from files, but also, for example, from a database. In the end, we want a more convenient site for the project and better documentation.
There is always something missing for someone) Here we hope for a request from the community in order to properly prioritize. And for the help of the community to realize everything
In conclusion
We use Balerter own for quite some time now. Dozens of scripts guard our peace of mind. I hope this work will be useful to someone else.