Easy work with complex alerts. Or the history of Balerter

Easy work with complex alerts. Or the history of Balerter

Everyone loves alerts.

Of course, it's much better to be notified when something happened (or fixed) than to sit back and look at charts and look for anomalies.

And there are many tools for this. Alertmanager from the Prometheus ecosystem and vmalert from the VictoriaMetrics product group. zabbix notifications and alerts in Grafana. Self-written scripts on bash and Telegram bots that periodically pull some URL and say if something is wrong. A lot of everything.

We, in our company, also used different solutions, until we ran into the complexity, or rather the impossibility of creating complex, composite alerts. What we wanted and what we ended up doing - under the cut. TLDR: This is how the open source project was born Balerter

For quite a long time, we lived well with alerts configured in Grafana. Yes, it's not the best way. It is always recommended to use some specialized solutions like Alertmanager. And we also looked at the move more than once. And then, slowly, we wanted more.

Say when a certain chart fell / rose by XX% and has been there for N minutes compared to the previous period of M hours? It seems that you can try to implement this with Grafana or Alertmanager, but it is rather difficult. (Or maybe not, I won’t say now)

Things get even more complicated when the decision to alert has to be made based on data from different sources. Live example:

We check the data from two Clickhouse databases, then compare it with some data from Postgres, and decide on an alert. Signal or cancel

We have accumulated enough similar Wishlist for us to think about our decision. And then we tried to compile the first list of requirements / capabilities of this service that has not been created yet

  • access different data sources. For example, Prometheus, Clickhouse, Postgres

  • send alerts to various channels - telegram, slack, etc.

  • in the process of thinking, it became clear that I wanted not a declarative description, but the ability to write scripts

  • running scripts on schedule

  • easy updating of scripts without restarting the service

  • the ability to somehow expand the functionality without rebuilding the service from source codes

This list is approximate and most likely not very accurate. Some items changed, some died. Everything is as usual.

Actually, this is how the history of Balerter began.

Easy work with complex alerts. Or the history of Balerter

I will try to describe briefly what happened in the end and how it works. (Yes, of course, this is not final. Lots of plans for product development. I'll just focus on today)

How does it work?

You write a Lua script where you explicitly send requests (to Prometheus, Clickhouse, etc.). You receive answers and somehow process and compare them. Then turn on/off some kind of alert. Balerter will automatically send a notification to the channels you set up (Email, telegram, slack, etc.). The script is executed at the specified frequency. And ... in general, that's all)

It's best to show with an example:

-- @interval 10s
-- @name script1

local minRequestsRPS = 100

local log = require("log")
local ch1 = require("datasource.clickhouse.ch1")

local res, err = ch1.query("SELECT sum(requests) AS rps FROM some_table WHERE date = now()")
if err ~= nil then
    log.error("clickhouse 'ch1' query error: " .. err)
    return
end

local resultRPS = res[1].rps

if resultRPS < minResultRPS then
    alert.error("rps-min-limit", "Requests RPS are very small: " .. tostring(resultRPS))
else
    alert.success("rps-min-limit", "Requests RPS ok")
end 

What's going on here:

  • specify that this script should be executed every 10 seconds

  • specify the name of the script (for API, for display in logs, for use in tests)

  • connect the module for outputting logs

  • connect the module to access the clickhouse with the name ch1 (the connection itself is configured in the config)

  • send request to clickhouse

  • in case of an error, we display a message in the log and exit

  • compare the query result with a constant (in a live example, we could get this value, for example, from the Postgres database)

  • enable or disable alert with ID rps-min-limit

  • you will receive a notification if the alert status has changed

The example is quite simple and clear. However, of course, in real life, scripts can be quite sprawling and complex. It is easy to get confused and make mistakes.

Therefore, a logical desire has ripened - to be able to write tests for your scripts. And in version v0.4.0 it appeared.

Script testing

An example test for our script from the example above:

-- @test script1
-- @name script1-test

test = require('test')

local resp = {
    {
        rps = 10
    }
} 

test.datasource('clickhouse.ch1').on('query', 'SELECT sum(requests) AS rps FROM some_table WHERE date = now()').response(resp)

test.alert().assertCalled('error', 'rps-min-limit', 'Requests RPS are very small: 10')
test.alert().assertNotCalled('success', 'rps-min-limit', 'Requests RPS ok')

In steps:

  • specify the name of the script for which the test is written

  • test name (for logs)

  • connect the test module

  • we say what result should be returned for a specific request to the clickhouse ch1

  • check that an alert (error) rps-min-limit was called with the specified message

  • check that the rps-min-limit alert has not been disabled (success)

What else can Balerter do?

I will try to touch on the most important, in my opinion, skills of Balerter. You can see everything in detail on the official website. https://balerter.com

  • receive data from

    • clickhouse

    • postgres

    • mysql

    • prometheus

    • loki

  • send notifications to channels

    • slack

    • telegram

    • syslog

    • notiify (UI notifications on your computer)

    • Email

    • discord

  • build graphs based on your data, upload the image to S3 compatible storage and attach it to notifications (Example with pictures)

  • allows you to exchange data between scripts - global Key / Value storage

  • write your own libraries in Lua and use them in scripts (by default, lua libraries are supplied to work with json, csv)

  • send HTTP requests from your scripts (and receive responses, of course)

  • provides an API (not yet as functional as we would like)

  • exports metrics in Prometheus format

What else would you like to know?

It is already clear that users and we want the ability to control the launch of scripts using the syntax cron. This will be done before version v1.0.0

I would like to support more data sources and notification delivery channels. For example, someone will definitely miss MongoDB. Someone Elastic Search. Send SMS and/or dial to mobile. We want to be able to receive scripts not only from files, but also, for example, from a database. In the end, we want a more convenient site for the project and better documentation.

There is always something missing for someone) Here we hope for a request from the community in order to properly prioritize. And for the help of the community to realize everything

In conclusion

We use Balerter own for quite some time now. Dozens of scripts guard our peace of mind. I hope this work will be useful to someone else.

And welcome with your Issue and PR.

Source: habr.com

Add a comment