PagerDuty, or Why the Operations Department Can't Sleep at Night

The more complex the system, the more it acquires all sorts of alerts. And there is a need to respond to these very alerts, aggregate them and visualize them. I think the situation is familiar to many to a nervous tic.

The solution that will be discussed is not the most unexpected, but the search does not return a full-fledged article on this topic.

Therefore, I decided to share the experience of FunCorp and talk about how the duty process is built, who calls, why and how you can look at it all.

PagerDuty, or Why the Operations Department Can't Sleep at Night

What is PagerDuty?

So, to solve all these problems, we started looking for a convenient tool. After a short search, we settled on PagerDuty. PD seemed to us a fairly complete and concise solution with a large number of integrations and settings. What is she like?

In short, PagerDuty is an incident processing platform that can process incoming incidents through various integrations, set up a duty order and then alert the engineer on duty depending on the level of the incident (at a high level - a call, at a low level - a push from the application / sms) .

Who is the attendant?

This is probably the first thing you need to start setting up the PD.

FunCorp, like other companies, has an honorary duty officer. It is passed from engineer to engineer once a day. There is the so-called first and second line of response to an alert from PagerDuty. Suppose a high-priority alert comes in, and if 10 minutes after the call to the duty officer from the first line, there is no reaction to it (that is, it has not been transferred to the acknowledge or resolved status), the call goes to the second engineer on duty. This is configured in PagerDuty itself via Escalation Policies.

PagerDuty, or Why the Operations Department Can't Sleep at Night

If the second attendant does not answer, then the notification returns back to the main duty officer.

Thus, any incoming high priority alert cannot be left unprocessed. 

Now let's see where incidents can come from.

What integrations do we use?

A lot of different incidents from various services are pouring into PD. We now have about 25 such services, and for their processing we use some ready-made integrations.

  • Prometheus

The main metrics collection system is Prometheus. A lot has already been written about it on Habré, I’ll just say that we have several of them for different environments: one collects metrics from virtual machines and dockers, the other from Amazon services, and the third from “iron machines”. Basically, Telegraf is used as a metrics exporter.

  • Email

Here, too, I think everything is clear from the title. This integration is used to send notifications from some cron scripts. PD gives you a certain address to which you send letters. When creating a service with such an integration, you can set priorities, in what order incoming incidents will be processed, how exactly to create an alert (for each incoming letter, for an incoming letter + some rule, etc.).

PagerDuty, or Why the Operations Department Can't Sleep at Night

  • Slack

In my opinion, a very interesting integration. There are times when something happens but is not covered by incidents. Therefore, we have added integration from Slack to create an incident. That is, in corporate Slack, you can write /callofduty slows everything down and will break soon and the PD will process it and forward the incident to the duty engineer.

We do:

PagerDuty, or Why the Operations Department Can't Sleep at Night

We see:

PagerDuty, or Why the Operations Department Can't Sleep at Night

  • API

HTTP integration. There isn't really anything interesting here, just a POST request with a body in JSON format. For example, from the interesting: we use it for external monitoring using https://www.statuscake.com/. This service checks the availability of our sites from different parts of the world. In the case when we receive an unacceptable response code (for example, 502), an incident is created and then everything goes along the chain described above. StatusCake itself has the ability to monitor internal URLs, the expiration of an SSL certificate or a domain.

  • FreeNMS

This is another monitoring system, you can read more about it on their website. https://www.librenms.org/. With its help, we monitor network interfaces and iDRAC from servers.

PagerDuty, or Why the Operations Department Can't Sleep at Night

There were also such integrations as Datadog, CloudWatch. For more information about what happened to them, you can see here.

Visualization

The main incident reporting system is Slack. All incidents coming to PD are written to a special chat, and if their status changes, this is also displayed in the chat.

PagerDuty, or Why the Operations Department Can't Sleep at Night

When it became possible to display useful data on the screens of monitors hanging from the ceiling, we suddenly realized that we (in the devops department) had nothing to display on them. There is a wonderful Grafana, but it doesn’t cover everything, and employees react to alerts, not charts.

After a thorough but unsuccessful search on GitHub for a concise and informative "board" for PD, we decided to write our own - only with what we need. Although at first there was an idea to display the PD interface itself, but it looked even more inconvenient.

To write it, all you need is to get a key from a PD with read-only rights.
And here's what we got:

PagerDuty, or Why the Operations Department Can't Sleep at Night

The screen displays the current open incidents, the name of the current engineer on duty from the selected schedule, and the time without a high priority incident (the panel with a high priority incident will be highlighted in red).

See the source code for this implementation here..

As a result, we got a convenient dashboard for viewing all our incidents. I would be glad if any of you will benefit from our experience.

Source: habr.com

Add a comment