Grafana Open-Code OnCall Incident Response System

Grafana Labs, which develops the Grafana data visualization platform and the Prometheus monitoring system, announced the open source code for the OnCall incident response system, designed to ensure that teams work together to eliminate and analyze incidents. OnCall was previously shipped as a proprietary product and was acquired by Grafana through its takeover of Amixr Inc. last year. The project code is written in Python and is open under the AGPLv3 license.

The system allows you to collect information about anomalies and events from various monitoring systems, and then automatically group the data, send notifications to responsible groups and track the status of problem resolution. Integration with Grafana, Prometheus, AlertManager and Zabbix monitoring systems is supported. Minor and insignificant events are filtered out from the information received from monitoring systems, duplicates are aggregated and problems that can be solved without human intervention are excluded.

Significant events cleared of unnecessary information noise are sent to the notification sending subsystem, which identifies employees responsible for solving the identified categories of problems and sends notifications taking into account their work schedule and degree of employment (the data from the scheduler is evaluated). The rotation of the binding of incidents between different employees and the escalation of particularly important or unresolved problems to other team members or employees of higher levels are supported.

Grafana Open-Code OnCall Incident Response System

Depending on the severity of the incident, notifications can be sent via phone calls, SMS, email, creating events in the scheduler calendar, Slack and Telegram messengers. At the same time, Slack can automatically create channels for discussing issues related to solving an incident, to which both individual employees and entire teams are automatically connected.

The system provides flexible expansion and customization options (for example, you can customize the grouping and routing of events to suit your preferences, define rules and channels for delivering notifications). For integration with external systems, an API and Terraform support is provided. Work management is carried out through the web-interface.

Grafana Open-Code OnCall Incident Response System


Source: opennet.ru

Add a comment