How I was an SRE engineer trainee for a week. Duty through the eyes of a software engineer

How I was an SRE engineer trainee for a week. Duty through the eyes of a software engineer

SRE engineer - trainee

To begin, let me introduce myself. I - @tristan.read, front-end engineer in the group Monitor::Health GitLab. Last week, I had the privilege of being an intern with one of our on-duty SRE engineers. The goal was to observe daily how the duty officer responds to incidents and gain real work experience. We would like our engineers to better understand the needs of users functions Monitor::Health.

I had to follow the SRE around for a week. That is, I was present at the transfer of duty, observed the same alert channels and responded to incidents, if and when they occurred.

Incidents

There were 2 incidents in a week.

1. Cryptominer

GitLab.com recorded a jump in usage on Wednesday GitLab Runner'a, caused by attempts to use runner's minutes for cryptocurrency mining. The incident was sorted out using a proprietary mitigation tool that stops the runner's tasks and deletes the project and account associated with it.

If this event had not been noticed, an automated tool would have caught it, but in this case, the SRE engineer noticed the violation first. An incident task was created, but the information on it is closed.

2. Performance degradation of Canary and Main applications

The incident was fueled by slowdowns and increased error rates in the canary and main web applications at Gitlab.com. Several Apdex values ​​were violated.

Open task by incident: https://gitlab.com/gitlab-com/gl-infra/production/issues/1442

Key findings

Here are a few points that I learned during the week of duty.

1. Alerts are most useful when detecting deviations from the norm.

Notifications can be divided into several types:

  • Alerts based on a certain threshold, such as "10 5xx errors occurred per second."
  • Alerts where the threshold is a percentage value such as "5xx error rate per 10% of total requests at a given time".
  • Alerts based on a historical average such as "5xx errors in the 90th percentile".

Generally speaking, types 2 and 3 are more useful for SREs on duty, as they reveal abnormalities in the process.

2. Many alerts never escalate to incidents

SR engineers deal with a constant stream of alerts, many of which are not really critical.

So why not limit alerts to only the really important ones? With this approach, however, early symptoms of what will snowball into a real problem that threatens major damage may be overlooked.

The task of the SRE on duty is to determine which alerts really mean something serious, and whether they need to be escalated and started to be sorted out. I suspect this is also due to the inflexibility of alerts: it would be better if they introduced several levels or β€œsmart” ways to customize alerts according to the situation described above.

Feature suggestion: https://gitlab.com/gitlab-org/gitlab/issues/42633

3. Our SREs use a lot of tools

Internal:

  • GitLab infra project: Runbooks live here, shift/week handovers, incident response tasks.
  • GitLab issues: Investigation, debriefing, and maintenance are also tracked in issues.
  • GitLab labels: Automation tasks are triggered by specific labels that bots use to track task activity.

External:

  • PagerDuty Alerts
  • Slack: This is where the PagerDuty/AlertManager message flow goes. Integration with slash commands to perform a variety of tasks, such as close an alert or escalate to an incident.
  • Grafana: visualization of metrics with a focus on long-term trends.
  • Kibana: gives visualization/log search, the ability to dig deeper into certain events.
  • Zoom: There is a permanent "breakout room" in Zoom. This allows SREs to quickly discuss events without wasting precious time creating a room and linking members.

And many many others.

4. Monitoring GitLab.com with GitLab is a single point of failure

If GitLab.com experiences a major service outage, we don't want it to affect our ability to resolve the issue. It can be stopped by running a second GitLab instance to manage GitLab.com. In fact, this already works for us: https://ops.gitlab.net/.

5. A few features to consider adding to GitLab

  • Multi-User Issue Editing, similar to Google Docs. This would help in incident tasks during the event, as well as in debriefing tasks. In both cases, several participants may need to add something in real time at once.
  • More webhooks for tasks. The ability to run various GitLab workflow steps from within will help reduce your dependency on Slack integrations. For example, the ability to enable an alert in PagerDuty via a slash command in a GitLab issue.
    Conclusion

SRE engineers have a hard time with many complexities. It would be great to see more GitLab products address these issues. We are already working on some additions to the product that will make the workflows mentioned above easier. Parts are available in Ops Product Vision section.

In 2020, we are expanding the team to bring together all these great features. If interested, please check out vacancies, and feel free to contact someone from our team with any questions.

Source: habr.com

Add a comment