What to think about when implementing shifts

Effective DevOps author Ryn Daniels shares strategies anyone can use to create better, less frustrating, and more sustainable Oncall rotations.

What to think about when implementing shifts

With the advent of Devops, many engineers these days are organizing shifts in one way or another, which was once the sole responsibility of sysadmins or operations engineers. Being on duty, especially during non-working hours, is not a task that most people enjoy. Oncall duty can disrupt our sleep, interfere with the normal work we are trying to do during the day, and interfere with our lives in general. As more and more teams participate in vigils, we asked the question, “What can we as individuals, teams and organizations do to make vigils more humane and sustainable?”

Save your sleep

Often the first thing people think about when they think about being on duty is that it will negatively affect their sleep; no one wants an alert to wake them up in the middle of the night. If your organization or team gets large enough, you can use "follow-the-sun" rotations, where teams in multiple time zones participate in the same rotation, with shorter duty shifts. so each time zone will only be on duty during its business (or at least wake up) hours. Establishing such a rotation can do wonders to reduce the night workload that the attendant takes on.

If you don't have enough engineers and the geographic distribution to support a follow-the-sun rotation, there are still things you can do to reduce the likelihood of people being unnecessarily woken up in the middle of the night. After all, it's one thing to get out of bed at 4 a.m. to solve a pressing, customer-facing problem; It's quite another to wake up only to find that you're dealing with a false alarm. It can help to review all the alerts you've set up and ask your team which ones are actually needed to wake someone up after hours, and whether those alerts can wait until the morning. It can be difficult to get people to agree to turn off some non-working alerts, especially if missed issues have caused problems in the past, but it's important to remember that a sleep-deprived engineer is not the most effective engineer. Set these alerts during business hours when they really matter. Most alert tools these days allow you to set up different rules for after-hours notifications, be it Nagios notification periods or setting up different schedules in PagerDuty.

Sleep, duty and team culture

Other solutions to sleep disruption involve larger cultural changes. One way to solve this problem is to monitor alerts, paying particular attention to when they arrive and whether they are actionable. Opsweekly is a tool created and published by Etsy that allows teams to track and categorize the alerts they receive. It can generate graphs showing how many alerts woke people up (using sleep data from fitness trackers), as well as how many alerts actually required human action. Using these technologies, you can track the effectiveness of your on-call rotation and its impact on sleep over time.

The team can play a role in ensuring that every person on duty gets enough rest. Create a culture that encourages people to take care of themselves: if you're losing sleep because you were called on at night, you can sleep a little longer in the morning to try to make up for lost sleep time. Team members can look out for each other: When teams share their sleep data with each other through something like Opsweekly, they can go to their colleagues on duty and say, “Hey, looks like you had a rough night with PagerDuty last night.” “Would you like me to cover for you tonight so you can get some rest?” Encourage people to support each other in this way and discourage a “hero culture” where people will push themselves to the limit and avoid asking for help.

Reducing the impact of being on duty at work

When engineers are tired because they were woken up while on duty, they will obviously not work at 100% capacity for the day, but even without accounting for sleep deprivation, being on duty can also have other consequences on work. One of the most significant losses during duty is due to the interruption factor, context change: a single interruption can result in the loss of at least 20 minutes due to loss of focus and context switching. It's likely that your teams will have other sources of interruption, such as tickets generated by other teams, requests or questions coming through chat and/or email. Depending on the volume of these other interruptions, you may consider adding them to an existing rotation while on duty or setting up a second rotation just to handle these other requests.

It is important to take this into account when you are planning the work that the team will do, both long term and short term. If your team tends to have fairly intense duty shifts, this fact needs to be taken into account in long-term planning, as you may have a situation where the entire staff is effectively on duty at any given time, rather than doing other work. In short-term planning, you may find that the on-call person is unable to meet deadlines due to their on-call responsibilities - this should be expected and the rest of the team should be willing to accommodate and help to ensure that the job gets done and the on-call person is supported in their work tasks. Regardless of whether the on-call person is called in, the on-call shift will impact the on-call person's ability to perform other work—don't expect the on-call person to work nights to complete scheduled projects in addition to being on duty after hours.

Teams will have to find a way to cope with the extra work generated while on duty. This work could be real work to fix real problems detected by monitoring and alerting systems, or it could be work to fix monitoring and alerts to reduce the number of false positive alerts. Whatever the nature of the work being created, it is important to distribute that work fairly and sustainably across the team. Not all on-call shifts are created equal, and some are more complex than others, so stating that the person receiving the alert is the person responsible for dealing with all the consequences of that alert can lead to an uneven distribution of work. It may make more sense for the person on duty to be responsible for scheduling or distributing work, with the expectation that the rest of the team will be willing to help complete the work created.

Creating and maintaining work-life balance

Think about the impact being on duty has on your life outside of work. When you are on duty, you are likely to feel tied to your mobile phone and laptop, this means that you always carry a laptop and a mobile router (usb modem) with you or simply do not leave your home/office. Being on call usually means giving up things like seeing friends or family during your shift. This means that the length of each shift depends on the number of people on your team, and the frequency of shifts can put an undue burden on people. You may need to experiment with the length and timing of your shifts to find a schedule that works for at least the majority of people involved, as different teams and people will have different priorities and preferences.

It is vital to recognize the impact that being on duty will have on people's lives, both at a management level and at an individual level. It should be noted that the impact will be felt disproportionately by people with less privilege. For example, if you have to spend time caring for children or other family members, or if you find that most of the housework falls on your shoulders, you already have less time and energy than someone who doesn't. responsibilities. This kind of “second shift” or “third shift” work tends to disproportionately impact people, and if you establish on-call rotations with a schedule or intensity that assumes participants have no personal life outside of the office, you are limiting the people who can participate on your team.

Encourage people to try to maintain more of their regular schedule. You should consider providing the team with mobile routers (usb modems) so that people can leave the house with their laptop and still have some semblance of a life. Encourage people to trade on-call hours with each other, if necessary, for short periods of time so that people can go to the gym or see a doctor while on duty. Don't create a culture where being on call means engineers literally do nothing but be on call. Work-life balance is an important part of any job, but especially when you consider off-duty hours, more senior members of your team should set an example for others in terms of work-life balance, as much as possible while on duty.

On an individual level, don't forget to explain what being on duty means to your friends, family, partners, pets, etc. (your cats probably won't care since they're already up at 4 a.m. when you get the alert, although they will in no way want to help you solve it). Make sure you make up for lost time after your shift ends, whether it's to see friends, family or sleep, for example. If you can, consider setting up a silent alarm (like a smartwatch) that can wake you up by buzzing your wrist so you don't wake up anyone around you. Find ways to take care of yourself when you're in the middle of your on-call shift and when it's over. You might want to put together an “on-call survival kit” that will help you relax: listen to a playlist of your favorite music, read your favorite book, or take time to play with your pet. Managers should encourage self-care by giving people a day off after a week on duty and making sure people ask for (and get) help when they need it.

Improving the duty experience

Overall, being on duty shouldn't just be seen as a terrible job: you have the opportunity and responsibility as a person on duty to actively work to make it better for the people who will be on duty in the future, which means that people will receive fewer messages and they will be more accurate. Again, tracking the value of your alerts using something like Opsweekly can help you figure out what's making your on-call annoying and fix it. For inactive alerts, ask yourself if there are ways to get rid of these alerts - perhaps this means they will only go off during business hours, because there are some things you just don't need to respond to in the middle of the night. Don't be afraid to delete alerts, change them, or change the sending method from "send to phone and email" to "email only." Experimentation and iteration are the key to improving duty over time.

For alerts that are actually actionable, you should consider how easy it is for an engineer to take the necessary actions. Every running alert should have a runbook that goes with it - consider using a tool like nagios-herald to add runbook links to your alerts. If the alert is simple enough that it doesn't need a runbook, it's probably simple enough that you can automate the response using something like Nagios event handlers, which saves people having to wake up or interrupt themselves for easily automated tasks. Both runbooks and nagios-herald can help you add valuable context to your alerts, which will help people respond to them more effectively. See if you can answer common questions like: When was the last time this alert went off? Who answered it last time, and what actions did they ultimately take (if any)? What other alerts appear at the same time as this and are they related? This type of contextual information often ends up only in people's brains, so encouraging a culture of documenting and sharing contextual information can reduce the amount of overhead required to respond to alerts.

A big part of the fatigue that comes from on-calls is that they never end—if your team has on-calls, it's unlikely that they'll end anytime in the foreseeable future. The shifts never end, and we may feel like they will always be terrible. This lack of hope is a big mental issue that can contribute to stress and exhaustion, so addressing the perception (in addition to the reality) that duty will always be terrible is a good place to start to think about your duty in the long term.

In order to give people hope that the situation on duty will ever improve, it is necessary to have observability of the system (the same tracking and categorization of duty that I mentioned earlier). Keep track of how many alerts you have, what percentage of them require attendant intervention, how many of them wake people up, and then work to create a culture that encourages people to do things better. If you have a large team, it can be tempting, as soon as your watch comes to an end, to throw up your hands and say "that's a future duty officer's problem" rather than dig in to fix something - who wants to spend more effort on duty than from them required? This is where a culture of empathy can make a big difference, because you're not only looking out for your well-being on duty, but also for your colleagues.

It's all about empathy

Empathy is an important part of what allows us to drive performance that improves the on-call experience. As a manager or member, you can positively evaluate or even reward people for behavior that makes the shift better. Operations support is one of those areas where engineers often feel like people only pay attention to them when something goes wrong: people will be there to yell at them when a site crashes, but they rarely learn about the behind-the-scenes efforts that operations engineers put into keeping the site running the rest of the time. Recognizing work can go a long way, whether it's thanking someone in a meeting or in a general email for improving a specific alert, a technical aspect of being on duty, or giving someone time to cover for another engineer on shift for a while.

Encourage people to spend time and effort to improve their on-call situation in the long term. If your team has on-calls, you should plan and prioritize this work the same way you would any other work on your roadmap. On-calls are 90% entropy, and unless you actively work to improve them, they will get worse and worse over time. Work with your team to figure out what best motivates and rewards people, and then use that to encourage people to reduce alert noise, write runbooks, and create tools that solve their on-call problems. Whatever you do, don't settle for terrible duty as a permanent part of the state of affairs.

Source: habr.com

Add a comment