How We Changed the Always Connected State to Prevent Burnout

The translation of the article was prepared specifically for the students of the course "DevOps practices and tools".

How We Changed the Always Connected State to Prevent Burnout

Intercom's mission is to make online business personal. But it's impossible to personalize a product when it's not working. how to. Uptime is critical to the success of our business, not only because our clients pay us, but also because we use with your product. If our service does not work, we literally feel the pain of our customers.

Uninterrupted operation depends on many factors, such as software architecture and the quality of daily work. However, quite often it comes down to the fact that the person who is always in touch answers calls from pagerduty. This technical support can be a powerful customer-focused tool that combines the help of engineers with what customers get when they purchase your product. It also opens up a great opportunity for learning and growth, because after all, failures and mistakes can be a good field for practicing skills and understanding complex working mechanisms.

Staying "always in touch" outside of working hours is detrimental to your life.

But at the same time, being β€œalways connected” can have a detrimental effect on your life. You must be ready to respond quickly and competently to the notification that something is broken. Even if you're not being paged at that particular moment, the "always on" state creates a sense of unease, and I know this from personal experience. Especially because of this, the quality of sleep worsens. Regularly being in the access zone at any time of the day can lead to burnout, apathy, or, in general, to the desire to never see the computer again.

History of "always connected" state in Intercom

In the very early days of Intercom, our CTO Ciaran single-handedly was the entire XNUMX/XNUMX technical support team, both in and out of the office. As Intercom grew, a task force was created to assist Ciaran. Shortly thereafter, new development teams began to create many new features and services, and they already took over all technical support responsibilities.

At any moment there were too many people β€œon the line”.

At the time, this approach seemed like a no-brainer, as it was an easy way to scale our technical support team at any given moment, it was in line with our values, and it gratified our sense of ownership. As a result, without any plans, we ended up with four or five teams that regularly contacted clients during their non-working hours. The rest of the development teams didn't have many difficult points that could throw an error, so they were rarely, if ever, called.

We realized that we were in a situation where we had technical support mechanics to be proud of and a number of critical issues that we wanted to address, such as:

  • At any given time, too many people were ready to take on the challenge. Our infrastructure was not large enough to require a minimum of five development engineers working without normal days off.
  • The quality of our alerts and calling procedures was not consistent across teams, we used ad hoc review processes for new and existing problem alerts. The directions in the runbook (to be followed when an issue is raised) were mostly conspicuous by their absence.
  • Depending on the team the engineers worked for, they had conflicting expectations. For example, only the very first technical support team had any compensation for duty shifts and disrupted holidays.
  • It turned out that there is a general level of tolerance for unnecessary calls at odd hours.
  • Finally, this type of work is not for everyone. Life circumstances sometimes showed that shifts on duty do not affect people in the best way.

Finding the right state of "always connected"

We have decided to create a new virtual team that will do the technical support work of each team when they have off hours. The team will be made up of volunteers, not conscripts from any team in the organization. The engineers on the virtual team rotated about every six months, spending weeks "in touch". Luckily, we had no problem finding enough volunteers to put together a virtual team.

As a result, our support team was reduced from 30 people to just 6 or 7.

The team then agreed and defined what issue alerts and runbook descriptions should look like, and outlined the process for forwarding alerts to the new support team. They identified all the alerts in the code using the Terraform module, and started using peer review for every change. We introduced a level of compensation for a weekly shift that suited the duty officers quite well. We also created an escalated second-level team, which consisted of only managers. This command should be the single point of escalation for technical support engineers.

We had several months of hard work during which we established this process, as a result, now not 30 engineers remained in touch as before, but only 6 or 7. During working hours, teams independently deal with problems with their functions or services, on this time usually accounts for the highest number of breakdowns, but the rest of the time, technical support is handled by volunteers.

What have we learned

After we launched our virtual technical support team, we expected a flood of new tasks, such as investigating the causes of problems or a general gathering to solve a single problem that caused a crash. However, our development teams took full responsibility for the factors causing the crashes, and any subsequent response was usually immediate action. We also needed to avoid the situation in which the technical consultation task would be returned back to the team from which it came, so as not to force the engineers to get in touch after hours.

Out-of-hours calls have been reduced to less than 10 per month.

Formally, our escalation process was rarely used. The more common perception was that the engineer was unofficially assisted by the team that was currently online, especially our guys in the San Francisco office. Many issues have been fixed or reduced through teamwork and on-the-fly resolution.

Engineers in our San Francisco office join the team as a full team and go beyond regular technical support. There were some overheads involved, but expanding our support team membership across multiple locations has worked in our favor as it has proven to be a good way to build relationships, strengthen them, and learn more about the technology stack we all work with.

In our teams, the work of Intercom developers has become more consistent, and we can confidently talk about the benefits of the position of a systems engineer on our website Careers, stating that there is no need to always be in touch if you do not want it yourself.

Along with the fundamental work of stabilizing and scaling our data warehouses, the constant focus on problem solving has resulted in out-of-hours calls being reduced to less than 10 per month. We are very proud of this number.

We continue to work on maintaining and improving our technical support team, and as Intercom grows we may need to rethink our decisions, because what works today may not necessarily work the next time our staff numbers double. However, this experience has been extremely positive for our organization, greatly improving the quality of life of our development engineers, the quality of our responses to challenges, and most of all, the experience of our customers.

Source: habr.com

Add a comment