How we evacuated the Yandex duty shift

How we evacuated the Yandex duty shift

When the work fits in one laptop and can be performed autonomously from other people, then there is no problem moving to a remote location - it is enough to stay at home in the morning. But not everyone is so lucky.

The duty shift is a team of Service Availability Specialists (SREs). It includes duty administrators, developers, managers, as well as a common "dashboard" of 26 LCD panels of 55 inches each. The stability of the company's services and the speed of solving problems depend on the work of the duty shift.

Today Dmitry Melikov tal10n, the head of the shift on duty, will talk about how in a matter of days they managed to transport the equipment to their homes and establish new work processes. I give him the floor.

- When you have an infinite supply of time, you can comfortably move with anything anywhere. But the rapid spread of the coronavirus has put us in completely different conditions. Yandex employees were among the first to switch to remote work, even before the introduction of the self-isolation regime. It happened like this. On Thursday, March 12, I was asked to evaluate the possibility of moving the team's work home. On Friday the 13th, there was a recommendation to switch to remote work. On the night of Tuesday, March 17, everything was ready for us: the attendants were working at home, the equipment was moved, the missing software was written, the processes were reconfigured. And now I'll tell you how we did it. But first you need to remember about the tasks that the duty shift solves.

Who we are

Yandex is a big company with hundreds of services. The stability of search, voice assistant and all other products depends not only on developers. The power supply may be interrupted in the data center. A worker during asphalt replacement may accidentally damage the optical cable. Or there may be a surge in user activity, which will require an urgent reallocation of capacity. Moreover, we all live in a large, complex infrastructure, and the release of one of the products can accidentally lead to the degradation of another.

26 panels in our open space are one and a half thousand alerts and more than a hundred charts and panels of our services. In fact, this is a huge diagnostic panel. An experienced duty administrator, by looking at it, quickly understands the status of important nodes and can set the direction for investigating a technological problem. This does not mean that a person should constantly look at all the devices: the automation itself will attract attention by sending a notification to the special interface of the duty officer, but without a visual panel, the solution to the problem may be delayed.

When problems occur, the attendant first evaluates their priority. It then isolates the problem or minimizes its impact on users.

There are several standard ways to isolate a problem. One of them is the degradation of services, when the administrator on duty disables some of the functions that users least notice. This allows you to temporarily reduce the load and figure out what happened. If there is a problem with the data center, the duty officer contacts the operation team, understands the problem, controls the timing of its solution and, if necessary, connects the relevant teams.

When the administrator on duty cannot isolate the problem that arose due to the release, he reports it to the service team - and the developers look for errors in the new code. If they fail to figure it out, then the administrator attracts developers from other products or engineers for the availability of services.

I can talk for a long time about how everything is arranged with us, but I think that I have already conveyed the essence. The duty shift coordinates the work of all services and controls global problems. It is important for the administrator on duty to have a diagnostic panel in front of his eyes. That’s why when you switch to remote work, you can’t just take and give everyone a laptop. Graphs and alerts will not fit on the screen. What to do?

Idea

In the office, all ten administrators on duty work in shifts at the same dashboard, which includes 26 monitors, two computers, four NVIDIA Quadro NVS 810 video cards, two rack-mounted uninterruptible power supplies and several independent network accesses. We needed to ensure that everyone has the opportunity to work from home. It's just not possible to assemble such a wall in an apartment (my wife will be especially happy about it), so we decided to create a portable version that can be brought and assembled at home.

We started experimenting with the configuration. We needed to fit all the devices on fewer displays, so the main requirement for the monitor was a high pixel density. Of the 4K monitors available in our environment, we chose Lenovo P27u-10 for tests.

From laptops, we took a 16-inch MacBook Pro. It has a fairly powerful graphics subsystem, which is necessary for rendering images on several 4K displays, and four universal Type-C connectors. You may ask: why not desktop? Replacing a laptop with exactly the same one from the warehouse is much easier and faster than assembling and configuring an identical system unit. And yes, it weighs less.

Now it was necessary to understand how many monitors we can really connect to a laptop. And the problem here is not the number of connectors, we could find out only by testing the system as an assembly.

How we evacuated the Yandex duty shift

The test is

We comfortably placed all charts and alerts on four monitors and even connected them to a laptop, but we ran into a problem. Rendering 4Γ—4K pixels on the connected monitors loaded the video card so much that the laptop was discharged even while charging. Fortunately, the problem was solved with the help of the Lenovo ThinkPad Thunderbolt 3 Dock Gen 2 docking station. We managed to connect a monitor, power, and even your favorite mouse and keyboard to the docking station.

But another problem immediately surfaced: the GPU puffed so much that the laptop overheated, which means that the battery also overheated, which as a result went into protective mode and stopped taking charge. In general, this is a very useful mode that protects against dangerous situations. In some cases, the problem was solved with the help of a high-tech device - a ballpoint pen placed under the laptop to improve ventilation. But this did not help everyone, so we also turned up the speed of the standard fan.

There was one more unpleasant feature. All charts and alerts must be placed in a strictly defined place. Imagine that you are piloting a plane to land - and then speed indicators, altimeters, variometers, artificial horizons, compasses and position indicators begin to change size and jump around in different places. So we decided to make an application that will help with this. In one evening, we wrote it on Electron.js, taking a ready-made API for creating and managing windows. We added a configuration handler and their periodic updating, as well as support for a limited number of monitors. A little later, they added support for different setups.

Assembly and delivery

By Monday, the wizards from the helpdesk had obtained 40 monitors, ten laptops and the same number of docking stations for us. I don't know how they did it, but thank you very much.

How we evacuated the Yandex duty shift

It remained to deliver all this to the apartments of the administrators on duty. And these are ten addresses in different parts of Moscow: south, east, center, and also Balashikha, which is 45 kilometers from the office (by the way, an intern from Serpukhov was also added later). It was necessary to somehow distribute all this between people, build logistics.

I entered all the addresses on our Maps, there is still an opportunity to optimize the route between different points (I used the free beta version of the tool for couriers). We divided our team into four independent teams of two people, each received its own route. My car turned out to be the most spacious, so I took equipment for four employees at once.

How we evacuated the Yandex duty shift

The whole delivery took a record three hours. We left the office at XNUMXpm Monday. At one o'clock in the morning I was already at home. That same night we went on duty with new equipment.

With the result that

Instead of one large diagnostic console, we collected ten relatively portable ones in the apartment of each duty officer. Of course, there were still a few things to be ironed out. For example, before we had one "iron" phone of the duty officer for notifications. Under the new conditions, this did not work, so we came up with β€œvirtual phones” for those on duty (in fact, channels in the messenger). There were other changes as well. But the main thing is that in record time we managed to transfer not just people, reducing the risk of their infection, but all our work from home without harm to processes and product stability. We've been doing this for a month now.

Below you will find photos of the real jobs of our attendants.

How we evacuated the Yandex duty shift

How we evacuated the Yandex duty shift

How we evacuated the Yandex duty shift

How we evacuated the Yandex duty shift

How we evacuated the Yandex duty shift

Source: habr.com