Rollout story that touched everything

Rollout story that touched everything
Enemies of Reality by 12f-2

At the end of April, while the White Walkers were besieging Winterfell, something more interesting happened to us, we made an unusual rollout. In principle, we constantly roll new features into production (like everyone else). But this one was not like the others. The scale of it was such that any potential mistakes we could make would affect all of our services and users. As a result, we rolled everything out according to plan, at the planned and announced downtime period, without consequences for the sale. The article is about how we achieved this and how those who wish can repeat it at home.

I will not now describe the architectural and technical solutions we have adopted, or tell how it all works. These are rather notes in the margins about how one of the most difficult rollouts, which I observed and in which I was directly involved, went. I do not claim completeness or technical details, perhaps they will appear in another article.

Background + what is this functionality

We are building a cloud platform Mail.ru Cloud Solutions (MCS), where I work as CTO. And now - it's time to attach to our platform IAM (Identity and Access Management), which provides a unified management of all user accounts, users, passwords, roles, services and more. Why it is needed in the cloud is an obvious question: all user information is stored in it.

Usually such things begin to build at the stage of the very start of any project. But MCS historically has developed a little differently. MCS was built in two parts:

  • Openstack with its own Keystone authorization module,
  • Hotbox (S3 storage) based on the Cloud Mail.ru project,

around which new services then appeared.

In fact, these were two different types of authorization. Plus, we used some separate Mail.ru developments, for example, the Mail.ru shared password storage, as well as a self-written openid connector, thanks to which SSO (pass-through authorization) was provided in the Horizon panel of virtual machines (native UI OpenStack).

To make IAM for us meant to combine it all into a single system, completely our own. At the same time, not to lose any functionality along the way, to create a reserve for the future, which will allow us to transparently refine it without refactoring, and scale it in functionality. Also, at the start, users got a role model for accessing services (central RBAC, role-based access control) and some other little things.

The task turned out to be non-trivial: python and perl, several backends, independently written services, several development teams and admins. And most importantly - thousands of live users on the combat production system. All this had to be written and, most importantly, rolled out without casualties.

What are we going to roll out

If it’s very rough, somewhere in 4 months we prepared the following:

  • We made several new daemons that aggregated functions that previously worked in different parts of the infrastructure. The rest of the services were assigned a new backend in the form of these demons.
  • We have written our own central storage of passwords and keys, available for all our services, which can be freely modified as we need.
  • We wrote 4 new backends for Keystone from scratch (users, projects, roles, role assignments), which, in fact, replaced its base, and now it acts as a single repository for our user passwords.
  • We taught all our Openstack services to go to a third-party policy service for their policies instead of reading these policies locally from each server (yes, that's how Openstack works by default!)

Such a large rework requires large, complex, and, most importantly, synchronous changes in several systems, written by different development teams. After assembly, the whole system should work.

How to roll out such changes and not screw it up? First, we decided to take a look into the future.

Rollout strategy

  • It would be possible to roll out in several stages, but this would increase the development time by three times. In addition, for some time we would have a complete desynchronization of data in the databases. I would have to write my own synchronization tools and live with several data stores for a long time. And this creates a wide variety of risks.
  • Everything that could be prepared transparently for the user was done in advance. It took 2 months.
  • We allowed ourselves downtime for several hours - only on user operations to create and modify resources.
  • Downtime was invalid for all already created resources to work. We planned that during the rollout, the resources should work without downtime and affect for clients.
  • To reduce the impact on our customers if something goes wrong, we decided to roll out on a Sunday evening. Fewer customers manage virtual machines at night.
  • We have warned all our customers that service management will not be available during the rollout period.

Retreat: what is rollout?

<caution, philosophy>

Each IT specialist can easily answer what a rollout is. You put CI / CD, and everything is automatically delivered to the prod. πŸ™‚

Of course, this is true. But the difficulty is that with modern tools for automating code delivery, the understanding of the rollout itself is lost. How do you forget about the epic invention of the wheel, looking at modern transport. Everything is so automated that the rollout is often carried out without realizing the whole picture.

And the whole picture is like this. Roll-out consists of four major aspects:

  1. Code delivery, including data modification. For example, their migrations.
  2. Code rollback is the ability to return if something goes wrong. For example, through the creation of backups.
  3. The time of each roll-out/rollback operation. It is necessary to understand the timing of any operation of the first two points.
  4. Affected functionality. It is necessary to evaluate both the expected positive and possible negative effects.

All these aspects must be taken into account for a successful rollout. Usually, only the first, at best, the second point is evaluated, and then the rollout is considered successful. But the third and fourth are even more important. What user would like it if the rollout took 3 hours instead of a minute? Or if something superfluous gets affected on the rollout? Or downtime of one service will lead to unpredictable consequences?

Act 1..n, preparation for release

At first, I thought to briefly describe our meetings: the whole team, its parts, heaps of discussions in coffee points, disputes, tests, brainstorms. Then I thought it would be redundant. Four months of development always consists of this, especially when you write not something that can be delivered constantly, but one big feature for a live system. Which affects all services, but nothing should change for users, except for β€œone button in the web interface”.

Our understanding of how to roll out changed from each new meeting, and quite significantly. For example, we were going to update our entire billing database. But we counted the time and realized that it was impossible to do this in a reasonable time for rolling out. We took almost a week extra to shard and archive the billing database. And when the expected rollout speed didn’t suit even after that, they ordered additional, more powerful hardware, where they dragged the entire base. It's not that we didn't want to do it sooner, but the current need to roll out left us with no options.

When one of us had doubts that the rollout might affect the availability of our virtual machines, we spent a week testing, experimenting, analyzing the code and got a clear understanding that this would not happen in our production, and even the most doubtful people agreed with this.

In the meantime, the guys from tech support conducted their own independent experiments in order to write instructions to clients on the connection methods, which were supposed to change after the rollout. They worked on user UX, prepared instructions and provided personal advice.

We have automated all rolling out operations that were possible. Any operation was scripted, even the simplest, tests were constantly run. They argued about the best way to turn off the service - lower the daemon or block access to the service with a firewall. We created a checklist of teams for each stage of the rollout, constantly updating it. We drew and constantly updated the Gantt chart for all work on rolling out, with timings.

And so…

Final act, before rolling out

… it is time to roll out.

As they say, a work of art cannot be completed, only finished working on it. It is necessary to make an effort of will, realizing that you won’t find everything, but believing that you made all reasonable assumptions, foresaw all possible cases, closed all critical bugs, and all participants did their best. The more code you roll out, the more difficult it is to convince yourself of this (besides, anyone understands that it is impossible to foresee everything).

We made the decision that we were ready to roll out when we made sure that we had done everything possible to close all the risks for our users associated with unexpected affects and downtimes. That is, anything can go wrong, except:

  1. The affect of the (sacred for us, most precious) user infrastructure,
  2. Functionality: the use of our service after the rollout should be the same as before it.

Rollout

Rollout story that touched everything
Two roll, 8 do not interfere

We take downtime for all requests from users within 7 hours. At this time, we have both a rollout plan and a rollback plan.

  • The ride itself takes about 3 hours.
  • 2 hours for testing.
  • 2 hours is a margin for a possible rollback of changes.

A Gantt chart has been compiled for each action, how long it takes, what goes sequentially, what is done in parallel.

Rollout story that touched everything
A piece of a roll-out Gantt chart, one of the early versions (without parallel execution). The most valuable synchronization tool

All participants have a defined role in the rollout, what tasks they do, what they are responsible for. We try to bring each stage to automatism, roll it out, roll it back, collect feedback and roll it again.

Chronicle of events

So, 15 people came to work on Sunday, April 29, at 10 pm. In addition to the key participants, some came just to support the team, for which special thanks to them.

Separately, it is worth mentioning that our key tester is on vacation. Rolling out without testing is impossible, we are working on options. A colleague agrees to test us from vacation, for which she is immensely grateful from the whole team.

00:00. Stop
We stop user requests, hang up a nameplate, they say, technical work. Monitoring screams, but everything is regular. We check that nothing has fallen, except for what should have been. And we begin work on migration.

Everyone has a printed roll-out plan point by point, everyone knows who is doing what and at what moment. After each action, we check the timings that we do not exceed them, and everything goes according to plan. Those who are not participating in the rollout directly at the current stage are preparing by launching an online toy (Xonotic, type 3 wacks) so as not to disturb colleagues. πŸ™‚

02:00. rolled out
A pleasant surprise - we finish the rollout an hour earlier, due to the optimization of our databases and migration scripts. The universal cry, "rolled out!" All new features are in production, but only we see in the interface. Everyone goes into testing mode, sorts into piles, and starts to see what happened in the end.

It didn’t work out very well, we understand this after 10 minutes, when nothing is connected and does not work in the projects of the team members. Quick sync, voicing our problems, setting priorities, splitting into teams and debugging.

02:30. Two big problems vs four eyes
We find two big problems. We realized that customers would not see some connected services, and there would be problems with partner accounts. Both are related to the imperfection of the migration scripts for some edge cases. Gotta fix it now.

We write requests that fix this, at least in 4 eyes. We roll it in pre-production to make sure that they work and do not break anything. You can roll further. In parallel, our usual integration testing is run, which uncovers a few more problems. All of them are small, but also need to be repaired.

03:00. -2 problems +2 problems
Two previous big problems are fixed, almost all small ones too. All unemployed in fixes are actively working in their accounts and report what they find. We prioritize, distribute by teams, leave the non-critical for the morning.

We run the tests again, they reveal two new big problems. Not all service policies arrived correctly, so some user requests fail authorization. Plus a new problem with partner accounts. We rush to look.

03:20. emergency sync
One new issue fixed. For the second, we arrange an emergency synch. We understand what is happening: the previous fix fixed one problem, but created another. We take a break to figure out how to do it right and without consequences.

03:30. six eyes
We are aware of what the final state of the base should be in order for everything to be good for all partners. We write a request in 6 eyes, we roll it on the pre-production, we test it, we roll it on the production.

04:00. Everything is working
All tests passed, no critical problems are visible. From time to time in the team something does not work for someone, we quickly respond. Most of the time it's a false alarm. But sometimes something did not reach, a separate page does not work somewhere. We sit, fix, fix, fix. A separate team is launching the last big feature - billing.

04:30. point of no return
The point of no return is approaching, that is, the time when, if we start to roll back, we will not meet the downtime given to us. There are problems with billing, which knows and records everything, but stubbornly does not want to write off money from customers. There are several bugs on separate pages, actions, statuses. The main functionality works, all tests pass successfully. We decide that the rollout has taken place, we will not roll back.

06:00. Open to all in UI
Bugs are fixed. Some that do not affect users are left for later. We open the interface to everyone. We continue to conjure over billing, we are waiting for user feedback and monitoring results.

07:00. API load issues
It becomes clear that we planned the load on our API and testing of this load a bit incorrectly, which could not identify the problem. As a result, β‰ˆ5% of queries fail. We are mobilizing, looking for a reason.

Billing persistent, too does not want to work. We decide to postpone it until later in order to make changes in a calm mode. That is, all the resources in it are accumulated, but the write-offs from customers do not go through. Of course, this is a problem, but compared to the general roll-out, it seems not important.

08:00. Fix API
They rolled out a fix for the load, the fails are gone. We start going home.

10:00. All
Everything is fixed. It is quiet in monitoring and at customers, the team gradually goes to sleep. Billing remains, we will restore it tomorrow.

Then during the day there were rollouts that fixed the logs, notifications, return codes and customizations for some of our clients.

So, the rollout was successful! It could, of course, be better, but we drew conclusions about what we lacked in order to achieve perfection.

Total

Within 2 months of active preparation, 43 tasks were completed by the time of rollout, lasting from a couple of hours to several days.

During rollout:

  • new and changed demons - 5 pieces, replacing 2 monoliths;
  • changes inside the databases - all 6 of our databases with user data were affected, unloading was performed from three old databases to one new one;
  • completely redone frontend;
  • amount of downloaded code β€” 33 thousand lines of new code, β‰ˆ 3 thousand lines of code in tests, β‰ˆ 5 thousand lines of migration code;
  • all data is intact, not a single virtual machine of the customer was affected. πŸ™‚

Good practices for a good rollout

We were guided by them in this difficult situation. But, generally speaking, it is useful to observe them during any rollout. But the more difficult the roll-out, the greater the role they play.

  1. The first thing to do is to understand how the rollout can or will affect users. Will there be a downtime? If so, then downtime of what? How will this affect users? What are the possible best and worst scenarios? And cover risks.
  2. Plan everything. At each stage, you need to understand all aspects of rolling out:
    • code delivery;
    • code rollback;
    • the time of each operation;
    • affected functionality.
  3. Play scenarios until all stages of the rollout, as well as the risks in each of them, become clear. If there is any doubt about something, you can take a break and examine the doubtful stage separately.
  4. Each stage can and should be improved if it helps our users. For example, it will reduce downtime or remove some risks.
  5. Rollback testing is much more important than code delivery testing. It is necessary to check that as a result of a rollback the system will return to its original state, confirm this with tests.
  6. Everything that can be automated should be automated. Everything that cannot be automated should be written on a cheat sheet in advance.
  7. Fix the success criteria. What functionality should be available and at what time? If it doesn't, run the rollback plan.
  8. And most importantly, people. Everyone should be aware of what he is doing, why and what depends on his actions in the process of rolling out.

And if in one phrase, then with good planning and elaboration, you can roll out anything you want without consequences for the sale. Even something that affects all your services in the sale.

Source: habr.com

Add a comment