Failover: perfectionism and… laziness are ruining us

In the summer, both purchasing activity and the intensity of changes in the infrastructure of web projects traditionally decrease, Captain Evidence tells us. Just because even IT people sometimes go on vacation. And CTO too. The harder it is for those who remain in office, but this is not about that now: perhaps this is why summer is the best time to take your time to consider the existing reservation scheme and make a plan to improve it. And in this you will benefit from the experience of Yegor Andreev from AdminDivisionwhich he spoke about at the conference Uptime day.

When building reserve sites, when reserving, there are several traps that you can fall into. And you can't really hit them. And ruins us in all this, as in many other things, perfectionism and ... laziness. We try to make everything-everything-everything perfect, but you don’t need to do it perfectly! You need to do only certain things, but do them right, bring them to the end so that they work normally.

Failover is not some kind of fun, fun “let it be” thing; it is a thing that should do exactly one thing - reduce downtime so that the service, the company, loses less money. And in all reservation methods, I suggest thinking in the following context: where is the money?

Failover: perfectionism and… laziness are ruining us

First trap: when we build large reliable systems and do redundancy, we reduce the number of accidents. This is a terrible delusion. When we do redundancy, we are likely to increase the number of accidents. And if we do everything right, then collectively we will reduce downtime. There will be more accidents, but they will occur at a lower cost. After all, what is a reservation? is a complication of the system. Any complication is bad: we have more screws, more gears, in a word, more elements - and, therefore, a higher chance of breakage. And they really break. And they will break more often. A simple example: let's say we have a site, with PHP, MySQL. And it urgently needs to be reserved.

Shtosh (c) We take the second site, build an identical system ... The complexity becomes twice as large - we have two entities. And we also roll over a certain logic for transferring data from one site to another from above - that is, data replication, copying statics, and so on. So, the logic of replication is usually very complex, and therefore, the total complexity of the system can be not 2, but 3, 5, 10 times more.

Second trap: when we build really big complex systems, we fantasize what we want to get in the end. Voila: we want a super-reliable system that works without any downtime, switches in half a second (or better, instantly), and start making dreams come true. But here, too, there is a nuance: the shorter the desired switching time, the more complex the system logic becomes. The more difficult we have to make this logic, the more often the system will break. And you can get into a very unpleasant situation: we are trying our best to reduce downtime, but in fact we complicate everything, and when something goes wrong, the downtime will end up being longer. Here you often catch yourself thinking: here ... it would be better not to reserve. It would be better if it worked alone and with a clear downtime.

How can you fight it? We need to stop lying to ourselves, stop flattering ourselves that we are going to build a spaceship right now, but adequately understand how long the project can lie down. And under this maximum time, we will choose what, in fact, methods we will increase the reliability of our system.

Failover: perfectionism and… laziness are ruining us

It's time for "stories from f" ... from life, of course.

Example number one

Imagine a business card website of the pipe rolling plant No. 1 of the city N. It says in huge letters - PIPE ROLLING PLANT No. 1. A little lower is the slogan: “Our pipes are the roundest pipes in N.” And at the bottom is the CEO's phone number and his name. We understand that you need to reserve - this is a very important thing! Let's start to understand what it consists of. Html-statics - that is, a couple of pictures where the general, in fact, at the table in the bathhouse with his partner are discussing some next deal. We start thinking about downtime. It comes to mind: you need to lie there for five minutes, no more. And then the question is: how many sales from this site of ours were there in general? How much? What does "zero" mean? And that means: because all four transactions over the past year, the general made at the same table, with the same people with whom they go to the bathhouse sit at the table. And we understand that even if the site lies down for a day, there will be nothing terrible.

Based on the input, there is a day to raise this story. We start thinking about the redundancy scheme. And we choose the most ideal scheme for redundancy for this example: we do not use redundancy. This whole thing is raised by any admin in half an hour with smoke breaks. Set up a web server, put files - that's it. It will work. There is nothing to watch, nothing to pay special attention to. That is, the conclusion from example number one is pretty obvious: services that do not need to be backed up do not need to be backed up.

Failover: perfectionism and… laziness are ruining us

Example number two

Company blog: specially trained people write news there, so we participated in such and such an exhibition, but we released another new product, and so on. Let's say it's standard PHP with WordPress, a small database, and a little bit of static. Of course, it comes to mind again that in no case should you lie down - “no more than five minutes!”, That's all. But let's think further. What is this blog doing? They come there from Yandex, from Google for some requests, for organics. Great. Does it have anything to do with sales? Insight: not really. Advertising traffic goes to the main site, which is on a different machine. Let's start thinking about the blog's reservation scheme. In a good way, it needs to be raised in a couple of hours, and it would be nice to prepare for this. It would be reasonable to take a machine in another data center, roll the environment onto it, that is, a web server, PHP, WordPress, MySQL, and leave it muted. At the moment when we understand that everything is broken, we need to do two things - roll out the mysql dump to 50 meters, it will fly there in a minute, and roll out a certain number of pictures from the backup there. It's also there not God knows how much. So in half an hour the whole thing rises. Any replications, or God forgive, automatic failover'a. Conclusion: what we can quickly roll out from a backup is not necessary to reserve.

Failover: perfectionism and… laziness are ruining us

Example number three, more difficult

Online store. PhP with open heart is slightly filed, mysql with a solid base. Quite a lot of static (after all, the online store has beautiful HD pictures and all that), Redis for the session and Elasticsearch for search. We start thinking about downtime. And here, of course, it is obvious that an online store cannot lie painlessly for a day. After all, the longer it lies, the more money we lose. It's worth speeding up. How much? I guess if we lie down for an hour, no one will go crazy. Yes, we will lose something, but if we start to be zealous, it will only get worse. We determine the scheme of idle time allowed per hour.

How can this be booked? A car is needed in any case: an hour of time is quite a bit. Mysql: replication is already needed here, live replication, because in an hour 100 GB will most likely not be included in the dump. Statics, pictures: again, in an hour 500 GB may not have time to join. Therefore, it is better to copy the pictures right away. Redis: this is where things get interesting. Sessions lie in Redis - we can’t just take it and bury it. Because it will not be very good: all users will be logged out, the baskets will be emptied, and so on. People will be forced to re-enter their username and password, and a lot of people may break away and not complete the purchase. Again, the conversion will drop. On the other hand, Redis is up-to-date, with the last logged-in users, probably not needed either. And a good compromise is to take Redis and restore it from a backup, yesterday, or, if you do it every hour, an hour ago. Fortunately, restoring it from a backup is copying one file. And the most interesting story is Elasticsearch. Who ever brought up MySQL replication? Has anyone ever picked up Elasticsearch replication? And for whom did she work normally after? What I mean: we see a certain entity in our system. It's kind of useful, but it's complicated.
Difficult in the sense that our fellow engineers have no experience with it. Or have a bad experience. Or we understand that while this is a fairly new technology with nuances or dampness. We think ... Damn, elastic is also healthy, it also takes a long time to restore it from backup, what should I do? We understand that elastic in our case is used for search. How does our online store sell? We go to marketers and ask where people come from. They answer: “90% of Yandex Market comes directly to the product card.” And either they buy it or they don't. Therefore, 10% of users need search. And to keep elastic'a replication, especially between different data centers in different zones, there really are a lot of nuances. Which exit? We take elastic on a reserved site and do nothing with it. If the case drags on, then we may raise it sometime later, but this is not certain. Actually, the plus or minus conclusion is the same: services that do not affect money, we, again, do not reserve. To keep the circuit simple.

Failover: perfectionism and… laziness are ruining us

Example number four, even harder

Integrator: selling flowers, calling a taxi, selling goods, in general, anything. A serious thing that works 24/7 for a large number of users. With a full-fledged interesting stack, where there are interesting bases, solutions, high workload, and most importantly, lying down for more than 5 minutes hurts him. Not only and not so much because people will not buy, but because people will see that this thing does not work, they will be upset and may not come at all a second time.

OK. Five minutes. What will we do with this? In this case, in an adult way, we are building a real backup site with all the money, with replication of everything and everything, and perhaps even automating the maximum switching to this site. And in addition to this, you need to remember to do one important thing: in fact, write the switching regulations. The regulation, even if everything and everything is automated for you, can be very simple. From the series “run such and such ansible script”, “press such and such a checkbox in route 53”, and so on - but this should be some kind of exact list of actions.

And everything seems to be clear. Switching replication is a trivial task, or it will switch itself. Rewrite the domain name in dns - from the same series. The trouble is that when such a project falls, panic sets in, and even the strongest, bearded admins can be subject to it. Without a clear instruction “open the terminal, come here, the address of our server is still like this” the period of 5 minutes allotted for resuscitation is difficult to endure. Well, plus, when we use this regulation, it is easy to fix some changes in the infrastructure, for example, and change the regulation accordingly.
Well, if the reservation system is very complicated and at some point we made a mistake, then we can lay down our backup site, and in addition turn the data into a pumpkin on both sites - it will be very sad.

Failover: perfectionism and… laziness are ruining us

Example number five full hardcore

An international service with hundreds of millions of users worldwide. All the timezones that exist, highload at maximum speeds, you can’t lie down at all. A minute - and it will be sad. What to do? Reserve, again, in full. We did everything that was mentioned in the previous example, and a little more. An ideal world, and our infrastructure is, according to all the concepts of IaaC devops. That is, everything is generally in git, and just press the button.

What is missing? One is teaching. You can't do without them. Everything seems to be perfect, we have everything under control. We press the button, everything happens. Even if this is true - and we understand that this is not the case - our system interacts with some other systems. For example, this is dns from route 53, s3 storage, integration with some api. We cannot foresee everything in this speculative experiment. And until we really pull the switch, we won’t know if it will work or not.

Failover: perfectionism and… laziness are ruining us

That's probably all. Don't be lazy and don't overdo it. And may uptime be with you!

Source: habr.com

Add a comment