"Bitrix24": "Quickly raised is not considered to be fallen"

To date, the Bitrix24 service does not have hundreds of gigabits of traffic, there is no huge fleet of servers (although, of course, there are quite a few existing ones). But for many clients, it is the main tool for working in the company, this is a real business-critical application. Therefore, falling is, well, impossible. But what if the fall did happen, but the service “resurrected” so quickly that no one noticed anything? And how is it possible to implement failover without losing the quality of work and the number of clients? Alexander Demidov, director of cloud services at Bitrix24, spoke for our blog about how the reservation system has evolved over the 7 years of the product's existence.

"Bitrix24": "Quickly raised is not considered to be fallen"

“In the form of SaaS, we launched Bitrix24 7 years ago. The main difficulty, probably, was the following: before the public launch in the form of SaaS, this product existed simply in the format of a boxed solution. Clients bought it from us, hosted it on their servers, started a corporate portal - a common solution for employee communication, file storage, task management, CRM, that's all. And we decided by 2012 that we wanted to launch it as a SaaS, administering it ourselves, providing fault tolerance and reliability. We gained experience in the process, because until then we simply did not have it - we were only software manufacturers, not service providers.

When launching the service, we understood that the most important thing is to ensure fault tolerance, reliability and constant availability of the service, because if you have a simple regular website, a store, for example, and it crashes and lies for an hour, only you yourself suffer, you lose orders , you lose customers, but for your client himself, this is not very critical for him. He was upset, of course, but went and bought on another site. And if this is an application that all work within the company, communications, solutions are tied to, then the most important thing is to gain the trust of users, that is, not let them down and not fall. Because all the work can get up if something inside does not work.

Bitrix.24 as SaaS

We assembled the first prototype a year before the public launch, in 2011. They assembled it in about a week, looked at it, twisted it - it was even working. That is, it was possible to go into the form, enter the name of the portal there, a new portal was deployed, a user base was started. We looked at it, evaluated the product in principle, turned it off, and refined it further for a whole year. Because we had a big task: we didn’t want to make two different code bases, we didn’t want to support a separate boxed product, separate cloud solutions, we wanted to do all this within one code.

"Bitrix24": "Quickly raised is not considered to be fallen"

A typical web application at that time is one server on which some kind of php code, mysql database is running, files are uploaded, documents, pictures are placed in the upload folder - well, it all works. Alas, it is impossible to run a critically stable web service on this. Distributed cache is not supported there, database replication is not supported.

We have formulated requirements: this is the ability to be located in different locations, support replication, ideally, be located in different geographically distributed data centers. Separate the logic of the product and, in fact, data storage. Dynamically be able to scale according to the load, take out statics in general. From these considerations, in fact, the requirements for the product were formed, which we have been finalizing just for a year. During this time, in a platform that turned out to be unified - for boxed solutions, for our own service - we made support for those things that we needed. Support for mysql replication at the level of the product itself: that is, the developer who writes the code does not think about how his requests will be distributed, he uses our api, and we can correctly distribute write and read requests between masters and slaves.

We have made support at the product level for various cloud object storages: google storage, amazon s3, - plus, support for open stack swift. Therefore, it was convenient both for us as a service and for developers who work with a boxed solution: if they just use our api for work, they don’t think about where the file will eventually be saved, locally on the file system or get into the object file storage .

As a result, we immediately decided that we would be backing up at the level of the entire data center. In 2012, we launched entirely on Amazon AWS, because we already had experience with this platform - our own site was hosted there. We were attracted by the fact that in each region in Amazon there are several availability zones - in fact, (in their terminology) several data centers that are more or less independent of each other and allow us to reserve at the level of the entire data center: if it suddenly fails, the databases are replicated by master-master, the web application servers are reserved, and the static is moved to the s3 object storage. The load is balanced - at that time by Amazon's elb, but a little later we came to our own balancers, because we needed more complex logic.

What they wanted, they got...

All the basic things that we wanted to provide - the fault tolerance of the servers themselves, web applications, databases - everything worked well. The simplest scenario: if one of the web applications fails, then everything is simple - they are turned off from balancing.

"Bitrix24": "Quickly raised is not considered to be fallen"

The balancer (then it was Amazon's elb) that failed machines marked unhealthy itself, turned off the load distribution on them. Amazon autoscaling worked: when the load grew, new cars were added to the autoscaling group, the load was distributed to new cars - everything was fine. With our balancers, the logic is approximately the same: if something happens to the application server, we remove requests from it, throw out these machines, start new ones and continue working. The scheme has changed a little over the years, but continues to work: it is simple, understandable, and there are no difficulties with this.

We work all over the world, customer load peaks are completely different, and, in a good way, we should be able to carry out certain service work on any components of our system at any time - imperceptibly for customers. Therefore, we have the opportunity to shut down the database by redistributing the load on the second data center.

How does it all work? - We switch traffic to a working data center - if this is an accident at the data center, then completely, if this is our planned work with any one database, then we switch part of the traffic serving these clients to the second data center, it is suspended replication. If new machines for web applications are needed, as the load on the second data center has increased, they will automatically start. We finish the work, replication is restored, and we return the entire load back. If we need to mirror some work in the second DC, for example, install system updates or change settings in the second database, then, in general, we repeat the same thing, just in the other direction. And if this is an accident, then we do everything trite: we use the event-handlers mechanism in the monitoring system. If several checks work for us and the status goes into critical, then this handler is launched, a handler that can execute this or that logic. For each database, we have written which server is a failover for it, and where to switch traffic if it is unavailable. We - as it happened historically - use nagios or any of its forks in one form or another. In principle, there are similar mechanisms in almost any monitoring system, we do not use something more complex yet, but perhaps someday we will. Now monitoring is triggered by unavailability and has the ability to switch something.

Have we reserved everything?

We have a lot of customers from the USA, a lot of customers from Europe, a lot of customers who are closer to the East - Japan, Singapore and so on. Of course, a huge proportion of customers in Russia. That is, the work is far from one region. Users want a quick response, there are requirements to comply with various local laws, and within each region we reserve two data centers, plus there are some additional services that, again, are conveniently placed within one region - for customers who are in this region are working. REST handlers, authorization servers, they are less critical for the client's work in general, you can switch over them with a small acceptable delay, but you don't want to reinvent the wheel, how to monitor them and what to do with them. Therefore, we try to use existing solutions to the maximum, and not develop some kind of competence in additional products. And somewhere we tritely use switching at the dns level, and the liveliness of the service is determined by the same dns. Amazon has a Route 53 service, but it's not just dns, where you can make records and that's it - it's much more flexible and convenient. Through it, you can build geo-distributed services with geolocations, when you use it to determine where the client came from and give him certain records - you can use it to build failover architectures. The same health-checks are configured in Route 53 itself, you set the endpoint that are monitored, set the metrics, set which protocols to determine the "liveness" of the service - tcp, http, https; set the frequency of checks that determine whether the service is alive or not. And in the dns itself, you prescribe what will be primary, what will be secondary, where to switch if health-check works inside route 53. All this can be done with some other tools, but what’s more convenient is that we set it up once and then don’t think about it at all how we do checks, how we switch: everything works by itself.

The first "but": how and how to reserve route 53 itself? You never know, suddenly something happens to him? Fortunately, we have never stepped on this rake, but again, I will have a story ahead of me why we thought that we still need to reserve. Here we make straw for ourselves in advance. Several times a day, we do a complete unloading of all zones that we have on route 53. Amazon's API allows you to easily send them to JSON, and we have several backup servers where we convert it, upload it in the form of configs and, roughly speaking, have a backup configuration. In which case we can quickly deploy it manually without losing dns settings data.

Second "but": What is not yet reserved in this picture? The Balancer! Our distribution of customers by region is very simple. We have bitrix24.ru, bitrix24.com, .de domains - now there are 13 different ones, which work in a variety of zones. We came to the following: each region has its own balancers. So it is more convenient to distribute by region, depending on where the peak load on the network is. If this is a failure at the level of any one balancer, then it is simply decommissioned and removed from dns. If there is some problem with a group of balancers, then they are reserved at other sites, and switching between them is done using the same route53, because due to a short ttl, switching occurs within a maximum of 2, 3, 5 minutes.

Third "but": what is not reserved yet? S3 is correct. We, placing the files that we store with users in s3, sincerely believed that it was armor-piercing and there was no need to reserve anything there. But history shows that things are different. In general, Amazon describes S3 as a fundamental service, because Amazon itself uses S3 to store machine images, configs, AMI images, snapshots ... And if s3 falls, as it once happened in these 7 years, how much bitrix24 we have been operating, it will fan out pulls a bunch of everything - the inaccessibility of the start of virtual machines, the failure of the api, and so on.

And S3 can fall - it happened once. Therefore, we came to the following scheme: a few years ago there were no serious object public repositories in Russia, and we considered the option of doing something of our own ... Fortunately, we did not start doing this, because we would have dug into the expertise that we do not we have, and probably would have messed up. Now Mail.ru has s3-compatible storage, Yandex has it, and a number of other providers have it. We eventually came to the conclusion that we want to have, firstly, redundancy, and secondly, the ability to work with local copies. For a specific Russian region, we use the Mail.ru Hotbox service, which is API compatible with s3. We didn’t need any serious improvements to the code inside the application, and we made the following mechanism: s3 has triggers that work on creating / deleting objects, Amazon has a service like Lambda - this is a serverless code launch that will be executed just when certain triggers are triggered.

"Bitrix24": "Quickly raised is not considered to be fallen"

We made it very simple: if a trigger works for us, we execute the code that will copy the object to the Mail.ru storage. To fully launch work with local copies of data, we also need reverse synchronization so that customers who are in the Russian segment can work with storage that is closer to them. Mail is about to complete the triggers in its repository - it will be possible to perform reverse synchronization at the infrastructure level, but for now we are doing it at the level of our own code. If we see that the client has placed some file, then we put the event in the queue at the code level, process it and do reverse replication. Why it is bad: if we have some kind of work with our objects outside of our product, that is, by some external means, we will not take this into account. Therefore, we wait until the end, when there will be triggers at the storage level, so that no matter where we executed the code from, the object that got to us is copied to the other side.

At the code level, we have both storages for each client: one is considered the main one, the other is the backup. If everything is good, we work with the storage that is closer to us: that is, our customers who are in Amazon, they work with S3, and those who work in Russia, they work with Hotbox. If the flag is triggered, then failover should be connected to us, and we switch clients to another storage. We can set this flag independently by region and can switch them back and forth. In practice, this has not yet been used, but this mechanism has been provided for and we think that someday we will need this very switching and come in handy. It already happened once.

Oh, and your Amazon ran away ...

This April is the anniversary of the start of Telegram blocking in Russia. The provider most affected by this is Amazon. And, unfortunately, Russian companies that worked for the whole world suffered more.

If the company is global and Russia is a very small segment for it, 3-5% - well, one way or another, you can donate them.

If this is a purely Russian company - I'm sure that it needs to be hosted locally - well, it's just that the users themselves will be comfortable, comfortable, there will be less risks.

But what if this is a company that operates globally, and it has approximately equal number of clients from Russia and somewhere around the world? The connectivity of the segments is important, and they must work with each other one way or another.

At the end of March 2018, Roskomnadzor sent a letter to the largest operators that they planned to block several million Amazon ips in order to block ... the Zello messenger. Thanks to these very providers - they successfully leaked the letter to everyone, and there was an understanding that connectivity with Amazon could fall apart. It was Friday, we ran to colleagues from servers.ru in a panic, saying: “Friends, we need several servers that will not be in Russia, not in Amazon, but, for example, somewhere in Amsterdam”, in order to in order to be able to at least somehow put our own vpn and proxy there for some endpoints that we can’t influence in any way, for example endponts of the same s3 - you can’t try to raise a new service and get a different ip, we still need to get there. In a few days, we set up these servers, raised them, and in general, we were prepared by the time the blocking began. It is curious that the RKN, looking at the hype and panic raised, said: "No, we are not going to block anything now." (But this is exactly until the moment when they started blocking telegrams.) Having set up the bypass options and realizing that the blocking was not introduced, we, nevertheless, did not begin to analyze the whole thing. Yes, just in case.

"Bitrix24": "Quickly raised is not considered to be fallen"

And in 2019, we still live in conditions of blocking. I looked last night: about a million ip continue to be blocked. True, Amazon was almost completely unblocked, at the peak it reached 20 million addresses ... In general, the reality is that there may not be connectivity, good connectivity. Suddenly. It may not be for technical reasons - fires, excavators, all that. Or, as we have seen, not quite technical. Therefore, someone big and big, with their own ASs, can probably steer it in other ways - direct connect and other things are already at the l2 level. But in a simple version, like us or even smaller, just in case, you can have redundancy at the level of servers raised somewhere else, configured in advance vpn, proxy, with the ability to quickly switch configuration to them in those segments that you have critical connectivity . This came in handy for us more than once when Amazon blocks began, we let S3 traffic through them in the worst case, but gradually it all fell apart.

And how to reserve ... an entire provider?

Right now, we don’t have a scenario for the failure of the entire Amazon. We have a similar scenario for Russia. In Russia, we were hosted by one provider, which we chose to have several sites. And a year ago, we ran into a problem: even though these are two data centers, there may already be problems at the provider's network configuration level that will affect both data centers anyway. And we can get unavailability on both sites. Of course, that's what happened. We eventually revised the architecture inside. It has not changed much, but for Russia we now have two sites, which are not from one provider, but from two different ones. If one fails, we can switch to another.

Hypothetically, for Amazon, we are considering the possibility of reserving at the level of another provider; maybe Google, maybe someone else… But so far we have observed in practice that if Amazon has crashes at the level of one availability zone, then crashes at the level of an entire region are quite rare. Therefore, we theoretically have an idea that we might make an Amazon-not-Amazon reservation, but in practice there is no such thing yet.

A few words about automation

Is automation always needed? Here it is appropriate to recall the Dunning-Kruger effect. On the x-axis is our knowledge and experience that we are gaining, and on the y-axis is confidence in our actions. At first we don't know anything and are not at all sure. Then we know a little and become mega-confident - this is the so-called "peak of stupidity", well illustrated by the picture "dementia and courage". Then we have already learned a little and are ready to go into battle. Then we step on some kind of mega-serious rake, we fall into the valley of despair, when we seem to know something, but in fact we don’t know much. Then, as we gain experience, we become more confident.

"Bitrix24": "Quickly raised is not considered to be fallen"

Our logic about various switching automatically to certain accidents is very well described by this graph. We started - we did not know how, almost all the work was done manually. Then we realized that you can put automatics on everything and, like, sleep peacefully. And suddenly we step on a mega-rake: false positive works for us, and we switch traffic back and forth when, in a good way, we should not have done it. Consequently, replication breaks or something else - this is the very valley of despair. And then we come to the understanding that everything must be treated wisely. That is, it makes sense to rely on automation, providing for the possibility of false positives. But! if the consequences can be devastating, then it is better to leave it at the mercy of the duty shift, the duty engineers, who will make sure that it is really an accident, and the necessary actions will be performed manually ...

Conclusion

For 7 years, we have gone from the fact that when something fell, there was a panic-panic, to the understanding that there are no problems, there are only tasks, they must - and can - be solved. When you build a service, look at it from above, evaluate all the risks that can happen. If you see them right away, then plan in advance for redundancy and the possibility of building a fault-tolerant infrastructure, because any point that can fail and lead to the inoperability of the service will definitely do it. And even if it seems to you that some elements of the infrastructure will definitely not fail - such as the same s3, still keep in mind that they can. And at least in theory, have an idea of ​​what you will do with them if something does happen. Have a risk management plan. When you think about doing everything automatically or manually, assess the risks: what will happen if the automation starts to switch everything - will it not lead to an even worse picture compared to an accident? Perhaps somewhere you need to use a reasonable compromise between the use of automation and the reaction of the engineer on duty, who will evaluate the real picture and understand whether something needs to be switched on the go or “yes, but not now”.

A reasonable compromise between perfectionism and real forces, time, money that you can spend on the scheme that you will eventually have.

This text is a supplemented and expanded version of Alexander Demidov's report at the conference Uptime day 4.

Source: habr.com

Add a comment