
Hey Habr! I'm Artem Karamyshev, head of the system administration team . Over the past year, we have had many new product launches. We wanted to ensure that API services are easily scalable, fault-tolerant and ready for rapid growth in user load. Our platform is implemented on OpenStack, and I want to talk about what component fault tolerance problems we had to close in order to get a fault-tolerant system. I think it will be interesting for those who also develop products on OpenStack.
The overall fault tolerance of a platform is made up of the stability of its components. So we will gradually go through all the levels where we found risks and closed them.
A video version of this story, the primary source of which was a report at the Uptime day 4 conference, organized by can see .
Physical Architecture Fault Tolerance
The public part of the MCS cloud is now based in two Tier III level data centers, between them there is their own dark fiber, reserved at the physical level by different routes, with a bandwidth of 200 Gb / s. Tier III provides the required level of resiliency for the physical infrastructure.
The dark fiber is reserved at both the physical and logical levels. The channel reservation process was iterative, problems arose, and we are constantly improving communication between data centers.
For example, not so long ago, when working in a well near one of the data centers, a pipe was pierced by an excavator, both the main and the backup optical cable were inside this pipe. Our fault-tolerant communication channel with the data center turned out to be vulnerable at one point, in the well. Accordingly, we have lost part of the infrastructure. We drew conclusions, took a number of actions, including laying additional optics along the neighboring well.
In data centers there are points of presence of communication providers, to which we broadcast our prefixes via BGP. For each network direction, the best metric is selected, which allows different clients to provide the best connection quality. If communication through one provider is disconnected, we rebuild our routing through available providers.
In the event of a provider failure, we automatically switch to the next one. In the event of a failure of one of the data centers, we have a mirror copy of our services in the second data center, which take over the entire load.

Physical Infrastructure Resiliency
What do we use for application level fault tolerance
Our service is built on a number of opensource components.
ExaBGP - a service that implements a number of functions using a dynamic routing protocol based on BGP. We use it extensively to announce our whitelisted IP addresses through which users access the API.
HAProxy - a highly loaded load balancer that allows you to configure very flexible traffic balancing rules at different levels of the OSI model. We use it to balance before all services: databases, message brokers, API services, web services, our internal projects - everything is behind HAProxy.
API application - a web application written in python, with which the user manages his infrastructure, his service.
Worker application (hereinafter simply worker) - in OpenStack services, this is an infrastructure daemon that allows you to broadcast API commands to the infrastructure. For example, the creation of a disk occurs in the worker, and the request for creation occurs in the API application.
Standard OpenStack Application Architecture
Most of the services that are developed for OpenStack try to follow a single paradigm. The service usually consists of 2 parts: API and workers (backend executors). As a rule, the API is a WSGI python application that runs either as a standalone process (daemon) or using a ready-made Nginx, Apache web server. The API processes the user's request and passes further instructions to the worker application for execution. The transfer takes place using a message broker, usually RabbitMQ, the rest are poorly supported. When messages get to the broker, they are processed by workers and, if necessary, return a response.
This paradigm implies isolated common points of failure: RabbitMQ and the database. But RabbitMQ is isolated within a single service and, in theory, can be individual for each service. So in MCS we separate these services as much as possible, for each individual project we create a separate database, a separate RabbitMQ. This approach is good because in the event of an accident at some vulnerable points, not the entire service breaks, but only part of it.
There is no limit to the number of worker applications, so the API can be easily scaled horizontally behind balancers in order to increase performance and fault tolerance.
In some services, coordination within the service is necessary - when complex sequential operations occur between APIs and workers. In this case, a single coordination center is used, a cluster system such as Redis, Memcache, etcd, which allows one worker to tell another that this task is assigned to him ("please don't take it"). We use etcd. As a rule, workers actively communicate with the database, write and read information from there. As a database, we use mariadb, which we have in a multimaster cluster.
Such a classic single service is organized in a manner generally accepted for OpenStack. It can be considered as a closed system, for which the ways of scaling and fault tolerance are quite obvious. For example, for API fault tolerance, it is enough to put a balancer in front of them. Worker scaling is achieved by increasing their number.
The weak point in the whole scheme is RabbitMQ and MariaDB. Their architecture deserves a separate article. In this article I want to focus on the fault tolerance of the API.

Openstack Application Architecture. Balancing and fault tolerance of the cloud platform
Making the HAProxy Balancer Failover with ExaBGP
To make our APIs scalable, fast, and fault-tolerant, we put a balancer in front of them. We chose HAProxy. In my opinion, it has all the necessary characteristics for our task: balancing at several OSI levels, management interface, flexibility and scalability, a large number of balancing methods, support for session tables.
The first problem that needed to be solved was the fault tolerance of the balancer itself. Simply installing a balancer also creates a point of failure: the balancer breaks - the service crashes. To prevent this from happening, we used HAProxy in conjunction with ExaBGP.
ExaBGP allows you to implement a mechanism for checking the state of the service. We have used this mechanism to check the health of HAProxy and, in case of problems, disable the HAProxy service from BGP.
ExaBGP+HAProxy Schema
- We install the necessary software on three servers, ExaBGP and HAProxy.
- On each of the servers, we create a loopback interface.
- On all three servers, we assign the same white IP address to this interface.
- The white IP address is advertised to the Internet via ExaBGP.
Fault tolerance is achieved by advertising the same IP address from all three servers. From a network point of view, the same address is available from three different next hops. The router sees three identical routes, selects the most priority of them according to its own metric (this is usually the same option), and traffic goes only to one of the servers.
In the event of problems with HAProxy or a server failure, ExaBGP stops announcing the route, and traffic is seamlessly switched to another server.
Thus, we have achieved the fault tolerance of the balancer.

Fault tolerance of HAProxy balancers
The scheme turned out to be imperfect: we learned how to reserve HAProxy, but did not learn how to distribute the load within the services. Therefore, we expanded this scheme a little: we switched to balancing between several white IP addresses.
Balancing based on DNS plus BGP
The issue of load balancing in front of our HAProxy remains unresolved. Nevertheless, it can be solved quite simply, as we did in our case.
To balance three servers, you need 3 white IP addresses and good old DNS. Each of these addresses is defined on the loopback interface of each HAProxy and advertised on the Internet.
In OpenStack, a service catalog is used to manage resources, in which the endpoint of the API of a particular service is specified. In this directory, we register a domain name - public.infra.mail.ru, which resolves via DNS with three different IP addresses. As a result, we get load balancing between three addresses via DNS.
But since when announcing white IP addresses, we do not control the server selection priorities until this is balancing. Typically, only one server will be selected based on IP address precedence, and the other two will be idle because no metrics are specified in BGP.
We started giving routes through ExaBGP with different metrics. Each balancer advertises all three white IP addresses, but one of them, the main one for this balancer, is advertised with the minimum metric. So while all three balancers are in operation, calls to the first IP address go to the first balancer, calls to the second one go to the second one, and calls to the third one go to the third one.
What happens when one of the balancers crashes? If any balancer fails, its main address is still announced from the other two, traffic is redistributed between them. Thus, we give the user through DNS several IP addresses at once. By balancing on DNS and different metrics, we get an even distribution of the load on all three balancers. And at the same time we do not lose fault tolerance.

Balancing HAProxy based on DNS + BGP
Interaction between ExaBGP and HAProxy
So, we have implemented failover in case the server leaves, based on the termination of the announcement of routes. But HAProxy can also be disconnected for other reasons than a server failure: administration errors, failures within the service. We want to remove a broken balancer from under load in these cases as well, and we need a different mechanism.
Therefore, extending the previous scheme, we implemented a heartbeat between ExaBGP and HAProxy. This is a soft implementation of the interaction between ExaBGP and HAProxy, when ExaBGP uses custom scripts to check the status of applications.
To do this, in the ExaBGP config, you need to configure a health checker that can check the status of HAProxy. In our case, we configured the health backend in HAProxy, and from the ExaBGP side we check with a simple GET request. If the announcement stops happening, then HAProxy is most likely not working and you don't need to announce it.

HAProxy Health Check
HAProxy Peers: session synchronization
The next thing to do was to synchronize the sessions. When working through distributed balancers, it is difficult to organize the saving of information about client sessions. But HAProxy is one of the few balancers that can do this due to the Peers functionality - the ability to transfer session tables between different HAProxy processes.
There are different balancing methods: simple ones such as , and extended, when the client's session is remembered, and each time he gets to the same server as before. We wanted to implement the second option.
HAProxy uses stick-tables to persist client sessions of this mechanism. They save the client's original IP address, the selected target address (backend) and some service information. Typically, stick tables are used to store the source-IP + destination-IP pair, which is especially useful for applications that cannot pass the user's session context when switching to another balancer, for example, in RoundRobin balancing mode.
If the stick table is taught to move between different HAProxy processes (between which balancing takes place), our balancers will be able to work with one pool of stick tables. This will make it possible to seamlessly switch the client network when one of the balancers fails, and work with client sessions will continue on the same backends that were previously selected.
For correct operation, the problem of the source IP address of the balancer from which the session was established must be solved. In our case, this is a dynamic address on the loopback interface.
The correct operation of peers is achieved only under certain conditions. That is, TCP timeouts must be large enough, or the switch must be fast enough so that the TCP session does not have time to end. However, it allows for seamless switching.
We have a service in IaaS built using the same technology. This called Octavia. It is based on two HAProxy processes and natively supports peers. They excel in this service.
The picture schematically shows the movement of peer tables between three HAProxy instances, a config is proposed, how this can be configured:

HAProxy Peers (session synchronization)
If you implement the same scheme, its operation must be carefully tested. Not the fact that it will work in the same form in 100% of cases. But at least you won't lose stick tables when you need to remember the client's source IP.
Limiting the number of simultaneous requests from the same client
Any services that are publicly available, including our APIs, may be subject to an avalanche of requests. The reasons for them can be completely different, from user errors to targeted attacks. We are periodically DDoSed by IP addresses. Clients often make mistakes in their scripts, they make us mini-DDoSs.
One way or another, it is necessary to provide additional protection. The obvious solution is to limit the number of requests to the API and not waste CPU time processing malicious requests.
To implement such restrictions, we use rate limits organized on the basis of HAProxy using the same stick tables. The limits are set quite simply and allow you to limit the user by the number of requests to the API. The algorithm remembers the source IP from which requests are made and limits the number of simultaneous requests from one user. Of course, we calculated the average API load profile for each service and set the limit to ≈ 10 times this value. We are still closely monitoring the situation, keeping our finger on the pulse.
What does it look like in practice? We have clients who constantly use our autoscale APIs. They create about two hundred or three hundred virtual machines in the late morning and delete them in the late afternoon. To create a virtual machine for OpenStack, also with PaaS services, at least 1000 API requests, since interaction between services also occurs through the API.
Such transfers of tasks cause a fairly large load. We assessed this load, collected daily peaks, increased them tenfold, and this became our rate limit. We keep our finger on the pulse. We often see bots, scanners that are trying to see if we have any CGA scripts that can be run, we actively cut them.
How to update the codebase silently to users
We implement fault tolerance also at the level of code deployment processes. There are failures during rollouts, but their impact on the availability of services can be minimized.
We are constantly updating our services and must ensure that the process of updating the codebase has no effect on users. We managed to solve this problem using the HAProxy management capabilities and the implementation of Graceful Shutdown in our services.
To solve this problem, it was necessary to ensure the control of the balancer and the “correct” shutdown of services:
- In the case of HAProxy, management is done through the stats file, which is essentially a socket and is defined in the HAProxy config. You can send commands to it via stdio. But our main configuration control tool is ansible, so it has a built-in module for managing HAProxy. which we actively use.
- Most of our API and Engine services support graceful shutdown technologies: when they turn off, they wait for the current task to complete, be it an http request or some kind of service task. The same thing happens with the worker. He knows all the tasks that he does, and ends when everything has been successfully completed.
Thanks to these two points, the safe algorithm of our deployment is as follows.
- The developer builds a new code package (we have it RPM), tests it in the dev environment, tests it in the stage, and leaves it in the stage repository.
- The developer sets the deployment task with the most detailed description of the "artifacts": the version of the new package, the description of the new functionality and other details about the deployment, if necessary.
- The system administrator starts the update. Runs an Ansible playbook, which in turn does the following:
- Takes a package from the stage repository, updates the version of the package in the product repository using it.
- Lists the backends of the service being updated.
- Turns off the first updated service in HAProxy and waits for its processes to finish. Thanks to graceful shutdown, we are sure that all current client requests will succeed.
- After a complete stop of the API, workers, turning off HAProxy, the code is updated.
- Ansible starts services.
- For each service, it pulls certain “handles” that do unit testing on a number of predefined key tests. There is a basic check of the new code.
- If no errors were found in the previous step, then the backend is activated.
- Let's move on to the next backend.
- After updating all backends, functional tests are launched. If they are not enough, then the developer looks at any new functionality that he did.
This completes the deployment.

Service update cycle
This scheme would not be working if we did not have one rule. We support the old and new versions at the same time. In advance, at the stage of software development, it is laid down that even if there are changes in the service database, they will not break the previous code. The result is a gradual update of the codebase.
Conclusion
Sharing my own thoughts on the fault-tolerant WEB-architecture, I want to once again note its key points:
- physical fault tolerance;
- network fault tolerance (balancers, BGP);
- fault tolerance of used and developed software.
All stable uptime!
Source: habr.com
