One-cloud - data center-level OS in Odnoklassniki

One-cloud - data center-level OS in Odnoklassniki

Aloha, people! My name is Oleg Anastasiev, I work at Odnoklassniki in the Platform team. And besides me, a lot of iron works in Odnoklassniki. We have four data centers, they have about 500 racks with more than 8 thousand servers. At a certain point, we realized that the introduction of a new control system would allow us to load equipment more efficiently, facilitate access management, automate (re)distribution of computing resources, speed up the launch of new services, and speed up response to large-scale accidents.

What came of it?

In addition to me and a bunch of iron, there are still people who work with this iron: engineers who are directly in data centers; networkers who configure networking; administrators, or SREs, who provide fault tolerance for the infrastructure; and development teams, each of them is responsible for part of the portal's functions. The software they create works something like this:

One-cloud - data center-level OS in Odnoklassniki

User requests arrive both at the fronts of the main portal www.ok.ru, and others, such as the music API fronts. They call the application server to process the business logic, which, when processing the request, calls the necessary specialized microservices - one-graph (graph of social connections), user-cache (cache of user profiles), etc.

Each of these services is deployed on many machines, and each of them has responsible developers responsible for the functioning of the modules, their operation and technological development. All these services run on iron servers, and until recently we ran exactly one task per server, i.e. it was specialized for a specific task.

Why is that? This approach had several advantages:

  • Facilitated mass control. Suppose a task requires some libraries, some settings. And then the server is assigned to exactly one specific group, the cfengine policy for this group is described (or it has already been described), and this configuration is centrally and automatically rolled out to all servers of this group.
  • Simplified diagnostics. Let's say you look at the increased load on the central processor and understand that only the task that runs on this iron processor could generate this load. The search for the culprit ends very quickly.
  • Simplified monitoring. If something is wrong with the server, the monitor reports this, and you know exactly who is to blame.

A service that consists of multiple replicas is allocated multiple servers, one for each. Then the computing resource for the service is allocated very simply: how many servers the service has, so many resources it can consume as much as possible. "Easy" here is not in the sense that it is easy to use, but that the allocation of resources is done manually.

This approach also allowed us to specialized iron configurations under the task running on this server. If the task stores large amounts of data, then we use a 4U server with a 38-disk chassis. If the task is purely computational, then we can buy a cheaper 1U server. This is efficient in terms of computing resources. Among other things, this approach allows us to use four times fewer machines with a load comparable to one friendly social network.

Such efficiency in the use of computing resources should also ensure economic efficiency, based on the premise that the most expensive are servers. For a long time, it was hardware that cost the most, and we put a lot of effort into reducing the price of hardware, inventing fault-tolerance algorithms to reduce hardware reliability requirements. And today we have reached the stage at which the price of the server has ceased to be decisive. If you do not consider the freshest exotic, then the specific configuration of servers in the rack does not matter. Now we have another problem - the price of space occupied by the server in the data center, i.e. rack space.

Realizing that this is the case, we decided to calculate how effectively we use the racks.
We took the price of the most powerful server out of economically viable ones, calculated how many of these servers we could put in racks, how many tasks we would run on them based on the old “one server = one task” model, and how much such tasks could utilize equipment. Counted - wept. It turned out that the efficiency of using racks in our country is about 11%. The conclusion is obvious: it is necessary to increase the efficiency of using data centers. It would seem that the solution is obvious: you need to run several tasks at once on one server. But here the difficulties begin.

Mass configuration becomes much more complicated - now it is impossible to assign a single group to a server. After all, now several tasks of different commands can be launched on one server. In addition, the configuration may be conflicting for different applications. Diagnostics is also more complicated: if you see increased consumption of processors or disks on a server, then you do not know which task is causing trouble.

But the main thing is that there is no isolation between tasks running on the same machine. Here, for example, is a graph of the average response time of a server task before and after another calculation application was launched on the same server, which is in no way connected with the first one - the response time for the main task has greatly increased.

One-cloud - data center-level OS in Odnoklassniki

Obviously, you need to run tasks either in containers or in virtual machines. Since almost all of our tasks run under the control of one OS (Linux) or are adapted for it, we do not need to support many different operating systems. Accordingly, virtualization is not needed, because of the additional overhead, it will be less efficient than containerization.

As a container implementation for running tasks directly on Docker servers, this is a good candidate: filesystem images are good at handling conflicting configurations. The fact that images can be composed of several layers allows us to significantly reduce the amount of data required to deploy them on the infrastructure by separating common parts into separate base layers. Then the basic (and most voluminous) layers will be cached fairly quickly throughout the entire infrastructure, and to deliver many different types of applications and versions, only small layers will need to be transferred.

Plus, a ready-made registry and image tagging in Docker give us ready-made primitives for versioning and delivering code to production.

Docker, like any other similar technology, provides us with some level of container isolation out of the box. For example, memory isolation - each container is given a limit on the use of machine memory, above which it will not consume. You can also isolate containers by CPU usage. For us, however, standard insulation was not enough. But more on that below.

Running containers directly on servers is only part of the problem. Another part is related to the placement of containers on servers. You need to understand which container can be placed on which server. This is not such an easy task, because the containers must be placed on servers as densely as possible, while not reducing their speed. Such placement can also be complex from a fault tolerance point of view. Often we want to place replicas of the same service in different racks or even in different rooms of the data center so that if a rack or room fails, we do not immediately lose all the replicas of the service.

Manually distributing containers is not an option when you have 8k servers and 8-16k containers.

In addition, we wanted to give developers more autonomy in the allocation of resources so that they can host their services on production themselves, without the help of an administrator. At the same time, we wanted to maintain control so that some secondary service would not consume all the resources of our data centers.

Obviously, we need a control layer that would do this automatically.

So we came to a simple and understandable picture that all architects adore: three squares.

One-cloud - data center-level OS in Odnoklassniki

one-cloud masters is a failover cluster responsible for cloud orchestration. The developer sends a manifest to the master, which contains all the information necessary to host the service. Based on it, the master gives commands to the selected minions (machines designed to launch containers). There is our agent on the minions, which receives the command, issues its commands to Docker, and Docker configures the linux kernel to start the corresponding container. In addition to executing commands, the agent continuously informs the master about changes in the state of both the minion machine and the containers running on it.

Resource allocation

Now let's deal with the problem of a more complex resource allocation for many minions.

Computing resource in one-cloud is:

  • Processor processing power consumed by a specific task.
  • The amount of memory available to the task.
  • network traffic. Each of the minions has a specific network interface with limited bandwidth, so tasks cannot be distributed without taking into account the amount of data they transmit over the network.
  • Disks. In addition, obviously, to the space for these tasks, we also allocate the type of disk: HDD or SSD. Disks can serve a finite number of requests per second - IOPS. Therefore, for tasks that generate more IOPS than a single disk can serve, we also allocate "spindles" - that is, disk devices that need to be exclusively reserved for the task.

Then for some service, for example, for user-cache, we can record the resources consumed in this way: 400 processor cores, 2,5 TB of memory, 50 Gb / s traffic in both directions, 6 TB of HDD space located on 100 spindles . Or in a more familiar form like this:

alloc:
    cpu: 400
    mem: 2500
    lan_in: 50g
    lan_out: 50g
    hdd:100x6T

The resources of the user-cache service consume only a fraction of all available resources in the production infrastructure. Therefore, I want to make sure that suddenly, due to an operator error or not, user-cache does not consume more resources than it has been allocated. That is, we must limit resources. But what could we link the quota to?

Let's go back to our highly simplified component interaction diagram and redraw it with more detail, like this:

One-cloud - data center-level OS in Odnoklassniki

What catches the eye:

  • The web frontend and music use isolated clusters of the same application server.
  • You can select the logical layers to which these clusters belong: fronts, caches, data storage and management layer.
  • The frontend is heterogeneous, these are different functional subsystems.
  • Caches can also be scattered across the subsystem whose data they cache.

Let's redraw the picture again:

One-cloud - data center-level OS in Odnoklassniki

Ba! Yes, we see the hierarchy! This means that you can distribute resources in larger chunks: assign a responsible developer to the node of this hierarchy corresponding to the functional subsystem (like “music” in the picture), and attach a quota to the same level of the hierarchy. This hierarchy also allows us to organize services more flexibly for ease of management. For example, all web, since this is a very large grouping of servers, we subdivide into several smaller groups, shown in the picture as group1, group2.

By removing the extra lines, we can write down each node of our picture in a flatter way: group1.web.front, api.music.front, user-cache.cache.

So we come to the concept of "hierarchical queue". It has a name like "group1.web.front". It is assigned a quota for resources and user rights. We will give the person from DevOps the rights to send a service to the queue, and such an employee can run something in the queue, and the person from OpsDev will have admin rights, and now he can manage the queue, assign people there, give these people rights, etc. Services running on this queue will run within the queue's quota. If the computational quota of the queue is not enough to execute all services at once, then they will be executed sequentially, thus forming the actual queue.

Let's take a closer look at the services. The service has a fully qualified name, which always includes the name of the queue. Then the front web service will have the name ok-web.group1.web.front. And the application server service to which it accesses will be named ok-app.group1.web.front. Each service has a manifest, which specifies all the necessary information for placement on specific machines: how many resources this task consumes, what configuration it needs, how many replicas should be, properties for failover of this service. And after placing the service directly on the machines, its instances appear. They are also named unambiguously - as an instance number and a service name: 1.ok-web.group1.web.front, 2.ok-web.group1.web.front, ...

This is very convenient: looking only at the name of the running container, we can immediately find out a lot.

And now let's take a closer look at what these instances actually do: with tasks.

Task isolation classes

All tasks in OK (and, probably, everywhere) can be divided into groups:

  • Short latency tasks - prod. For such tasks and services, the response delay (latency) is very important, how quickly each of the requests will be processed by the system. Task examples: web fronts, caches, application servers, OLTP storages, etc.
  • Calculation tasks - batch. Here the processing speed of each specific request is unimportant. For them, it is important how many calculations in a certain (large) period of time this task will make (throughput). These will be any MapReduce, Hadoop, machine learning, statistics tasks.
  • Background tasks - idle. For such tasks, neither latency nor throughput are very important. This includes various tests, migrations, recalculations, data conversion from one format to another. On the one hand, they are similar to the calculated ones, on the other hand, it is not very important for us how quickly they end.

Let's see how such tasks consume resources, for example, the central processor.

Tasks with a short delay. For such a task, the CPU consumption pattern will be similar to this:

One-cloud - data center-level OS in Odnoklassniki

A request from the user is received for processing, the task starts using all available CPU cores, works, returns a response, waits for the next request and stops. The next request was received - again they chose everything that was, shortchanged, we are waiting for the next one.

To guarantee the minimum latency for such a task, we must take the maximum resources consumed by it and reserve the required number of cores on the minion (the machine that will perform the task). Then the reservation formula for our problem will be as follows:

alloc: cpu = 4 (max)

and if we have a minion machine with 16 cores, then exactly four such tasks can be placed on it. Especially note that the average CPU consumption of such tasks is often very low - which is obvious, since a significant part of the time the task is waiting for a request and doing nothing.

Calculation tasks. They will have a slightly different pattern:

One-cloud - data center-level OS in Odnoklassniki

The average consumption of processor resources for such tasks is quite high. We often want a calculation task to run in a certain time, so we need to reserve the minimum number of processors that it needs so that the entire calculation completes in a reasonable time. Her reservation formula will look like this:

alloc: cpu = [1,*)

“Please place it on a minion where there is at least one free core, and then how many there are, it will gobble up everything.”

Here, with the efficiency of use, it is already much better than on tasks with a short delay. But the gain will be much greater if you combine both types of tasks on one minion machine and distribute its resources on the go. When a task with a short delay requires a processor, it receives it immediately, and when resources are no longer needed, they are transferred to the calculation task, i.e. something like this:

One-cloud - data center-level OS in Odnoklassniki

But how to do that?

First, let's deal with prod and its alloc: cpu = 4. We need to reserve four cores. In Docker run, this can be done in two ways:

  • Using option --cpuset=1-4, i.e., allocate four specific cores to the task on the machine.
  • Use --cpuquota=400_000 --cpuperiod=100_000, assign a quota for processor time, i.e. specify that every 100 ms of real time the task consumes no more than 400 ms of processor time. The same four cores are obtained.

But which of these methods is right?

cpuset looks pretty attractive. The task has four dedicated cores, which means that the processor caches will work as efficiently as possible. This also has a downside: we would have to take on the task of distributing computations over the unloaded cores of the machine instead of the OS, and this is a rather non-trivial task, especially if we try to place batch tasks on such a machine. Tests have shown that the quota option is better here: this way the operating system has more freedom in choosing the kernel to execute the task at the current moment and the processor time is distributed more efficiently.

Let's figure out how to make a reservation for the minimum number of cores in docker. The quota for batch tasks is no longer applicable, because there is no need to limit the maximum, it is enough to guarantee the minimum. And here is a good option docker run --cpushares.

We agreed that if batch requires a guarantee for at least one core, then we specify --cpushares=1024, and if at least two cores, then we indicate --cpushares=2048. Cpu shares do not interfere in any way with the distribution of processor time as long as there is enough of it. Thus, if prod is not currently using all of its four cores, nothing limits batch tasks, and they can use additional processor time. But in a situation of processor shortage, if prod consumed all its four cores and ran into a quota, the remaining processor time will be divided proportionally to cpushares, i.e. in a situation of three free cores, one task will receive 1024 cpushares, and the remaining two will receive a task from 2048 cpushares.

But using quota and shares is not enough. We need to make sure that a task with a short delay gets priority over a batch task when allocating processor time. Without such prioritization, a batch task will take all the CPU time at the moment when it is needed by prod. There are no container prioritization options in Docker run, but Linux CPU scheduler policies come to the rescue. You can read more about them here, and in this article we will go through them briefly:

  • SCHED_OTHER
    By default, all normal user processes on a Linux machine are received.
  • SCHED_BATCH
    Designed for resource-intensive processes. When placing a task on the processor, a so-called activation penalty is introduced: such a task is less likely to receive processor resources if a task with SCHED_OTHER is currently using it
  • SCHED_IDLE
    Background process with very low priority, even lower than nice -19. We use our open source library one-nio, in order to set the necessary policy when starting the container by calling

one.nio.os.Proc.sched_setscheduler( pid, Proc.SCHED_IDLE )

But even if you are not a Java programmer, you can do the same with the chrt command:

chrt -i 0 $pid

Let's summarize all our isolation levels in one table for clarity:

Insulation class
alloc example
Docker run options
sched_setscheduler chrt*

Prod
cpu = 4
--cpuquota=400000 --cpuperiod=100000
SCHED_OTHER

Batch
Cpu = [1, *)
--cpushares=1024
SCHED_BATCH

Idle
CPU=[2, *)
--cpushares=2048
SCHED_IDLE

*If you're doing chrt from within a container, you may need the sys_nice capability, because by default Docker removes that capability when the container is started.

But tasks consume not only the processor, but also traffic, which affects the delay of the network task even more than the misallocation of processor resources. Therefore, we naturally want to get exactly the same picture for traffic. That is, when the prod task sends some packets to the network, we quote the maximum speed (formula alloc:lan=[*,500mbps) ) with which prod can do it. And for batch, we guarantee only the minimum throughput, but do not limit the maximum (formula alloc:lan=[10Mbps,*) ) At the same time, prod traffic should receive priority over batch tasks.
Here Docker doesn't have any primitives that we could use. But we come to the rescue Linux Traffic Control. We were able to achieve the desired result with the help of discipline Hierarchical Fair Service Curve. With its help, we distinguish two classes of traffic: high-priority prod and low-priority batch/idle. As a result, the configuration for outgoing traffic is as follows:

One-cloud - data center-level OS in Odnoklassniki

here 1:0 is the “root qdisc” of the hsfc discipline; 1:1 - hsfc child class with a total bandwidth limit of 8 Gbit / s, under which the child classes of all containers are placed; 1:2 - a child class hsfc common to all batch and idle tasks with a "dynamic" limit, which is discussed below. The remaining hsfc child classes are dedicated classes for currently running prod containers with limits corresponding to their manifests - 450 and 400 Mbit/s. Each hsfc class is assigned a qdisc queue fq or fq_codel, depending on the version of the linux kernel, to avoid packet loss during bursts of traffic.

Typically, tc disciplines are used to prioritize only outgoing traffic. But we want to prioritize the incoming traffic too - after all, some batch task can easily select the entire incoming channel, receiving, for example, a large batch of input data for map&reduce. For this we use the module ifb, which creates a virtual ifbX interface for each network interface and redirects incoming traffic from the interface to outgoing traffic to ifbX. Further, for ifbX, all the same disciplines work to control outgoing traffic, for which the hsfc configuration will be very similar:

One-cloud - data center-level OS in Odnoklassniki

In the course of experiments, we found out that hsfc shows the best results when the 1:2 class of non-priority batch / idle traffic is limited on minion machines to no more than some free lane. Otherwise, non-priority traffic affects prod task latency too much. miniond determines the current amount of free bandwidth every second by measuring the average traffic consumption of all prod-tasks of this minion One-cloud - data center-level OS in Odnoklassniki and subtracting it from the bandwidth of the network interface One-cloud - data center-level OS in Odnoklassniki with a small margin, i.e.

One-cloud - data center-level OS in Odnoklassniki

Bands are defined for incoming and outgoing traffic independently. And in accordance with the new values, miniond will reconfigure the non-priority class limit 1:2.

Thus, we have implemented all three isolation classes: prod, batch and idle. These classes greatly influence the characteristics of task execution. Therefore, we decided to place this sign at the top of the hierarchy, so that when looking at the name of the hierarchical queue, it would immediately be clear what we are dealing with:

One-cloud - data center-level OS in Odnoklassniki

All our friends websites и music fronts are then placed in the hierarchy under prod. For example, under batch, let's put a service music catalog, which periodically compiles a catalog of tracks from a set of mp3 files uploaded to Odnoklassniki. And an example of a service under idle is music transformer, which normalizes the volume level of the music.

With the extra lines removed again, we can write our service names more flatly by appending the task isolation class to the end of the full service name: web.front.prod, catalog.music.batch, transformer.music.idle.

And now, looking at the name of the service, we understand not only what function it performs, but also its isolation class, which means its criticality, etc.

Everything is wonderful, but there is one bitter truth. It is impossible to completely isolate tasks running on the same machine.

What we managed to achieve: if batch intensively consumes only processor resources, then the built-in Linux CPU scheduler does its job very well, and there is almost no effect on the prod task. But if this batch task starts to actively work with memory, then the mutual influence is already manifested. This happens because the processor memory caches are “washed out” of the prod task - as a result, cache misses increase, and the processor processes the prod task more slowly. Such a batch task can increase the latency of our typical prod container by 10%.

It is even more difficult to isolate traffic due to the fact that modern network cards have an internal packet queue. If the packet from the batch task got there first, then it will be the first to be transmitted over the cable, and there's nothing to be done about it.

In addition, we have so far managed to solve only the problem of prioritizing TCP traffic: for UDP, the hsfc approach does not work. And even in the case of TCP traffic, if the batch task generates a lot of traffic, this also gives about 10% increase in the latency of the prod task.

fault tolerance

One of the goals in the development of one-cloud was to improve the fault tolerance of Odnoklassniki. Therefore, further I would like to consider in more detail possible scenarios of failures and accidents. Let's start with a simple scenario - a container failure.

The container itself can fail in several ways. This could be some experiment, a bug, or an error in the manifest, due to which the prod task starts consuming more resources than indicated in the manifest. We had a case: a developer implemented one complex algorithm, reworked it many times, overthinking himself and getting confused so that in the end the task looped in a very non-trivial way. And since the prod task has a higher priority than all the others on the same minions, it began to consume all available processor resources. In this situation, isolation, or rather the quota for processor time, saved. If a task is allocated a quota, the task will not consume more. Therefore, batch- and other prod-tasks that were running on the same machine did not notice anything.

The second possible trouble is the fall of the container. And here we are saved by the restart policies, everyone knows them, Docker does a great job on its own. Almost all prod tasks have an always restart policy. Sometimes we use on_failure for batch tasks or for debugging prod containers.

And what can be done if the whole minion is unavailable?

Obviously run the container on a different machine. The most interesting thing here is what happens to the IP address(es) assigned to the container.

We can assign containers the same IP addresses as the minion machines on which these containers run. Then, when the container is launched on another machine, its IP address changes, and all clients should understand that the container has moved, now they need to go to a different address, which requires a separate Service Discovery service.

Service Discovery is convenient. There are many solutions on the market with varying degrees of fault tolerance for organizing a registry of services. Often, such solutions implement load balancer logic, store additional configuration in the form of KV storage, etc.
However, we would like to avoid the need to introduce a separate registry, because this would mean introducing a critical system that is used by all services in production. This means that this is a potential point of failure, and you need to choose or develop a very fault-tolerant solution, which is obviously very difficult, long and expensive.

And one more big drawback: in order for our old infrastructure to work with the new one, absolutely all tasks would have to be rewritten to use some kind of Service Discovery system. There is a LOT of work, and sometimes to the point of impossibility, when it comes to low-level devices working at the OS kernel level or directly with hardware. Implementing this functionality using well-established decision patterns, such as sidecar in some places it would mean an additional load, in other places - a complication of operation and additional failure scenarios. We didn’t want to complicate things, so we decided to make the use of Service Discovery optional.

In one-cloud, the IP follows the container, i.e. each task instance has its own IP address. This address is “static”: it is assigned to each instance when the service is first sent to the cloud. If during the life of the service had a different number of instances, then as a result, as many IP addresses as there were maximum instances will be assigned to it.

Subsequently, these addresses do not change: they are assigned once and continue to exist throughout the life of the service in production. IP addresses follow containers over the network. If the container is transferred to another minion, then the address will follow it.

Thus, the mapping of a service name to a list of its IP addresses changes very rarely. If we look again at the names of the service instances that we mentioned at the beginning of the article (1.ok-web.group1.web.front.prod, 2.ok-web.group1.web.front.prod, ...), we will notice that they resemble the FQDNs used in DNS. So it is, to display the names of service instances in their IP addresses, we use the DNS protocol. Moreover, this DNS returns all reserved IP addresses of all containers - both running and stopped (for example, three replicas are used, and we have five addresses reserved there - all five will be returned). Clients, having received this information, will try to establish a connection with all five replicas - and thus determine those that work. This option for determining availability is much more reliable, neither DNS nor Service Discovery are involved in it, which means that there are no difficult tasks to solve to ensure the relevance of information and the fault tolerance of these systems. Moreover, in critical services, on which the operation of the entire portal depends, we can not use DNS at all, but simply enter IP addresses into the configuration.

Implementing such an IP transfer behind containers can be non-trivial - and we will focus on how it works with the following example:

One-cloud - data center-level OS in Odnoklassniki

Let's say the one-cloud master instructs the M1 minion to run 1.ok-web.group1.web.front.prod with address 1.1.1.1. Works on minion IBRD, which announces this address to special servers route reflector. The latter have a BGP session with a network piece of hardware, into which the address route 1.1.1.1 is translated to M1. M1, on the other hand, routes packets inside the container using Linux tools. There are three route reflector servers, since this is a very critical part of the one-cloud infrastructure - the network in one-cloud will not work without them. We place them in different racks, located in different rooms of the data center if possible, to reduce the likelihood of a simultaneous failure of all three.

Let's now assume that the connection between the one-cloud master and the M1 minion is lost. The one-cloud master will now act on the assumption that M1 has failed completely. That is, it will give the command to the M2 minion to launch web.group1.web.front.prod with the same address 1.1.1.1. Now we have two conflicting routes in the network for 1.1.1.1: on M1 and on M2. In order to resolve such conflicts, we use the Multi Exit Discriminator, which is specified in the BGP announcement. This is a number that indicates the weight of the advertised route. Of the conflicting routes, the route with the lower MED value will be selected. The one-cloud master supports MED as an integral part of container IP addresses. The first time the address is issued with a sufficiently large MED = 1. In the situation of such an emergency container transfer, the master reduces the MED, and M000 will already receive the command to announce the address 000 with MED = 2. The instance running on M1.1.1.1 will remain at this without connection, and his further fate is of little interest to us until the moment of restoration of communication with the master, when he will be stopped as an old double.

Accidents

All data center management systems always handle minor failures to an acceptable degree. Container departure is the norm almost everywhere.

Let's take a look at how we handle an emergency, such as a power failure in one or more rooms in a data center.

What does an accident mean for a data center management system? First of all, this is a massive one-time failure of many machines, and the control system needs to migrate a lot of containers at the same time. But if the accident is very large, then it may happen that all tasks cannot be relocated to other minions, because the resource capacity of the data center drops below 100% load.

Accidents are often accompanied by a failure of the control layer as well. This can happen due to the failure of its equipment, but more often due to the fact that accidents are not tested, and the control layer itself falls from the increased load.

What can be done about all this?

Bulk migrations mean that there are a lot of actions, migrations, and deployments going on in the infrastructure. Each of the migrations can take some time to deliver and unpack container images to minions, launch and initialize containers, etc. Therefore, it is desirable that more important tasks run before less important ones.

Let's take another look at the familiar service hierarchy and try to decide which tasks we want to run first.

One-cloud - data center-level OS in Odnoklassniki

Of course, these are the processes that are directly involved in processing user requests, i.e. prod. We indicate this with placement priority — a number that can be assigned to the queue. If a queue has a higher priority, its services are placed first.

On prod we prioritize higher, 0; on batch - a little lower, 100; on idle - even lower, 200. Priorities are applied hierarchically. All tasks lower in the hierarchy will have a corresponding priority. If we want caches to be launched inside prod before frontends, then we assign priorities to cache = 0 and subqueues to front = 1. If, for example, we want the main portal to be launched from the fronts first, and the music front only then, then we can assign the latter a lower priority - 10.

The next problem is the lack of resources. So, a large amount of equipment has failed, entire rooms of the data center have failed, and we have launched so many services that now there are not enough resources for everyone. You need to decide which tasks to sacrifice in order for the main critical services to work.

One-cloud - data center-level OS in Odnoklassniki

Unlike placement priority, we cannot indiscriminately sacrifice all batch tasks, some of them are important for the portal to work. Therefore, we singled out preemption priority tasks. When placed, a task with a higher priority can preempt, i.e. stop a task with a lower priority, if there are no more free minions. At the same time, a task with a low priority will probably remain unplaced, i.e. there will no longer be a suitable minion for it with enough free resources.

In our hierarchy, it is very easy to specify a preemption priority such that prod and batch tasks preempt or stop idle tasks, but not each other, by specifying idle priority equal to 200. Just as in the case of placement priority, we can use our hierarchy in order to describe more complex rules. For example, we will indicate that we sacrifice the music function if we do not have enough resources for the main web portal by setting the priority for the corresponding nodes lower: 10.

Whole DC accidents

Why can the entire data center fail? Element. There was a good post the hurricane affected the work of the data center. The elements can be considered the homeless, who once burned the optics in the collector, and the data center completely lost contact with the rest of the sites. The reason for the failure is also the human factor: the operator will issue such a command that the entire data center will fall. This can happen because of a big bug. In general, data centers are falling - this is not uncommon. This happens to us every few months.

And here's what we're doing to make sure no one #okzhivi posts on twitter.

The first strategy is isolation. Each one-cloud instance is isolated and can only manage machines in one data center. That is, the loss of a cloud due to bugs or an incorrect operator command is the loss of only one data center. We are ready for this: there is a redundancy policy in which application and data replicas are placed in all data centers. We use fault-tolerant databases and periodically test failures.
Since today we have four data centers, that is, four separate, completely isolated one-cloud instances.

This approach not only protects against physical failure, but can also protect against operator errors.

What else can be done with the human factor? When an operator gives the cloud some strange or potentially dangerous command, they may suddenly be required to solve a small problem to test how well they think. For example, if this is some kind of mass stop of many replicas or just a strange command - reducing the number of replicas or changing the name of the image, and not just the version number in the new manifest.

One-cloud - data center-level OS in Odnoklassniki

Results

Distinctive features of one-cloud:

  • Hierarchical and descriptive naming scheme for services and containers, which allows you to very quickly find out what the task is, what it belongs to and how it works, and who is responsible for it.
  • We apply our technique of combining prod- and batch-tasks on minions to increase the efficiency of machine sharing. Instead of cpuset we use CPU quotas, shares, CPU scheduler policies and Linux QoS.
  • It was not possible to completely isolate containers running on the same machine, but their mutual influence remains within the limits of up to 20%.
  • The organization of services into a hierarchy helps with automatic recovery of accidents using placement and preemption priorities.

FAQ

Why didn't we take a ready-made solution.

  • Different task isolation classes require different logic when placed on minions. If prod tasks can be allocated by simply reserving resources, then batch and idle tasks must be allocated by tracking the actual utilization of resources on minion machines.
  • The need to take into account such resources consumed by tasks as:
    • network bandwidth;
    • types and "spindles" of disks.
  • The need to specify the priorities of services when eliminating accidents, the rights and quotas of commands for resources, which is solved using hierarchical queues in one-cloud.
  • The need to have human naming of containers to reduce the response time to accidents and incidents
  • The impossibility of a one-time universal implementation of Service Discovery; the need to coexist for a long time with tasks hosted on iron hosts - something that is solved by "static" IP addresses following containers, and, as a result, the need for unique integration with a large network infrastructure.

All these functions would require significant alterations of existing solutions for themselves, and, having estimated the amount of work, we realized that we could develop our own solution with approximately the same labor costs. But your solution will be much easier to operate and develop - it does not have unnecessary abstractions that support functionality that we do not need.

To those who read the last lines, thank you for your patience and attention!

Source: habr.com

Add a comment