Network load balancer architecture in Yandex.Cloud

Network load balancer architecture in Yandex.Cloud
Hi, I'm Sergey Elantsev, I'm developing network load balancer in Yandex.Cloud. Previously, I led the development of the L7 balancer for the Yandex portal - my colleagues joke that no matter what I do, it turns out to be a balancer. I will tell readers of Habr how to manage the load in a cloud platform, how we see the ideal tool to achieve this goal, and how we are moving towards building this tool.

First, let's introduce some terms:

  • VIP (Virtual IP) - IP address of the balancer
  • Server, backend, instance - a virtual machine with a running application
  • RIP (Real IP) - server IP address
  • Healthcheck - server readiness check
  • Availability Zone, AZ - isolated infrastructure in the data center
  • Region - an amalgamation of different AZs

Load balancers solve three main tasks: they perform the balancing itself, improve the fault tolerance of the service, and simplify its scaling. Fault tolerance is ensured by automatic traffic management: the balancer monitors the state of the application and excludes from balancing instances that have not passed the liveness check. Scaling is provided by uniform load distribution across instances, as well as updating the list of instances on the fly. If the balancing isn't even enough, then some of the instances will be loaded beyond their health limit and the service will become less reliable.

A load balancer is often classified by the protocol layer of the OSI model on which it runs. The Cloud Balancer operates at the TCP layer, which corresponds to the fourth layer, L4.

Let's move on to an overview of the architecture of the Cloud balancer. We will gradually increase the level of detail. We divide balancer components into three classes. The config plane class is responsible for user interaction and stores the target state of the system. The control plane stores the current state of the system and manages systems from the data plane class, which are directly responsible for delivering traffic from clients to your instances.

data plane

Traffic goes to expensive devices called border routers. To increase fault tolerance in one data center, several such devices work simultaneously. Then the traffic gets to balancers, which announce anycast IP address to all AZs via BGP for clients. 

Network load balancer architecture in Yandex.Cloud

Traffic is transmitted over ECMP - this is a routing strategy, according to which there can be several equally good routes to the destination (in our case, the destination IP address will be the destination) and packets can be sent via any of them. We also support work in several availability zones according to the following scheme: we announce the address in each of the zones, the traffic enters the nearest one and does not go beyond it. Further in the post, we will take a closer look at what happens to traffic.

config plane

 
The key component of the config plane is the API, through which the main operations with balancers are performed: creating, deleting, changing the composition of instances, obtaining the results of healthchecks, etc. On the one hand, this is a REST API, and on the other hand, we very often use the framework in the Cloud gRPC, so we "translate" REST to gRPC and continue to use only gRPC. Any request results in the creation of a series of asynchronous idempotent tasks that run on the common pool of Yandex.Cloud workers. Tasks are written in such a way that they can be suspended at any time and then restarted. This ensures scalability, repeatability and logging of operations.

Network load balancer architecture in Yandex.Cloud

As a result, the task from the API will make a request to the service controller of the balancers, which is written in Go. He can add and remove balancers, change the composition of backends and settings. 

Network load balancer architecture in Yandex.Cloud

The service stores its state in Yandex Database, a distributed managed database that you will soon be able to use. In Yandex.Cloud, as we already told, the dog food concept works: if we use our services ourselves, then our customers will also be happy to use them. Yandex Database is an example of such a concept. We store all our data in YDB, and we do not have to think about maintaining and scaling the database: these problems are solved for us, we use the database as a service.

We return to the balancer controller. Its task is to save information about the balancer, send the task of checking the readiness of the virtual machine to the healthcheck controller.

Healthcheck controller

It receives requests to change the check rules, saves them to YDB, distributes tasks among healtcheck nodes and aggregates the results, which are then saved to the database and sent to the loadbalancer controller. He, in turn, sends a request to change the composition of the cluster in the data plane to the loadbalancer-node, which I will discuss below.

Network load balancer architecture in Yandex.Cloud

Let's talk more about healthchecks. They can be divided into several classes. Audits have different success criteria. TCP checks need to successfully establish a connection in a fixed amount of time. HTTP checks require both a successful connection and a response with a 200 status code.

Also, checks differ in the class of action - they are active and passive. Passive checks simply watch what happens to the traffic without taking any special action. This doesn't work very well on L4 because it depends on the logic of the higher layer protocols: on L4 there is no information about how long the operation took, and whether the termination of the connection was good or bad. Active checks require the balancer to send requests to each server instance.

Most load balancers perform liveness checks themselves. We at the Cloud decided to separate these parts of the system to improve scalability. This approach will allow us to increase the number of balancers, while maintaining the number of healthcheck requests to the service. Checks are performed by separate healthcheck nodes, over which check targets are sharded and replicated. You cannot do checks from one host, as it may fail. Then we will not get the state of the instances he checked. We perform checks on any of the instances from at least three healthcheck nodes. We shard targets between nodes using consistent hashing algorithms.

Network load balancer architecture in Yandex.Cloud

Separating balancing and healthcheck can lead to problems. If the healthcheck node makes requests to the instance, bypassing the balancer (which is currently not serving traffic), then a strange situation arises: the resource seems to be alive, but the traffic will not reach it. We solve this problem in the following way: we are guaranteed to start healthcheck traffic through balancers. In other words, the scheme for moving packets with traffic from clients and from healthchecks differs minimally: in both cases, the packets will get to the balancers, which will deliver them to the target resources.

The difference is that clients make requests for VIPs, while healthchecks go to each individual RIP. Here an interesting problem arises: we give our users the ability to create resources in gray IP networks. Imagine that there are two different cloud owners who have hidden their services behind balancers. Each of them has resources on the 10.0.0.1/24 subnet, and with the same addresses. You need to be able to somehow distinguish them, and here you need to dive into the Yandex.Cloud virtual network device. More details can be found in video from the about:cloud event, it is important for us now that the network is multilayer and has tunnels in it, which can be distinguished by subnet id.

Healthcheck nodes access balancers using so-called quasi-IPv6 addresses. A quasi-address is an IPv6 address that has the user's IPv4 address and subnet id hardwired inside it. The traffic gets to the balancer, which extracts the resource's IPv4 address from it, replaces IPv6 with IPv4, and sends the packet to the user's network.

Reverse traffic is the same: the balancer sees that the destination is a gray network from healthcheckers, and converts IPv4 to IPv6.

VPP is the heart of the data plane

The load balancer is implemented on the Vector Packet Processing (VPP) technology, a Cisco framework for packet processing of network traffic. In our case, the framework works on top of the user-space network device management library - Data Plane Development Kit (DPDK). This provides high packet processing performance: there are much fewer interrupts in the kernel, there are no context switches between kernel space and user space. 

VPP goes even further and squeezes even more performance out of the system by combining packets into batches. Performance gains come from the aggressive use of modern processor caches. Both data caches are used (packets are processed by "vectors", the data is close to each other), and instruction caches: in VPP, packet processing follows a graph, in the nodes of which there are functions that perform one task.

For example, the processing of IP packets in VPP takes place in the following order: first, the packet headers are parsed in the parsing node, and then they are sent to the node, which forwards the packets further according to the routing tables.

A little hardcore. VPP authors do not compromise on the use of processor caches, so typical packet vector processing code contains manual vectorization: there is a processing loop that processes the situation like β€œwe have four packets in the queue”, then - the same for two, then - for one. Prefetch instructions are often used, loading data into caches to speed up access to them in the next iterations.

n_left_from = frame->n_vectors;
while (n_left_from > 0)
{
    vlib_get_next_frame (vm, node, next_index, to_next, n_left_to_next);
    // ...
    while (n_left_from >= 4 && n_left_to_next >= 2)
    {
        // processing multiple packets at once
        u32 next0 = SAMPLE_NEXT_INTERFACE_OUTPUT;
        u32 next1 = SAMPLE_NEXT_INTERFACE_OUTPUT;
        // ...
        /* Prefetch next iteration. */
        {
            vlib_buffer_t *p2, *p3;

            p2 = vlib_get_buffer (vm, from[2]);
            p3 = vlib_get_buffer (vm, from[3]);

            vlib_prefetch_buffer_header (p2, LOAD);
            vlib_prefetch_buffer_header (p3, LOAD);

            CLIB_PREFETCH (p2->data, CLIB_CACHE_LINE_BYTES, STORE);
            CLIB_PREFETCH (p3->data, CLIB_CACHE_LINE_BYTES, STORE);
        }
        // actually process data
        /* verify speculative enqueues, maybe switch current next frame */
        vlib_validate_buffer_enqueue_x2 (vm, node, next_index,
                to_next, n_left_to_next,
                bi0, bi1, next0, next1);
    }

    while (n_left_from > 0 && n_left_to_next > 0)
    {
        // processing packets by one
    }

    // processed batch
    vlib_put_next_frame (vm, node, next_index, n_left_to_next);
}

So, Healthchecks access VPP over IPv6, which turns them into IPv4. This is done by the node of the graph, which we call algorithmic NAT. For reverse traffic (and translation from IPv6 to IPv4) there is the same algorithmic NAT node.

Network load balancer architecture in Yandex.Cloud

Direct traffic from the balancer's clients goes through the nodes of the graph, which perform the balancing itself. 

Network load balancer architecture in Yandex.Cloud

The first node is sticky sessions. It contains a hash of 5-tuple for established sessions. 5-tuple includes the address and port of the client from which information is transmitted, the address and ports of the resources available to receive traffic, as well as the network protocol. 

The 5-tuple hash helps us do less computation in the subsequent consistent hash node, and also better handle resource list changes behind the balancer. When a packet arrives at the balancer for which there is no session, it is sent to the consistent hashing node. This is where balancing occurs using consistent hashing: we select a resource from the list of available β€œlive” resources. Next, the packets are sent to the NAT node, which actually replaces the destination address and recalculates the checksums. As you can see, we follow the rules of VPP - like to like, grouping similar calculations to increase the efficiency of processor caches.

Consistent hashing

Why did we choose it and what is it all about? To begin with, consider the previous task - selecting a resource from the list. 

Network load balancer architecture in Yandex.Cloud

With inconsistent hashing, a hash is calculated from the incoming packet, and the resource is selected from the list by the remainder of dividing this hash by the number of resources. As long as the list remains unchanged, this scheme works well: we always send packets with the same 5-tuple to the same instance. If, for example, some resource stopped responding to healthchecks, then for a significant part of the hashes, the choice will change. The client will lose TCP connections: a packet that previously hit instance A may begin to hit instance B, which is not familiar with the session for this packet.

Consistent hashing solves the described problem. The easiest way to explain this concept is as follows: imagine that you have a ring to which you distribute resources by hash (for example, by IP:port). The choice of resource is the rotation of the wheel by an angle, which is determined by the hash from the package.

Network load balancer architecture in Yandex.Cloud

This minimizes the redistribution of traffic when changing the composition of resources. Removing a resource will only affect the part of the consistent hash ring that the resource was on. Adding a resource also changes the distribution, but we have a sticky sessions node that allows us not to switch already established sessions to new resources.

We looked at what happens with direct traffic between the balancer and resources. Now let's deal with return traffic. It follows the same pattern as check traffic - through algorithmic NAT, that is, through reverse NAT 44 for client traffic and through NAT 46 for healthchecks traffic. We adhere to our own scheme: we unify healthchecks traffic and real user traffic.

Loadbalancer-node and assembled components

The composition of balancers and resources in VPP is reported by the local service - loadbalancer-node. It subscribes to the event stream from the loadbalancer-controller, can build the difference between the current VPP state and the target state received from the controller. We get a closed system: events from the API come to the balancer controller, which sets the tasks for the healthcheck controller to check the β€œliveness” of resources. That, in turn, puts tasks in the healthcheck-node and aggregates the results, after which it gives them back to the balancer controller. Loadbalancer-node subscribes to events from the controller and changes the state of the VPP. In such a system, each service only knows what it needs to know about neighboring services. The number of connections is limited and we have the ability to independently operate and scale different segments.

Network load balancer architecture in Yandex.Cloud

What issues have been avoided?

All of our control plane services are written in Go and have good scalability and reliability. Go has many open source libraries for building distributed systems. We actively use GRPC, all components contain an open source implementation of service discovery - our services monitor each other's performance, can change their composition dynamically, and we tied this with GRPC balancing. For metrics, we also use an open source solution. In the data plane, we got decent performance and a large margin of resources: it turned out to be very difficult to assemble a stand on which we could rest against the performance of VPP, and not an iron network card.

Problems and solutions

What didn't work well? Go has automatic memory management, but memory leaks do happen. The easiest way to deal with them is to run goroutines and remember to terminate them. Conclusion: Watch the memory consumption of Go programs. Often a good indicator is the number of goroutines. There is a plus in this story: in Go, it is easy to get data on runtime - on memory consumption, on the number of running goroutines, and on many other parameters.

Also, Go may not be the best choice for functional tests. They are quite verbose, and the standard β€œrun everything in CI in batch” approach is not very suitable for them. The fact is that functional tests are more demanding on resources, real timeouts occur with them. Because of this, tests may fail because the CPU is busy with unit tests. Conclusion: whenever possible, run "heavy" tests separately from unit tests. 

A microservice event architecture is more complicated than a monolith: grabbing logs on dozens of different machines is not very convenient. Conclusion: if you are doing microservices, immediately think about tracing.

Our plans

We will launch an internal balancer, an IPv6 balancer, add support for Kubernetes scenarios, continue to shard our services (currently only healthcheck-node and healthcheck-ctrl are sharded), add new healthchecks, and implement smart check aggregation. We are considering the possibility of making our services even more independent - so that they do not communicate directly with each other, but using a message queue. The Cloud has recently launched an SQS-compatible service Yandex Message Queue.

Recently, the public release of Yandex Load Balancer took place. Learn documentation to the service, manage balancers in a convenient way for you and increase the fault tolerance of your projects!

Source: habr.com

Add a comment