Load balancing and scaling long-lived connections in Kubernetes

Load balancing and scaling long-lived connections in Kubernetes
This article will help you understand how load balancing works in Kubernetes, what happens when scaling long-lived connections, and why you should consider client-side load balancing if you are using HTTP/2, gRPC, RSockets, AMQP, or other long-lived protocols. 

A little about how traffic is redistributed in Kubernetes 

Kubernetes provides two convenient abstractions for rolling out applications: Services and Deployments.

Deployments describe how and how many copies of your application should be running at any given time. Each application is deployed as a Pod and assigned an IP address.

Services are similar in function to a load balancer. They are designed to distribute traffic across multiple pods.

Let's see what it looks like.

  1. In the diagram below, you see three instances of the same application and a load balancer:

    Load balancing and scaling long-lived connections in Kubernetes

  2. The load balancer is called a Service and is assigned an IP address. Any incoming request is redirected to one of the pods:

    Load balancing and scaling long-lived connections in Kubernetes

  3. The deployment script determines the number of application instances. You almost never have to expand directly under:

    Load balancing and scaling long-lived connections in Kubernetes

  4. Each pod is assigned its own IP address:

    Load balancing and scaling long-lived connections in Kubernetes

It is useful to think of services as a set of IP addresses. Each time you access the service, one of the IP addresses is selected from the list and used as the destination address.

It looks like this.

  1. There is a curl request 10.96.45.152 to the service:

    Load balancing and scaling long-lived connections in Kubernetes

  2. The service selects one of three pod addresses as the destination:

    Load balancing and scaling long-lived connections in Kubernetes

  3. Traffic is redirected to a specific pod:

    Load balancing and scaling long-lived connections in Kubernetes

If your application consists of a front-end and a back-end, then you will have both a service and a deployment for each.

When the frontend makes a request to the backend, it doesn't need to know exactly how many pods the backend serves: there can be one, ten, or one hundred.

Also, the frontend knows nothing about the addresses of the pods serving the backend.

When the frontend makes a request to the backend, it uses the backend's service IP address, which does not change.

Here's how it looks.

  1. Under 1 requests an internal backend component. Instead of choosing a specific backend pod, it makes a request to the service:

    Load balancing and scaling long-lived connections in Kubernetes

  2. The service selects one of the backend pods as the destination address:

    Load balancing and scaling long-lived connections in Kubernetes

  3. Traffic goes from pod 1 to pod 5, selected by the service:

    Load balancing and scaling long-lived connections in Kubernetes

  4. Pod 1 doesn't know exactly how many Pods like Pod 5 are hidden behind the service:

    Load balancing and scaling long-lived connections in Kubernetes

But how exactly does the service distribute requests? It seems that round-robin balancing is used? Let's figure it out. 

Balancing in Kubernetes services

Kubernetes services do not exist. There is no process for a service that has an IP address and port allocated.

You can verify this by going to any cluster node and running the netstat -ntlp command.

You won't even be able to find the IP address allocated to the service.

The IP address of the service is located in the control layer, in the controller, and recorded in the database - etcd. The same address is used by another component - kube-proxy.
Kube-proxy gets a list of IP addresses for all services and generates a set of iptables rules on each cluster node.

These rules say: "If we see the IP address of the service, we need to modify the destination address of the request and send it to one of the pods."

The service IP address is used only as an entry point and is not served by any process listening on that IP address and port.

Let's look at it

  1. Consider a cluster of three nodes. Each node has pods:

    Load balancing and scaling long-lived connections in Kubernetes

  2. Connected pods, colored in beige, are part of the service. Since the service does not exist as a process, it is shown in grey:

    Load balancing and scaling long-lived connections in Kubernetes

  3. The first Pod requests a service and must hit one of the associated Pods:

    Load balancing and scaling long-lived connections in Kubernetes

  4. But the service doesn't exist, the process doesn't exist. How does it work?

    Load balancing and scaling long-lived connections in Kubernetes

  5. Before a request leaves the node, it goes through the iptables rules:

    Load balancing and scaling long-lived connections in Kubernetes

  6. The iptables rules know that the service is not present and replace its IP address with one of the IP addresses of the pods associated with this service:

    Load balancing and scaling long-lived connections in Kubernetes

  7. The request receives a valid IP address as the destination address and is processed normally:

    Load balancing and scaling long-lived connections in Kubernetes

  8. Depending on the network topology, the request eventually reaches the pod:

    Load balancing and scaling long-lived connections in Kubernetes

Can iptables load balance?

No, iptables is used for filtering and was not designed for balancing.

However, it is possible to write a set of rules that work like pseudo-balancer.

And that's exactly what Kubernetes does.

If you have three pods, kube-proxy will write the following rules:

  1. Choose the first pod with a probability of 33%, otherwise go to the next rule.
  2. Choose the second sub with a 50% probability, otherwise go to the next rule.
  3. Select the third sub.

This system results in each pod being selected with a probability of 33%.

Load balancing and scaling long-lived connections in Kubernetes

And there is no guarantee that Pod 2 will be selected next after Pod 1.

Note: iptables uses a randomized statistic module. Thus, the balancing algorithm is based on a random choice.

Now that you understand how services work, let's look at more interesting scenarios.

Long-lived connections in Kubernetes do not scale by default

Each HTTP request from the front end to the back end is serviced by a separate TCP connection that is opened and closed.

If the frontend sends 100 requests per second to the backend, then 100 different TCP connections are opened and closed.

You can reduce request processing time and reduce load by opening a single TCP connection and using it for all subsequent HTTP requests.

The HTTP protocol has a feature called HTTP keep-alive, or connection reuse. In this case, one TCP connection is used to send and receive many HTTP requests and responses:

Load balancing and scaling long-lived connections in Kubernetes

This feature is not enabled by default: both server and client must be configured accordingly.

The setup itself is simple and available for most programming languages ​​and environments.

Here are some links to examples in different languages:

What happens if we use keep-alive in a Kubernetes service?
Let's assume that both the frontend and the backend support keep-alive.

We have one copy of the frontend and three instances of the backend. The frontend makes the first request and opens a TCP connection to the backend. The request reaches the service, one of the backend pods is chosen as the destination address. The backend sends a response, and the frontend receives it.

Unlike the normal situation where the TCP connection is closed after receiving a response, it is now kept open for the next HTTP requests.

What happens if the frontend sends more requests to the backend?

To forward these requests, an open TCP connection will be used, all requests will go to the same backend pod where the first request got.

Shouldn't iptables redistribute the traffic?

Not in this case.

When a TCP connection is created, it goes through the iptables rules, which select a specific backend sub where the traffic will go.

Since all subsequent requests are over an already open TCP connection, the iptables rules are no longer invoked.

Let's see what it looks like.

  1. The first pod sends a request to the service:

    Load balancing and scaling long-lived connections in Kubernetes

  2. You already know what will happen next. The service does not exist, but there are iptables rules that will process the request:

    Load balancing and scaling long-lived connections in Kubernetes

  3. One of the backend pods will be selected as the destination:

    Load balancing and scaling long-lived connections in Kubernetes

  4. The request reaches the pod. At this point, a permanent TCP connection between the two pods will be established:

    Load balancing and scaling long-lived connections in Kubernetes

  5. Any next request from the first pod will go over the already established connection:

    Load balancing and scaling long-lived connections in Kubernetes

As a result, you get faster response and higher throughput, but you lose the ability to scale your backend.

Even if you have two pods in your backend, with a constant connection, traffic will always go to one of them.

Can this be fixed?

Since Kubernetes doesn't know how to balance persistent connections, this task is up to you.

Services are a set of IP addresses and ports that are called endpoints.

Your application can get a list of endpoints from a service and decide how to distribute requests among them. You can open a persistent connection to each pod and balance requests between these connections using round-robin.

Or apply more complex balancing algorithms.

The client-side code that is responsible for balancing should follow this logic:

  1. Get a list of endpoints from the service.
  2. For each endpoint, open a persistent connection.
  3. When you need to make a request, use one of the open connections.
  4. Regularly update the list of endpoints, create new or close old persistent connections if the list changes.

Here's what it will look like.

  1. Instead of the first Pod sending a request to the service, you can balance requests on the client side:

    Load balancing and scaling long-lived connections in Kubernetes

  2. You need to write code that asks which pods are part of the service:

    Load balancing and scaling long-lived connections in Kubernetes

  3. Once you have the list, save it on the client side and use it to connect to pods:

    Load balancing and scaling long-lived connections in Kubernetes

  4. You are responsible for the load balancing algorithm:

    Load balancing and scaling long-lived connections in Kubernetes

Now the question is: does this problem only apply to HTTP keep-alive?

Load balancing on the client side

HTTP is not the only protocol that can use persistent TCP connections.

If your application uses a database, then a TCP connection is not opened every time you need to make a query or get a document from the database. 

Instead, a persistent TCP connection to the database is opened and used.

If your database is deployed on Kubernetes and access is provided as a service, then you will run into the same problems as described in the previous section.

One database replica will be loaded more than the others. Kube-proxy and Kubernetes won't help balance connections. You must take care of balancing requests to your database.

Depending on which library you are using to connect to the database, you may have different options for solving this problem.

The following is an example of accessing a MySQL DB cluster from Node.js:

var mysql = require('mysql');
var poolCluster = mysql.createPoolCluster();

var endpoints = /* retrieve endpoints from the Service */

for (var [index, endpoint] of endpoints) {
  poolCluster.add(`mysql-replica-${index}`, endpoint);
}

// Make queries to the clustered MySQL database

There are many other protocols that use persistent TCP connections:

  • WebSockets and secured WebSockets
  • HTTP / 2
  • gRPC
  • RSockets
  • AMQP

You should already be familiar with most of these protocols.

But if these protocols are so popular, why isn't there a standardized balancing solution? Why is it necessary to change the client logic? Is there a native Kubernetes solution?

Kube-proxy and iptables are designed to cover most common use cases when deployed to Kubernetes. This is for convenience.

If you are using a web service that provides a REST API, you are in luck - in this case, persistent TCP connections are not used, you can use any Kubernetes service.

But as soon as you start using persistent TCP connections, you will have to figure out how to evenly distribute the load on the backends. Kubernetes does not contain ready-made solutions for this case.

However, of course, there are options that can help.

Balancing Long Lived Connections in Kubernetes

There are four types of services in Kubernetes:

  1. Cluster IP
  2. Node Port
  3. LoadBalancer
  4. Headless

The first three services are based on a virtual IP address, which is used by kube-proxy to build iptables rules. But the fundamental basis of all services is a headless service.

The headless service does not have any IP address associated with it and only provides a mechanism for getting a list of IP addresses and ports of the pods (endpoints) associated with it.

All services are based on the headless service.

The ClusterIP service is a headless service with some additions: 

  1. The control layer assigns an IP address to it.
  2. Kube-proxy generates the necessary iptables rules.

This way you can ignore the kube-proxy and directly use the list of endpoints obtained from the headless service to load balance your application.

But how to add similar logic to all applications deployed in a cluster?

If your application is already deployed, then this task may seem impossible. However, there is an alternative.

Service Mesh will help you

You've probably already noticed that the client-side load balancing strategy is pretty standard.

When the application starts, it:

  1. Gets a list of IP addresses from the service.
  2. Opens and maintains a connection pool.
  3. Updates the pool periodically by adding or removing endpoints.

As soon as the application wants to make a request, it:

  1. Selects an available connection using some logic (eg round-robin).
  2. Executes the request.

These steps work for WebSockets, gRPC, and AMQP connections.

You can separate this logic into a separate library and use it in your applications.

However, service meshes such as Istio or Linkerd can be used instead.

Service Mesh completes your application with a process that:

  1. Automatically searches for service IP addresses.
  2. Checks for connections such as WebSockets and gRPC.
  3. Balances requests using the correct protocol.

Service Mesh helps manage traffic within the cluster, but it is quite resource intensive. Other options are using third party libraries like Netflix Ribbon or programmable proxies like Envoy.

What happens if balancing issues are ignored?

You can choose not to use load balancing and still not notice any changes. Let's look at a few work scenarios.

If you have more clients than servers, this is not such a big problem.

Suppose there are five clients that connect to two servers. Even if there is no balancing, both servers will be used:

Load balancing and scaling long-lived connections in Kubernetes

Connections can be unevenly distributed: perhaps four clients connected to the same server, but there is a good chance that both servers will be used.

What is more problematic is the opposite scenario.

If you have fewer clients and more servers, your resources may be underutilized and there will be a potential bottleneck.

Suppose there are two clients and five servers. At best, there will be two permanent connections to two servers out of five.

The rest of the servers will be idle:

Load balancing and scaling long-lived connections in Kubernetes

If those two servers can't handle client requests, scaling out won't help.

Conclusion

Kubernetes services are designed to work in most common web application scenarios.

However, once you start working with application protocols that use persistent TCP connections, such as databases, gRPC, or WebSockets, services are no longer a good fit. Kubernetes does not provide internal mechanisms for balancing persistent TCP connections.

This means you must write applications with client-side balancing in mind.

Translation prepared by the team Kubernetes aaS from Mail.ru.

What else to read on the topic:

  1. Three levels of autoscaling in Kubernetes and how to use them effectively
  2. Kubernetes in the spirit of piracy with a template for implementation.
  3. Our Telegram channel about digital transformation.

Source: habr.com

Add a comment