How to Connect Kubernetes Clusters in Different Data Centers
Welcome to the Kubernetes Quick Start Series. This is a regular column with the most interesting questions we get online and in our trainings. Kubernetes expert answers.
Today's expert is Daniel Polenchik (Daniele Polencic). Daniel works as an instructor and software developer in Learnk8s.
Quite often, infrastructure is replicated and distributed across different regions, especially in controlled environments.
If one region is unavailable, traffic is redirected to another to avoid interruptions.
With Kubernetes, you can use a similar strategy and distribute workloads across different regions.
You can have one or more clusters per team, region, environment, or a combination of these.
Your clusters can be hosted across multiple clouds and on-premises.
But how to plan the infrastructure for such a geographical spread?
Do you need to create one large cluster for several cloud environments over a single network?
Or have many small clusters and find a way to control and synchronize them?
One leadership cluster
Creating one cluster over a single network is not so easy.
Imagine you have an accident, connectivity between cluster segments is lost.
If you have one master server, half of the resources will not be able to receive new commands because they will not be able to contact the master.
And at the same time you have old routing tables (kube-proxy can't download new ones) and no additional pods (kubelet can't query for updates).
Even worse, if Kubernetes can't see a node, it marks it as orphaned and distributes missing pods to existing nodes.
As a result, you have twice as many pods.
If you make one master server per region, there will be problems with the consensus algorithm in the etcd database. (approx. ed. - In fact, the etcd database does not have to be located on the master servers. It can be run on a separate group of servers in the same region. However, having received at the same time a point of failure of a cluster. But quickly.)
etcd uses raft algorithmto agree on a value before writing it to disk.
That is, most instances must reach a consensus before the state can be written to etcd.
If the latency between etcd instances skyrockets, as is the case with three etcd instances in different regions, it takes a long time to agree on a value and write it to disk.
This is reflected in Kubernetes controllers as well.
The controller manager needs more time to learn about the change and write the response to the database.
And since the controller is not one, but several, a chain reaction is obtained, and the entire cluster starts to work very slowly.
There are currently no good examples of a large network for a single cluster.
Basically, the developer community and the SIG-cluster group are trying to figure out how to orchestrate clusters in the same way that Kubernetes orchestrates containers.
For the first time, we tried to manage a collection of clusters as a single object using the kube federation tool.
The beginning was good, but in the end, kube federation did not become popular, because it did not support all resources.
It supported federated supplies and services, but not StatefulSets, for example.
Also, the federation configuration was passed in the form of annotations and was not flexible.
Imagine how you can describe the division of replicas for each cluster in a federation using a single annotation.
It turned out to be a complete mess.
SIG-cluster did a great job after kubefed v1 and decided to approach the problem from a different angle.
Instead of annotations, they decided to release a controller that is installed on clusters. It can be configured using custom resource definitions (Custom Resource Definition, CRD).
For each resource that will be federated, you have a custom CRD definition in three sections:
a standard definition of a resource, such as deploy;
section placement, where you define how the resource will be distributed in the federation;
section override, where for a specific resource you can override the weight and parameters from placement.
Here is an example of a bundled delivery with placement and override sections.
The developers of Booking.com did not deal with kubefed v2, but they came up with Shipper, an operator for delivery on multiple clusters, multiple regions, and multiple clouds.
Both tools allow you to customize your multi-cluster deployment strategy (which clusters are used and how many replicas they have).
But Shipper's job is to reduce the risk of delivery errors.
In Shipper, you can define a series of steps that describe the division of replicas between the previous and current deployments and the amount of incoming traffic.
When you push a resource to a cluster, the Shipper controller incrementally deploys that change to all the federated clusters.
Also Shipper is very limited.
For example, the it takes Helm charts as input and does not support vanilla resources.
In general terms, Shipper works as follows.
Instead of a standard distribution, you need to create an application resource that includes a Helm chart:
But instead of inventing a new way to interact with the cluster and wrapping resources in custom definitions, multi-cluster-scheduler is injected into the standard Kubernetes lifecycle and intercepts all calls that create pods.
Each created pod is immediately replaced with a dummy.
The original pod goes through another scheduling cycle where, after polling the entire federation, a hosting decision is made.
Finally, the pod is delivered to the target cluster.
As a result, you have an extra pod that does nothing, just takes up space.
The advantage is that you don't have to write new resources to combine supplies.
Each resource that creates a pod is automatically ready to be federated.
This is interesting, because you suddenly have supplies distributed over several regions, and you did not notice. However, this is quite risky, because here everything rests on magic.
But while Shipper is mainly trying to mitigate the effects of shipments, multi-cluster-scheduler is more general and perhaps better suited for batch jobs.
It does not have an advanced gradual delivery mechanism.
More about multi-cluster-scheduler can be found at official repository page.
If you want to read about multi-cluster-scheduler in action, Admiralty has interesting use case with Argo - workflows, events, CI and CD Kubernetes.
Other tools and solutions
Connecting and managing multiple clusters is a complex task, and there is no one-size-fits-all solution.
If you want to learn more about this topic, here are some resources:
Submariner by Rancher is a tool that connects overlay networks of different Kubernetes clusters.
Cilium, a container network interface plugin, offers cluster mesh function, which allows you to combine several clusters
That's all for today
Thanks for reading to the end!
If you know how to more efficiently connect multiple clusters, tell us.
We will add your method to the links.
Special thanks to Chris Nesbitt-Smith (Chris Nesbitt-Smith) and Vincent de Sme (Vincent De Smet) (to the reliability engineer in swatmobile.io) for reading the article and sharing useful information about how the federation works.