How to Connect Kubernetes Clusters in Different Data Centers

How to Connect Kubernetes Clusters in Different Data Centers
Welcome to the Kubernetes Quick Start Series. This is a regular column with the most interesting questions we get online and in our trainings. Kubernetes expert answers.

Today's expert is Daniel Polenchik (Daniele Polencic). Daniel works as an instructor and software developer in Learnk8s.

If you want an answer to your question in the next post, contact us by email or Twitter: @learnk8s.

Missed previous posts? Look for them here.

How to connect Kubernetes clusters in different data centers?

Briefly: Kubefed v2 coming soon, and I also advise you to read about Shipper ΠΈ multi-cluster-scheduler project.

Quite often, infrastructure is replicated and distributed across different regions, especially in controlled environments.

If one region is unavailable, traffic is redirected to another to avoid interruptions.

With Kubernetes, you can use a similar strategy and distribute workloads across different regions.

You can have one or more clusters per team, region, environment, or a combination of these.

Your clusters can be hosted across multiple clouds and on-premises.

But how to plan the infrastructure for such a geographical spread?
Do you need to create one large cluster for several cloud environments over a single network?
Or have many small clusters and find a way to control and synchronize them?

One leadership cluster

Creating one cluster over a single network is not so easy.

Imagine you have an accident, connectivity between cluster segments is lost.

If you have one master server, half of the resources will not be able to receive new commands because they will not be able to contact the master.

And at the same time you have old routing tables (kube-proxy can't download new ones) and no additional pods (kubelet can't query for updates).

Even worse, if Kubernetes can't see a node, it marks it as orphaned and distributes missing pods to existing nodes.

As a result, you have twice as many pods.

If you make one master server per region, there will be problems with the consensus algorithm in the etcd database. (approx. ed. - In fact, the etcd database does not have to be located on the master servers. It can be run on a separate group of servers in the same region. However, having received at the same time a point of failure of a cluster. But quickly.)

etcd uses raft algorithmto agree on a value before writing it to disk.
That is, most instances must reach a consensus before the state can be written to etcd.

If the latency between etcd instances skyrockets, as is the case with three etcd instances in different regions, it takes a long time to agree on a value and write it to disk.
This is reflected in Kubernetes controllers as well.

The controller manager needs more time to learn about the change and write the response to the database.

And since the controller is not one, but several, a chain reaction is obtained, and the entire cluster starts to work very slowly.

etcd is so latency sensitive that the official documentation recommends using an SSD instead of regular hard drives.

There are currently no good examples of a large network for a single cluster.

Basically, the developer community and the SIG-cluster group are trying to figure out how to orchestrate clusters in the same way that Kubernetes orchestrates containers.

Option 1: federate clusters with kubefed

Official answer from SIG-cluster - kubefed2, a new version of the original kube federation client and operator.

For the first time, we tried to manage a collection of clusters as a single object using the kube federation tool.

The beginning was good, but in the end, kube federation did not become popular, because it did not support all resources.

It supported federated supplies and services, but not StatefulSets, for example.
Also, the federation configuration was passed in the form of annotations and was not flexible.

Imagine how you can describe the division of replicas for each cluster in a federation using a single annotation.

It turned out to be a complete mess.

SIG-cluster did a great job after kubefed v1 and decided to approach the problem from a different angle.

Instead of annotations, they decided to release a controller that is installed on clusters. It can be configured using custom resource definitions (Custom Resource Definition, CRD).

For each resource that will be federated, you have a custom CRD definition in three sections:

  • a standard definition of a resource, such as deploy;
  • section placement, where you define how the resource will be distributed in the federation;
  • section override, where for a specific resource you can override the weight and parameters from placement.

Here is an example of a bundled delivery with placement and override sections.

apiVersion: types.federation.k8s.io/v1alpha1
kind: FederatedDeployment
metadata:
  name: test-deployment
  namespace: test-namespace
spec:
  template:
    metadata:
      labels:
        app: nginx
    spec:
      replicas: 3
      selector:
        matchLabels:
          app: nginx
      template:
        metadata:
          labels:
            app: nginx
        spec:
          containers:
            - image: nginx
              name: nginx
  placement:
    clusterNames:
      - cluster2
      - cluster1
  overrides:
    - clusterName: cluster2
      clusterOverrides:
        - path: spec.replicas
          value: 5

As you can see, the supply is divided into two clusters: cluster1 ΠΈ cluster2.

The first cluster supplies three replicas, and the second one has a value of 5.

If you need more control over the number of replicas, kubefed2 provides a new ReplicaSchedulingPreference object where replicas can be weighted:

apiVersion: scheduling.federation.k8s.io/v1alpha1
kind: ReplicaSchedulingPreference
metadata:
  name: test-deployment
  namespace: test-ns
spec:
  targetKind: FederatedDeployment
  totalReplicas: 9
  clusters:
    A:
      weight: 1
    B:
      weight: 2

The CRD structure and API are not quite ready yet, and active work is underway in the official project repository.

Keep an eye out for kubefed2, but keep in mind that it's not yet good enough for a production environment.

Learn more about kubefed2 from official article about kubefed2 on the Kubernetes blog and official repository of the kubefed project.

Option 2: Clustering Booking.com Style

The developers of Booking.com did not deal with kubefed v2, but they came up with Shipper, an operator for delivery on multiple clusters, multiple regions, and multiple clouds.

Shipper somewhat similar to kubefed2.

Both tools allow you to customize your multi-cluster deployment strategy (which clusters are used and how many replicas they have).

But Shipper's job is to reduce the risk of delivery errors.

In Shipper, you can define a series of steps that describe the division of replicas between the previous and current deployments and the amount of incoming traffic.

When you push a resource to a cluster, the Shipper controller incrementally deploys that change to all the federated clusters.

Also Shipper is very limited.

For example, the it takes Helm charts as input and does not support vanilla resources.
In general terms, Shipper works as follows.

Instead of a standard distribution, you need to create an application resource that includes a Helm chart:

apiVersion: shipper.booking.com/v1alpha1
kind: Application
metadata:
  name: super-server
spec:
  revisionHistoryLimit: 3
  template:
    chart:
      name: nginx
      repoUrl: https://storage.googleapis.com/shipper-demo
      version: 0.0.1
    clusterRequirements:
      regions:
        - name: local
    strategy:
      steps:
        - capacity:
            contender: 1
            incumbent: 100
          name: staging
          traffic:
            contender: 0
            incumbent: 100
        - capacity:
            contender: 100
            incumbent: 0
          name: full on
          traffic:
            contender: 100
            incumbent: 0
    values:
      replicaCount: 3

Shipper is a good option for managing multiple clusters, but its close relationship with Helm only gets in the way.

What if we all switch from Helm to customize or kapitan?

Learn more about Shipper and his philosophy at this official press release.

If you want to dig into the code, go to the official project repository.

Option 3: "magic" cluster merging

Kubefed v2 and Shipper work with cluster federation by providing new resources to clusters through a custom resource definition.

But what if you don't want to rewrite all the supplies, StatefulSets, DaemonSets, etc. to be merged?

How to include an existing cluster in federation without changing YAML?

multi-cluster-scheduler is an Admirality project, which deals with scheduling workloads on clusters.

But instead of inventing a new way to interact with the cluster and wrapping resources in custom definitions, multi-cluster-scheduler is injected into the standard Kubernetes lifecycle and intercepts all calls that create pods.

Each created pod is immediately replaced with a dummy.

multi-cluster-scheduler uses web hooks for modifying accessto intercept the call and create an idle dummy pod.

The original pod goes through another scheduling cycle where, after polling the entire federation, a hosting decision is made.

Finally, the pod is delivered to the target cluster.

As a result, you have an extra pod that does nothing, just takes up space.

The advantage is that you don't have to write new resources to combine supplies.

Each resource that creates a pod is automatically ready to be federated.

This is interesting, because you suddenly have supplies distributed over several regions, and you did not notice. However, this is quite risky, because here everything rests on magic.

But while Shipper is mainly trying to mitigate the effects of shipments, multi-cluster-scheduler is more general and perhaps better suited for batch jobs.

It does not have an advanced gradual delivery mechanism.

More about multi-cluster-scheduler can be found at official repository page.

If you want to read about multi-cluster-scheduler in action, Admiralty has interesting use case with Argo - workflows, events, CI and CD Kubernetes.

Other tools and solutions

Connecting and managing multiple clusters is a complex task, and there is no one-size-fits-all solution.

If you want to learn more about this topic, here are some resources:

That's all for today

Thanks for reading to the end!

If you know how to more efficiently connect multiple clusters, tell us.

We will add your method to the links.

Special thanks to Chris Nesbitt-Smith (Chris Nesbitt-Smith) and Vincent de Sme (Vincent De Smet) (to the reliability engineer in swatmobile.io) for reading the article and sharing useful information about how the federation works.

Source: habr.com

Add a comment