Upgrading a Kubernetes Cluster Without Downtime

Upgrading a Kubernetes Cluster Without Downtime

The upgrade process for your Kubernetes cluster

At some point, when using a Kubernetes cluster, there is a need to update running nodes. This may include package updates, kernel upgrades, or deployment of new VM images. In Kubernetes terminology, this is called "Voluntary Disruption".

This post is part of a series of 4 posts:

  1. This post.
  2. Graceful shutdown of pods in a Kubernetes cluster
  3. Delayed termination of a pod when it is deleted
  4. How to Avoid Downtime in a Kubernetes Cluster with PodDisruptionBudgets

(approx. transl. translations of the remaining articles in the series expect in the near future)

In this article, we will describe all the tools that Kubernetes provides to achieve zero downtime for the nodes running in your cluster.

Identify the problem

At first, we will use a naive approach, identify problems and assess the potential risks of this approach, and accumulate knowledge to solve each of the problems that we encounter throughout the cycle. The result is a configuration that uses lifecycle hooks, readiness probes, and Pod disruption budgets to achieve our zero downtime.

To start our journey, let's take a concrete example. Let's say we have a two-node Kubernetes cluster running an application with two pods behind Service:

Upgrading a Kubernetes Cluster Without Downtime

Let's start with two Nginx and Service pods running on our two Kubernetes cluster nodes.

We want to update the kernel version of two worker nodes in our cluster. How will we do it? A simple solution would be to load the new nodes with the updated configuration and then shut down the old nodes while starting the new ones. While this will work, there will be a few problems with this approach:

  • When you turn off the old nodes, the pods running on them will also be turned off. What if the pods need to be cleared for graceful shutdown? The virtualization system you are using may not wait for the cleanup process to complete.
  • What if you disable all nodes at the same time? You will get some decent downtime while the pods move to the new nodes.

We need a way to gracefully migrate pods from old nodes and make sure none of our worker processes are running while we make changes to the node. Or when we do a full cluster replacement, as in the example (that is, we replace VM images), we want to transfer running applications from old nodes to new ones. In both cases, we want to prevent scheduling new pods on old nodes and then evicting all running pods from them. To achieve these goals, we can use the command kubectl drain.

Redistributing all pods from a node

The drain operation allows you to redistribute all pods from a node. During the drain process, the node is marked as unschedulable (flag NoSchedule). This prevents new pods from appearing on it. Then drain starts evicting pods from the node, shutting down the containers that are currently running on the node by sending a signal TERM containers in a pod.

Although kubectl drain does a great job of evicting pods, there are two other factors that can cause the drain operation to fail:

  • Your application should be able to exit gracefully when given TERM signal. When pods are evicted, Kubernetes sends a signal TERM containers and waits for them to stop for a specified amount of time, after which, if they have not stopped, it terminates them forcibly. In any case, if your container does not perceive the signal correctly, you can still incorrectly extinguish pods if they are currently running (for example, a transaction is being executed in the database).
  • You lose all pods that contain your application. It may not be available when new containers are launched on new nodes, or if your pods are deployed without controllers, they may not restart at all.

Avoid Downtime

To minimize downtime from voluntary disruption, such as a drain operation on a node, Kubernetes provides the following failure handling options:

In the rest of the series, we will use these Kubernetes features to mitigate the impact of migrating pods. To make it easier to follow the main idea, we will use our example above with the following resource configuration:

---
apiVersion: apps/v1
kind: Deployment
metadata:
 name: nginx-deployment
 labels:
   app: nginx
spec:
 replicas: 2
 selector:
   matchLabels:
     app: nginx
 template:
   metadata:
     labels:
       app: nginx
   spec:
     containers:
     - name: nginx
       image: nginx:1.15
       ports:
       - containerPort: 80
---
kind: Service
apiVersion: v1
metadata:
 name: nginx-service
spec:
 selector:
   app: nginx
 ports:
 - protocol: TCP
   targetPort: 80
   port: 80

This configuration is a minimal example Deployment, which manages the nginx pods in the cluster. In addition, the configuration describes the resource Service, which can be used to access nginx pods in the cluster.

Throughout the cycle, we will iteratively expand this configuration so that in the end it includes all the features provided by Kubernetes to reduce downtime.

To get a fully implemented and tested version of Kubernetes Cluster Updates for Zero Downtime on AWS and other resources, visit gruntwork.io.

Also read other articles on our blog:

Source: habr.com

Add a comment