Best Practices for Kubernetes Containers: Health Checks

Best Practices for Kubernetes Containers: Health Checks

TL; DR

  • To achieve high observability of containers and microservices, logs and primary metrics are not enough.
  • For faster recovery and improved fault tolerance, applications should apply the High Observability Principle (HOP).
  • At the application level, HOP requires: proper logging, close monitoring, health checks, and performance/transition tracing.
  • Use checks as the HOR element readiness Probe ΠΈ liveness Probe Kubernetes.

What is a Health Check Pattern?

When designing a mission-critical and highly available application, it is very important to think about such an aspect as fault tolerance. An application is considered fault-tolerant if it quickly recovers from a failure. A typical cloud application uses a microservices architecture - where each component is placed in a separate container. And in order to make sure that the application on k8s is highly available, when you design a cluster, you need to follow certain patterns. Among them is the Health Check Template. It defines how the application reports its health to k8s. This is not only information about whether the pod is running, but also how it receives requests and responds to them. The more Kubernetes knows about the health of a pod, the smarter decisions it makes about traffic routing and load balancing. Therefore, the principle of high observability of the application in a timely manner to respond to requests.

High Observability Principle (HOP)

The principle of high observability is one of the design principles for containerized applications. In a microserver architecture, services do not care how their request is processed (and rightly so), but it is important how to get responses from the receiving services. For example, to authenticate a user, one container sends an HTTP request to another, expecting a response in a certain format - that's all. PythonJS can handle the request, and Python Flask can respond. Containers are to each other like black boxes with hidden contents. However, the HOP principle requires each service to expose multiple API endpoints indicating how healthy it is, as well as its availability and failover status. These metrics are what Kubernetes asks for in order to consider the next steps in routing and load balancing.

A well-designed cloud application logs its main events using standard input/output streams STDERR and STDOUT. Next comes a helper service like filebeat, logstash or fluentd that delivers logs to a centralized monitoring system (like Prometheus) and a log collection system (ELK software suite). The diagram below shows how a cloud application works according to the Health Check Pattern and the High Observability Principle.

Best Practices for Kubernetes Containers: Health Checks

How to apply the Health Check Pattern in Kubernetes?

Out of the box, k8s monitors the status of pods using one of the controllers (Deployments, ReplicaSets, DaemonSets, StatefulSets etc., etc.). Having found that the pod has crashed for some reason, the controller tries to restart it or transfer it to another node. However, a pod may report that it is up and running, while itself not functioning. Let's take an example: your application uses Apache as a web server, you installed the component on several cluster pods. Since the library was configured incorrectly, all requests to the application are answered with a 500 code (internal server error). When verifying a delivery, checking the status of the pods succeeds, but the clients think otherwise. We describe this undesirable situation as follows:

Best Practices for Kubernetes Containers: Health Checks

In our example, k8s does health check. In this kind of validation, the kubelet constantly checks the state of the process in the container. As soon as he understands that the process has stopped, he will restart it. If the error is fixed by simply restarting the application, and the program is designed to shut down on any error, then a process health check is enough for you to follow the HOP and the Health Check Pattern. The only pity is that not all errors are eliminated by restarting. For this case, k8s offers 2 deeper ways to troubleshoot a pod: liveness Probe ΠΈ readiness Probe.

LivenessProbe

During liveness Probe kubelet performs 3 types of checks: not only whether the pod is running, but also whether it is ready to receive and adequately respond to requests:

  • Set an HTTP request to the pod. The response should contain an HTTP response code ranging from 200 to 399. Thus, 5xx and 4xx codes indicate that the pod has problems, even if the process is running.
  • To test pods with non-HTTP services (like the Postfix mail server), a TCP connection must be established.
  • Execution of an arbitrary command for a pod (internally). The check is considered successful if the command exit code is 0.

An example of how it works. The definition of the following pod contains a NodeJS application that throws a 500 error on HTTP requests. To ensure that the container is restarted upon receiving such an error, we use the livenessProbe parameter:

apiVersion: v1
kind: Pod
metadata:
 name: node500
spec:
 containers:
   - image: magalix/node500
     name: node500
     ports:
       - containerPort: 3000
         protocol: TCP
     livenessProbe:
       httpGet:
         path: /
         port: 3000
       initialDelaySeconds: 5

This is no different from any other pod definition, but we add an object .spec.containers.livenessProbe... Parameter httpGet takes the path by which it sends an HTTP GET request (in our example, this is /, but in combat scenarios there may be something like /api/v1/status). Another livenessProbe takes a parameter initialDelaySeconds, which causes the validation operation to wait the specified number of seconds. The delay is needed because the container needs time to start, and when restarted, it will be unavailable for a while.

To apply this setting to a cluster, use:

kubectl apply -f pod.yaml

After a few seconds, you can check the contents of the pod with the following command:

kubectl describe pods node500

At the end of the output find that's what.

As you can see, livenessProbe initiated an HTTP GET request, the container returned a 500 error (which it was programmed to do), and kubelet restarted it.

If you're wondering how the NideJS app was programmed, here's the app.js and Dockerfile that were used:

app. js

var http = require('http');

var server = http.createServer(function(req, res) {
    res.writeHead(500, { "Content-type": "text/plain" });
    res.end("We have run into an errorn");
});

server.listen(3000, function() {
    console.log('Server is running at 3000')
})

Dockerfile

FROM node
COPY app.js /
EXPOSE 3000
ENTRYPOINT [ "node","/app.js" ]

It's important to note this: livenessProbe will only restart the container if it fails. If a restart doesn't fix the error that's preventing the container from running, kubelet won't be able to take action to fix the problem.

readiness Probe

readinessProbe works similarly to livenessProbes (GET requests, TCP communications, and command execution), except for troubleshooting. A container in which a failure is detected is not restarted, but is isolated from incoming traffic. Imagine that one of the containers is doing a lot of calculations or is under heavy load, which increases the response time to requests. In the case of livenessProbe, a response availability check is triggered (via the timeoutSeconds check parameter), after which the kubelet restarts the container. When the container is started, it starts to perform resource-intensive tasks, and it is restarted again. This can be critical for applications that value response speed. For example, a car is waiting for a response from the server on the way, the response is delayed - and the car gets into an accident.

Let's write a redinessProbe definition that will set the response time for a GET request to no more than two seconds, and the application will respond to a GET request after 5 seconds. The pod.yaml file should look like this:

apiVersion: v1
kind: Pod
metadata:
 name: nodedelayed
spec:
 containers:
   - image: afakharany/node_delayed
     name: nodedelayed
     ports:
       - containerPort: 3000
         protocol: TCP
     readinessProbe:
       httpGet:
         path: /
         port: 3000
       timeoutSeconds: 2

Deploy a pod with kubectl:

kubectl apply -f pod.yaml

Let's wait a couple of seconds, and then we'll see how readinessProbe worked:

kubectl describe pods nodedelayed

At the end of the output, you can see that some of the events are similar this one.

As you can see, kubectl did not restart the pod when the check time exceeded 2 seconds. Instead, he canceled the request. Incoming connections are redirected to other working pods.

Note that now that the pod is unloaded, kubectl sends requests to it again: GET responses are no longer delayed.

For comparison, below is the modified app.js file:

var http = require('http');

var server = http.createServer(function(req, res) {
   const sleep = (milliseconds) => {
       return new Promise(resolve => setTimeout(resolve, milliseconds))
   }
   sleep(5000).then(() => {
       res.writeHead(200, { "Content-type": "text/plain" });
       res.end("Hellon");
   })
});

server.listen(3000, function() {
   console.log('Server is running at 3000')
})

TL; DR
Before the advent of cloud applications, logs were the main means of monitoring and checking the status of applications. However, there were no means to take any corrective action. Logs are still useful today, they must be collected and sent to the log collection system for analyzing emergency situations and making decisions. [all this could be done without cloud applications using monit, for example, but with k8s it became much easier πŸ™‚ – ed. ]

Today, corrections have to be made almost in real time, so applications no longer have to be black boxes. No, they should show endpoints that allow monitoring systems to query and collect valuable data about the state of processes in order to respond instantly if necessary. This is called the Health Check Design Pattern, which follows the Highly Observable Principle (HOP).

Kubernetes offers 2 types of health checks by default: readinessProbe and livenessProbe. Both use the same types of checks (HTTP GET requests, TCP communications, and command execution). They differ in what decisions they make in response to problems in pods. livenessProbe restarts the container in the hope that the error does not reoccur, and readinessProbe isolates the pod from incoming traffic until the cause of the problem is corrected.

Proper application design should include both types of validation, and that they collect enough data, especially when an exception is thrown. It should also show the necessary API endpoints that pass important health state metrics to the monitoring system (like Prometheus).

Source: habr.com

Add a comment