Liveness probes in Kubernetes can be dangerous

Note. transl.: The lead engineer from Zalando, Henning Jacobs, has repeatedly noticed problems among Kubernetes users in understanding the purpose of liveness (and readiness) probes and their correct application. Therefore, he compiled his thoughts into this capacious note, which will eventually become part of the K8s documentation.

Liveness probes in Kubernetes can be dangerous

Status checks, known in Kubernetes as liveness probes (i.e., literally, “vitality tests” - approx. transl.), can be quite dangerous. I recommend avoiding them if possible: the exceptions are only when they are really necessary and you are fully aware of the specifics and consequences of their use. In this publication, we will talk about liveness and readiness checks, and it will also tell you in which cases is and should not be used.

My colleague Sandor recently shared on Twitter the most common mistakes he encounters, including those related to the use of readiness/liveness probes:

Liveness probes in Kubernetes can be dangerous

Misconfigured livenessProbe can exacerbate high-load situations (flooding shutdown + potentially long container/application startup) and lead to other negative consequences like falling dependencies (see also my recent article about limiting the number of requests in the K3s + ACME bundle). It's even worse when the liveness probe is combined with a dependency health check, which is an external database: a single db crash will restart all your containers!

General message "Do not use liveness probes" it doesn't help much in this case, so let's look at what readiness and liveness checks are for.

Note: Most of the test below was originally included in Zalando's internal developer documentation.

Readiness and Liveness Checks

Kubernetes provides two important mechanisms called liveness probes and readiness probes. They periodically perform some action—such as sending an HTTP request, opening a TCP connection, or executing a command in a container—to confirm that the application is working as expected.

Kubernetes uses readiness probesto understand when the container is ready to receive traffic. A pod is considered ready to go if all of its containers are ready. One use of this mechanism is to control which pods are used as backends for Kubernetes services (and especially Ingress).

Liveness probes help Kubernetes know when it's time to restart a container. For example, such a check allows you to intercept a deadlock when an application gets stuck in one place. Restarting the container in this state helps get the application off the ground despite errors, but it can also lead to cascading crashes (see below).

If you try to deploy an app update that fails liveness/readiness checks, its rollout will stall as Kubernetes waits for the status Ready from all pods.

Example

Here is an example of a readiness probe checking the path /health over HTTP with default settings (interval: 10 Seconds, timeout: 1 second, success threshold: 1, failure threshold:3):

# часть общего описания deployment'а/стека
podTemplate:
  spec:
    containers:
    - name: my-container
      # ...
      readinessProbe:
        httpGet:
          path: /health
          port: 8080

Recommendations

  1. For microservices with HTTP endpoint (REST, etc.) always define a readiness probe, which checks if the application (pod) is ready to receive traffic.
  2. Make sure the readiness probe covers the readiness of the actual web server port:
    • using ports for administrative needs, called "admin" or "management" (for example, 9090), to readinessProbe, make sure the endpoint returns OK only if the main HTTP port (like 8080) is ready to receive traffic*;

      * I am aware of at least one case in Zalando where this did not happen, i.e. readinessProbe I checked the "management" port, but the server itself never started working due to problems loading the cache.

    • hanging a readiness probe on a separate port may cause the congestion on the main port to not be reflected in the health check (that is, the thread pool on the server is full, but the health check still shows that everything is OK).
  3. Make sure that readiness probe includes database initialization/migration;
    • the easiest way to achieve this is to access the HTTP server only after initialization is completed (for example, database migration from flyway and so on.); that is, instead of changing the status of the health check, simply do not start the web server until the database migration is completed*.

      * You can also run database migrations from init containers outside of the pod. I'm still a fan of self-contained applications, that is, those in which the application container, without external coordination, knows how to bring the database into the desired state.

  4. Use httpGet for readiness checks through typical endpoints of health checks (for example, /health).
  5. Understand the default checks (interval: 10s, timeout: 1s, successThreshold: 1, failureThreshold: 3):
    • the default options mean the pod will become not-ready after about 30 seconds (3 failed sanity checks).
  6. Use a separate port for "admin" or "management" if the technology stack (e.g. Java/Spring) allows it, to separate "health" and metrics management from regular traffic:
    • but don't forget point 2.
  7. If necessary, the readiness probe can be used to warm up/load the cache and return a 503 status code until the container is "warmed up":
    • I also recommend that you familiarize yourself with the new check startupProbe, introduced in version 1.16 (we wrote about her in Russian here - approx. transl.).

Caveats

  1. Don't rely on external dependencies (such as data stores) when doing readiness/liveness tests - this can lead to cascading failures:
    • as an example, let's take a stateful REST service with 10 pods dependent on a single Postgres database: when validation depends on a working database connection, all 10 pods can fail if there is network/db side latency - usually it all ends worse than it could;
    • note that Spring Data checks the database connection by default*;

      * This is the default behavior of Spring Data Redis (at least it was when I checked last time), which led to a "catastrophic" failure: when Redis was unavailable for a short time, all pods "down".

    • "external" in this sense can also mean other pods in the same application, so ideally the check should not depend on the state of other pods in the same cluster to prevent cascading crashes:
      • results may vary for applications with distributed state (e.g. in-memory caching in pods).
  2. Do not use a liveness probe for pods (the exceptions are cases when they are really needed and you are fully aware of the specifics and consequences of their use):
    • liveness probe can help recover hung containers, but since you have full control over your application, things like hung processes and deadlocks should ideally not happen: the best alternative is to deliberately crash the application and return it to the previous steady state;
    • a failed liveness probe will cause the container to restart, thereby potentially exacerbating the effects of boot-related errors: restarting the container will result in idle time (at least for the duration of the application startup, say 30+ seconds), causing new errors, increasing load on other containers and making them more likely to fail, etc.;
    • liveness checks combined with an external dependency is the worst possible combination, threatening to cascade failures: a slight delay on the DB side will cause all your containers to restart!
  3. Parameters of liveness and readiness checks should be different:
    • you can use a liveness probe with the same health check, but a higher threshold (failureThreshold), for example, assign the status not-ready after 3 attempts and consider that the liveness probe failed after 10 attempts;
  4. Don't use exec checks, as there are known issues with them that lead to zombie processes:

Summary

  • Use readiness probes to determine when a pod is ready to receive traffic.
  • Only use liveness probes when you really need them.
  • Incorrect use of readiness/liveness probes can lead to reduced availability and cascading failures.

Liveness probes in Kubernetes can be dangerous

Additional materials on the topic

Update #1 from 2019-09-29

About init containers for database migration: footnote added.

EJ reminded me About PDB: One of the problems with liveness checks is the lack of coordination between pods. Kubernetes has Pod Disruption Budgets (PDB) to limit the number of concurrent failures an application can experience, however the checks do not take PDB into account. Ideally, we can tell K8s to "restart one pod if it fails, but don't restart them all to make things worse."

Bryan put it very well: "Use liveness-sensing when you know exactly what the best thing to do is kill the app(Again, don't get carried away.)

Liveness probes in Kubernetes can be dangerous

Update #2 from 2019-09-29

Regarding reading documentation before use: I created the corresponding request (feature request) to supplement documentation on liveness probes.

PS from translator

Read also on our blog:

Source: habr.com

Add a comment