Is Kafka on Kubernetes good?

Welcome, Habr!

At one time, we were the first to bring the topic to the Russian market Kafka and continue follow behind its development. In particular, we found the topic of interaction between Kafka and Kubernetes. Panoramic (and quite cautious) article on this topic was published on the Confluent blog back in October last year under the authorship of Gwen Shapira. Today we want to draw your attention to a more recent article from April by Johann Geiger, who, although not without a question mark in the title, takes the topic in a more specific way, accompanying the text with interesting links. Please forgive us the free translation of "chaos monkey", if you can!

Is Kafka on Kubernetes good?

Introduction

Kubernetes is designed to work with stateless workloads. Typically, such workloads are presented in the form of a microservice architecture, they are lightweight, lend themselves well to horizontal scaling, obey the principles of 12-factor applications, allow you to work with circuit breakers (circuit breaker) and monkeys (chaos monkeys).

Kafka, on the other hand, essentially acts as a distributed database. Thus, when working, you have to deal with the state, and it is much heavier than a microservice. Kubernetes supports stateful workloads, but as Kelsey Hightower points out in two of his tweets, they should be handled with care:

It seems to some that if you roll Kubernetes into a stateful workload, it turns into a fully managed database that can compete with RDS. This is wrong. Maybe if you work hard enough, screw in additional components and involve a team of SRE engineers, you will be able to equip RDS on top of Kubernetes.

I always recommend everyone to exercise extreme caution when running stateful workloads on Kubernetes. Most of those who wonder β€œcan I run stateful workloads on Kubernetes” do not have sufficient experience with Kubernetes, and often with the workload that is being asked about.

So, should you run Kafka on Kubernetes? Counter question: will Kafka work better without Kubernetes? That's why I want to highlight in this article how Kafka and Kubernetes complement each other, and what pitfalls can come across when they are combined.

Time of completion

Let's talk about the basic thing - the runtime environment as such

Process

Kafka brokers are CPU friendly. TLS may introduce some overhead. However, Kafka clients may be more CPU intensive if they use encryption, but this does not affect brokers.

Memory

Kafka brokers eat up memory. It is usually fashionable to limit the JVM heap size to 4-5 GB, but you will also need a lot of system memory, since Kafka uses the page cache very heavily. In Kubernetes, set container limits on resources and requests appropriately.

Data store

Data storage in containers is ephemeral - data is lost on restart. For Kafka data, you can use a volume emptyDir, and the effect will be similar: your broker data will be lost after completion. Your messages may still be stored on other brokers as replicas. Therefore, after a restart, a failed broker must first replicate all data, and this process can take a long time.

That's why you should use durable data storage. Let it be non-local long-term storage with the XFS file system or, more precisely, ext4. Don't use NFS. I warned. NFS versions v3 or v4 will not work. In short, the Kafka broker will terminate if it cannot remove the data directory due to the "stupid renames" issue that is common in NFS. If I haven't convinced you yet, very carefully read this article. The data store should be non-local so that Kubernetes can more flexibly choose a new node after a restart or relocation.

Network

As with most distributed systems, Kafka's performance is highly dependent on keeping network delays as low as possible and bandwidth as high as possible. Don't try to host all brokers on the same node as this will reduce availability. If a Kubernetes node fails, then the entire Kafka cluster will fail. Also, do not disperse the Kafka cluster across entire data centers. The same goes for the Kubernetes cluster. A good compromise in this case is to choose different Availability Zones.

Configuration

Regular manifests

The Kubernetes site has very good guide how to set up ZooKeeper using manifests. Since ZooKeeper is part of Kafka, this is a convenient place to start getting acquainted with what Kubernetes concepts are applicable here. With that out of the way, you can use the same concepts with a Kafka cluster.

  • Under:pod is the smallest deployable unit in Kubernetes. A Pod contains your workload, and a Pod corresponds to a process in your cluster. A Pod contains one or more containers. Each ZooKeeper server in the ensemble and each broker in the Kafka cluster will run in a separate Pod.
  • StatefulSet: A StatefulSet is a Kubernetes object that works with multiple stateful workloads, and these workloads require coordination. StatefulSets provide guarantees regarding the ordering of Pods and their uniqueness.
  • Headless services: Services allow you to detach pods from clients using a logical name. Kubernetes in this case is responsible for load balancing. However, when dealing with stateful workloads, as with ZooKeeper and Kafka, clients need to communicate with a specific instance. This is where headless services come in handy: in this case, the client will still have a logical name, but you can not access the pod directly.
  • Volume for long-term storage: These volumes are required for the non-local block persistent storage configuration mentioned above.

On the Yolean provides a comprehensive set of manifests to help you get started with Kafka on Kubernetes.

Helm charts

Helm is a package manager for a Kubernetes that can be compared to OS package managers like yum, apt, Homebrew or Chocolatey. With it, it is convenient to install predefined software packages, described in the Helm diagrams. A well-chosen Helm diagram makes the difficult task of how to correctly configure all the parameters for using Kafka on Kubernetes easier. There are several Kafka diagrams: the official one is in incubation, there is one from confluent, another one from Bitnami.

Operators

Because Helm has its drawbacks, another tool is gaining popularity: Kubernetes operators. The operator not only packages the software for Kubernetes, but also allows you to deploy and manage such software.

The list amazing operators two operators for Kafka are mentioned. One of them - Strimzi. With the help of Strimzi, it is not difficult to raise a Kafka cluster in a matter of minutes. Almost no configuration is required, in addition, the operator itself provides some nice features, for example, point-to-point TLS encryption within the cluster. Confluent also provides own operator.

Performance

It is very important to test performance by providing checkpoints to your Kafka instance. These tests will help you find potential bottlenecks before problems start. Luckily, Kafka already provides two performance testing tools: kafka-producer-perf-test.sh ΠΈ kafka-consumer-perf-test.sh. Use them actively. For reference, you can refer to the results described in this post Jay Kreps, or focus on this review Amazon MSK by Stephane Maarek.

operations

Monitoring

Transparency in the system is very important - otherwise you will not understand what is happening in it. Today there is a solid toolkit that provides monitoring based on cloud native style metrics. Two popular tools for this purpose are Prometheus and Grafana. Prometheus can collect metrics from all Java processes (Kafka, Zookeeper, Kafka Connect) using a JMX exporter in the simplest way. If you add cAdvisor metrics, then you can more fully understand how resources are used in Kubernetes.

Strimzi has a very handy example of a Grafana dashboard for Kafka. It visualizes key metrics, such as under-replicated sectors or those that are offline. Everything is very clear there. These metrics are complemented by resource usage and performance information, as well as stability indicators. So you get basic Kafka cluster monitoring for nothing!

Is Kafka on Kubernetes good?

Source: streamzi.io/docs/master/#kafka_dashboard

It would be nice to supplement all this with customer monitoring (metrics by consumers and producers), as well as latency monitoring (there is a Burrow) and end-to-end monitoring - for this, use Kafka Monitor.

Logging

Logging is another important task. Make sure all containers in your Kafka installation are logging to stdout ΠΈ stderr, and make sure your Kubernetes cluster aggregates all logs in a central logging infrastructure, such as Elasticsearch.

Functional Testing

Kubernetes uses liveness and readiness probes to check if your pods are working properly. If the live check fails, Kubernetes will stop that container and then automatically restart it if the restart policy is set appropriately. If the readiness check fails, then Kubernetes isolates that Pod from serving requests. Thus, in such cases, manual intervention is no longer required at all, which is a big plus.

Rolling out updates

StatefulSets support automatic updates: when choosing the RollingUpdate strategy, each Kafka Pod will be updated in turn. In this way, downtime can be reduced to zero.

Scaling

Scaling a Kafka cluster is not an easy task. However, Kubernetes makes it very easy to scale pods to a certain number of replicas, which means you can declaratively define as many Kafka brokers as you want. The most difficult thing in this case is the reassignment of sectors after scaling up or before scaling down. Again, Kubernetes will help you with this task.

Administration

Tasks related to the administration of your Kafka cluster, such as creating topics and reassigning sectors, can be done using the available shell scripts by opening a command line interface in your pods. However, this solution is not very beautiful. Strimzi supports topic management with another operator. There is something to improve here.

Backup & Restore

Now the availability of Kafka will also depend on the availability of Kubernetes. If your Kubernetes cluster goes down, then in the worst case, the Kafka cluster will go down as well. According to Murphy's law, this will definitely happen, and you will lose data. To mitigate this kind of risk, have a good understanding of the concept of backup. You can use MirrorMaker, another option is to use S3 for this, as described in this post by Zalando.

Conclusion

When working with small or medium Kafka clusters, Kubernetes is definitely worth using as it provides additional flexibility and makes it easier to work with operators. If you have very serious non-functional latency and/or throughput requirements, then it might be better to consider some other deployment option.

Source: habr.com

Add a comment