Logs in Kubernetes (and not only) today: expectations and reality

Logs in Kubernetes (and not only) today: expectations and reality

It was 2019 and we still didn't have a standard solution for log aggregation in Kubernetes. In this article, we would like to share our searches, encountered problems and their solutions, using real-life examples.

However, to begin with, I’ll make a reservation that different customers understand very different things by collecting logs:

  • someone wants to see security and audit logs;
  • someone - centralized logging of the entire infrastructure;
  • and for some it is enough to collect only application logs, excluding, for example, balancers.

About how we implemented various "Wishlist" and what difficulties we encountered - under the cut.

Theory: about logging tools

Background on the components of the logging system

Logging has come a long way, as a result of which methodologies for collecting and analyzing logs have been developed, which is what we use today. Back in the 1950s, Fortran introduced an analogue of standard input-output streams that helped the programmer in debugging his program. These were the first computer logs that made life easier for programmers of those times. Today we see in them the first component of the logging system - source or "producer" (producer) of logs.

Computer science did not stand still: computer networks appeared, the first clusters ... Complex systems consisting of several computers began to work. Now system administrators were forced to collect logs from several machines, and in special cases they could add OS kernel messages in case they needed to investigate a system failure. To describe centralized logging systems, in the early 2000s, RFC 3164, which standardized remote_syslog. So another important component appeared: collector (collector) of logs and their repository.

With the increase in the volume of logs and the widespread introduction of web technologies, the question arose that logs should be conveniently shown to users. Simple console tools (awk/sed/grep) have been replaced by more advanced ones. log viewers is the third component.

In connection with the increase in the volume of logs, another thing became clear: logs are needed, but not all. And different logs require different levels of preservation: some can be lost in a day, while others must be stored for 5 years. So, a filtering and routing component for data flows was added to the logging system - let's call it filter.

Repositories also made a major leap: they switched from ordinary files to relational databases, and then to document-oriented storages (for example, Elasticsearch). So the storage was separated from the collector.

In the end, the very concept of the log has expanded to some kind of abstract stream of events that we want to save for history. Or rather, in the event that you need to conduct an investigation or draw up an analytical report ...

As a result, in a relatively short period of time, the collection of logs has developed into an important subsystem, which can rightfully be called one of the subsections in Big Data.

Logs in Kubernetes (and not only) today: expectations and reality
If once ordinary prints could be enough for a "logging system", now the situation has changed a lot.

Kubernetes and logs

When Kubernetes came to the infrastructure, the already existing problem of collecting logs did not bypass it either. In a sense, it has become even more painful: the management of the infrastructure platform has been not only simplified, but also complicated at the same time. Many old services have started migrating to microservice rails. In the context of logs, this was reflected in the growing number of sources of logs, their special life cycle, and the need to track the interconnections of all system components through logs ...

Looking ahead, I can state that now, unfortunately, there is no standardized logging option for Kubernetes that would compare favorably with all the others. The most popular schemes in the community come down to the following:

  • someone unrolls the stack SFAO (Elasticsearch, Fluentd, Kibana);
  • someone is trying the newly released Loki or uses Logging operator;
  • us (and perhaps not only us? ..) largely satisfied with their own development - log house...

As a rule, we use such bundles in K8s clusters (for self-hosted solutions):

However, I will not dwell on the instructions for their installation and configuration. Instead, I will focus on their shortcomings and more global conclusions on the situation with the logs in general.

Practice with logs in K8s

Logs in Kubernetes (and not only) today: expectations and reality

"Everyday logs", how many of you? ..

Centralized collection of logs from a sufficiently large infrastructure requires considerable resources that will be spent on collecting, storing and processing logs. During the operation of various projects, we have encountered various requirements and resulting problems in operation.

Let's try ClickHouse

Let's consider a centralized repository on a project with an application that generates quite a lot of logs: more than 5000 lines per second. Let's start working with its logs, adding them to ClickHouse.

As soon as the maximum realtime is required, the 4-core server with ClickHouse will already be overloaded on the disk subsystem:

Logs in Kubernetes (and not only) today: expectations and reality

This type of download is due to the fact that we are trying to write in ClickHouse as quickly as possible. And the database reacts to this with an increased disk load, which can cause such errors:

DB::Exception: Too many parts (300). Merges are processing significantly slower than inserts

Fact is that MergeTree tables in ClickHouse (they contain log data) have their own difficulties with write operations. The data inserted into them generates a temporary partition, which is then merged with the main table. As a result, the record turns out to be very demanding on the disk, and it is also subject to the restriction that we were notified about above: no more than 1 subpartitions can merge in 300 second (in fact, this is 300 inserts per second).

To avoid this behavior, should write to ClickHouse in as large chunks as possible and no more than once every 1 seconds. However, writing in large batches suggests that we should write to ClickHouse less frequently. This, in turn, can lead to buffer overflows and loss of logs. The solution is to increase the Fluentd buffer, but then the memory consumption will also increase.

Note: Another problematic side of our solution with ClickHouse was related to the fact that partitioning in our case (loghouse) is implemented through external tables linked Merge table. This leads to the fact that when fetching large time intervals, excessive RAM is required, since the metatable goes through all the partitions - even those that obviously do not contain the necessary data. However, now this approach can be safely declared obsolete for current versions of ClickHouse (c 18.16).

As a result, it becomes clear that not every project will have enough resources to collect real-time logs in ClickHouse (more precisely, their distribution will not be appropriate). In addition, you will need to use battery, to which we will return. The case described above is a real one. And at that time, we could not offer a reliable and stable solution that would suit the customer and would allow collecting logs with minimal delay...

What about elasticsearch?

Elasticsearch is known to handle heavy workloads. Let's try it in the same project. Now the load looks like this:

Logs in Kubernetes (and not only) today: expectations and reality

Elasticsearch was able to digest the data stream, however, writing such volumes to it consumes a lot of CPU. This is decided by the organization of the cluster. Purely technically, this is not a problem, but it turns out that just for the operation of the logging system, we already use about 8 cores and have an additional highly loaded component in the system ...

Bottom line: this option can be justified, but only if the project is large and its management is ready to spend significant resources on a centralized logging system.

Then a natural question arises:

What logs are really needed?

Logs in Kubernetes (and not only) today: expectations and reality Let's try to change the approach itself: the logs should be both informative and not cover each event in the system.

Let's say we have a successful online store. What logs are important? Collecting as much information as possible, for example, from a payment gateway, is a great idea. But not all logs from the image slicing service in the product catalog are critical for us: only errors and advanced monitoring are enough (for example, for the percentage of 500 errors that this component generates).

Here we have come to the conclusion that centralized logging is not always justified. Very often, the client wants to collect all the logs in one place, although in reality only 5% of the messages that are critical for business are required from the entire log:

  • Sometimes it is enough to adjust, say, only the size of the container log and the error collector (for example, Sentry).
  • To investigate incidents, an error notification and a large local log itself can often be enough.
  • We had projects that completely managed exclusively with functional tests and error collection systems. The developer did not need logs as such - they saw everything from error traces.

life illustration

Another story is a good example. We received a request from the security team of one of the clients who was already using a commercial solution, which was developed long before the introduction of Kubernetes.

It was necessary to "make friends" of the system of centralized collection of logs with the corporate sensor for detecting problems - QRadar. This system can receive logs via the syslog protocol, take them from FTP. However, it was not immediately possible to integrate it with the remote_syslog plugin for fluentd (as it turned out, we are not alone). Problems with setting up QRadar turned out to be on the side of the client's security team.

As a result, some business-critical logs were uploaded to FTP QRadar, while the other part was redirected via remote syslog directly from the nodes. For this we even wrote simple chart - perhaps it will help someone solve a similar problem ... Thanks to the resulting scheme, the client himself received and analyzed critical logs (using his favorite tools), and we were able to reduce the cost of the logging system, saving only the last month.

Another example is quite indicative of what not to do. One of our clients for processing each events coming from the user, did a multi-line unstructured output information in the log. As you might guess, such logs were extremely inconvenient both to read and store.

Criteria for logs

Such examples lead to the conclusion that in addition to choosing a system for collecting logs, it is necessary to design the logs themselves! What are the requirements here?

  • The logs must be in a machine-readable format (for example, JSON).
  • The logs should be compact and with the ability to change the level of logging in order to debug possible problems. At the same time, in production environments, you should run systems with a logging level like Warning or Error.
  • Logs must be normalized, that is, all lines in the log object must have the same field type.

Unstructured logs can lead to problems with uploading logs to the repository and completely stopping their processing. As an illustration, here is an example with a 400 error that many people have definitely encountered in fluentd logs:

2019-10-29 13:10:43 +0000 [warn]: dump an error event: error_class=Fluent::Plugin::ElasticsearchErrorHandler::ElasticsearchError error="400 - Rejected by Elasticsearch"

The error means that you are sending a field whose type is unstable to the index with ready mapping. The simplest example is a field in the nginx log with a variable $upstream_status. It can contain either a number or a string. For example:

{ "ip": "1.2.3.4", "http_user": "-", "request_id": "17ee8a579e833b5ab9843a0aca10b941", "time": "29/Oct/2019:16:18:57 +0300", "method": "GET", "uri": "/staffs/265.png", "protocol": "HTTP/1.1", "status": "200", "body_size": "906", "referrer": "https://example.com/staff", "user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36", "request_time": "0.001", "cache_status": "-", "upstream_response_time": "0.001, 0.007", "upstream_addr": "127.0.0.1:9000", "upstream_status": "200", "upstream_response_length": "906", "location": "staff"}
{ "ip": "1.2.3.4", "http_user": "-", "request_id": "47fe42807f2a7d8d5467511d7d553a1b", "time": "29/Oct/2019:16:18:57 +0300", "method": "GET", "uri": "/staff", "protocol": "HTTP/1.1", "status": "200", "body_size": "2984", "referrer": "-", "user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36", "request_time": "0.010", "cache_status": "-", "upstream_response_time": "0.001, 0.007", "upstream_addr": "10.100.0.10:9000, 10.100.0.11:9000", "upstream_status": "404, 200", "upstream_response_length": "0, 2984", "location": "staff"}

The logs show that the server 10.100.0.10 responded with a 404 error and the request went to another content store. As a result, in the logs the value became like this:

"upstream_response_time": "0.001, 0.007"

This situation is so widespread that it even received a separate references in documentation.

And what about reliability?

There are times when all the logs without exception are vital. And with this, the typical schemes for collecting logs for K8s, proposed / discussed above, have problems.

For example, fluentd cannot collect logs from short-lived containers. In one of our projects, the database migration container lived for less than 4 seconds, and then was deleted - according to the corresponding annotation:

"helm.sh/hook-delete-policy": hook-succeeded

Because of this, the migration execution log was not included in the repository. Policy can help in this case. before-hook-creation.

Another example is Docker log rotation. Let's say there is an application that actively writes to the logs. Under normal conditions, we have time to process all the logs, but as soon as there is a problem - for example, as described above with the wrong format - processing stops, and Docker rotates the file. Bottom line - business-critical logs may be lost.

That is why it is important to separate log streams, embedding the sending of the most valuable directly into the application to ensure their safety. In addition, it will not be superfluous to create some "battery" logs, which can survive a brief storage outage while retaining critical messages.

Finally, we must not forget that any subsystem is important to monitor qualitatively. Otherwise, it's easy to run into a situation where fluentd is in a state CrashLoopBackOff and does not send anything, and this promises the loss of important information.

Conclusions

In this article, we do not consider SaaS solutions like Datadog. Many of the problems described here have already been solved in one way or another by commercial companies specializing in collecting logs, but not everyone can use SaaS for various reasons. (the main ones are the cost and compliance with 152-FZ).

Centralized log collection looks like a simple task at first, but it is not at all. It is important to remember that:

  • Only critical components should be logged in detail, and for other systems, monitoring and error collection can be configured.
  • Logs in production should be kept minimal so as not to give an extra load.
  • Logs must be machine-readable, normalized, and strictly formatted.
  • Really critical logs should be sent in a separate thread, which should be separated from the main ones.
  • It is worth considering a log accumulator, which can save you from bursts of high load and make the load on the storage more even.

Logs in Kubernetes (and not only) today: expectations and reality
These simple rules, if applied everywhere, would allow the circuits described above to work even though they lack important components (the battery). If you do not adhere to such principles, the task will easily lead you and the infrastructure to another highly loaded (and at the same time inefficient) component of the system.

PS

Read also on our blog:

Source: habr.com

Add a comment