🥇Fluentd: why it's important to tune the output buffer

Nowadays, it is impossible to imagine a project based on Kubernetes without an ELK stack, with the help of which logs of both applications and system components of the cluster are stored. In our practice, we use the EFK stack with Fluentd instead of Logstash.

Fluentd is a modern, general-purpose log collector that is gaining more and more popularity and has joined the Cloud Native Computing Foundation, which is why its development vector is focused on use with Kubernetes.

The fact of using Fluentd instead of Logstash does not change the general essence of the software package, however, Fluentd has its own specific nuances resulting from its versatility.

For example, when we started using EFK in a busy project with a high logging intensity, we were faced with the fact that in Kibana some messages are displayed repeatedly several times. In this article, we will tell you why this phenomenon occurs and how to solve the problem.

The problem of document duplication

In our projects, Fluentd is deployed as a DaemonSet (automatically runs in one instance on each node of the Kubernetes cluster) and monitors container stdout logs in /var/log/containers. After collection and processing, logs in the form of JSON documents are sent to ElasticSearch, raised in a clustered or standalone form, depending on the scale of the project and the requirements for performance and fault tolerance. Kibana is used as a graphical interface.

When using Fluentd with an output buffering plugin, we encountered a situation where some documents in ElasticSearch have exactly the same content and differ only in ID. You can make sure that this is a repetition of the message using the example of the Nginx log. In the log file, this message exists in a single instance:

127.0.0.1 192.168.0.1 - [28/Feb/2013:12:00:00 +0900] "GET / HTTP/1.1" 200 777 "-" "Opera/12.0" -

However, there are several documents in ElasticSearch containing this message:

{
  "_index": "test-custom-prod-example-2020.01.02",
  "_type": "_doc",
  "_id": "HgGl_nIBR8C-2_33RlQV",
  "_version": 1,
  "_score": 0,
  "_source": {
    "service": "test-custom-prod-example",
    "container_name": "nginx",
    "namespace": "test-prod",
    "@timestamp": "2020-01-14T05:29:47.599052886 00:00",
    "log": "127.0.0.1 192.168.0.1 - [28/Feb/2013:12:00:00  0900] "GET / HTTP/1.1" 200 777 "-" "Opera/12.0" -",
    "tag": "custom-log"
  }
}

{
  "_index": "test-custom-prod-example-2020.01.02",
  "_type": "_doc",
  "_id": "IgGm_nIBR8C-2_33e2ST",
  "_version": 1,
  "_score": 0,
  "_source": {
    "service": "test-custom-prod-example",
    "container_name": "nginx",
    "namespace": "test-prod",
    "@timestamp": "2020-01-14T05:29:47.599052886 00:00",
    "log": "127.0.0.1 192.168.0.1 - [28/Feb/2013:12:00:00  0900] "GET / HTTP/1.1" 200 777 "-" "Opera/12.0" -",
    "tag": "custom-log"
  }
}

Moreover, there can be more than two repetitions.

During the fixing of this problem, a large number of warnings can be observed in the Fluentd logs of the following content:

2020-01-16 01:46:46 +0000 [warn]: [test-prod] failed to flush the buffer. retry_time=4 next_retry_seconds=2020-01-16 01:46:53 +0000 chunk="59c37fc3fb320608692c352802b973ce" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>"elasticsearch", :port=>9200, :scheme=>"http", :user=>"elastic", :password=>"obfuscated"}): read timeout reached"

These warnings occur when ElasticSearch cannot return a response to a request within the time set by the request_timeout parameter, due to which the forwarded buffer fragment cannot be cleared. After that, Fluentd tries to send the buffer fragment to ElasticSearch again, and after an arbitrary number of retries, the operation succeeds:

2020-01-16 01:47:05 +0000 [warn]: [test-prod] retry succeeded. chunk_id="59c37fc3fb320608692c352802b973ce" 
2020-01-16 01:47:05 +0000 [warn]: [test-prod] retry succeeded. chunk_id="59c37fad241ab300518b936e27200747" 
2020-01-16 01:47:05 +0000 [warn]: [test-dev] retry succeeded. chunk_id="59c37fc11f7ab707ca5de72a88321cc2" 
2020-01-16 01:47:05 +0000 [warn]: [test-dev] retry succeeded. chunk_id="59c37fb5adb70c06e649d8c108318c9b" 
2020-01-16 01:47:15 +0000 [warn]: [kube-system] retry succeeded. chunk_id="59c37f63a9046e6dff7e9987729be66f"

However, ElasticSearch treats each of the forwarded buffer fragments as unique and assigns them unique _id field values when indexing. Thus, copies of messages appear.

In Kibana it looks like this:

The solution

There are several solutions to this problem. One of them is the mechanism built into the fluent-plugin-elasticsearch plugin to generate a unique hash for each document. If you use this mechanism, ElasticSearch will recognize duplicates at the forwarding stage and prevent duplicate documents. But one cannot ignore that this method of solving the problem fights the investigation and does not eliminate the error with a lack of timeout, so we abandoned its use.

We use a buffering plugin on the Fluentd output to prevent loss of logs in the event of short network outages or increased logging. If, for some reason, ElasticSearch cannot immediately write the document to the index, the document is put into a queue that is stored on disk. Therefore, in our case, in order to eliminate the source of the problem that leads to the error described above, it is necessary to set the correct values of the buffering parameters, in which the Fluentd output buffer will be of sufficient size and at the same time have time to clear in the allotted time.

It should be noted that the values of the parameters, which will be discussed below, are individual in each specific case of using buffering in output plugins, as they depend on many factors: the intensity of logging messages by services, the performance of the disk system, the network channel load and its bandwidth. Therefore, in order to get suitable for each individual case, but not redundant buffer settings, avoiding a long blind search, you can use the debug information that Fluentd writes to its log during operation and relatively quickly get the correct values.

At the time of fixing the problem, the configuration looked like this:

 <buffer>
        @type file
        path /var/log/fluentd-buffers/kubernetes.test.buffer
        flush_mode interval
        retry_type exponential_backoff
        flush_thread_count 2
        flush_interval 5s
        retry_forever
        retry_max_interval 30
        chunk_limit_size 8M
        queue_limit_length 8
        overflow_action block
      </buffer>

In the course of solving the problem, the values of the following parameters were manually selected:
chunk_limit_size - the size of the chunks into which messages are broken in the buffer.

flush_interval - the time interval after which the buffer is cleared.
queue_limit_length - the maximum number of chunks in the queue.
request_timeout is the time for which a connection is established between Fluentd and ElasticSearch.

The total buffer size can be calculated by multiplying the queue_limit_length and chunk_limit_size parameters, which can be interpreted as "the maximum number of chunks in the queue, each of which has a given size." If the buffer size is insufficient, the following warning will appear in the logs:

2020-01-21 10:22:57 +0000 [warn]: [test-prod] failed to write data into buffer by buffer overflow action=:block

It means that the buffer does not have time to clear in the allotted time and the data that enters the full buffer is blocked, which will lead to the loss of part of the logs.

You can increase the buffer in two ways: either by increasing the size of each chunk in the queue, or by increasing the number of chunks that can be in the queue.

If you set the chunk_limit_size chunk size to more than 32 megabytes, then ElasticSeacrh will not accept it, since the incoming packet will be too large. Therefore, if you need to increase the buffer further, it is better to increase the maximum length of the queue_limit_length.

When the buffer stops overflowing and only the message about the lack of timeout remains, you can start increasing the request_timeout parameter. However, setting it to more than 20 seconds will cause the following warnings to appear in the Fluentd logs:

2020-01-21 09:55:33 +0000 [warn]: [test-dev] buffer flush took longer time than slow_flush_log_threshold: elapsed_time=20.85753920301795 slow_flush_log_threshold=20.0 plugin_id="postgresql-dev"

This message does not affect the system in any way and means that the time to flush the buffer took longer than set by the slow_flush_log_threshold parameter. This is debug information and we use it when picking the value of the request_timeout parameter.

The generalized selection algorithm is as follows:

Set the request_timeout value to be guaranteed to be larger than necessary (hundreds of seconds). At the time of setting, the main criterion for the correct setting of this parameter will be the disappearance of warnings for a lack of timeout.
Wait for messages about exceeding the slow_flush_log_threshold threshold. In the text of the warning, the elapsed_time field will contain the actual buffer clearing time.
Set the request_timeout value to be greater than the maximum elapsed_time value received during the monitoring period. We calculate the request_timeout value as elapsed_time + 50%.
To remove warnings about a long buffer flush from the log, you can raise the slow_flush_log_threshold value. We calculate this value as elapsed_time + 25%.

The final values of these parameters, as noted earlier, are obtained individually for each case. Following the algorithm above, we are guaranteed to eliminate the error that leads to repeated messages.

The table below shows how the number of errors per day, leading to duplicate messages, changes in the process of selecting the values of the parameters described above:

node-1
node-2
node-3
node-4

Before after
Before after
Before after
Before after

failed to flush the buffer
1749/2
694/2
47/0
1121/2

retry succeeded
410/2
205/1
24/0
241/2

It should be additionally noted that the received settings may lose their relevance in the process of project growth and, accordingly, an increase in the number of logs. The primary indication that a timeout is not being set is the return of long buffer flush messages to the Fluentd log, i.e. the slow_flush_log_threshold threshold has been exceeded. From this point on, there is still a small margin before the request_timeout parameter is exceeded, so it is necessary to respond to these messages in a timely manner and repeat the process of selecting the optimal settings described above.

Conclusion

Fine-tuning the Fluentd output buffer is one of the main steps in tuning the EFK stack, determining the stability of its operation and the correct placement of documents in indexes. Focusing on the described configuration algorithm, you can be sure that all logs will be written to the ElasticSearch index in the correct order, without repetition and loss.

Also read other articles on our blog:

Source: habr.com

Fluentd: why it's important to tune the output buffer

The problem of document duplication

The solution

Conclusion

Also read other articles on our blog:

Add a comment Отменить ответ