How we at CYAN tamed terabytes of logs

How we at CYAN tamed terabytes of logs

Hello everyone, my name is Alexander, I work at CIAN as an engineer and deal with system administration and automation of infrastructure processes. In the comments to one of the previous articles, we were asked to tell us where we get 4 TB of logs per day from and what we do with them. Yes, we have a lot of logs, and a separate infrastructure cluster has been created to process them, which allows us to quickly solve problems. In this article, I will talk about how we adapted it to work with an ever-growing data stream in a year.

Where did we start

How we at CYAN tamed terabytes of logs

Over the past few years, the load on cian.ru has grown very rapidly, and by the third quarter of 2018, the resource's traffic reached 11.2 million unique users per month. At that time, at critical moments, we lost up to 40% of the logs, which is why we could not quickly deal with incidents and spent a lot of time and effort on solving them. We also often could not find the cause of the problem, and it recurred after some time. It was hell and something had to be done about it.

At that time, to store logs, we used a cluster of 10 data nodes with ElasticSearch version 5.5.2 with typical index settings. It was introduced over a year ago as a popular and affordable solution: at that time the log flow was not so large, there was no point in coming up with non-standard configurations. 

The processing of incoming logs was provided by Logstash on different ports on five ElasticSearch coordinators. One index, regardless of size, consisted of five shards. Hourly and daily rotation was organized, as a result, about 100 new shards appeared in the cluster every hour. While there were not very many logs, the cluster coped and no one focused on its settings. 

Problems of Rapid Growth

The volume of generated logs grew very quickly, as two processes overlapped each other. On the one hand, there were more and more users of the service. On the other hand, we began to actively switch to a microservice architecture, sawing our old monoliths in C # and Python. Several dozen new microservices that replaced parts of the monolith generated noticeably more logs for the infrastructure cluster. 

It was scaling that led us to the fact that the cluster became practically unmanageable. When the logs started coming in at a rate of 20k messages per second, frequent useless rotation increased the number of shards to 6k, and there were more than 600 shards per node. 

This led to problems with the allocation of RAM, and when a node crashed, all shards began to move simultaneously, multiplying traffic and loading the rest of the nodes, which made it almost impossible to write data to the cluster. And during this period we remained without dens. And when there was a problem with the server, we lost 1/10 of the cluster in principle. A large number of small indexes added complexity.

Without logs, we did not understand the causes of the incident and could sooner or later step on the same rake again, and in the ideology of our team this was unacceptable, since all the mechanisms of our work are sharpened just the opposite - never repeat the same problems. To do this, we needed the full volume of logs and their delivery almost in real time, since the team of engineers on duty monitored alerts not only from metrics, but also from logs. To understand the scale of the problem, at that time the total volume of logs was about 2 TB per day. 

We set the task to completely eliminate the loss of logs and reduce the time of their delivery to the ELK cluster to a maximum of 15 minutes during force majeure (we relied on this figure in the future as an internal KPI).

New rotation mechanism and hot-warm nodes

How we at CYAN tamed terabytes of logs

We started the cluster transformation by updating the ElasticSearch version from 5.5.2 to 6.4.3. We once again had a version 5 cluster, and we decided to extinguish it and completely update it - there are still no logs. So we made this transition in just a couple of hours.

The largest transformation at this stage was the implementation on three nodes with the coordinator as an intermediate buffer of Apache Kafka. The message broker saved us from losing logs during problems with ElasticSearch. At the same time, we added 2 nodes to the cluster and switched to a hot-warm architecture with three “hot” nodes placed in different racks in the data center. We redirected logs to them by mask, which should not be lost in any case - nginx, as well as application error logs. Minor logs — debug, warning, etc. — went to the rest of the nodes, and “important” logs from “hot” nodes moved after 24 hours.

In order not to increase the number of small indexes, we switched from time rotation to the rollover mechanism. There was a lot of information on the forums that rotation by index size is very unreliable, so we decided to use rotation by the number of documents in the index. We analyzed each index and fixed the number of documents after which the rotation should work. Thus, we have reached the optimal shard size - no more than 50 GB. 

Cluster optimization

How we at CYAN tamed terabytes of logs

However, we have not completely got rid of the problems. Unfortunately, small indexes still appeared: they did not reach the specified volume, were not rotated and were deleted by the global cleaning of indexes older than three days, since we removed the rotation by date. This led to data loss due to the fact that the index from the cluster disappeared completely, and an attempt to write to a non-existent index broke the logic of the curator, which we used to manage. The alias for the record was converted to an index and broke the rollover logic, causing some indexes to grow uncontrollably up to 600 GB. 

For example, for the rotation config:

сurator-elk-rollover.yaml

---
actions:
  1:
    action: rollover
    options:
      name: "nginx_write"
      conditions:
        max_docs: 100000000
  2:
    action: rollover
    options:
      name: "python_error_write"
      conditions:
        max_docs: 10000000

In the absence of rollover alias, an error occurred:

ERROR     alias "nginx_write" not found.
ERROR     Failed to complete action: rollover.  <type 'exceptions.ValueError'>: Unable to perform index rollover with alias "nginx_write".

We left the solution of this problem for the next iteration and dealt with another issue: we switched to the pull logic of Logstash, which processes incoming logs (removing unnecessary information and enriching). We put it in docker, which we run through docker-compose, we also placed logstash-exporter there, which sends metrics to Prometheus for online monitoring of the log stream. So we gave ourselves the opportunity to smoothly change the number of logstash instances responsible for processing each type of log.

While we were improving the cluster, cian.ru traffic grew to 12,8 million unique users per month. As a result, it turned out that our transformations were a little behind the changes in production, and we were faced with the fact that the "warm" nodes could not cope with the load and slowed down the entire delivery of logs. We received “hot” data without failures, but we had to intervene in the delivery of the rest and do a manual rollover in order to evenly distribute the indexes. 

At the same time, scaling and changing the settings of logstash instances in the cluster was complicated by the fact that it was a local docker-compose, and all actions were performed manually (to add new ends, it was necessary to manually go through all the servers and do docker-compose up -d everywhere).

Redistribution of logs

In September of this year, we still continued to cut the monolith, the load on the cluster increased, and the log flow was approaching 30 thousand messages per second. 

How we at CYAN tamed terabytes of logs

We started the next iteration with a hardware update. We switched from five coordinators to three, replaced data nodes and won in terms of money and storage space. For nodes, we use two configurations: 

  • For hot nodes: E3-1270 v6 / 960Gb SSD / 32 Gb x 3 x 2 (3 for Hot1 and 3 for Hot2).
  • For warm nodes: E3-1230 v6 / 4Tb SSD / 32 Gb x 4.

At this iteration, we moved the index with the access logs of microservices, which takes up as much space as the front-end nginx logs, to the second group of three "hot" nodes. We now store data on “hot” nodes for 20 hours, and then transfer it to “warm” nodes to the rest of the logs. 

We solved the problem of the disappearance of small indexes by reconfiguring their rotation. Now indexes are rotated in any case every 23 hours, even if there is little data. This slightly increased the number of shards (there were about 800 of them), but in terms of cluster performance, this is tolerable. 

As a result, the cluster has six "hot" and only four "warm" nodes. This causes a slight latency on requests over long time intervals, but increasing the number of nodes in the future will solve this problem.

In this iteration, the problem of the lack of semi-automatic scaling was also fixed. To do this, we deployed an infrastructure Nomad cluster - similar to what is already deployed in our production. While the number of Logstash does not automatically change depending on the load, but we will come to this.

How we at CYAN tamed terabytes of logs

Plans for the future

The implemented configuration scales perfectly, and now we store 13,3 TB of data - all logs for 4 days, which is necessary for emergency analysis of alerts. We convert some of the logs into metrics that we add to Graphite. To facilitate the work of engineers, we have metrics for the infrastructure cluster and scripts for semi-automatic fixing of common problems. After the increase in the number of data nodes, which is planned for the next year, we will switch to data storage from 4 to 7 days. This will be enough for operational work, as we always try to investigate incidents as soon as possible, and there is telemetry data for long-term investigations. 

In October 2019, cian.ru traffic grew to 15,3 million unique users per month. This was a serious test of the architectural solution for the delivery of logs. 

Now we are preparing to upgrade ElasticSearch to version 7. True, for this we will have to update the mapping of many indexes in ElasticSearch, since they moved from version 5.5 and were declared deprecated in version 6 (they simply do not exist in version 7). And this means that during the update process there will definitely be some kind of force majeure, which will leave us without logs for the time the issue is resolved. Of version 7, we are most looking forward to Kibana with an improved interface and new filters. 

We achieved the main goal: we stopped losing logs and reduced the downtime of the infrastructure cluster from 2-3 crashes per week to a couple of hours of service work per month. All this work in production is almost invisible. However, now we can determine exactly what is happening with our service, we can quickly do it in a quiet mode and not worry that the logs will be lost. In general, we are satisfied, happy and preparing for new exploits, which we will talk about later.

Source: habr.com

Add a comment