Yury Bushmelev "Map of a rake in the field of collecting and delivering logs" - transcript of the report

Logs are an important part of the system, allowing you to understand that it works (or does not work) as expected. Under the conditions of microservice architecture, working with logs becomes a separate discipline of the Special Olympiad. There are a lot of issues that need to be addressed:

  • how to write logs from the application;
  • where to write logs;
  • how to deliver logs for storage and processing;
  • how to process and store logs.

The use of currently popular containerization technologies adds sand on top of the rake in the field of problem solving options.

Just about this is the transcript of the report by Yuri Bushmelev "Map of a rake in the field of collecting and delivering logs"

Who cares, please under the cat.

My name is Yuri Bushmelev. I work for Lazada. Today I will talk about how we made our logs, how we collected them, and what we write there.

Yury Bushmelev "Map of a rake in the field of collecting and delivering logs" - transcript of the report

Where are we from? Who are we? Lazada is the #1 online retailer in six countries in Southeast Asia. All these countries are distributed among data centers. There are now 4 data centers in total. Why is this important? Because some decisions were due to the fact that there is a very weak link between the centers. We have a microservice architecture. I was surprised to find that we already have 80 microservices. When I started the task with logs, there were only 20 of them. Plus, there is a rather large piece of PHP legacy, which I also have to live with and put up with. All this generates for us at the moment more than 6 million messages per minute for the system as a whole. Further I will show how we are trying to live with this, and why this is so.

Yury Bushmelev "Map of a rake in the field of collecting and delivering logs" - transcript of the report

You have to live with these 6 million messages somehow. What should we do with them? 6 million messages needed:

  • send from app
  • accept for delivery
  • deliver for analysis and storage.
  • analyze
  • store somehow.

Yury Bushmelev "Map of a rake in the field of collecting and delivering logs" - transcript of the report

When there were three million messages, I had about the same look. Because we started with some pennies. It is clear that application logs are written there. For example, could not connect to the database, could connect to the database, but could not read something. But besides this, each of our microservices also writes an access log. Each request that arrives at the microservice falls into the log. Why are we doing this? Developers want to be able to trace. Each access log contains the traceid field, according to which a special interface then unwinds the entire chain and beautifully displays the trace. The trace shows how the request went, and this helps our developers deal with any unknown garbage faster.

Yury Bushmelev "Map of a rake in the field of collecting and delivering logs" - transcript of the report

How to live with it? Now I will briefly describe the field of options - how this problem is generally solved. How to solve the problem of collecting, transferring and storing logs.

Yury Bushmelev "Map of a rake in the field of collecting and delivering logs" - transcript of the report

How to write from the application? It is clear that there are different ways. In particular, there is best practice, as fashionable comrades tell us. There are two types of old school, as grandfathers said. There are other ways.

Yury Bushmelev "Map of a rake in the field of collecting and delivering logs" - transcript of the report

With the collection of logs, the situation is approximately the same. There are not so many options for solving this particular part. There are more of them, but not so many yet.

Yury Bushmelev "Map of a rake in the field of collecting and delivering logs" - transcript of the report

But with delivery and subsequent analysis, the number of variations begins to explode. I will not describe each option now. I think the main options are well known to everyone who was interested in the topic.

Yury Bushmelev "Map of a rake in the field of collecting and delivering logs" - transcript of the report

I'll show you how we did it at Lazada and how it all started.

Yury Bushmelev "Map of a rake in the field of collecting and delivering logs" - transcript of the report

A year ago, I came to Lazada and was sent to the log project. It was like this there. The log from the application was written to stdout and stderr. Everything was done in a fashionable way. But then the developers threw it out of the standard streams, and then infrastructure specialists will figure it out somehow. Between infrastructure specialists and developers, there are also releasers who said: "uh ... well, let's just wrap them in a file with a shell, and that's it." And since all this is in a container, they wrapped it right in the container itself, mapped the directory inside and put it there. I think it's pretty obvious to everyone what happened.

Yury Bushmelev "Map of a rake in the field of collecting and delivering logs" - transcript of the report

Let's look a little further. How we delivered these logs. Someone picked td-agent, which is actually fluentd but not quite fluentd. I still do not understand the relationship of these two projects, but they seem to be about the same thing. And this fluentd, written in Ruby, read log files, parsed them into JSON using some regular expressions. Then they were sent to Kafka. Moreover, in Kafka, we had 4 separate topics for each API. Why 4? Because there is live, there is staging, and because there is stdout and stderr. Developers produce them, and infrastructure workers must create them in Kafka. Moreover, Kafka was controlled by another department. Therefore, it was necessary to create a ticket so that they created 4 topics there for each api. Everyone forgot about it. In general, it was trash and waste.

Yury Bushmelev "Map of a rake in the field of collecting and delivering logs" - transcript of the report

What did we do next with it? We sent it to kafka. Further from Kafka, half of the logs flew to Logstash. The other half of the logs were shared. Some flew to one Graylog, some to another Graylog. As a result, all this flew into one Elasticsearch cluster. That is, all this mess fell in the end there. You don't have to do that!

Yury Bushmelev "Map of a rake in the field of collecting and delivering logs" - transcript of the report

This is what it looks like when viewed from above. You don't have to do that! Here, the problem areas are immediately marked with numbers. There are actually more of them, but 6 are really problematic ones, with which something needs to be done. I will tell about them separately now.

Yury Bushmelev "Map of a rake in the field of collecting and delivering logs" - transcript of the report

Here (1,2,3) we write files and, accordingly, there are three rakes here at once.

The first (1) is that we need to write them somewhere. It's not always desirable to give an API the ability to write directly to a file. It is desirable that the API be isolated in a container, and even better, that it be read-only. I am a system administrator, so I have a slightly alternative view of these things.

The second point (2,3) is that we have a lot of requests coming to the API. The API writes a lot of data to a file. The files are growing. We need to rotate them. Because otherwise you won't be able to save any discs there. Rotating them is bad because they are redirected via the shell to a directory. There is no way we can rotate it. You cannot tell the application to reopen the handles. Because the developers will look at you like a fool: “What descriptors? We generally write to stdout. The frameworks made copytruncate into logrotate, which just makes a copy of the file and trunks the original. Accordingly, between these copying processes, disk space usually runs out.

(4) We had different formats in different APIs. They were slightly different, but regexp had to be written differently. Since it was all managed by Puppet, there was a big bunch of classes with their own cockroaches. Plus, td-agent most of the time could eat memory, be stupid, he could just pretend that he was working and do nothing. Outwardly, it was impossible to understand that he was doing nothing. At best, he will fall, and someone will pick him up later. More precisely, an alert will fly in, and someone will go and raise it with their hands.

Yury Bushmelev "Map of a rake in the field of collecting and delivering logs" - transcript of the report

(6) And the most trash and waste - it was elasticsearch. Because it was an old version. Because we did not have dedicated masters at that time. We had heterogeneous logs whose fields could overlap. Different logs of different applications could be written with the same field names, but at the same time there could be different data inside. That is, one log comes with an Integer in a field, for example, level. Another log comes with a String in the level field. In the absence of static mapping, such a wonderful thing turns out. If, after index rotation, a message with a string arrived first in elasticsearch, then we live normally. And if the first one arrived with Integer, then all subsequent messages that arrived with String are simply discarded. Because the field type does not match.

Yury Bushmelev "Map of a rake in the field of collecting and delivering logs" - transcript of the report

We started asking these questions. We decided not to look for the guilty.

Yury Bushmelev "Map of a rake in the field of collecting and delivering logs" - transcript of the report

But something needs to be done! The obvious thing is that we need to establish standards. We already had some standards. Some we brought a little later. Fortunately, a single log format for all APIs was already approved at that time. It is written directly into the service interaction standards. Accordingly, those who want to receive logs should write them in this format. If someone does not write logs in this format, then we do not guarantee anything.

Further, I would like to have a single standard for the methods of recording, delivering and collecting logs. Actually, where to write them, and how to deliver them. The ideal situation is when projects use the same library. There is a separate logging library for Go, there is a separate library for PHP. Everyone we have, everyone should use them. At the moment, I would say that we are succeeding by 80 percent. But some continue to eat cacti.

And there (on the slide) the “SLA for log delivery” is barely beginning to appear. It's not there yet, but we're working on it. Because it is very convenient when infra says that if you write in such and such a format to such and such a place and no more than N messages per second, then we will most likely deliver it there. It takes away a lot of headaches. If there is an SLA, then it's just great!

Yury Bushmelev "Map of a rake in the field of collecting and delivering logs" - transcript of the report

How did we start solving the problem? The main rake was with td-agent. It was unclear where our logs go. Are they delivered? Are they going? Where are they anyway? Therefore, it was decided to replace td-agent with the first item. Options for what to replace it with, I briefly outlined here.

Fluentd. Firstly, I came across him at a previous job, and he also periodically fell there. Secondly, this is the same, only in profile.

filebeat. How was it good for us? The fact that he is in Go, and we have a great expertise in Go. Accordingly, if anything, we could somehow add it to ourselves. That's why we didn't take it. So that there would not even be any temptation to start rewriting it for yourself.

The obvious solution for the sysadmin is all sorts of syslogs in this quantity (syslog-ng/rsyslog/nxlog).

Or write something of your own, but we discarded it, as well as filebeat. If you write something, then it is better to write something useful for business. To deliver logs, it is better to take something ready-made.

Therefore, the choice actually came down to a choice between syslog-ng and rsyslog. I leaned towards rsyslog simply because we already had classes for rsyslog in Puppet, and I did not find an obvious difference between them. What is syslog, what is syslog. Yes, some documentation is worse, some better. He knows this way, and he does it differently.

Yury Bushmelev "Map of a rake in the field of collecting and delivering logs" - transcript of the report

And a little about rsyslog. First, it's cool because it has a lot of modules. It has a human-readable RainerScript (modern configuration language). An awesome bonus is that we could emulate the behavior of td-agent with its standard tools, and nothing has changed for applications. That is, we change td-agent to rsyslog, and do not touch everything else yet. And immediately we get a working delivery. Next, mmnormalize is the cool thing about rsyslog. It allows you to parse logs, but not with Grok and regexp. It makes an abstract syntax tree. It parses logs in much the same way that a compiler parses source code. This allows you to work very fast, eat little CPU, and, in general, it’s just a very cool thing. There are a bunch of other bonuses. I will not dwell on them.

Yury Bushmelev "Map of a rake in the field of collecting and delivering logs" - transcript of the report

rsyslog has a lot more disadvantages. They are about the same as bonuses. The main problems are that you need to be able to cook it, and you need to select a version.

Yury Bushmelev "Map of a rake in the field of collecting and delivering logs" - transcript of the report

We decided that we would write logs in a unix socket. And not in /dev/log, because there we have a mess of system logs, there is journald in this pipeline. So let's write to a custom socket. We will attach it to a separate ruleset. Let's not interfere with anything. Everything will be transparent and understandable. So we actually did. The directory with these sockets is standardized and forwarded to all containers. Containers can see the socket they need, open and write to it.

Why not a file? Because everyone has read article about Badushechka, which tried to forward the file to docker, and found that after restarting rsyslog, the file descriptor changes, and docker loses this file. He keeps open something else, but not the same socket where they write. We decided that we would bypass this problem, and, at the same time, bypass the blocking problem.

Yury Bushmelev "Map of a rake in the field of collecting and delivering logs" - transcript of the report

Rsyslog does the actions indicated on the slide and sends logs to either relay or Kafka. Kafka follows the old way. Rayleigh - I tried to use pure rsyslog to deliver logs. Without Message Queue, using standard rsyslog tools. Basically, it works.

Yury Bushmelev "Map of a rake in the field of collecting and delivering logs" - transcript of the report

But there are nuances with how to stuff them later into this part (Logstash/Graylog/ES). This part (rsyslog-rsyslog) is used between datacenters. Here is a compressed tcp link, which allows you to save bandwidth and, accordingly, somehow increase the likelihood that we will receive some logs from another data center when the channel is full. Because we have Indonesia, where everything is bad. That's where the constant problem lies.

Yury Bushmelev "Map of a rake in the field of collecting and delivering logs" - transcript of the report

We thought about how we actually monitor, with what probability the logs that we recorded from the application reach that end? We decided to start metrics. Rsyslog has its own statistics collection module, which has some kind of counters. For example, it can show you the size of the queue, or how many messages came in for such and such an action. You can already take something from them. Plus, it has custom counters that you can configure, and it will show you, for example, the number of messages that some API has recorded. Next, I wrote rsyslog_exporter in Python, and we sent it all to Prometheus and plotted. We really wanted Graylog metrics, but so far we have not had time to set them up.

Yury Bushmelev "Map of a rake in the field of collecting and delivering logs" - transcript of the report

What are the problems? The problem arose with the fact that we found out (SUDDENLY!) that our Live APIs write 50k messages per second. This is only Live API without staging. And Graylog only shows us 12 thousand messages per second. And a reasonable question arose, where are the remnants? From which we concluded that Graylog simply can not cope. We looked, and, indeed, Graylog with Elasticsearch did not master this flow.

Next, other discoveries we've made along the way.

Writes to socket are blocked. How did it happen? When I used rsyslog for delivery, at some point we broke the channel between the data centers. Delivery got up in one place, delivery got up in another place. All this has come down to a machine with APIs that write to the rsyslog socket. There was a queue. Then the queue for writing to the unix socket filled up, which by default is 128 packets. And the next write() in the application blocks. When we looked at the library that we use in Go applications, it was written there that writing to the socket occurs in non-blocking mode. We were sure that nothing was blocked. Because we have read article about Badushechkawho wrote about it. But there is a moment. There was also an infinite loop around this call, in which there was a constant attempt to push a message into the socket. We didn't notice him. I had to rewrite the library. Since then, it has changed several times, but now we have got rid of locks in all subsystems. Therefore, you can stop rsyslog and nothing will fall.

It is necessary to monitor the size of the queues, which helps not to step on this rake. First, we can monitor when we start to lose messages. Secondly, we can monitor that we basically have problems with delivery.

And another unpleasant moment - amplification by 10 times in a microservice architecture is very easy. We don’t have so many incoming requests, but because of the graph along which these messages run further, because of the access logs, we actually increase the load on the logs by about ten times. Unfortunately, I did not have time to calculate the exact numbers, but microservices are what they are. This must be kept in mind. It turns out that at the moment the log collection subsystem is the most loaded in Lazada.

Yury Bushmelev "Map of a rake in the field of collecting and delivering logs" - transcript of the report

How to solve elasticsearch problem? If you need to quickly get logs in one place, so as not to run across all machines and collect them there, use file storage. This is guaranteed to work. It is done from any server. You just need to stick disks there and put syslog. After that, you are guaranteed to have all the logs in one place. Then it will be possible to slowly configure elasticsearch, graylog, or something else. But you will already have all the logs, and, moreover, you can store them, as far as there are enough disk arrays.

Yury Bushmelev "Map of a rake in the field of collecting and delivering logs" - transcript of the report

At the time of my report, the scheme began to look like this. We practically stopped writing to the file. Now, most likely, we will turn off the remnants. On local machines running the API, we will stop writing to files. First, there is file storage, which works very well. Secondly, these machines are constantly running out of space, you need to constantly monitor it.

This part with Logstash and Graylog, it really soars. Therefore, you need to get rid of it. You have to choose one.

Yury Bushmelev "Map of a rake in the field of collecting and delivering logs" - transcript of the report

We decided to drop Logstash and Kibana. Because we have a security department. What is the connection? The connection is that Kibana without X-Pack and without Shield does not allow you to differentiate access rights to the logs. Therefore, they took Graylog. It has it all. I don't like it, but it works. We bought new hardware, installed a fresh Graylog there, and moved all logs with strict formats to a separate Graylog. We have solved the problem with different types of the same fields organizationally.

Yury Bushmelev "Map of a rake in the field of collecting and delivering logs" - transcript of the report

What exactly is included in the new Graylog. We just wrote everything in the docker. We took a bunch of servers, rolled out three Kafka instances, 7 Graylog servers version 2.3 (because I wanted Elasticsearch version 5). All this was raised on raids from the HDD. We saw an indexing rate of up to 100 thousand messages per second. We saw the figure that 140 terabytes of data per week.

Yury Bushmelev "Map of a rake in the field of collecting and delivering logs" - transcript of the report

And again a rake! We have two sales coming up. We have moved beyond 6 million posts. We Graylog does not have time to chew. Somehow you have to survive again.

Yury Bushmelev "Map of a rake in the field of collecting and delivering logs" - transcript of the report

This is how we survived. Added a few more servers and SSDs. At the moment we live like this. Now we are already chewing 160k messages per second. We haven't hit the limit yet, so it's unclear how much we can realistically get out of it.

Yury Bushmelev "Map of a rake in the field of collecting and delivering logs" - transcript of the report

These are our plans for the future. Of these, really, the most important is probably high availability. We don't have it yet. Several cars are set up the same way, but so far everything is going through one car. It is necessary to spend time to set up a failover between them.

Collect metrics from Graylog.

Make a rate limit so that we have one crazy API that doesn’t kill us bandwidth and everything else.

And finally, sign some kind of SLA with developers so that we can serve that much. If you write more, then sorry.

And write documentation.

Yury Bushmelev "Map of a rake in the field of collecting and delivering logs" - transcript of the report

Briefly, the results of everything that we have experienced. First, the standards. Second, syslog is cake. Thirdly, rsyslog works exactly as it is written on the slide. And let's get to the questions.

Questions.

Question: Why did they decide not to take ... (filebeat?)

Response: Need to write to a file. I really didn't want to. When your API writes thousands of messages per second, even if you rotate once an hour, this is still not an option. You can write to pipe. To which the developers asked me: “What will happen if the process in which we write falls down”? I just did not find what to answer them, and said: "Well, ok, let's not do that."

Question: Why don't you just write logs to HDFS?

ResponseA: This is the next step. We thought about it at the very beginning, but since there are no resources to deal with it at the moment, it hangs in our long term solution.

Question: A column format would be more suitable.

Response: I understand. We are "for" with both hands.

Question: You write to rsyslog. Both TCP and UDP are available there. But if UDP, then how do you guarantee delivery?

ResponseA: There are two points. First, I immediately tell everyone that we do not guarantee the delivery of logs. Because when developers come and say: “Let’s start writing financial data there, and you will put it somewhere for us in case something happens,” we answer them, “Great! Let's start blocking on writing to the socket, and do it in transactions, so that you are guaranteed to put it in the socket for us and make sure that we received it from the other side. And at this moment, everyone immediately becomes unnecessary. And if not, then what questions do we have? If you don't want to guarantee a socket write, then why would we guarantee delivery? We are making the best effort. We really try to deliver as much as possible and as best as possible, but we do not give a 100% guarantee. Therefore, you do not need to write financial data there. There are transactional databases for this.

Question: When the API generates some message to the log and transfers control to microservices, have you encountered the problem that messages from different microservices arrive in the wrong order? Because of this, confusion arises.

ResponseA: It's normal that they come in a different order. You have to be ready for this. Because any network delivery does not guarantee order to you, or you need to spend special resources on this. If we take file storages, then each API saves logs to its own file. Rather, rsyslog decomposes them into directories there. Each API has its own logs there, where you can go and look, and then you can compare them using the timestamp in this log. If they go to look in Graylog, then there they will be sorted by timestamp. Everything will be fine there.

Question: Timestamp may vary by milliseconds.

Response: The timestamp is generated by the API itself. This, in fact, is the whole point. We have NTP. The API generates a timestamp already in the message itself. It is not added by rsyslog.

Question: Interaction between data centers is not very clear. Within the framework of the data center, it is clear how the logs were collected and processed. How is the interaction between data centers? Or does each data center live its own life?

Response: Almost. We have each country located in one data center. We do not currently have spreading, so that one country is placed in different data centers. Therefore, there is no need to combine them. Inside each center there is a Log Relay. This is an Rsyslog server. In fact, two management machines. They are set up the same way. But for now, traffic just goes through one of them. She logs everything aggregates. It has a disk queue just in case. She presses the logs and sends them to the central data center (Singapore), where further they are already poisoned in Graylog. And each data center has its own file storage. In case we lost connection, we have all the logs there. They will stay there. They will be stored there.

Question: Do you get logs from there during abnormal situations?

Response: You can go there (to the file storage) and look.

Question: How do you monitor that you do not lose logs?

Response: We are actually losing them, and we are monitoring it. Monitoring started a month ago. The library that the Go APIs use has metrics. She can count how many times she failed to write to socket. There at the moment there is a tricky heuristic. There is a buffer there. It tries to write a message from it to socket. If the buffer overflows, it starts dropping them. And he counts how many he dropped them. If the counters start to overflow there, we will know about it. They are now also coming to prometheus, and you can see the graphs in Grafana. You can set up alerts. But it is not yet clear to whom to send them.

Question: In elasticsearch, you store logs with redundancy. How many replicas do you have?

Response: One replica.

Question: Is it just one line?

Response: This is the master and replica. The data is stored in duplicate.

Question: Did you tweak the size of the rsyslog buffer somehow?

Response: We write datagrams to a custom unix socket. This immediately imposes a limitation of 128 kilobytes on us. We can't write more into it. We have written this into the standard. Who wants to get into storage, they write 128 kilobytes. Libraries, moreover, cut off, and put a flag that the message is cut off. We have a special field in the standard of the message itself, which shows whether it was cut off during recording or not. So we have the opportunity to track this moment.

Question: Do you write broken JSON?

Response: Broken JSON will be discarded either during relay because the packet is too large. Or Graylog will be dropped, because it will not be able to parse JSON. But there are nuances here that need to be fixed, and they are mostly tied to rsyslog. I have already filled in a few issue there, which still need to be worked on.

Question: Why Kafka? Have you tried RabbitMQ? Graylog does not add up under such loads?

Response: It doesn't work out with Graylog. And Graylog is taking shape. It's really problematic for him. He's kind of a thing. And, in fact, it is not needed. I'd rather write from rsyslog directly to elasticsearch and then watch Kibana. But we need to settle the issue with the security guards. This is a possible variant of our development when we throw out Graylog and use Kibana. Logstash will not make sense. Because I can do the same with rsyslog. And it has a module to write to elasticsearch. With Graylog we are trying to live somehow. We even tweaked it a bit. But there is still room for improvement.

About Kafka. That's how it happened historically. When I arrived, it was already there, and logs were already being written to it. We just raised our cluster and moved logs into it. We manage him, we know how he feels. As for RabbitMQ... we're having trouble with RabbitMQ. And RabbitMQ is developing for us. We have it in production, and there were problems with it. Now, before the sale, he would be shamanized, and he would begin to work normally. But before that, I was not ready to release it into production. There is one more point. Graylog can read AMQP 0.9 version and rsyslog can write AMQP 1.0 version. And there is not a single solution that can do both in the middle. There is either one or the other. So for now only Kafka. But there are also nuances. Because omkafka of the version of rsyslog that we use can lose the entire message buffer that it scooped up from rsyslog. As long as we put up with it.

Question: Are you using Kafka because you had it? Not used for any other purpose?

Response: Kafka, which was used by the Data Science team. This is a completely separate project, about which I, unfortunately, can not say anything. I do not know. She was run by the Data Science team. When the logs started, they decided to use it, so as not to put their own. Now we have updated Graylog, and we have lost compatibility, because there is an old version of Kafka. We had to make our own. At the same time, we got rid of these four topics for each API. We made one wide top for all live, one wide wide top for all staging and we just shoot everything there. Graylog rakes all this out in parallel.

Question: Why do we need this shamanism with sockets? Have you tried using the syslog log driver for containers.

Response: At the time when we asked this question, we had tense relations with the docker. It was docker 1.0 or 0.9. Docker itself was weird. Secondly, if you also shove logs into it ... I have an unverified suspicion that it passes all the logs through itself, through the docker daemon. If we have one API going crazy, then the rest of the APIs will run into the fact that they cannot send stdout and stderr. I don't know where this will lead. I have a suspicion at the level of feeling that it is not necessary to use the docker syslog driver in this place. Our functional testing department has its own Graylog cluster with logs. They use docker log drivers and everything seems to be fine there. But they immediately write GELF to Graylog. At the moment when we started all this, we needed it to just work. Perhaps later, when someone comes and says that it has been working normally for a hundred years, we will try.

Question: You deliver between data centers using rsyslog. Why not on Kafka?

Response: We do this, and this is how it really is. For two reasons. If the channel is completely dead, then all our logs, even in a compressed form, will not climb through it. And kafka allows them to simply lose in the process. In this way, we get rid of the sticking of these logs. We are just using Kafka in this case directly. If we have a good channel and want to free it, then we use their rsyslog. But in fact, you can set it up so that it drops what it didn’t get through. At the moment we are just using rsyslog delivery directly somewhere, somewhere Kafka.

Source: habr.com

Add a comment