Bioyino - distributed, scalable metrics aggregator

So you collect metrics. As we are. We also collect metrics. Of course, necessary for business. Today we will talk about the very first link of our monitoring system - a statsd-compatible aggregation server bioyino, why we wrote it and why we abandoned brubeck.

Bioyino - distributed, scalable metrics aggregator

From our previous articles (1, 2) you can find out that until some time we collected marks using Brubeck. It is written in C. From a code point of view, it is as simple as a plug (this is important when you want to contribute) and, most importantly, it handles our volumes of 2 million metrics per second (MPS) at peak without any problems. The documentation states support for 4 million MPS with an asterisk. This means that you will get the stated figure if you configure the network correctly on Linux. (We don’t know how many MPS you can get if you leave the network as is). Despite these advantages, we had several serious complaints about brubeck.

Claim 1. Github, the developer of the project, stopped supporting it: publishing patches and fixes, accepting ours and (not only ours) PR. In the last few months (somewhere from February-March 2018), activity has resumed, but before that there was almost 2 years of complete calm. In addition, the project is being developed for internal Gihub needs, which can become a serious obstacle to the introduction of new features.

Claim 2. Accuracy of calculations. Brubeck collects a total of 65536 values ​​for aggregation. In our case, for some metrics, during the aggregation period (30 seconds), much more values ​​may arrive (1 at the peak). As a result of this sampling, the maximum and minimum values ​​appear useless. For example, like this:

Bioyino - distributed, scalable metrics aggregator
As it was

Bioyino - distributed, scalable metrics aggregator
How it should have been

For the same reason, amounts are generally calculated incorrectly. Add here a bug with a 32-bit float overflow, which generally sends the server to segfault when receiving a seemingly innocent metric, and everything becomes great. The bug, by the way, has not been fixed.

Finally, Claim X. At the time of writing, we are ready to present it to all 14 more or less working statsd implementations that we were able to find. Let's imagine that some single infrastructure has grown so much that accepting 4 million MPS is no longer enough. Or even if it hasn’t grown yet, but the metrics are already so important to you that even short, 2-3 minute dips in the charts can already become critical and cause bouts of insurmountable depression among managers. Since treating depression is a thankless task, technical solutions are needed.

Firstly, fault tolerance, so that a sudden problem on the server does not cause a psychiatric zombie apocalypse in the office. Secondly, scaling to be able to accept more than 4 million MPS, without digging deep into the Linux network stack and calmly growing “in breadth” to the required size.

Since we had room for scaling, we decided to start with fault tolerance. "ABOUT! Fault tolerance! It’s simple, we can do it,” we thought and launched 2 servers, raising a copy of brubeck on each. To do this, we had to copy traffic with metrics to both servers and even write for this small utility. We solved the fault tolerance problem with this, but... not very well. At first everything seemed great: each brubeck collects its own version of aggregation, writes data to Graphite once every 30 seconds, overwriting the old interval (this is done on the Graphite side). If one server suddenly fails, we always have a second one with its own copy of the aggregated data. But here’s the problem: if the server fails, a “saw” appears on the graphs. This is due to the fact that brubeck's 30-second intervals are not synchronized, and at the moment of a crash one of them is not overwritten. When the second server starts, the same thing happens. Quite tolerable, but I want better! The problem of scalability has also not gone away. All metrics still “fly” to a single server, and therefore we are limited to the same 2-4 million MPS, depending on the network level.

If you think a little about the problem and at the same time dig up snow with a shovel, then the following obvious idea may come to mind: you need a statsd that can work in distributed mode. That is, one that implements synchronization between nodes in time and metrics. “Of course, such a solution probably already exists,” we said and went to Google…. And they found nothing. After going through the documentation for different statsd (https://github.com/etsy/statsd/wiki#server-implementations as of December 11.12.2017, XNUMX), we found absolutely nothing. Apparently, neither the developers nor the users of these solutions have yet encountered SO many metrics, otherwise they would definitely come up with something.

And then we remembered about the “toy” statsd - bioyino, which was written at the Just for Fun hackathon (the name of the project was generated by the script before the start of the hackathon) and realized that we urgently needed our own statsd. For what?

  • because there are too few statsd clones in the world,
  • because it is possible to provide the desired or close to the desired fault tolerance and scalability (including synchronizing aggregated metrics between servers and solving the problem of sending conflicts),
  • because it is possible to calculate metrics more accurately than brubeck does,
  • because you can collect more detailed statistics yourself, which brubeck practically did not provide to us,
  • because I had a chance to program my own hyperperformance distributed scale lab application, which will not completely repeat the architecture of another similar hyperfor... well, that’s it.

What to write on? Of course, in Rust. Why?

  • because there was already a prototype solution,
  • because the author of the article already knew Rust at that time and was eager to write something in it for production with the opportunity to put it in open-source,
  • because languages ​​with GC are not suitable for us due to the nature of the traffic received (almost realtime) and GC pauses are practically unacceptable,
  • because you need maximum performance comparable to C
  • because Rust provides us with fearless concurrency, and if we started writing it in C/C++, we would have raked in even more vulnerabilities, buffer overflows, race conditions and other scary words than brubeck.

There was also an argument against Rust. The company had no experience creating projects in Rust, and now we also do not plan to use it in the main project. Therefore, there were serious fears that nothing would work out, but we decided to take a chance and tried.

Time passed...

Finally, after several failed attempts, the first working version was ready. What happened? This is what happened.

Bioyino - distributed, scalable metrics aggregator

Each node receives its own set of metrics and accumulates them, and does not aggregate metrics for those types where their full set is required for final aggregation. The nodes are connected to each other by some kind of distributed lock protocol, which allows you to select among them the only one (here we cried) that is worthy of sending metrics to the Great One. This problem is currently being resolved by Consul, but in the future the author’s ambitions extend to own implementation Raft, where the most worthy one will, of course, be the consensus leader node. In addition to consensus, nodes quite often (once per second by default) send to their neighbors those parts of pre-aggregated metrics that they managed to collect in that second. It turns out that scaling and fault tolerance are preserved - each node still holds a full set of metrics, but the metrics are sent already aggregated, via TCP and encoded into a binary protocol, so duplication costs are significantly reduced compared to UDP. Despite the fairly large number of incoming metrics, accumulation requires very little memory and even less CPU. For our highly compressible mertics, this is only a few tens of megabytes of data. As an additional bonus, we get no unnecessary data rewrites in Graphite, as was the case with burbeck.

UDP packets with metrics are unbalanced between nodes on network equipment through a simple Round Robin. Of course, the network hardware does not parse the contents of packets and therefore can pull much more than 4M packets per second, not to mention metrics about which it knows nothing at all. If we take into account that the metrics do not come one at a time in each packet, then we do not foresee any performance problems in this place. If a server crashes, the network device quickly (within 1-2 seconds) detects this fact and removes the crashed server from rotation. As a result of this, passive (i.e., non-leader) nodes can be turned on and off practically without noticing drawdowns on the charts. The maximum we lose is part of the metrics that came in at the last second. A sudden loss/shutdown/switch of a leader will still create a minor anomaly (the 30 second interval is still out of sync), but if there is communication between nodes, these problems can be minimized, for example, by sending out synchronization packets.

A little about the internal structure. The application is, of course, multithreaded, but the threading architecture is different from that used in brubeck. The threads in brubeck are the same - each of them is responsible for both information collection and aggregation. In bioyino, workers are divided into two groups: those responsible for the network and those responsible for aggregation. This division allows you to more flexibly manage the application depending on the type of metrics: where intensive aggregation is required, you can add aggregators, where there is a lot of network traffic, you can add the number of network flows. At the moment, on our servers we work in 8 network and 4 aggregation flows.

The counting (responsible for aggregation) part is quite boring. Buffers filled by network flows are distributed among counting flows, where they are subsequently parsed and aggregated. Upon request, metrics are given for sending to other nodes. All this, including sending data between nodes and working with Consul, is performed asynchronously, running on the framework Tokyo.

Much more problems during development were caused by the network part responsible for receiving metrics. The main goal of separating network flows into separate entities was the desire to reduce the time that a flow spends not to read data from the socket. Options using asynchronous UDP and regular recvmsg quickly disappeared: the first consumes too much user-space CPU for event processing, the second requires too many context switches. Therefore it is now used recvmmsg with large buffers (and buffers, gentlemen officers, are nothing to you!). Support for regular UDP is reserved for light cases where recvmmsg is not needed. In multimessage mode, it is possible to achieve the main thing: the vast majority of the time, the network thread rakes the OS queue - reads data from the socket and transfers it to the userspace buffer, only occasionally switching to giving the filled buffer to aggregators. The queue in the socket practically does not accumulate, the number of dropped packets practically does not grow.

Note

In the default settings, the buffer size is set to be quite large. If you suddenly decide to try the server yourself, you may encounter the fact that after sending a small number of metrics, they will not arrive in Graphite, remaining in the network stream buffer. To work with a small number of metrics, you need to set bufsize and task-queue-size to smaller values ​​in the config.

Finally, some charts for chart lovers.

Statistics on the number of incoming metrics for each server: more than 2 million MPS.

Bioyino - distributed, scalable metrics aggregator

Disabling one of the nodes and redistributing incoming metrics.

Bioyino - distributed, scalable metrics aggregator

Statistics on outgoing metrics: only one node always sends - the raid boss.

Bioyino - distributed, scalable metrics aggregator

Statistics of the operation of each node, taking into account errors in various system modules.

Bioyino - distributed, scalable metrics aggregator

Detailing of incoming metrics (metric names are hidden).

Bioyino - distributed, scalable metrics aggregator

What are we planning to do with all this next? Of course, write code, damn...! The project was originally planned to be open-source and will remain so throughout its life. Our immediate plans include switching to our own version of Raft, changing the peer protocol to a more portable one, introducing additional internal statistics, new types of metrics, bug fixes and other improvements.

Of course, everyone is welcome to help in the development of the project: create PR, Issues, if possible we will respond, improve, etc.

With that being said, that's all folks, buy our elephants!



Source: habr.com

Add a comment