Another monitoring system

Another monitoring system
16 modems, 4 cellular operators = Upload speed 933.45 Mbps

Introduction

Hello! This article is about how we wrote a new monitoring system for ourselves. It differs from the existing ones by the possibility of high-frequency synchronous obtaining of metrics and very low resource consumption. The polling rate can be as low as 0.1 milliseconds with a timing accuracy of 10 nanoseconds between metrics. All binary files take up 6 megabytes.

About Orchard

We have a rather specific product. We produce a comprehensive solution for the summation of throughput and fault tolerance of data transmission channels. This is when there are several channels, let's say Operator1 (40Mbps) + Operator2 (30Mbps) + Something else (5 Mbps), the result is one stable and fast channel, the speed of which will be something like this: (40+ 30+5)x0.92=75Γ—0.92=69 Mbps.

Such solutions are in demand where the capacity of any one channel is insufficient. For example, transport, video surveillance systems and real-time video streaming, broadcasting live television and radio broadcasts, any suburban facilities where there are only representatives of the Big Four among telecom operators and the speed on one modem / channel is not enough.
For each of these areas, we release a separate line of devices, but their software part is almost the same and a high-quality monitoring system is one of its main modules, without the correct implementation of which the product would not be possible.

For several years, we have managed to create a multi-level, fast, cross-platform and lightweight monitoring system. What we want to share with the respected community.

Formulation of the problem

The monitoring system provides the receipt of metrics of two fundamentally different classes: real-time metrics and all the rest. The monitoring system had only the following requirements:

  1. High-frequency synchronous acquisition of real-time metrics and their transfer to the communication control system without delay.
    High frequency and synchronization of different metrics is not just important, it is vital for analyzing the entropy of data transmission channels. If in one data transmission channel the average delay is 30 milliseconds, then the error in synchronization between the other metrics by only one millisecond will lead to degradation of the resulting channel speed by about 5%. If we miss the timing by 1 millisecond in 4 channels, speed degradation can easily drop to 30%. In addition, the entropy in the channels changes very quickly, so if we measure it less than once every 0.5 milliseconds, we will get a high speed degradation on fast channels with a small delay. Of course, such accuracy is not needed for all metrics and not in all conditions. When the delay in the channel is 500 milliseconds, and we work with such, then the error of 1 millisecond will hardly be noticeable. Also, for the metrics of life support systems, we have enough polling and synchronization rates of 2 seconds, but the monitoring system itself must be able to work with ultra-high polling rates and ultra-precise synchronization of metrics.
  2. Minimal resource consumption and a single stack.
    The end device can be either a powerful on-board system that can analyze the situation on the road or conduct biometric fixation of people, or a palm-sized single-board computer that a special forces soldier wears under a bulletproof vest to transmit video in real time in conditions of poor communication. Despite such a variety of architectures and computing power, we would like to have the same software stack.
  3. umbrella architecture
    Metrics must be collected and aggregated on the end device, have a local storage system and visualization in real time and retrospectively. If there is a connection, transfer data to the central monitoring system. When there is no connection, the queue for sending should accumulate and not consume RAM.
  4. API for integration into the customer's monitoring system, because no one needs many monitoring systems. The customer must collect data from any devices and networks into a single monitoring.

What happened

In order not to load the already impressive longread, I will not give examples and measurements of all monitoring systems. This will lead to another article. Let me just say that we were not able to find a monitoring system that is able to take two metrics simultaneously with an error of less than 1 millisecond and that works equally effectively on both the ARM architecture with 64MB of RAM and the x86_64 architecture with 32GB of RAM. Therefore, we decided to write our own, which can do just that. Here's what we got:

The summation of the throughput of three channels for different network topologies


Visualization of some key metrics

Another monitoring system
Another monitoring system
Another monitoring system
Another monitoring system

Architecture

As the main programming language, both on the device and in the data center, we use Golang. It has made life much easier with its implementation of multitasking and the ability to have one statically linked executable binary per service. As a result, we significantly save in resources, methods and traffic for deploying the service to end devices, development time and code debugging.

The system is implemented according to the classical modular principle and contains several subsystems:

  1. Registration of metrics.
    Each metric is served by its own thread and synchronized across channels. We managed to get synchronization accuracy up to 10 nanoseconds.
  2. Storage of metrics
    We chose between writing our own storage for time series or using something from the existing one. The database is needed for retrospective data that is subject to subsequent visualization. That is, it does not contain data on channel delays every 0.5 milliseconds or error indications in the transport network, but there is a speed on each interface every 500 milliseconds. In addition to the high requirements for cross-platform and low resource consumption, it is extremely important for us to be able to process. data where it is stored. This saves a lot of computing resources. We have been using the Tarantool DBMS in this project since 2016, and so far we do not see a replacement for it in the horizon. Flexible, with optimal resource consumption, more than adequate technical support. Tarantool also has a GIS module. Of course, it is not as powerful as PostGIS, but it is enough for our tasks of storing some location-related metrics (relevant for transport).
  3. Visualization of metrics
    Everything is relatively simple here. We take data from the warehouse and show it either in real time or retrospectively.
  4. Synchronization of data with the central monitoring system.
    The central monitoring system receives data from all devices, stores them with a given retrospective, and sends them through the API to the Customer's monitoring system. Unlike classical monitoring systems, in which the β€œhead” walks and collects data, we have the reverse scheme. Devices themselves send data when there is a connection. This is a very important point, since it allows you to receive data from the device for those periods of time in which it was not available and not load channels and resources at a time when the device is not available. We use Influx monitoring server as our central monitoring system. Unlike analogues, it can import retrospective data (that is, with a timestamp different from the moment the metric was received). The collected metrics are visualized by Grafana, modified with a file. This standard stack was also chosen because it has ready-made APIs for integration with almost any customer monitoring system.
  5. Data synchronization with the central device management system.
    The device management system implements Zero Touch Provisioning (updating firmware, configuration, etc.) and, unlike the monitoring system, receives only device problems. These are triggers for the operation of onboard hardware watchdog services and all metrics of life support systems: CPU and SSD temperature, CPU load, free space and SMART health on disks. The subsystem storage is also built on Tarantool. This gives us significant speed in aggregating time series across thousands of devices, and also completely solves the issue of data synchronization with these devices. Tarantool has a great queuing system and guaranteed delivery. We got this important feature out of the box, great!

Network management system

Another monitoring system

What's next

So far, our weakest link is the central monitoring system. It is 99.9% implemented on a standard stack and has a number of disadvantages:

  1. InfluxDB loses data on power outage. As a rule, the Customer quickly takes everything that comes from the devices and there is no data older than 5 minutes in the database itself, but in the future this can become a pain.
  2. Grafana has a number of problems with data aggregation and display synchronism. The most common problem is when the database contains a time series with an interval of 2 seconds starting, say, from 00:00:00, and Grafana starts showing data in aggregation from +1 second. As a result, the user sees a dancing graph.
  3. Excessive amount of code for API integration with third-party monitoring systems. You can make it much more compact and of course rewrite it in Go)

I suppose all of you have perfectly seen what Grafana looks like and without me you know its problems, so I will not overload the post with pictures.

Conclusion

I deliberately did not begin to describe the technical details, but only described the supporting design of this system. Firstly, in order to technically fully describe the system, one more article is required. Secondly, not everyone will be interested in it. Write in the comments what technical details you would like to know.

If anyone has questions outside of this article, feel free to email me at a.rodin @ qedr.com

Source: habr.com

Add a comment