Return the missing scooter, or the story of one IoT monitoring

A year ago, we launched a pilot version of a promotional project for decentralized rental of electric scooters.

Initially, the project was called Road-To-Barcelona, ​​later it became Road-To-Berlin (hence R2B in the screenshots), and in the end it was called xRide at all.

The main idea of ​​the project was the following: instead of having a centralized car or scooter rental service (we are talking about scooters aka electric motorcycles, not kickscooters / scooters), we wanted to make a decentralized rental platform. About the difficulties we faced have written before.

Initially, the project focused on cars, but due to deadlines, extremely long communication with manufacturers and a huge number of safety restrictions, electric scooters were chosen for the pilot.

The user installed an iOS or Android application on the phone, approached the scooter he liked, after which the phone and the scooter established a peer-to-peer connection, ETH was exchanged, and the user could start the trip by turning on the scooter through the phone. At the end of the trip, it was also possible to pay for the trip with Ethereum from the user's wallet on the phone.

In addition to scooters, the user saw "smart chargers" in the application, by visiting which the user himself could change the current battery if it was dead.

This is how our pilot, launched last September in two German cities: Bonn and Berlin, looked like.

Return the missing scooter, or the story of one IoT monitoring

And then, one day, in Bonn, in the early morning, our support team (located in a location to maintain scooters in working condition) was raised on alarm: one of the scooters disappeared without a trace.

How to find and return it?

In this article, I will talk about this, but first, about how we built our own IoT platform and how we monitored it.

What and why to monitor: scooters, infrastructure, charging?

So, what did we want to monitor in our project?

First of all, these are the scooters themselves - electric scooters themselves are quite expensive, you cannot launch such a project without being sufficiently prepared, if possible you want to collect as much information as possible about the scooters: about their location, charge level, etc.

In addition, I would like to monitor the state of our own IT infrastructure - bases, services and everything they need to work. It was also necessary to monitor the status of "smart chargers", in case they broke down or ran out of full batteries.

Scooters

What were our scooters and what did we want to know about them?

Return the missing scooter, or the story of one IoT monitoring

The first, and most important, is GPS coordinates, because thanks to them we can understand where they are and where they are moving.

Next is the battery charge, thanks to it we can determine that the charging of scooters is coming to an end and send a juicer, or at least warn the user.

Of course, we also need to check what happens to our Hardware components:

  • does Bluetooth work?
  • Does the GPS module itself work?
    • we also had a problem with the fact that the GPS could send incorrect coordinates and "stick", and this could only be determined at the level of additional checks on the scooter,
      and notify support as soon as possible to fix the problem

And the last: checks of the software part, starting with the OS and loading the processor, network and disk, ending with checks of our own modules that are more specific to us (Jolocom, keycloak).

Hardware

Return the missing scooter, or the story of one IoT monitoring

What was our "iron" part?

Considering the shortest possible time and the need for rapid prototyping, we chose the easiest option for implementation and selection of components - Raspberry Pi.
In addition to the Rpi itself, we had a custom board (which we ourselves developed and ordered in China to speed up the assembly process of the final solution) and a set of components - a relay (to turn the scooter on / off), a battery charge reader, a modem, antennas. All this was compactly completed in a special box "xRide box".

It should also be noted that the whole box was powered by an additional power bank, which in turn was powered by the main battery of the scooter.

This made it possible to monitor and turn on the scooter, even after the end of the trip, since the main battery was disconnected immediately after turning the ignition key to the "off" position.

Docker? plain linux? and deploy

Let's return to monitoring, so Raspberry - what do we have?

One of the first things we wanted to use to speed up the process of deploying, updating and delivering components to physical devices was Docker.

Unfortunately, it quickly became clear that although Docker works on RPi, it provides a lot of overhead, in particular in terms of power consumption.

The difference using the "native" OS, though not so strong, but still enough to make us wary of the possibility of losing charge too quickly.

The second reason was one of the libraries of our partners on Node.js (sic!) - the only component of the system that was not written in Go / C / C ++.

The authors of the library did not have time to provide a working version in any of the "native" languages.

Not only is the node itself not the most elegant solution for low-performance devices, but the library itself was very resource hungry.

We realized that with all our desire, using Docker would be too much overhead for us. The choice was made in favor of the native OS and work under it directly.

OS

As a result, we, again, chose the simplest option as the OS and used Raspbian (a Debian build for Pi).

We write all our software in Go, so we also wrote the main hardware agent module in our system in Go.

It is he who is responsible for working with GPS, Bluetooth, reading the charge, turning on the scooter, etc.

Deploy

The question immediately arose of the need to implement a mechanism for delivering updates to devices (OTA) - both updates to our agent / application itself, and updates to the OS / "firmware" itself (since new versions of the agent could require updates to the kernel or system components, libraries, etc.) .

After quite a long market analysis, it turned out that there are quite a few solutions for delivering updates to the device.

From relatively simple utilities mostly focused on update/dual-boot like swupd/SWUpdate/OSTree to full fledged platforms like Mender and Balena.

First of all, we decided that we were interested in end-to-end solutions, so the choice immediately fell on platforms.

Itself Whale was excluded due to the fact that it actually uses the same Docker inside its balenaEngine.

But I will note that despite this, in the end we constantly used their product Whale Etcher for flashing firmware to SD cards - a simple and extremely convenient utility for this.

Therefore, in the end, the choice fell on Mender. Mender is a complete platform for building, delivering and installing firmware.

Overall the platform looks great, but it took us about a week and a half just to build the correct version of our firmware using the mander builder.
And the more we immersed ourselves in the intricacies of its use, the more it became clear that it would take us much more time to fully deploy it than we had.

Alas, our tight deadlines meant that we were forced to abandon the use of Mender and opt for an even simpler path.

Ansible

The easiest solution in our situation was to use Ansible. A couple of playbooks was enough to get started.

Their essence boiled down to the fact that we simply connected from the host (CI server) via ssh to our rasberries and poured updates on them.

At the very beginning, everything was simple - you had to be in a single network with devices, the bottling went through Wi-Fi.

The office just had a dozen test raspberries connected to the same network, each device had a static IP address, also specified in the Ansible Inventory.

It was Ansible that delivered our monitoring agent to the end devices.

3G / LTE

Unfortunately, this use case for Ansible could only work in development mode before we had real scooters.

Because scooters, as you understand, are not connected to the same Wi-Fi router, constantly waiting for updates over the network.

In reality, scooters cannot have any connection at all other than mobile 3G / LTE (and even then not constantly).

This immediately imposes many problems and restrictions, such as low connection speed and unstable communication.

But most importantly, in a 3G/LTE network, we can't just rely on a static IP assigned on the network.

This is partially solved by some SIM card providers, there are even special SIM cards designed for IoT devices with static IP addresses. But we did not have access to such SIM cards and could not use IP addresses.

Of course, there were ideas to do some kind of registration of IP addresses aka service discovery somewhere like Consul, but such ideas had to be abandoned, since in our tests the IP address could change too often, which led to great instability.

For this reason, the most convenient use for delivering metrics would not be using the pull model, where we would go for the necessary metrics to devices, but push with the delivery of metrics from the device directly to the server

VPN

As a solution to this problem, we chose a VPN - specifically wire guard.

Clients (scooters) at the start of the system connected to the VPN server and kept the ability to connect to them. This tunnel was used to deliver updates.

Return the missing scooter, or the story of one IoT monitoring

In theory, the same tunnel could be used for monitoring, but such a connection was more complicated and less reliable than a simple push.

Cloud resources

The last thing is that we need to monitor our cloud services and databases, since we use Kubernetes for them, ideally, so that monitoring deployment in a cluster is as simple as possible. Ideally, using Helmet, because for deployment, we use it in most cases. And, of course, to monitor the cloud, you need to use the same solutions as for the scooters themselves.

I hope

Phew, sort of figured out the description, let's make a list of what we needed in the end:

  • A quick solution, since it is necessary to monitor already during the development process
  • Volume/quantity - need a lot of metrics
  • Logging is required
  • Reliability - data is critical to launch success
  • Cannot use pull model - need push
  • We need a unified monitoring of not only hardware, but also clouds

The final picture looked like this

Return the missing scooter, or the story of one IoT monitoring

Stack selection

So, we are faced with the question of choosing a stack for monitoring.

First of all, we were looking for the most complete all-in-one solution that would simultaneously cover all our requirements, but at the same time be flexible enough to tailor its use to our needs. Still, we had a lot of restrictions imposed on us by hardware, architecture and timing.

There are a huge variety of monitoring solutions, starting with full-fledged systems like Nagios, searching or zabbix and ending with ready-made solutions for Fleet management.

Return the missing scooter, or the story of one IoT monitoring

Initially, the latter seemed like an ideal solution for us, but some did not have full-fledged monitoring, others severely curtailed the features of free versions, and still others simply did not cover our "Wishlist" or were not flexible enough to fit our scenarios. Some are just outdated.

After analyzing a number of similar solutions, we quickly came to the conclusion that it would be easier and faster to build a similar stack on our own. Yes, it will be a little more difficult than deploying a fully-fledged Fleet management platform, but we won't have to compromise.

Almost certainly, in all the huge abundance of solutions, there are already ready-made ones that would suit us completely, but in our case it was much faster to assemble a certain stack on our own and adjust it "for ourselves" than to test finished products.

With all this, we did not seek to assemble a complete platform for monitoring on our own, but were looking for the most functional "ready-made" stacks, only with the ability to flexibly configure them.

(B)ELK?

The first solution that was actually considered was the well-known ELK stack.
In fact, it should be called BELK, because everything starts with Beatshttps://www.elastic.co/what-is/elk-stack

Return the missing scooter, or the story of one IoT monitoring

Of course, ELK is one of the most famous and powerful solutions in the field of monitoring, and in the collection and processing of logs, the same thing.

We have assumed that ELK will be used for collecting logs and, as well as a long-term storage of metrics obtained from Prometheus.

You can use Grafan for visualization.

In fact, the fresh ELK stack can collect metrics on its own (metricbeat), Kibana can also show them.

But still, initially ELK grew out of logs and so far the metrics functionality has a number of serious drawbacks:

  • Significantly slower than Prometheus
  • Integrates into far fewer places than Prometheus
  • Difficult to set up alerts for them
  • Metrics take up a lot of space
  • Setting up dashboards with metrics in Kiban is much more difficult than Grafan

In general, metrics in ELK are heavy and not yet as convenient as in other solutions, which are now actually much more than just Prometheus: TSDB, Victoria Metrics, Cortex, etc., etc. Of course, I would really like to have a full-fledged all-in-one solution right away, but in the case of metricbeat, there were too many compromises.

And the ELK stack itself has a number of difficult moments:

  • It is heavy, sometimes even very heavy, if you have a fairly large amount of data.
  • You need to "be able to cook" it - it is necessary to scale it, but this is not trivial
  • Stripped-down free version - in the free version there is no normal alerting, at the time of selection there was no authentication

I must say that recently with the last paragraph it has become better and in addition to output to open-source X-pack (including authentication) the pricing model itself began to change.

But at the moment when we were going to deploy this solution, there was no alerting at all.
Perhaps it was possible to try to build something using ElastAlert or other community solutions, but still decided to consider other alternatives.

Loki-Grafana-Prometheus

At the moment, building a monitoring stack based on pure Prometheus as a metrics provider, Loki for logs, and Grafana can be used for visualization.

Unfortunately, at the time of the start of the project pilot sale (September-October 19), Loki was still in beta version 0.3-0.4, and at the time of the start of development it could not be considered as a production solution at all.

I don't have real experience using Loki in serious projects yet, but I can say that Promtail (a logging agent) works great for both bare-metal and pods in kubernetes.

TICK

Perhaps the most worthy (the only?) full-featured alternative to the ELK stack now can only be called the TICK stack - Telegraf, InfluxDB, Chronograf, Kapacitor.

Return the missing scooter, or the story of one IoT monitoring

I'll describe all the components below in more detail, but the general idea is this:

  • Telegraf - agent for collecting metrics
  • InfluxDB - Metrics Database
  • Kapacitor - real-time metrics processor for alerting
  • Chronograf - web dashboard for visualization

There are official helm charts for InfluxDB, Kapacitor and Chronograf that we used to deploy them.

It should be noted that in the latest version of Influx 2.0 (beta), Kapacitor and Chronograf have become part of InfluxDB and no longer exist separately.

Telegraf

Return the missing scooter, or the story of one IoT monitoring

Telegraf is a very lightweight agent for collecting metrics on the end machine.

He can monitor a huge amount of everything, from to
Server Minecraft.

It has a number of cool benefits:

  • Fast and lightweight (written in Go)
    • Eats a minimum amount of resources
  • Push metrics by default
  • Collects all necessary metrics
    • System metrics without any settings
    • Hardwired metrics like information from sensors
    • Very easy to add custom metrics
  • Lots of plugins out of the box
  • Collects logs

Since push metrics was essential for us, all other benefits were more than nice additions.

Collection of logs by the agent itself is also very convenient, since there is no need to connect additional utilities for tailing logs.

Influx offers the best logging experience if you use syslog.

Telegraf is generally a great agent for collecting metrics, even if you don't use the rest of the ICK stack.

Many cross it with both ELK and various other time-series bases for convenience, since it can write metrics almost anywhere.

InfluxDB

Return the missing scooter, or the story of one IoT monitoring

InfluxDB is the main core of the TICK stack, namely the time-series database for metrics.
In addition to metrics, Influx can also store logs, although, in fact, logs for it are just the same metrics, only instead of the usual numerical indicators, the main function is the line of the log text.

InfluxDB is also written in Go and seems to run much faster than ELK on our (not the most powerful) cluster.

One of the cool advantages of Influx, I would also include a very convenient and rich API for querying data, which we used very actively.

Drawbacks - $$$ or scaling?

The TICK stack has only one drawback that we found - it expensive. Even more.

What's in the paid version that the free version doesn't?

As far as we were able to understand, the only difference between the paid version of the TICK stack and the free version is the scaling capabilities.

Namely, you can raise a cluster with High availability only in Enterprise versions.

If you want a full-fledged HA, you need to either pay or make some crutches. There are a couple of community solutions - for example influxdb-ha it looks like a smart solution, but it says that it is not suitable for production, as well as
influx-spout - a simple solution with pumping data through NATS (it will also have to be scaled, but this can be solved).

It’s a pity, but both of them seem to be abandoned - there are no fresh commits, I’ll assume that the point is the soon expected release of the new version of Influx 2.0, in which much will be different (so far there is no information about scaling in it).

Officially for the free version exists relay - in fact, this is a primitive HA, but only through balancing,
since all data will be written to all InfluxDB instances behind the load balancer.
He has some shortcomings like potential problems with overwriting points and the need to create bases for metrics in advance
(which happens automatically during normal work with InfluxDB).

Besides sharding is not supported, this means additional overhead for duplicated metrics (both processing and storage), which you might not need, but there is no way to separate them.

VictoriaMetrics?

As a result, despite the fact that in everything besides paid scaling, the TICK stack completely suited us, we decided to see if there were any free solutions that could replace the InfluxDB base, while leaving the rest of the T_CK components.

Return the missing scooter, or the story of one IoT monitoring

There are many Time-series databases, but the most promising is Victoria Metrics, which has a number of advantages:

  • Fast and easy, at least in terms of results benchmarks
  • There is a cluster version, which even has good reviews now
    • She can shard
  • Supports InfluxDB protocol

We were not going to build a completely custom stack based on Victoria and the main hope was that we could use it as a drop-in replacement for InfluxDB.

Unfortunately, this is not possible, despite the fact that the InfluxDB protocol is supported, this only works for recording metrics - only the Prometheus API is available "outside", which means that Chronograf will not work on it.

Moreover, only numeric values ​​are supported for metrics (we used string values ​​for custom metrics - more on that in the section admin area).

Obviously, for the same reason, the VM cannot store logs, as Influx does.

Also, it should be noted that at the time of the search for the optimal solution, Victoria Metrics was not yet so popular, the documentation was much less and the functionality was weaker
(I don’t remember a detailed description of the cluster version and sharding).

Base selection

As a result, it was decided that for the pilot we will still limit ourselves to a single InfluxDB node.

There were several main reasons for this choice:

  • We really liked the functionality of the TICK stack in its entirety
  • We already managed to deploy it and it worked great
  • Deadlines were running out and there was not much time left to test other options
  • We did not expect such a large load

We did not have many scooters for the first phase of the pilot, and testing during development did not reveal any performance problems.

Therefore, we decided that for this project, one Influx node would be enough for us without the need for scaling (see conclusions at the end).

We decided on the stack and base - now about the remaining components of the TICK stack.

Capacitor

Return the missing scooter, or the story of one IoT monitoring

Kapacitor is part of the TICK stack, a service that can monitor real-time metrics that enter the database and perform various actions based on the rules.

In general, it is positioned as a tool for potential anomaly tracking and machine learning (I'm not sure if these functions are in demand), but the most popular use case for it is more banal - it's alerting.

So we used it for notifications. We set up Slack notifications when a particular scooter is off, the same was done for smart chargers and important infrastructure components.

Return the missing scooter, or the story of one IoT monitoring

This allowed us to quickly respond to problems, as well as receive notifications that everything was back to normal.

A simple example is that an additional battery to power our “box” has broken down or for some reason has run out, just by putting in a new one we should receive a notification about the restoration of the scooter’s performance after a while.

In Influx 2.0, Kapacitor became part of DB

Chronograph

Return the missing scooter, or the story of one IoT monitoring

I have seen many different UI solutions for monitoring, but I can say that in terms of functionality and UX, nothing compares to Chronograf.

We started using the TICK stack, oddly enough, with Grafan as a web interface.
I will not describe its functionality, everyone knows its wide possibilities for setting up anything.

However, Grafana is still a very versatile tool, while Chronograf is mainly tailored for use with Influx.

And of course, thanks to this, Chronograf can afford much more tricky or convenient functionality.

Perhaps the main convenience of working with Chronograf is that you can view the insides of your InfluxDB through Explore.

It would seem that Grafana has almost identical functionality, but in reality, setting up a dashboard in Chronograf can be done with a few mouse clicks (while looking at the visualization in the same place), while in Grafana you still have to edit the JSON configuration sooner or later (of course, Chronograf allows upload your custom dashas and edit as JSON if needed - but I've never had to touch them once created on the UI).

Kibana has much richer opportunities for creating dashboards and controls for them, but the UX for such operations is very complicated.

It will take a good deal to create a convenient dashboard for yourself. And although Chronograf has fewer dashboards, they are much easier to make and configure.

The dashboards themselves, in addition to a pleasant visual style, actually do not differ from dashboards in Grafana or Kibana:

Return the missing scooter, or the story of one IoT monitoring

This is what the query window looks like:

Return the missing scooter, or the story of one IoT monitoring

It is important to note, among other things, that knowing the types of fields in the InfluxDB database, the chronograph can sometimes automatically help you with writing a Query or choosing the right aggregation function of the mean type.

And of course, Chronograf is the most convenient for viewing logs. It looks like this:

Return the missing scooter, or the story of one IoT monitoring

By default, Influx logs are sharpened for using syslog and therefore they have an important parameter - severity.

The graph above is especially useful, on it you can see the errors that occur and the color is immediately clearly visible if the severity is higher.

A couple of times in this way we caught important bugs by going to the logs for the last week and seeing a red spike.

Of course, ideally it would be to set up an alert for such errors, since we already had everything for this.

We even turned it on for a while, but in the process of preparing the pilot, it turned out that we had quite a lot of errors (including system ones like the unavailability of the LTE network), which "spam" too much into the Slack channel, without incurring great benefit.

The correct solution would be to process most of these types of errors, set the severity for them, and only then enable alerting.

That way, only new or important bugs would get into Slack. There was simply not enough time for such a setup in the conditions of tight deadlines.

Authentication

Separately, it is worth mentioning that Chronograf supports OAuth and OIDC as authentication.

This is very convenient, as it allows you to easily tie it to your server and make a full-fledged SSO.

In our case, the server was keycloak - it was used to connect to monitoring, but the same server was also used to authenticate scooters and back-end requests.

“Admin”

The last component that I will describe is our self-written "admin panel" on Vue.
In general, this is just a separate service that displays information about scooters simultaneously from our own databases, microservices, and metrics data from InfluxDB.

In addition, many administrative functions have been moved there, such as an emergency reboot or remote opening of the lock for the support team.

There were also cards. I already mentioned that we started with Grafana instead of Chronograf - because maps are available for Grafana in the form of plugins, on which it was possible to view the coordinates of scooters. Unfortunately, the possibilities of map widgets for Grafana are very limited, and as a result, it was much easier to write your own web application with maps in a few days, in order not only to see the coordinates at the moment, but also to display the route passed by the scooter, be able to filter data on map, etc. (all the functionality that we could not set up in a simple dashboard).

One of the already mentioned pros of Influx is the ability to easily create your own metrics.
This allows it to be used in a wide variety of scenarios.

We tried to record all useful information there: battery charge, lock status, sensors performance, bluetooth, GPS, and many other healthchecks.
We displayed all this on the admin panel.

Of course, the most important criterion for us was the state of the scooter - in fact, Influx checks this itself and shows it in the Nodes section with "green lights".

This is done by the function deadman - we used it to understand the performance of our box and send those same alerts to Slack.

By the way, we called the scooters after the characters from the Simpsons - so it was convenient to distinguish them from each other

And yes, it was more fun that way. Constantly sounded phrases like "Guys - Smithers is dead!"

Return the missing scooter, or the story of one IoT monitoring

String metrics

It is important that InfluxDB allows you to store not only numeric values, as is the case with Victoria Metrics.

It would seem that this is not so important - after all, except for the logs, any metrics can be stored as numbers (just add mapping for known states - a kind of enum)?

In our case, there was at least one scenario where string metrics were very useful.
It just so happened that the supplier of our "smart chargers" was a third party, we had no control over the development process and the information that these chargers can supply.

As a result, the charge API was far from ideal, but the main problem was that we could not always understand their state.

This is where Influx came to the rescue. We simply wrote down the string status that came to us in the field of the InfluxDB base without changes.

For a while, only values ​​like "online" and "offline" got there, on the basis of which information was displayed in our admin panel, and notifications came to Slack. However, at some point, values ​​of the "disconnected" type began to get there as well.

As it turned out later, this status was sent once after the connection was lost, if the charger could not establish a connection with the server after a certain number of attempts.

Thus, if we used only a fixed set of values, we might not see these changes in the firmware at the right time.

And in general, string metrics provide much more opportunities for use, you can write virtually any information into them. Although, of course, you also need to use this tool carefully.

Return the missing scooter, or the story of one IoT monitoring

In addition to the usual metrics, we also recorded information about the GPS location in InfluxDB. It was incredibly handy for monitoring the location of scooters in our admin panel.
In fact, we always knew where and what kind of scooter was at the right time for us.

It was very useful to us when we were looking for a scooter (see conclusions at the end).

Infrastructure monitoring

In addition to the scooters themselves, we needed to monitor our entire (rather extensive) infrastructure.

A very generalized architecture looked something like this:

Return the missing scooter, or the story of one IoT monitoring

If you select a pure monitoring stack, then it looks like this:

Return the missing scooter, or the story of one IoT monitoring

From what we would like to check in the cloud, these are:

  • Database
  • keycloak
  • Microservices

Since all our cloud services are in Kubernetes, it would be nice to collect information about its state.

Fortunately, Telegraf can collect a huge amount of metrics about the state of the Kubernetes cluster out of the box, and Chronograf immediately offers beautiful dashboards for this.

We mainly monitored pod health and memory consumption. In case of a fall, alerts in Slack.

There are two ways to track pods in Kubernetes: DaemonSet and Sidecar.
Both methods are described in detail. in this blog post.

We used Telegraf Sidecar and collected pod logs in addition to metrics.

In our case, we had to tinker with the logs. Despite the fact that Telegraf can pull logs from the Docker API, we wanted to have a uniform collection of logs with our end devices and set up syslog for containers for this. Perhaps this solution was not beautiful, but there were no complaints in its work and the logs were well displayed in Chronograf'e.

Monitor monitoring???

In the end, the age-old question of monitoring monitoring systems arose, but fortunately, or unfortunately, we simply did not have enough time for this.

Although Telegraf can easily send its own metrics or collect InfluxDB base metrics to send either to the same Influx or somewhere else.

Conclusions

What conclusions did we draw from the results of the pilot?

How can monitoring be done

First of all, the TICK stack fully met our expectations, and gave us even more opportunities than we originally expected.

All the functionality that we needed was present. Everything we did with it worked without problems.

Performance

The main problem with the TICK stack in the free version is the lack of scaling capabilities. For us, this was not a problem.

We didn't collect exact load data/figures, but we did collect data from about 30 scooters at the same time.

Each of them collected more than three dozen metrics. At the same time, logs were collected from devices. The collection and sending of data occurred every 10 seconds.

It is important to note that after a week and a half of the pilot, when most of the "childhood sores" were fixed and the most important problems were already resolved, we had to reduce the frequency of sending data to the server to 30 seconds. This became necessary because the traffic on our LTE SIM cards began to melt away quickly.

Logs ate the bulk of the traffic, the metrics themselves, even with a 10-second interval, practically did not spend it.

As a result, after some more time, we completely disabled the collection of logs on devices, since specific problems were already obvious even without constant collection.

In some cases, if viewing the logs was still necessary, we simply connected via WireGuard via VPN.

I will also add that each separate environment was separated from each other, and the above described load was relevant only for the production environment.

In the development environment, we had a separate InfluxDB instance raised that continued to collect data every 10 seconds and we did not run into any performance problems.

TICK is ideal for small to medium projects

Based on this information, I would conclude that the TICK stack is ideal for relatively small projects or projects that definitely do not expect any HighLoad.

Unless you have thousands of pods or hundreds of machines, even a single InfluxDB instance can handle the load just fine.

In some cases, you may be satisfied with Influx Relay as a primitive High Availability solution.

And, of course, no one prevents you from setting up "vertical" scaling and simply allocating different servers for different types of metrics.

If you are not sure about the expected load on monitoring services, or you are guaranteed to have / will have a very "heavy" architecture, I would not recommend using the free version of the TICK stack.

Of course, a simple solution would be to acquire InfluxDB Enterprise - but here I can’t comment on it somehow, since I myself am not familiar with the subtleties. In addition, it is very expensive and definitely not suitable for small companies.

In this case, today, I would recommend looking towards collecting metrics through Victoria Metrics and logs using Loki.

True, I will make a reservation again that Loki / Grafana are much less convenient (in view of their greater versatility) than the ready-made TICK, but they are free.

It's important: all the information described here is current for Influx 1.8, at the moment Influx 2.0 is about to be released.

Until I had a chance to try it in combat conditions and it is difficult to draw conclusions about improvements, the interface has definitely become even better, the architecture has been simplified (without kapacitor and chronograf),
templates appeared ("killer feature" - you can track players in Fortnite and get notified when your favorite player wins a game). But, unfortunately, at the moment in version 2 there is no key thing for which we chose the first version - there is no log collection.

This functionality will also appear in Influx 2.0, but we could not find any dates, even approximate ones.

How not to make IoT platforms (now)

In the end, having launched the pilot, we ourselves assembled our full-fledged IoT stack, for lack of an alternative suitable by our standards.

However, recently in Beta version is available OpenBalena - it's a pity she wasn't there when we started the project.

The end result and the platform based on Ansible + TICK + WireGuard, which we assembled on our own, completely suits us. But for today, I would recommend taking a closer look at Balena before trying to build your IoT platform yourself.

Because, in the end, it can do most of what we did, while OpenBalena is free and the code is open.

It already knows how to not only send out updates, but the VPN is already wired in there and sharpened for use in the IoT environment.

And more recently, they even released their own Hardware, which easily connects to their ecosystem.

Hey, what about the missing scooter?

So the scooter, Ralph, disappeared without a trace.

We immediately ran to look at the map in our "admin panel", with GPS metrics data from InfluxDB.

Thanks to the monitoring data, we easily determined that the scooter left the parking lot at about 21:00 the previous day, drove for about half an hour to some area and was parked until 5 in the morning next to some German house.

After 5 a.m., no monitoring data was received - this meant either a complete discharge of the additional battery, or the attacker guessed to remove the smart stuffing from the scooter.
Despite this, the police were still called to the address where the scooter was located. The scooter was not there.

However, the owner of the house was also surprised by this, since he really came home from the office on this scooter last night.

As it turned out, one of the support staff arrived early in the morning and picked up the scooter, seeing that it had a completely dead extra battery, and drove it (on foot) to the parking lot. And the extra battery failed due to moisture.

We stole the scooter from ourselves. By the way, I don’t know how and who later resolved the issue with the case in the police, but the monitoring worked perfectly ...

Source: habr.com

Add a comment