Anger, bargaining and depression when working with InfluxDB

Anger, bargaining and depression when working with InfluxDB

If you use a time series database (timeseries db, wiki) as the main storage for a site with statistics, then instead of solving the problem, you can get a lot of headaches. I am working on a project that uses such a database, and sometimes InfluxDB, which will be discussed, presented generally unexpected surprises.

DisclaimerNote: The issues shown are for InfluxDB 1.7.4.

Why time series?

The project is to track transactions in various blockchains and display statistics. Specifically, we look at the emission and burning of stable coins (wiki). Based on these transactions, you need to build graphs and show pivot tables.

When analyzing transactions, an idea came up: to use the InfluxDB time series database as the main storage. Transactions are points in time and they fit well into the time series model.

The aggregation functions also looked very convenient - they are ideal for processing charts with a long period. The user needs a chart for a year, and the database contains a data set with a timeframe of five minutes. It is pointless to send all one hundred thousand points to him - except for a long processing, they will not fit on the screen. You can write your own implementation of increasing the timeframe, or use the aggregation functions built into Influx. With their help, you can group data by day and send the desired 365 points.

It was a little embarrassing that such databases are usually used to collect metrics. Monitoring of servers, iot devices, everything from which millions of points of the form β€œflow”: [<time> - <metric value>]. But if the database works well with a large data stream, then why should a small volume cause problems? With this thought, they took InfluxDB to work.

What else is convenient in InfluxDB

In addition to the mentioned aggregation functions, there is another great thing - continuous queries (doc). This is a scheduler built into the database that can process data on a schedule. For example, you can group all the records for the day every 24 hours, calculate the average and write one new point to another table without writing your own bikes.

there is also a retention policies (doc) - setting for deleting data after a certain period. It is useful when, for example, you need to store the load on the CPU for a week with measurements once per second, but at a distance of a couple of months, such accuracy is not needed. In such a situation, you can do this:

  1. create a continuous query to aggregate data into another table;
  2. for the first table, define a policy for deleting metrics that are older than that same week.

And Influx will reduce the size of the data and delete unnecessary data on its own.

About stored data

Not much data is stored: about 70 thousand transactions and another million points with market information. Adding new entries - no more than 3000 points per day. There are also metrics for the site, but there is little data there and, according to the retention policy, they are stored for no more than a month.

Problems

During the development and subsequent testing of the service, more and more critical problems arose in the operation of InfluxDB.

1. Deleting data

There is a series of data with transactions:

SELECT time, amount, block, symbol FROM transactions WHERE symbol='USDT'

Result:

Anger, bargaining and depression when working with InfluxDB

I send a command to delete data:

DELETE FROM transactions WHERE symbol=’USDT’

Further I do request for obtaining of already deleted data. And Influx, instead of an empty response, returns a piece of data that should be removed.

I'm trying to delete the entire table:

DROP MEASUREMENT transactions

I check the deletion of the table:

SHOW MEASUREMENTS

I don't see the table in the list, but the new data query still returns the same set of transactions.

The problem arose for me only once, since the deletion case is an isolated case. But this behavior of the database clearly does not fit into the framework of "correct" work. Later on github found open ticket almost a year old on this topic.

As a result, the removal and subsequent restoration of the entire database helped.

2. Floating point numbers

Mathematical calculations using InfluxDB's built-in functions give precision errors. Not that it was anything unusual, but unpleasant.

In my case, the data has a financial component and I would like to process it with high accuracy. Because of this, we plan to abandon continuous queries.

3. Continuous queries cannot be adapted to different time zones

The service has a table with daily transaction statistics. For each day, you need to group all transactions for that day. But the day for each user will start at a different time, therefore, the set of transactions is different. UTC is 37 options shifts for which you want to aggregate data.

In InfluxDB, when grouping by time, you can additionally specify a shift, for example, for Moscow time (UTC + 3):

SELECT MEAN("supply") FROM transactions GROUP BY symbol, time(1d, 3h) fill(previous)

But the query result will be incorrect. For some reason, the data grouped by day will start as early as 1677 (InfluxDB officially supports a time span from this year):

Anger, bargaining and depression when working with InfluxDB

To circumvent this problem, the service was temporarily transferred to UTC + 0.

4. Performance

There are many benchmarks on the Internet with comparisons of InfluxDB and other databases. At first glance, they looked like marketing materials, but now I think that there is some truth in them.

Let me tell you my case.

The service provides an API method that returns statistics for the last XNUMX hours. During calculations, the method queries the database three times with the following queries:

SELECT * FROM coins_info WHERE time <= NOW() GROUP BY symbol ORDER BY time DESC LIMIT 1

SELECT * FROM dominance_info ORDER BY time DESC LIMIT 1

SELECT * FROM transactions WHERE time >= NOW() - 24h ORDER BY time DESC

Explanation:

  1. In the first request, we get the latest points for each coin with market data. Eight dots for eight coins in my case.
  2. The second request gets one newest point.
  3. The third one requests a list of transactions for the last XNUMX hours, there may be several hundred of them.

Let me clarify that in InfluxDB, an index is automatically built by tags and by time, which speeds up queries. In the first request symbol is a tag.

I did a stress test for this API method. For 25 RPS, the server showed a full load of six CPUs:

Anger, bargaining and depression when working with InfluxDB

At the same time, the NodeJs process did not give any load at all.

The execution speed degraded already by 7-10 RPS: if one client could receive a response in 200 ms, then 10 clients had to wait for a second. 25 RPS - the border from which stability suffered, 500 errors were returned to clients.

With such performance, it is impossible to use Influx in our project. Moreover, in a project where monitoring needs to be demonstrated to many clients, similar problems may appear and the metrics server will be overloaded.

Hack and predictor Aviator

The most important conclusion from the experience gained is that you cannot take an unknown technology into a project without sufficient analysis. A simple screening of open tickets on github could provide information not to take InfluxDB as the main data store.

InfluxDB was supposed to suit the tasks of my project well, but as practice has shown, this database does not meet the needs and messes up a lot.

You can already find version 2.0.0-beta in the project repository, it remains to be hoped that there will be significant improvements in the second version. In the meantime, I'll go study the TimescaleDB documentation.

Source: habr.com

Add a comment