Go optimizations in VictoriaMetrics. Alexander Valyalkin

I propose to read the transcript of the report of the end of 2019 by Alexander Valyalkin "Go optimizations in VictoriaMetrics"

VictoriaMetrics - a fast and scalable DBMS for storing and processing data in the form of a time series (a record forms a time and a set of values ​​corresponding to this time, for example, obtained through a periodic poll of the status of sensors or the collection of metrics).

Go optimizations in VictoriaMetrics. Alexander Valyalkin

Here is a link to the video of this report - https://youtu.be/MZ5P21j_HLE

Slideshow

Go optimizations in VictoriaMetrics. Alexander Valyalkin

Tell us about yourself. I am Alexander Valyalkin. Here my github account. I'm passionate about Go and performance optimization. Wrote a lot of all sorts of useful and not very libraries. They either start with fast, or with quick prefix.

I am currently working on VictoriaMetrics. What is it and what am I doing there? I will talk about this in this presentation.

Go optimizations in VictoriaMetrics. Alexander Valyalkin

The outline of the report is as follows:

  • First, I will tell you what VictoriaMetrics is.
  • Then I will tell you what time series are.
  • Then I will tell you how the time series database works.
  • Next, I will talk about the architecture of the database: what it consists of.
  • And then let's move on to the optimizations that are in VictoriaMetrics. These are an optimization for the inverted index and an optimization for the bitset implementation in Go.

Go optimizations in VictoriaMetrics. Alexander Valyalkin

What is VictoriaMetrics does anyone in the audience know? Wow, a lot of people already know. It's a good news. For those who don't know, this is a database for time series. It is based on the ClickHouse architecture, on some ClickHouse implementation details. For example, on such as: MergeTree, parallel computing on all available processor cores and performance optimization by working on data blocks that are placed in the processor cache.

VictoriaMetrics provides the best data compression compared to other time series databases.

It scales vertically - that is, you can add more processors, more RAM on one computer. VictoriaMetrics will successfully utilize these available resources and will improve linear performance.

Also, VictoriaMetrics scales horizontally β€” that is, you can add additional nodes to the VictoriaMetrics cluster, and its performance will grow almost linearly.

As you guessed, VictoriaMetrics is a fast database because I can't write others. And it's written in Go, so I'm talking about it at this meetup.

Go optimizations in VictoriaMetrics. Alexander Valyalkin

Who knows what a time series is? He also knows a lot of people. A time series is a series of pairs (timestamp, Π·Π½Π°Ρ‡Π΅Π½ΠΈΠ΅)where these pairs are sorted by time. The value is a floating point number - float64.

Each time series is uniquely identified by a key. What is this key? It consists of a non-empty set of key-value pairs.

Here is an example of a time series. The key of this series is the list of pairs: __name__="cpu_usage" is the name of the metric, instance="my-server" is the computer on which this metric is collected, datacenter="us-east" - this is the data center where this computer is located.

We got the name of the time series, consisting of three key-value pairs. This key corresponds to a list of pairs (timestamp, value). t1, t3, t3, ..., tN are timestamps, 10, 20, 12, ..., 15 are the corresponding values. This is the current cpu-usage for the given row.

Go optimizations in VictoriaMetrics. Alexander Valyalkin

Where can time series be used? Does anyone have any idea?

  • In DevOps, you can measure CPU, RAM, network, rps, errors, etc.
  • IoT - we can measure temperature, pressure, geo coordinates and something else.
  • Also finance - we can monitor prices for all sorts of stocks and currencies.
  • In addition, time series can be used in monitoring production processes in factories. We have users who use VictoriaMetrics to monitor wind turbines, for robots.
  • Also, time series are useful for collecting information from sensors of various devices. For example, for the engine; to measure tire pressure; to measure speed, distance; for measuring gasoline consumption, etc.
  • Also, time series can be used to monitor aircraft. Each aircraft has a black box that collects time series for different aircraft health parameters. Time series are also used in the aerospace industry.
  • Healthcare is blood pressure, pulse, etc.

Maybe there are still applications that I forgot about, but I hope you understand that time series are actively used in the modern world. And their use is growing every year.

Go optimizations in VictoriaMetrics. Alexander Valyalkin

What is a time series database for? Why can't you use a regular relational database to store time series?

Because time series usually contain a large amount of information that is difficult to store and process in conventional databases. Therefore, specialized databases for time series appeared. These bases effectively store points (timestamp, value) with the given key. They provide an API for reading stored data by key, by one key-value pair, or by several such pairs, or by regexp. For example, you want to find the CPU load of all your services in a data center in America, then you need to use this pseudo-query.

Typically, time series databases are specialized query languages ​​because SQL for time series is not well suited. While there are databases that support SQL, it's not a very good fit. Better suited query languages ​​like PromQL, InfluxQL, Flux, Q. I hope someone has heard at least one of these languages. Many of you have probably heard about PromQL. This is the Prometheus query language.

Go optimizations in VictoriaMetrics. Alexander Valyalkin

Here is what the architecture of a modern database for time series looks like using VictoriaMetrics as an example.

It consists of two parts. This is the store for the inverted index and the store for the time series values. These repositories are separated.

When a new record arrives in the database, we first access the inverted index to find the time series identifier for the given set label=value for this metric. We find this identifier and store the value in the data store.

When some request comes in to fetch data from TSDB, we first of all climb into the inverted index. We get everything timeseries_ids records that match the given set label=value. And then we get all the necessary data from the data store indexed by timeseries_ids.

Go optimizations in VictoriaMetrics. Alexander Valyalkin

Let's look at an example of how a time series database handles an incoming select query.

  • First of all, she gets everything timeseries_ids from the inverted index that contain the given pairs label=value, or satisfy the given regular expression.
  • Then she gets all the data points from the data store at a given time interval for the found timeseries_ids.
  • After that, the database performs some calculations on these data points, according to the user's request. And after that returns the answer.

In this presentation, I will tell you about the first part. This is a search timeseries_ids by inverted index. About the second part and the third part you can see later VictoriaMetrics sources, or wait until I prepare other reports πŸ™‚

Go optimizations in VictoriaMetrics. Alexander Valyalkin

Let's start with the inverted index. To many, this may seem simple. Who knows what an inverted index is and how it works? Oh, not so many people anymore. Let's try to understand what it is.

In fact, everything is simple. It's just a dictionary that maps a key to a value. What is a key? This couple label=valueWhere label ΠΈ value are lines. And the values ​​are a set timeseries_ids, which includes the given pair label=value.

Inverted index allows you to quickly find everything timeseries_ids, which have given label=value.

It also allows you to quickly find timeseries_ids time series for multiple pairs label=value, or for couples label=regexp. How does this happen? By finding the intersection of the set timeseries_ids for every pair label=value.

Go optimizations in VictoriaMetrics. Alexander Valyalkin

Consider various implementations of the inverted index. Let's start with the simplest naive implementation. She looks like this.

Function getMetricIDs gets a list of strings. Each line contains label=value. This function returns a list metricIDs.

How it works? Here we have a global variable called invertedIndex. This is a normal dictionarymap) which will map the string to slice ints. The line contains label=value.

Implementation of the function: get metricIDs for the first label=value, then go through all the rest label=value, we get metricIDs for them. And call the function intersectInts, which will be discussed below. And this function returns the intersection of these lists.

Go optimizations in VictoriaMetrics. Alexander Valyalkin

As you can see, the implementation of the inverted index is not very complicated. But this is a naive implementation. What are her shortcomings? The main drawback of the naive implementation is that we store such an inverted index in RAM. After restarting the application, we lose this index. There is no saving this index to disk. For a database, such an inverted index is hardly suitable.

The second disadvantage is also related to memory. The inverted index must fit in RAM. If it exceeds the size of RAM, then it is obvious that we will get - out of memory error. And the program won't work.

Go optimizations in VictoriaMetrics. Alexander Valyalkin

This problem can be solved using ready-made solutions such as LevelDBor RocksDB.

In short, we need a database that allows us to do three things quickly.

  • The first operation is writing ΠΊΠ»ΡŽΡ‡-Π·Π½Π°Ρ‡Π΅Π½ΠΈΠ΅ to this base. She does this very quickly. ΠΊΠ»ΡŽΡ‡-Π·Π½Π°Ρ‡Π΅Π½ΠΈΠ΅ are arbitrary strings.
  • The second operation is a quick search for a value by a given key.
  • And the third operation is a quick search for all values ​​by a given prefix.

LevelDB and RocksDB - These databases are developed by Google and Facebook. First came LevelDB. Then the guys from Facebook took LevelDB and started improving it, made RocksDB. Now almost all internal databases work on RocksDB inside Facebook, including they have transferred to RocksDB and MySQL. They named him myrocks.

An inverted index can be implemented using LevelDB. How to do it? We save as a key label=value. And as a value - the identifier of the time series, where there is a pair label=value.

If we have many time series with this pair label=value, then there will be many rows in this database with the same key and different timeseries_ids. To get a list of all timeseries_idsthat start with this label=prefix, we do a range scan for which this database is optimized. That is, we select all lines that begin with label=prefix and get the necessary timeseries_ids.

Go optimizations in VictoriaMetrics. Alexander Valyalkin

Here is an example implementation of what it would look like in Go. We have an inverted index. This is LevelDB.

The function is the same as for the naive implementation. It almost line by line repeats the naive implementation. The only point is that instead of referring to map we refer to the inverted index. We get all the values ​​for the first label=value. Then we go through all the remaining pairs label=value and get the corresponding sets of metricIDs for them. Then we find the intersection.

Go optimizations in VictoriaMetrics. Alexander Valyalkin

Everything seems to be fine, but this solution has drawbacks. VictoriaMetrics initially implemented an inverted index based on LevelDB. But in the end I had to give it up.

Why? Because LevelDB is slower than the naive implementation. In a naive implementation, for a given key, we immediately get the entire slice metricIDs. This is a very fast operation - the whole slice is ready for use.

In LevelDB, on each function call GetValues you need to go through all the lines that begin with label=value. And for each line get the value timeseries_ids. Of such timeseries_ids collect a slice of these timeseries_ids. Obviously, this is much slower than just accessing a regular map by key.

The second drawback is that LevelDB is written in C. Accessing C functions from Go is not very fast. It takes hundreds of nanoseconds. This is not very fast, because compared to a regular function call written in go, which takes 1-5 nanoseconds, the performance difference is dozens of times. For VictoriaMetrics, this was a fatal flaw πŸ™‚

Go optimizations in VictoriaMetrics. Alexander Valyalkin

So I wrote my own implementation of the inverted index. And called her mergeset.

Mergeset is based on the MergeTree data structure. This data structure is borrowed from ClickHouse. Obviously the mergeset needs to be optimized for fast retrieval timeseries_ids by the given key. Mergeset is written entirely in Go. You can see VictoriaMetrics sources on GitHub. The mergeset implementation is in the folder /lib/mergeset. You can try to figure out what's going on there.

The mergeset API is very similar to LevelDB and RocksDB. That is, it allows you to quickly save new records there and quickly select records by a given prefix.

Go optimizations in VictoriaMetrics. Alexander Valyalkin

We will talk about the disadvantages of mergeset later. Now let's talk about the problems with VictoriaMetrics in production when implementing the inverted index.

Why did they arise?

The first reason is the high churn rate. Translated into Russian, this is a frequent change in time series. This is when the time series ends and a new series begins, or many new time series begin. And this happens often.

The second reason is the large number of time series. In the beginning, when monitoring gained popularity, the number of time series was small. For example, on each computer you need to monitor the load of the processor, memory, network and disk. 4 time series per computer. Let's say you have 100 computers and 400 time series. This is very little.

Over time, people have come up with the idea that more detailed information can be measured. For example, to measure the load not of the entire processor, but separately of each processor core. If you have 40 processor cores, then, accordingly, you have 40 times more time series to measure processor load.

But that is not all. Each processor core can have several states such as idle when it is idle. As well as work in user space, work in kernel space and other states. And each such state can also be measured as a separate time series. This additionally increases the number of rows by 7-8 times.

From one metric, we got 40 x 8 = 320 metrics for only one computer. We multiply by 100, we get 32 instead of 000.

Then came Kubernetes. And it still made it worse, because many different services can be hosted in Kubernetes. Each service in Kubernetes is made up of many pods. And all this needs to be monitored. In addition, we have a constant deployment of new versions of your services. For each new version, you need to create new time series. As a result, the number of time series grows exponentially and we are faced with the problem of a large number of time series, which is called high-cardinality. VictoriaMetrics does it well compared to other time series databases.

Go optimizations in VictoriaMetrics. Alexander Valyalkin

Let's take a closer look at the high churn rate. What causes a high churn rate in production? Because some meanings of labels and tags are constantly changing.

For example, let's take Kubernetes, which has the concept deployment, i.e. when a new version of your application is rolled out. For some reason, the Kubernetes developers decided to add the deployment id to the label.

What did it lead to? To the fact that with each new deployment, all the old time series are interrupted, and instead of them, new time series begin with a new label value deployment_id. There can be hundreds of thousands and even millions of such rows.

An important feature of all this is that the total number of time series is growing, but the number of time series that are currently active, on which data is coming, remains constant. This state is called a high churn rate.

The main problem of high churn rate is to provide a constant search speed for all time series for a given set of labels for some time interval. This is usually the time interval for the last hour or the last day.

Go optimizations in VictoriaMetrics. Alexander Valyalkin

How to solve this problem? Here is the first option. This is to split the inverted index into independent time parts. That is, some time interval passes, we finish working with the inverted current index. And we create a new inverted index. Another time interval passes, we create another one and another one.

And when sampling from these inverted indices, we find a set of inverted indices that fall within the given interval. And, accordingly, we select the id of the time series from there.

This saves resources because we don't have to look at parts that don't fall within the given interval. That is, usually, if we select data for the last hour, then we skip requests for previous time intervals.

Go optimizations in VictoriaMetrics. Alexander Valyalkin

There is another solution to this problem. This is to store for each day a separate list of id's of the time series that occurred on that day.

The advantage of this solution over the previous solution is that we do not duplicate time series information that does not disappear over time. They are permanent and do not change.

The disadvantage is that such a solution is more difficult to implement and more difficult to debug. And VictoriaMetrics chose this solution. This is how it happened historically. This solution also shows itself well, compared with the previous one. Because this solution was not implemented due to the fact that you have to duplicate data in each partition for time series that do not change, that is, that do not disappear over time. VictoriaMetrics was primarily optimized for disk space consumption, and the previous implementation worsened disk space consumption. But this implementation is better suited for minimizing disk space consumption, so it was chosen.

I had to fight her. The struggle was that in this implementation, you still need to choose a much larger number timeseries_ids for data than when the inverted index is divided by time.

Go optimizations in VictoriaMetrics. Alexander Valyalkin

How did we solve this problem? We solved it in an original way - by storing several time series identifiers in each record of the inverted index instead of one identifier. i.e. we have the key label=value, which occurs in every time series. And now we save a few timeseries_ids in one entry.

Here is an example. We used to have N entries, and now we have one entry that has the same prefix as all the others. For the previous entry, the value contains all the id's of the time series.

This made it possible to increase the scanning speed of such an inverted index up to 10 times. And allowed to reduce memory consumption for the cache, because now we store the string label=value only once in the cache together N times. And this line can be large if you have long lines stored in tags and labels that Kubernetes likes to shove there.

Go optimizations in VictoriaMetrics. Alexander Valyalkin

Another option to speed up the search for an inverted index is sharding. Creating several inverted indexes instead of one and sharding data between them by key. This is a set key=value steam. That is, we get several independent inverted indexes that we can poll in parallel on several processors. Previous implementations allowed only single-processor mode, i.e. scanning data on only one core. This solution allows you to scan data on several cores at once, as ClickHouse likes to do. This is what we plan to implement.

Go optimizations in VictoriaMetrics. Alexander Valyalkin

And now back to our rams - to the intersection function timeseries_ids. Let's consider what implementations can be. This function allows you to find timeseries_ids for a given set label=value.

Go optimizations in VictoriaMetrics. Alexander Valyalkin

The first option is a naive implementation. Two nested loops. Here we get at the input of the function intersectInts two slices - a ΠΈ b. At the output, it should return to us the intersection of these slices.

The naive implementation looks like this. We iterate over all values ​​from slice a, inside this loop we go through all the values ​​of slice b. And we compare them. If they match, then we have found an intersection. And save it to result.

Go optimizations in VictoriaMetrics. Alexander Valyalkin

What are the disadvantages? Quadratic complexity is its main disadvantage. For example, if you have slice dimensions a ΠΈ b one million each, then this function will never return an answer to you. Because it will need to do one trillion iterations, which is a lot even for modern computers.

Go optimizations in VictoriaMetrics. Alexander Valyalkin

The second implementation is based on map. We are creating a map. We put all values ​​from slice into this map a. Then we run through a separate loop through slice b. And check if this value is from slice b in map. If it exists, then add it to the result.

Go optimizations in VictoriaMetrics. Alexander Valyalkin

What are the benefits? The advantage is that there is only linear complexity. That is, the function will execute much faster for large slices. For a millionth slice, this function will run in 2 million iterations, as opposed to a trillion iterations, as in the previous function.

And the disadvantage is that this function requires more memory in order to create this map.

The second disadvantage is a large overhead for hashing. This shortcoming is not very obvious. And for us, it was also not very obvious, so at first in VictoriaMetrics the implementation of intersection was through map. But then profiling showed that the main processor time is spent on writing to the map and checking for the presence of a value in this map.

Why are these places wasting CPU time? Because in these lines Go performs a hashing operation. That is, it calculates the hash from the key, in order to access it later at the given index in the HashMap. The hash calculation operation takes tens of nanoseconds. This is slow for VictoriaMetrics.

Go optimizations in VictoriaMetrics. Alexander Valyalkin

I decided to implement a bitset optimized specifically for this case. This is what the intersection of the two slices looks like now. Here we create a bitset. We add elements from the first slice to it. Then we check the presence of these elements in the second slice. And add them to the result. That is, it almost does not differ from the previous example. The only thing is that we have replaced the call to map with custom functions here add ΠΈ has.

Go optimizations in VictoriaMetrics. Alexander Valyalkin

At first glance, it seems that this should work slower if the standard map was used there before, and then some other functions are called, but profiling shows that this thing is 10 times faster than the standard map for the VictoriaMetrics case.

In addition, it uses much less memory compared to the map implementation. Because we store bits here instead of eight-byte values.

The disadvantage of such an implementation is that it is not so obvious, not trivial.

Another disadvantage that many may not notice is that this implementation may not work well in some cases. That is, it is optimized for a specific case, for this case of the intersection of ids of the VictoriaMetrics time series. This does not mean that it is suitable for all cases. If it is used incorrectly, then we will get not a performance increase, but an out of memory error and a performance slowdown.

Go optimizations in VictoriaMetrics. Alexander Valyalkin

Consider the implementation of this structure. If you want to see it, it is in the VictoriaMetrics sources, in the folder lib/uint64set. It is optimized specifically for the VictoriaMetrics case, where timeseries_id is a 64-bit value, where the first 32 bits are basically constant and only the last 32 bits change.

This data structure is not stored on disk, it only works in memory.

Go optimizations in VictoriaMetrics. Alexander Valyalkin

Here is its API. It's not very complex. The API is tailored specifically to the specific use case of VictoriaMetrics. That is, there are no extra functions. Here are the functions that are explicitly used by VictoriaMetrics.

There are functions add, which adds new values. Have a function has, which checks for new values. And there is a function del, which removes values. There is a helper function len, which returns the size of the set. Function clone clones the set. And function appendto converts this set to slice timeseries_ids.

Go optimizations in VictoriaMetrics. Alexander Valyalkin

Here is what the implementation of this data structure looks like. The set has two elements:

  • ItemsCount is an auxiliary field to quickly return the number of elements in the set. We could have done without this auxiliary field, but we had to add it here, because VictoriaMetrics often interrogates the length of the bitset in its algorithms.

  • The second field is buckets. This is a slice from a structure bucket32. Each structure contains hi field. These are the top 32 bits. And two slices - b16his ΠΈ buckets of bucket16 structures.

The upper 16 bits of the second part of the 64-bit structure are stored here. And bitsets are stored here for the lower 16 bits of each byte.

Bucket64 consists of an array uint64. The length is calculated using these constants. In one bucket16 maximum can be stored 2^16=65536 bit. If this is divided by 8, then this is 8 kilobytes. Divide by 8 again, that's 1000 uint64 meaning. i.e. Bucket16 - this is our 8-kilobyte structure.

Go optimizations in VictoriaMetrics. Alexander Valyalkin

Let's consider how one of the methods of this structure for adding a new value is implemented.

Everything starts with uint64 values. We calculate the top 32 bits, we calculate the bottom 32 bits. We go through all buckets. Compare the top 32 bits in each bucket with the value being added. And if they match, then we call the function add in structure b32 buckets. And add the lower 32 bits there. And if it returned true, then this means that we added such a value there and we did not have such a value. If it returns false, then such a value already existed. Then we increase the number of elements in the structure.

If we didn't find the right one bucket with the desired hi-value, then we call the function addAlloc, which will make a new bucket, adding it to the bucket structure.

Go optimizations in VictoriaMetrics. Alexander Valyalkin

This is the implementation of the function b32.add. It is similar to the previous implementation. We calculate high 16 bits, low 16 bits.

Then we go through all the top 16 bits. We find matches. And if there is a match, we call the add method, which we will consider on the next page for bucket16.

Go optimizations in VictoriaMetrics. Alexander Valyalkin

And here is the lowest level, which should be optimized as much as possible. We calculate for uint64 id value in slice bit as well bitmask. This is a mask for this 64-bit value, by which you can check for the presence of this bit, or set it. We check to see if this bit is set and set it and return the presence. Here is our implementation, which allowed us to speed up the operation of crossing ids of time series by 10 times compared to conventional maps.

Go optimizations in VictoriaMetrics. Alexander Valyalkin

In addition to this optimization, VictoriaMetrics has many other optimizations. Most of these optimizations were added for a reason, but after profiling the code in production.

This is the main rule of optimization - do not add optimization, assuming that there will be a bottleneck, because it may turn out that there will not be a bottleneck. Optimization usually degrades the quality of the code. Therefore, it is worth optimizing only after profiling, and preferably in production, so that this is real data. For those who are interested, you can look at the VictoriaMetrics source code and explore other optimizations that are there.

Go optimizations in VictoriaMetrics. Alexander Valyalkin

I have a question about bitset. Very similar to the C++ implementation of vector bool, bitset optimized. Did you take the implementation from there?

No, not from there. When implementing this bitset, I was guided by the knowledge of the structure of these ids timeseries, which are used in VictoriaMetrics. And their structure is such that the top 32 bits are basically constant. The bottom 32 bits are subject to change. The lower the bit, the more often it can change. Therefore, this implementation is optimized for this data structure. The C++ implementation, as far as I know, is optimized for the general case. If you do optimization for the general case, then this means that it will not be the most optimal for a specific case.

I advise you to look at the report of Alexei Milovid. About a month ago, he talked about optimizations in ClickHouse for specific specializations. He just says that in the general case, the C ++ implementation or some other implementation is tailored for good work on average in a hospital. It may perform worse than a specialized implementation for specific knowledge, like ours, when we know that the upper 32 bits are mostly constant.

I have a second question. What is the fundamental difference from InfluxDB?

There are many cardinal differences. If in terms of performance and memory consumption, then InfluxDB in tests shows 10 times more memory consumption for high cardinality time series, when you have a lot of them, for example, millions. For example, VictoriaMetrics consumes 1 GB per million active rows, while InfluxDB consumes 10 GB. And that's a big difference.

The second cardinal difference is that InfluxDB has strange query languages ​​- Flux and InfluxQL. They are not very convenient for working with time series compared to PromQL, which is supported by VictoriaMetrics. PromQL is a query language from Prometheus.

And another difference is that InfluxDB has a slightly strange data model, where each line can store several fields with a different set of tags. These lines are further divided into all sorts of tables. These additional complications complicate the subsequent work with this base. It is difficult to maintain and understand.

Everything is much simpler in VictoriaMetrics. There, each time series is a key-value. A value is a set of points βˆ’ (timestamp, value), and the key is the set label=value. There is no separation between fields and measurements. This allows you to select any data and then combine, add, subtract, multiply, divide it, unlike InfluxDB, where calculations between different rows are still not implemented, as far as I know. Even if implemented, it is difficult, you have to write a bunch of code.

I have a clarifying question. I correctly understood that there was some kind of problem that you talked about, that this inverted index does not fit into memory, so there is partitioning going on there?

First, I showed a naive implementation of an inverted index on a standard Go's map. This implementation is not suitable for databases because this inverted index is not saved to disk, and the database must save to disk so that this data remains available upon restart. In this implementation, when you restart the application, your inverted index will disappear. And you will lose access to all the data because you won't be able to find it.

Hello! Thanks for the report! My name is Pavel. I'm from Wildberries. I have a few questions for you. Question one. Do you think that if you had chosen a different principle when building the architecture of your application and partitioned data by time, then perhaps you could do the intersection of data when searching, based only on the fact that one partition contains data for one period of time , i.e., in one time interval and you would not have to worry about the fact that your pieces are scattered differently? Question number 2 - since you are implementing a similar algorithm with bitset and everything else, then perhaps you tried using processor instructions? Maybe you have tried such optimizations?

I will answer the second one. We haven't gotten there yet. But if need be, we will. What was the first question?

You discussed two scenarios. And they said that they chose the second one with a more complex implementation. And they did not prefer the first one, where the data is partitioned by time.

Yes. In the first case, the total volume of the index would be larger, because in each partition we would have to store duplicate data for those time series that continue through all these partitions. And if you have a small churn rate for time series, that is, the same series are constantly used, then in the first case we would lose much more in terms of disk space occupied compared to the second case.

And so - yes, partitioning by time is a good option. Prometheus uses it. But Prometheus has another drawback. When merging these pieces of data, it needs to keep in memory meta-information for all labels and timeseries. Therefore, if the pieces of data are large, which it merges, then the memory consumption grows very strongly when merging, unlike VictoriaMetrics. When merging, VictoriaMetrics does not consume memory at all, a couple of kilobytes are consumed there, regardless of the size of the data pieces being merged.

The algorithm you have is using memory. It marks timeseries-marks that have values. And in this way you check the paired presence in one data array and in another. And you understand whether intersect happened or not. Typically, databases implement cursors, iterators that store their current value and run through sorted data, due to the simple complexity of these operations.

Why don't we use cursors to traverse data?

Yes.

In our LevelDB or in mergeset, just sorted lines are stored. We can walk with the cursor and find the intersection. Why don't we use it? Because it's slow. Because cursors imply that you need to call a function for each line. A function call is 5 nanoseconds. And if you have 100 lines, then it turns out that we spend half a second only on a function call.

There is such a thing, yes. And my last question. The question may sound a little strange. Why, at the time of data arrival, it is impossible to count all the necessary aggregates and save them in the required form? Why save huge volumes to some systems like VictoriaMetrics, ClickHouse, etc., in order to spend a lot of time on them later?

I will give an example to make it clearer. Let's say how a small toy speedometer works? It records the distance you have traveled, all the time adding it to one value, to the second - time. And divides. And gets average speed. You can do pretty much the same. Add up all the necessary facts on the fly.

Okay, I understand the question. Your example has a place to live. If you know what aggregates you need, then this is the best implementation. But the problem is that people save these metrics, some data in ClickHouse, and they still don’t know how they will aggregate and filter them in the future, so you have to save all the raw data. But if you know that you need to calculate something, then why not calculate it instead of storing a bunch of raw values ​​there? But this is only if you know exactly what you need.

By the way, databases for storing time series support counting aggregates. For example, Prometheus supports recording rules. That is, it can be done if you know what units you need. This is not yet available in VictoriaMetrics, but it is usually preceded by Prometheus, in which you can do this in recoding rules.

For example, in a previous job, you had to count the number of events in a sliding window in the last hour. The problem is that I had to make a custom implementation in Go, i.e. a service for counting this thing. This service was ultimately non-trivial, because it is difficult to count. The implementation can be simple if you need to count some aggregates at fixed time intervals. If you want to count events in a sliding window, then it's not as easy as it seems. I think this has not yet been implemented in ClickHouse or in timeseries databases, because it is difficult to implement.

And one more question. We were just talking about averaging, and I remembered that there was once such a thing as Graphite with a Carbon backend. And he knew how to thin out old data, i.e. leave one point per minute, one point per hour, etc. In principle, this is quite convenient if we need raw data, relatively speaking, for a month, and everything else can be thinned out . But Prometheus, VictoriaMetrics do not support this functionality. Is it planned to be supported? If not, why not?

Thanks for the question. Our users ask it periodically. They ask when we will add support for thinning (downsampling). There are several problems here. First, each user understands downsampling something of their own: someone wants to get any arbitrary point on a given interval, someone wants the maximum, minimum, average values. If many systems write data to your database, then you can’t row them one size fits all. You may find that you need to use a different decimation for each system. And it's hard to implement.

And the second thing is that VictoriaMetrics, like ClickHouse, is optimized for working with a large amount of raw data, so it can shovel through a billion rows in less than a second if you have a lot of cores in your system. Time series point scanning in VictoriaMetrics - 50 points per second per core. And that performance scales to the available cores. That is, if you have 000 cores, for example, you will scan a billion points per second. And this property of VictoriaMetrics and ClickHouse reduces the need for downsamling.

Another benefit is that VictoriaMetrics compresses this data efficiently. Compression on average for production is from 0,4 to 0,8 bytes per point. Each dot is a timestamp + value. And it compresses less than one byte on average.

Sergey. I have a question. What is the minimum recording time slice?

One millisecond. We recently had a conversation with other time series database developers. Their minimum quantum of time is one second. And in Graphite, for example, it is also one second. OpenTSDB also has one second. In InfluxDB, nanosecond precision. VictoriaMetrics has one millisecond because Prometheus has one millisecond. And VictoriaMetrics was originally developed as remote storage for Prometheus. But now it can save data from other systems as well.

The person I spoke to says that they have a second accuracy - that's enough for them, because it depends on the type of data that is stored in the time series database. If this is DevOps data or data from the infrastructure where you collect it at intervals of 30 seconds, per minute, then there is enough second accuracy, you don’t need less. And if you collect this data from high frequency trading systems, then you need nanosecond precision.

Millisecond accuracy in VictoriaMetrics is also suitable for a DevOps case, and can be suitable for most of the cases that I mentioned at the beginning of the report. The only thing it may not be suitable for is high frequency trading systems.

Thank you! And another question. What is compatibility in PromQL?

Full backward compatibility. VictoriaMetrics fully supports PromQL. In addition, it adds another advanced functionality to PromQL, which is called MetricsQL. There is a talk on YouTube about this extended functionality. I spoke at the Monitoring Meetup in the spring in St. Petersburg.

Telegram channel VictoriaMetrics.

Only registered users can participate in the survey. Sign in, you are welcome.

What is stopping you from switching to VictoriaMetrics as a long-term storage for Prometheus? (Write in the comments, I will add to the poll))

  • Present in several = 71,4%Not using Prometheus5

  • Present in several = 28,6%Didn't know about VictoriaMetrics2

7 users voted. 12 users abstained.

Source: habr.com

Add a comment