ClickHouse + Graphite: How to Significantly Reduce Disk Space Consumed

ClickHouse + Graphite: How to Significantly Reduce Disk Space Consumed

Greetings habr.

If someone is operating the system graphite web and ran into storage performance issue whisper (IO, disk space consumed), then the chance that ClickHouse was looked at as a replacement should be close to one. This statement implies that a third-party implementation is already used as the receiving daemon, for example carbon writer or go carbon.

ClickHouse solves the problems described well. For example, after pouring 2TiB of data from whisper, they fit into 300GiB. I will not dwell on the comparison in detail, there are enough articles on this topic. In addition, until recently, not everything was perfect with our ClickHouse storage.

Consumed Space Issues

At first glance, everything should work well. Following documentation, create a config for the metrics storage scheme (hereinafter retention), then create a table according to the recommendation of the selected graphite-web backend: carbon-clickhouse+graphite-clickhouse or graphouse, depending on which stack is being used. And… a time bomb is activated.

In order to understand which one, you need to know how inserts work and the further life path of data in tables of family engines *MergeTree ClickHouse (diagrams taken from presentations Alexey Zatelepin):

  • Inserted Π±Π»ΠΎΠΊ data. In our case, these are metrics.
    ClickHouse + Graphite: How to Significantly Reduce Disk Space Consumed
  • Each such block is sorted according to the key before being written to disk. ORDER BYThe specified when the table was created.
  • After sorting, кусок (part) data is written to disk.
    ClickHouse + Graphite: How to Significantly Reduce Disk Space Consumed
  • The server monitors in the background so that there are not many such pieces, and launches background слияния (merge, then merge).
    ClickHouse + Graphite: How to Significantly Reduce Disk Space Consumed
    ClickHouse + Graphite: How to Significantly Reduce Disk Space Consumed
  • The server stops launching merges on its own as soon as the data stops actively flowing into ΠΏΠ°Ρ€Ρ‚ΠΈΡ†ΠΈΡŽ (partition), but you can start the process manually with the command OPTIMIZE.
  • If there is only one piece left in the partition, then you won’t be able to start the merge with the usual command, you must use OPTIMIZE ... FINAL

So, the first metrics arrive. And they take up some space. Subsequent events may vary somewhat depending on many factors:

  • The partition key can be either very small (a day) or very large (several months).
  • The retention config can fit several significant data aggregation thresholds inside the active partition (where the metrics are written), or maybe not.
  • If there is a lot of data, then the earliest chunks, which may already be huge due to background merging (when choosing a non-optimal partitioning key), will not merge themselves with fresh small chunks.

And it always ends the same. The place occupied by metrics in ClickHouse only grows if:

  • do not apply OPTIMIZE ... FINAL manually or
  • do not insert data into all partitions on an ongoing basis in order to start a background merge sooner or later

The second method seems to be the easiest to implement and, therefore, it is wrong and was tested in the first place.
I wrote a fairly simple python script that sends dummy metrics for every day for the past 4 years and runs every hour with cron.
Since all the work of ClickHouse DBMS is based on the fact that this system will sooner or later do all the background work, but it is not known when, I did not manage to wait for the moment when the old huge pieces deign to start merging with new small ones. It became clear that we needed to find a way to automate forced optimizations.

ClickHouse + Graphite: How to Significantly Reduce Disk Space Consumed

Information in ClickHouse system tables

Let's take a look at the structure of the table system.parts. This is comprehensive information about every piece of all tables on the ClickHouse server. Contains, among other things, the following columns:

  • database name (database);
  • table name (table);
  • partition name and ID (partition & partition_id);
  • when the piece was created (modification_time);
  • minimum and maximum date in the chunk (partitioning is by day) (min_date & max_date);

There is also a table system.graphite_retentions, with the following interesting fields:

  • database name (Tables.database);
  • table name (Tables.table);
  • the age of the metric when the next aggregation should be applied (age);

So:

  1. We have a table of chunks and a table of aggregation rules.
  2. We unite their intersection and we receive all *GraphiteMergeTree tables.
  3. We are looking for all partitions in which:
    • more than one piece
    • or the time has come to apply the next aggregation rule, and modification_time older than this moment.

implementation

This request

SELECT
    concat(p.database, '.', p.table) AS table,
    p.partition_id AS partition_id,
    p.partition AS partition,
    -- Π‘Π°ΠΌΠΎΠ΅ "староС" ΠΏΡ€Π°Π²ΠΈΠ»ΠΎ, ΠΊΠΎΡ‚ΠΎΡ€ΠΎΠ΅ ΠΌΠΎΠΆΠ΅Ρ‚ Π±Ρ‹Ρ‚ΡŒ ΠΏΡ€ΠΈΠΌΠ΅Π½Π΅Π½ΠΎ для
    -- ΠΏΠ°Ρ€Ρ‚ΠΈΡ†ΠΈΠΈ, Π½ΠΎ Π½Π΅ Π² Π±ΡƒΠ΄ΡƒΡ‰Π΅ΠΌ, см (*)
    max(g.age) AS age,
    -- ΠšΠΎΠ»ΠΈΡ‡Π΅ΡΡ‚Π²ΠΎ кусков Π² ΠΏΠ°Ρ€Ρ‚ΠΈΡ†ΠΈΠΈ
    countDistinct(p.name) AS parts,
    -- Π—Π° ΡΠ°ΠΌΡƒΡŽ ΡΡ‚Π°Ρ€ΡˆΡƒΡŽ ΠΌΠ΅Ρ‚Ρ€ΠΈΠΊΡƒ Π² ΠΏΠ°Ρ€Ρ‚ΠΈΡ†ΠΈΠΈ принимаСтся 00:00:00 ΡΠ»Π΅Π΄ΡƒΡŽΡ‰Π΅Π³ΠΎ дня
    toDateTime(max(p.max_date + 1)) AS max_time,
    -- Когда партиция Π΄ΠΎΠ»ΠΆΠ½Π° Π±Ρ‹Ρ‚ΡŒ ΠΎΠΏΡ‚ΠΈΠΌΠΈΠ·ΠΈΡ€ΠΎΠ²Π°Π½Π½Π°
    max_time + age AS rollup_time,
    -- Когда самый старый кусок Π² ΠΏΠ°Ρ€Ρ‚ΠΈΡ†ΠΈΠΈ Π±Ρ‹Π» ΠΎΠ±Π½ΠΎΠ²Π»Ρ‘Π½
    min(p.modification_time) AS modified_at
FROM system.parts AS p
INNER JOIN
(
    -- ВсС ΠΏΡ€Π°Π²ΠΈΠ»Π° для всСх Ρ‚Π°Π±Π»ΠΈΡ† *GraphiteMergeTree
    SELECT
        Tables.database AS database,
        Tables.table AS table,
        age
    FROM system.graphite_retentions
    ARRAY JOIN Tables
    GROUP BY
        database,
        table,
        age
) AS g ON
    (p.table = g.table)
    AND (p.database = g.database)
WHERE
    -- Волько Π°ΠΊΡ‚ΠΈΠ²Π½Ρ‹Π΅ куски
    p.active
    -- (*) И Ρ‚ΠΎΠ»ΡŒΠΊΠΎ строки, Π³Π΄Π΅ ΠΏΡ€Π°Π²ΠΈΠ»Π° Π°Π³Π³Ρ€Π΅Π³Π°Ρ†ΠΈΠΈ ΡƒΠΆΠ΅ Π΄ΠΎΠ»ΠΆΠ½Ρ‹ Π±Ρ‹Ρ‚ΡŒ ΠΏΡ€ΠΈΠΌΠ΅Π½Π΅Π½Ρ‹
    AND ((toDateTime(p.max_date + 1) + g.age) < now())
GROUP BY
    table,
    partition
HAVING
    -- Волько ΠΏΠ°Ρ€Ρ‚ΠΈΡ†ΠΈΠΈ, ΠΊΠΎΡ‚ΠΎΡ€Ρ‹Π΅ младшС ΠΌΠΎΠΌΠ΅Π½Ρ‚Π° ΠΎΠΏΡ‚ΠΈΠΌΠΈΠ·Π°Ρ†ΠΈΠΈ
    (modified_at < rollup_time)
    -- Или с нСсколькими кусками
    OR (parts > 1)
ORDER BY
    table ASC,
    partition ASC,
    age ASC

returns each of the *GraphiteMergeTree table partitions that should be merged to free up disk space. It remains only the case for small things: go through them all with a request OPTIMIZE ... FINAL. The final implementation also took into account the fact that there is no need to touch partitions with an active record.

This is what the project does. graphite-ch-optimizer. Former colleagues from Yandex.Market tested it in production, the result of the work can be seen below.

ClickHouse + Graphite: How to Significantly Reduce Disk Space Consumed

If you run the program on a server with ClickHouse, it will simply start working in daemon mode. Once an hour, a query will be executed, checking if there are new partitions older than three days that can be optimized.

In the near future - to provide at least deb packages, and if possible - also rpm.

Instead of a conclusion

For the past 9+ months, I have been inside my company InnoGames spent a lot of time cooking at the junction of ClickHouse and graphite-web. It was a good experience, which resulted in a quick transition from whisper to ClickHouse as a repository of metrics. I hope that this article is something like the start of a cycle about what improvements we have made to various parts of this stack, and what will be done in the future.

Several liters of beer and admin days were spent on the development of the request, together with v0devilfor which I want to express my gratitude to him. And also for reviewing this article.

Project page on github

Source: habr.com

Add a comment