Greetings habr.
If someone is operating the system
ClickHouse solves the problems described well. For example, after pouring 2TiB of data from whisper, they fit into 300GiB. I will not dwell on the comparison in detail, there are enough articles on this topic. In addition, until recently, not everything was perfect with our ClickHouse storage.
Consumed Space Issues
At first glance, everything should work well. Following retention
), then create a table according to the recommendation of the selected graphite-web backend:
In order to understand which one, you need to know how inserts work and the further life path of data in tables of family engines *MergeTree ClickHouse (diagrams taken from
- Inserted
Π±Π»ΠΎΠΊ
data. In our case, these are metrics.
- Each such block is sorted according to the key before being written to disk.
ORDER BY
The specified when the table was created. - After sorting,
ΠΊΡΡΠΎΠΊ
(part
) data is written to disk.
- The server monitors in the background so that there are not many such pieces, and launches background
ΡΠ»ΠΈΡΠ½ΠΈΡ
(merge
, then merge).
- The server stops launching merges on its own as soon as the data stops actively flowing into
ΠΏΠ°ΡΡΠΈΡΠΈΡ
(partition
), but you can start the process manually with the commandOPTIMIZE
. - If there is only one piece left in the partition, then you wonβt be able to start the merge with the usual command, you must use
OPTIMIZE ... FINAL
So, the first metrics arrive. And they take up some space. Subsequent events may vary somewhat depending on many factors:
- The partition key can be either very small (a day) or very large (several months).
- The retention config can fit several significant data aggregation thresholds inside the active partition (where the metrics are written), or maybe not.
- If there is a lot of data, then the earliest chunks, which may already be huge due to background merging (when choosing a non-optimal partitioning key), will not merge themselves with fresh small chunks.
And it always ends the same. The place occupied by metrics in ClickHouse only grows if:
- do not apply
OPTIMIZE ... FINAL
manually or - do not insert data into all partitions on an ongoing basis in order to start a background merge sooner or later
The second method seems to be the easiest to implement and, therefore, it is wrong and was tested in the first place.
I wrote a fairly simple python script that sends dummy metrics for every day for the past 4 years and runs every hour with cron.
Since all the work of ClickHouse DBMS is based on the fact that this system will sooner or later do all the background work, but it is not known when, I did not manage to wait for the moment when the old huge pieces deign to start merging with new small ones. It became clear that we needed to find a way to automate forced optimizations.
Information in ClickHouse system tables
Let's take a look at the structure of the table
- database name (
database
); - table name (
table
); - partition name and ID (
partition
&partition_id
); - when the piece was created (
modification_time
); - minimum and maximum date in the chunk (partitioning is by day) (
min_date
&max_date
);
There is also a table
- database name (
Tables.database
); - table name (
Tables.table
); - the age of the metric when the next aggregation should be applied (
age
);
So:
- We have a table of chunks and a table of aggregation rules.
- We unite their intersection and we receive all *GraphiteMergeTree tables.
- We are looking for all partitions in which:
- more than one piece
- or the time has come to apply the next aggregation rule, and
modification_time
older than this moment.
implementation
This request
SELECT
concat(p.database, '.', p.table) AS table,
p.partition_id AS partition_id,
p.partition AS partition,
-- Π‘Π°ΠΌΠΎΠ΅ "ΡΡΠ°ΡΠΎΠ΅" ΠΏΡΠ°Π²ΠΈΠ»ΠΎ, ΠΊΠΎΡΠΎΡΠΎΠ΅ ΠΌΠΎΠΆΠ΅Ρ Π±ΡΡΡ ΠΏΡΠΈΠΌΠ΅Π½Π΅Π½ΠΎ Π΄Π»Ρ
-- ΠΏΠ°ΡΡΠΈΡΠΈΠΈ, Π½ΠΎ Π½Π΅ Π² Π±ΡΠ΄ΡΡΠ΅ΠΌ, ΡΠΌ (*)
max(g.age) AS age,
-- ΠΠΎΠ»ΠΈΡΠ΅ΡΡΠ²ΠΎ ΠΊΡΡΠΊΠΎΠ² Π² ΠΏΠ°ΡΡΠΈΡΠΈΠΈ
countDistinct(p.name) AS parts,
-- ΠΠ° ΡΠ°ΠΌΡΡ ΡΡΠ°ΡΡΡΡ ΠΌΠ΅ΡΡΠΈΠΊΡ Π² ΠΏΠ°ΡΡΠΈΡΠΈΠΈ ΠΏΡΠΈΠ½ΠΈΠΌΠ°Π΅ΡΡΡ 00:00:00 ΡΠ»Π΅Π΄ΡΡΡΠ΅Π³ΠΎ Π΄Π½Ρ
toDateTime(max(p.max_date + 1)) AS max_time,
-- ΠΠΎΠ³Π΄Π° ΠΏΠ°ΡΡΠΈΡΠΈΡ Π΄ΠΎΠ»ΠΆΠ½Π° Π±ΡΡΡ ΠΎΠΏΡΠΈΠΌΠΈΠ·ΠΈΡΠΎΠ²Π°Π½Π½Π°
max_time + age AS rollup_time,
-- ΠΠΎΠ³Π΄Π° ΡΠ°ΠΌΡΠΉ ΡΡΠ°ΡΡΠΉ ΠΊΡΡΠΎΠΊ Π² ΠΏΠ°ΡΡΠΈΡΠΈΠΈ Π±ΡΠ» ΠΎΠ±Π½ΠΎΠ²Π»ΡΠ½
min(p.modification_time) AS modified_at
FROM system.parts AS p
INNER JOIN
(
-- ΠΡΠ΅ ΠΏΡΠ°Π²ΠΈΠ»Π° Π΄Π»Ρ Π²ΡΠ΅Ρ
ΡΠ°Π±Π»ΠΈΡ *GraphiteMergeTree
SELECT
Tables.database AS database,
Tables.table AS table,
age
FROM system.graphite_retentions
ARRAY JOIN Tables
GROUP BY
database,
table,
age
) AS g ON
(p.table = g.table)
AND (p.database = g.database)
WHERE
-- Π’ΠΎΠ»ΡΠΊΠΎ Π°ΠΊΡΠΈΠ²Π½ΡΠ΅ ΠΊΡΡΠΊΠΈ
p.active
-- (*) Π ΡΠΎΠ»ΡΠΊΠΎ ΡΡΡΠΎΠΊΠΈ, Π³Π΄Π΅ ΠΏΡΠ°Π²ΠΈΠ»Π° Π°Π³Π³ΡΠ΅Π³Π°ΡΠΈΠΈ ΡΠΆΠ΅ Π΄ΠΎΠ»ΠΆΠ½Ρ Π±ΡΡΡ ΠΏΡΠΈΠΌΠ΅Π½Π΅Π½Ρ
AND ((toDateTime(p.max_date + 1) + g.age) < now())
GROUP BY
table,
partition
HAVING
-- Π’ΠΎΠ»ΡΠΊΠΎ ΠΏΠ°ΡΡΠΈΡΠΈΠΈ, ΠΊΠΎΡΠΎΡΡΠ΅ ΠΌΠ»Π°Π΄ΡΠ΅ ΠΌΠΎΠΌΠ΅Π½ΡΠ° ΠΎΠΏΡΠΈΠΌΠΈΠ·Π°ΡΠΈΠΈ
(modified_at < rollup_time)
-- ΠΠ»ΠΈ Ρ Π½Π΅ΡΠΊΠΎΠ»ΡΠΊΠΈΠΌΠΈ ΠΊΡΡΠΊΠ°ΠΌΠΈ
OR (parts > 1)
ORDER BY
table ASC,
partition ASC,
age ASC
returns each of the *GraphiteMergeTree table partitions that should be merged to free up disk space. It remains only the case for small things: go through them all with a request OPTIMIZE ... FINAL
. The final implementation also took into account the fact that there is no need to touch partitions with an active record.
This is what the project does.
If you run the program on a server with ClickHouse, it will simply start working in daemon mode. Once an hour, a query will be executed, checking if there are new partitions older than three days that can be optimized.
In the near future - to provide at least deb packages, and if possible - also rpm.
Instead of a conclusion
For the past 9+ months, I have been inside my company
Several liters of beer and admin days were spent on the development of the request, together with
Source: habr.com