Continuing the topic of recording large data streams, raised
It's about TOAST settings and data alignment. “On average”, these methods will save not too many resources, but without modifying the application code at all.
However, our experience turned out to be very productive in this regard, since the repository of almost any monitoring is inherently mostly append-only in terms of recorded data. And if you are wondering how you can teach the database to write to disk instead of 200MB / s half as much - please under cat.
Little secrets of big data
By work profile
And since
Let's look at the structure of one of the tables into which we write "raw" data - that is, here is the original text from the log entry:
CREATE TABLE rawdata_orig(
pack -- PK
uuid NOT NULL
, recno -- PK
smallint NOT NULL
, dt -- ключ секции
date
, data -- самое главное
text
, PRIMARY KEY(pack, recno)
);
A typical sign like this (already sectioned, of course, so this is a section template), where the most important thing is the text. Sometimes quite large.
Recall that the "physical" size of a single record in a PG cannot take up more than one page of data, but the "logical" size is a different matter. To write a bulk value (varchar/text/bytea) to a field, use
PostgreSQL uses a fixed page size (usually 8 KB), and does not allow tuples to span multiple pages. Therefore, it is not possible to directly store very large field values. To overcome this limitation, large field values are compressed and/or split across multiple physical lines. This happens invisibly to the user and has little effect on most of the server code. This method is known as TOAST...
In fact, for every table with "potentially large" fields, automatically
TOAST(
chunk_id
integer
, chunk_seq
integer
, chunk_data
bytea
, PRIMARY KEY(chunk_id, chunk_seq)
);
That is, if we have to write a string with a "large" value data
, then the actual recording will occur not only to the main table and its PK, but also to TOAST and its PK.
Reducing TOAST influence
But most of our records are still not so great, 8KB should fit How can you save money on this?
This is where the attribute comes to the rescue. STORAGE
- EXTENDED allows both compression and separate storage. This standard version for most TOAST compatible data types. Compression is first attempted, then stored outside the table if the string is still too large.
- MAIN allows compression, but not separate storage. (In fact, separate storage will still be performed for such columns, but only as a last resortwhen there is no other way to shrink the string to fit on the page.)
In fact, this is exactly what we need for the text − compress as much as possible, and if it doesn’t fit at all, transfer it to TOAST. You can do this right on the fly, with one command:
ALTER TABLE rawdata_orig ALTER COLUMN data SET STORAGE MAIN;
How to evaluate the effect
Since the data flow changes every day, we cannot compare absolute numbers, but relative than smaller share we recorded in TOAST - so much the better. But there is a danger here - the larger the “physical” volume of each individual record we have, the “wider” the index becomes, because we have to cover more data pages.
Section before changes:
heap = 37GB (39%)
TOAST = 54GB (57%)
PK = 4GB ( 4%)
Section after changes:
heap = 37GB (67%)
TOAST = 16GB (29%)
PK = 2GB ( 4%)
In fact, we began to write in TOAST 2 times less often, which unloaded not only the disk, but also the CPU:
I note that we also “read” the disk less, not only “write” - because when inserting a record into some table, we also have to “subtract” a part of the tree of each of the indexes in order to determine its future position in them.
Who should live well on PostgreSQL 11
After updating to PG11, we decided to continue “tuning” TOAST and noticed that starting from this version, the parameter became available for tuning toast_tuple_target
The TOAST processing code only fires when the row value to be stored in the table is larger than TOAST_TUPLE_THRESHOLD bytes (usually 2 KB). The TOAST code will compress and/or move the field values out of the table until the row value becomes less than TOAST_TUPLE_TARGET bytes (variable value, also usually 2 KB) or it becomes impossible to reduce the volume.
We decided that we usually have data that is either “very short” or immediately “very long”, so we decided to limit ourselves to the minimum possible value:
ALTER TABLE rawplan_orig SET (toast_tuple_target = 128);
Let's see how the new settings affected disk loading after reconfiguration:
Not bad! Medium disk queue reduced by about 1.5 times, and the “busyness” of the disk is 20 percent! But maybe it somehow affected the CPU?
At least it didn't get any worse. Although, it is difficult to judge if even such volumes still cannot raise the average CPU load higher 5%.
From a change in the places of the terms, the sum ... changes!
As you know, a penny saves a ruble, and with our storage volumes of the order 10TB/month even a small optimization can give a good profit. Therefore, we paid attention to the physical structure of our data - how specifically "stacked" fields inside the record each of the tables.
Because due to
Many architectures provide data alignment on machine word boundaries. For example, on a 32-bit x86 system, integers (type integer, occupies 4 bytes) will be aligned on a 4-byte word boundary, as will double-precision floating-point numbers (type double precision, 8 bytes). And on a 64-bit system, double values will be aligned on the boundary of 8-byte words. This is another reason for incompatibility.
Due to alignment, the size of a table row depends on the order of the fields. Usually this effect is not very noticeable, but in some cases it can lead to a significant increase in size. For example, if you mix fields of char(1) and integer types, 3 bytes will usually be wasted between them.
Let's start with synthetic models:
SELECT pg_column_size(ROW(
'0000-0000-0000-0000-0000-0000-0000-0000'::uuid
, 0::smallint
, '2019-01-01'::date
));
-- 48 байт
SELECT pg_column_size(ROW(
'2019-01-01'::date
, '0000-0000-0000-0000-0000-0000-0000-0000'::uuid
, 0::smallint
));
-- 46 байт
Where did a couple of extra bytes come from in the first case? Everything is simple - 2-byte smallint aligned on 4-byte boundary before the next field, and when it is the last one, there is nothing to align and there is no need to.
In theory, everything is fine and you can rearrange the fields as you like. Let's check on real data on the example of one of the tables, the daily section of which occupies 10-15GB.
Source structure:
CREATE TABLE public.plan_20190220
(
-- Унаследована from table plan: pack uuid NOT NULL,
-- Унаследована from table plan: recno smallint NOT NULL,
-- Унаследована from table plan: host uuid,
-- Унаследована from table plan: ts timestamp with time zone,
-- Унаследована from table plan: exectime numeric(32,3),
-- Унаследована from table plan: duration numeric(32,3),
-- Унаследована from table plan: bufint bigint,
-- Унаследована from table plan: bufmem bigint,
-- Унаследована from table plan: bufdsk bigint,
-- Унаследована from table plan: apn uuid,
-- Унаследована from table plan: ptr uuid,
-- Унаследована from table plan: dt date,
CONSTRAINT plan_20190220_pkey PRIMARY KEY (pack, recno),
CONSTRAINT chck_ptr CHECK (ptr IS NOT NULL),
CONSTRAINT plan_20190220_dt_check CHECK (dt = '2019-02-20'::date)
)
INHERITS (public.plan)
Section after reordering columns - exactly the same fields, only the order is different:
CREATE TABLE public.plan_20190221
(
-- Унаследована from table plan: dt date NOT NULL,
-- Унаследована from table plan: ts timestamp with time zone,
-- Унаследована from table plan: pack uuid NOT NULL,
-- Унаследована from table plan: recno smallint NOT NULL,
-- Унаследована from table plan: host uuid,
-- Унаследована from table plan: apn uuid,
-- Унаследована from table plan: ptr uuid,
-- Унаследована from table plan: bufint bigint,
-- Унаследована from table plan: bufmem bigint,
-- Унаследована from table plan: bufdsk bigint,
-- Унаследована from table plan: exectime numeric(32,3),
-- Унаследована from table plan: duration numeric(32,3),
CONSTRAINT plan_20190221_pkey PRIMARY KEY (pack, recno),
CONSTRAINT chck_ptr CHECK (ptr IS NOT NULL),
CONSTRAINT plan_20190221_dt_check CHECK (dt = '2019-02-21'::date)
)
INHERITS (public.plan)
The total size of the section is determined by the number of "facts" and depends only on external processes, so we divide the size of the heap (pg_relation_size
) by the number of entries in it - that is, we get average size of a real stored record:
Minus 6% volume, Great!
But everything, of course, is not so rosy - after all in indexes, we cannot change the order of fields, and therefore "on the whole" (pg_total_relation_size
) ...
... still here saved 1.5%without changing a single line of code. So yes!
I note that the above option for arranging the fields is not the fact that the most optimal. Because you don’t want to “break” some blocks of fields for aesthetic reasons - for example, a couple (pack, recno)
, which is the PK for this table.
In general, the definition of the "minimal" arrangement of fields is a fairly simple "brute force" task. Therefore, you can get even better results on your data than ours - try it!
Source: habr.com