Saving a pretty penny on large volumes in PostgreSQL

Continuing the topic of recording large data streams, raised previous article about partitioning, in this we will look at ways in which you can reduce the "physical" size of the stored in PostgreSQL and their impact on server performance.

It's about TOAST settings and data alignment. “On average”, these methods will save not too many resources, but without modifying the application code at all.

Saving a pretty penny on large volumes in PostgreSQL
However, our experience turned out to be very productive in this regard, since the repository of almost any monitoring is inherently mostly append-only in terms of recorded data. And if you are wondering how you can teach the database to write to disk instead of 200MB / s half as much - please under cat.

Little secrets of big data

By work profile our service, he regularly receives messages from logs text packages.

And since VLSI complexwhose database we are monitoring is a multi-component product with complex data structures, then queries for maximum performance turn out quite like this "multi-volume" with complex algorithmic logic. So the volume of each individual query instance or the resulting execution plan in the log that comes to us turns out to be “on average” quite large.

Let's look at the structure of one of the tables into which we write "raw" data - that is, here is the original text from the log entry:

CREATE TABLE rawdata_orig(
  pack -- PK
    uuid NOT NULL
, recno -- PK
    smallint NOT NULL
, dt -- ключ секции
    date
, data -- самое главное
    text
, PRIMARY KEY(pack, recno)
);

A typical sign like this (already sectioned, of course, so this is a section template), where the most important thing is the text. Sometimes quite large.

Recall that the "physical" size of a single record in a PG cannot take up more than one page of data, but the "logical" size is a different matter. To write a bulk value (varchar/text/bytea) to a field, use TOAST technology:

PostgreSQL uses a fixed page size (usually 8 KB), and does not allow tuples to span multiple pages. Therefore, it is not possible to directly store very large field values. To overcome this limitation, large field values ​​are compressed and/or split across multiple physical lines. This happens invisibly to the user and has little effect on most of the server code. This method is known as TOAST...

In fact, for every table with "potentially large" fields, automatically a paired table with "slicing" is created each "large" record in 2KB segments:

TOAST(
  chunk_id
    integer
, chunk_seq
    integer
, chunk_data
    bytea
, PRIMARY KEY(chunk_id, chunk_seq)
);

That is, if we have to write a string with a "large" value data, then the actual recording will occur not only to the main table and its PK, but also to TOAST and its PK.

Reducing TOAST influence

But most of our records are still not so great, 8KB should fit How can you save money on this?

This is where the attribute comes to the rescue. STORAGE for a table column:

  • EXTENDED allows both compression and separate storage. This standard version for most TOAST compatible data types. Compression is first attempted, then stored outside the table if the string is still too large.
  • MAIN allows compression, but not separate storage. (In fact, separate storage will still be performed for such columns, but only as a last resortwhen there is no other way to shrink the string to fit on the page.)

In fact, this is exactly what we need for the text − compress as much as possible, and if it doesn’t fit at all, transfer it to TOAST. You can do this right on the fly, with one command:

ALTER TABLE rawdata_orig ALTER COLUMN data SET STORAGE MAIN;

How to evaluate the effect

Since the data flow changes every day, we cannot compare absolute numbers, but relative than smaller share we recorded in TOAST - so much the better. But there is a danger here - the larger the “physical” volume of each individual record we have, the “wider” the index becomes, because we have to cover more data pages.

Section before changes:

heap  = 37GB (39%)
TOAST = 54GB (57%)
PK    =  4GB ( 4%)

Section after changes:

heap  = 37GB (67%)
TOAST = 16GB (29%)
PK    =  2GB ( 4%)

In fact, we began to write in TOAST 2 times less often, which unloaded not only the disk, but also the CPU:

Saving a pretty penny on large volumes in PostgreSQL
Saving a pretty penny on large volumes in PostgreSQL
I note that we also “read” the disk less, not only “write” - because when inserting a record into some table, we also have to “subtract” a part of the tree of each of the indexes in order to determine its future position in them.

Who should live well on PostgreSQL 11

After updating to PG11, we decided to continue “tuning” TOAST and noticed that starting from this version, the parameter became available for tuning toast_tuple_target:

The TOAST processing code only fires when the row value to be stored in the table is larger than TOAST_TUPLE_THRESHOLD bytes (usually 2 KB). The TOAST code will compress and/or move the field values ​​out of the table until the row value becomes less than TOAST_TUPLE_TARGET bytes (variable value, also usually 2 KB) or it becomes impossible to reduce the volume.

We decided that we usually have data that is either “very short” or immediately “very long”, so we decided to limit ourselves to the minimum possible value:

ALTER TABLE rawplan_orig SET (toast_tuple_target = 128);

Let's see how the new settings affected disk loading after reconfiguration:

Saving a pretty penny on large volumes in PostgreSQL
Not bad! Medium disk queue reduced by about 1.5 times, and the “busyness” of the disk is 20 percent! But maybe it somehow affected the CPU?

Saving a pretty penny on large volumes in PostgreSQL
At least it didn't get any worse. Although, it is difficult to judge if even such volumes still cannot raise the average CPU load higher 5%.

From a change in the places of the terms, the sum ... changes!

As you know, a penny saves a ruble, and with our storage volumes of the order 10TB/month even a small optimization can give a good profit. Therefore, we paid attention to the physical structure of our data - how specifically "stacked" fields inside the record each of the tables.

Because due to data alignment it's straight affects the resulting volume:

Many architectures provide data alignment on machine word boundaries. For example, on a 32-bit x86 system, integers (type integer, occupies 4 bytes) will be aligned on a 4-byte word boundary, as will double-precision floating-point numbers (type double precision, 8 bytes). And on a 64-bit system, double values ​​will be aligned on the boundary of 8-byte words. This is another reason for incompatibility.

Due to alignment, the size of a table row depends on the order of the fields. Usually this effect is not very noticeable, but in some cases it can lead to a significant increase in size. For example, if you mix fields of char(1) and integer types, 3 bytes will usually be wasted between them.

Let's start with synthetic models:

SELECT pg_column_size(ROW(
  '0000-0000-0000-0000-0000-0000-0000-0000'::uuid
, 0::smallint
, '2019-01-01'::date
));
-- 48 байт

SELECT pg_column_size(ROW(
  '2019-01-01'::date
, '0000-0000-0000-0000-0000-0000-0000-0000'::uuid
, 0::smallint
));
-- 46 байт

Where did a couple of extra bytes come from in the first case? Everything is simple - 2-byte smallint aligned on 4-byte boundary before the next field, and when it is the last one, there is nothing to align and there is no need to.

In theory, everything is fine and you can rearrange the fields as you like. Let's check on real data on the example of one of the tables, the daily section of which occupies 10-15GB.

Source structure:

CREATE TABLE public.plan_20190220
(
-- Унаследована from table plan:  pack uuid NOT NULL,
-- Унаследована from table plan:  recno smallint NOT NULL,
-- Унаследована from table plan:  host uuid,
-- Унаследована from table plan:  ts timestamp with time zone,
-- Унаследована from table plan:  exectime numeric(32,3),
-- Унаследована from table plan:  duration numeric(32,3),
-- Унаследована from table plan:  bufint bigint,
-- Унаследована from table plan:  bufmem bigint,
-- Унаследована from table plan:  bufdsk bigint,
-- Унаследована from table plan:  apn uuid,
-- Унаследована from table plan:  ptr uuid,
-- Унаследована from table plan:  dt date,
  CONSTRAINT plan_20190220_pkey PRIMARY KEY (pack, recno),
  CONSTRAINT chck_ptr CHECK (ptr IS NOT NULL),
  CONSTRAINT plan_20190220_dt_check CHECK (dt = '2019-02-20'::date)
)
INHERITS (public.plan)

Section after reordering columns - exactly the same fields, only the order is different:

CREATE TABLE public.plan_20190221
(
-- Унаследована from table plan:  dt date NOT NULL,
-- Унаследована from table plan:  ts timestamp with time zone,
-- Унаследована from table plan:  pack uuid NOT NULL,
-- Унаследована from table plan:  recno smallint NOT NULL,
-- Унаследована from table plan:  host uuid,
-- Унаследована from table plan:  apn uuid,
-- Унаследована from table plan:  ptr uuid,
-- Унаследована from table plan:  bufint bigint,
-- Унаследована from table plan:  bufmem bigint,
-- Унаследована from table plan:  bufdsk bigint,
-- Унаследована from table plan:  exectime numeric(32,3),
-- Унаследована from table plan:  duration numeric(32,3),
  CONSTRAINT plan_20190221_pkey PRIMARY KEY (pack, recno),
  CONSTRAINT chck_ptr CHECK (ptr IS NOT NULL),
  CONSTRAINT plan_20190221_dt_check CHECK (dt = '2019-02-21'::date)
)
INHERITS (public.plan)

The total size of the section is determined by the number of "facts" and depends only on external processes, so we divide the size of the heap (pg_relation_size) by the number of entries in it - that is, we get average size of a real stored record:

Saving a pretty penny on large volumes in PostgreSQL
Minus 6% volume, Great!

But everything, of course, is not so rosy - after all in indexes, we cannot change the order of fields, and therefore "on the whole" (pg_total_relation_size) ...

Saving a pretty penny on large volumes in PostgreSQL
... still here saved 1.5%without changing a single line of code. So yes!

Saving a pretty penny on large volumes in PostgreSQL

I note that the above option for arranging the fields is not the fact that the most optimal. Because you don’t want to “break” some blocks of fields for aesthetic reasons - for example, a couple (pack, recno), which is the PK for this table.

In general, the definition of the "minimal" arrangement of fields is a fairly simple "brute force" task. Therefore, you can get even better results on your data than ours - try it!

Source: habr.com

Add a comment