🥇When VACUUM passes, we clean the table manually

VACUUM can "clean up" from a table in PostgreSQL only what no one can see - that is, there is not a single active request that started before these records were changed.

But what if there is such an unpleasant type (long-term OLAP load on an OLTP database)? How clean up an actively changing table surrounded by long queries and not stepping on a rake?

Laying out the rake

First, let's define what is and how the problem that we want to solve can arise.

This situation usually happens on a relatively small table, but in which a lot of changes. Usually this or different counters/aggregates/ratings, which are frequently UPDATEd, or buffer-queue to process some kind of constantly running stream of events, records of which are INSERT / DELETE all the time.

Let's try to reproduce the version with ratings:

CREATE TABLE tbl(k text PRIMARY KEY, v integer);
CREATE INDEX ON tbl(v DESC); -- по этому индексу будем строить рейтинг

INSERT INTO
  tbl
SELECT
  chr(ascii('a'::text) + i) k
, 0 v
FROM
  generate_series(0, 25) i;

And in parallel, in another connection, a long, long query starts, collecting some kind of complex statistics, but not affecting our table:

SELECT pg_sleep(10000);

Now we update the value of one of the counters many, many times. For the purity of the experiment, we will do this in separate transactions with dblinkhow it will happen in reality:

DO $$
DECLARE
  i integer;
  tsb timestamp;
  tse timestamp;
  d double precision;
BEGIN
  PERFORM dblink_connect('dbname=' || current_database() || ' port=' || current_setting('port'));
  FOR i IN 1..10000 LOOP
    tsb = clock_timestamp();
    PERFORM dblink($e$UPDATE tbl SET v = v + 1 WHERE k = 'a';$e$);
    tse = clock_timestamp();
    IF i % 1000 = 0 THEN
      d = (extract('epoch' from tse) - extract('epoch' from tsb)) * 1000;
      RAISE NOTICE 'i = %, exectime = %', lpad(i::text, 5), lpad(d::text, 5);
    END IF;
  END LOOP;
  PERFORM dblink_disconnect();
END;
$$ LANGUAGE plpgsql;

NOTICE:  i =  1000, exectime = 0.524
NOTICE:  i =  2000, exectime = 0.739
NOTICE:  i =  3000, exectime = 1.188
NOTICE:  i =  4000, exectime = 2.508
NOTICE:  i =  5000, exectime = 1.791
NOTICE:  i =  6000, exectime = 2.658
NOTICE:  i =  7000, exectime = 2.318
NOTICE:  i =  8000, exectime = 2.572
NOTICE:  i =  9000, exectime = 2.929
NOTICE:  i = 10000, exectime = 3.808

What happened? Why even for the simplest UPDATE of a single record execution time degraded by 7 times - from 0.524ms to 3.808ms? And our rating is being built more and more slowly.

It's all MVCC's fault

It's all about MVCC mechanism, which forces the query to look at all previous versions of the entry. So let's clean our table from "dead" versions:

VACUUM VERBOSE tbl;

INFO:  vacuuming "public.tbl"
INFO:  "tbl": found 0 removable, 10026 nonremovable row versions in 45 out of 45 pages
DETAIL:  10000 dead row versions cannot be removed yet, oldest xmin: 597439602

Oh, and there is nothing to clean! Parallel running request interferes with us - after all, he may someday want to turn to these versions (what if?), And they should be available to him. And therefore, even VACUUM FULL will not help us.

"Collapsing" the table

But we know for sure that our table is not needed for that query. Therefore, we will try to return the system performance to an adequate framework, throwing out everything superfluous from the table - at least “manually”, since VACUUM passes.

To make it clearer, let's consider the case of a buffer table as an example. That is, there is a large INSERT / DELETE flow, and sometimes the table is completely empty. But if it's not empty, we must save its current content.

#0: Assess the situation

It is clear that you can try to do something with the table at least after each operation, but this does not make much sense - the maintenance overhead will be clearly greater than the throughput of the target queries.

Let's formulate the criteria - "it's time to act" if:

VACUUM has been launched for a long time
We expect a large load, so let it be 60 seconds since last [auto]VACUUM.
the physical size of the table is larger than the target
Let's define it as twice the number of pages (blocks of 8KB) relative to the minimum size - 1 blk per heap + 1 blk per index - for a potentially empty table. If we expect that a certain amount of data will always remain in the buffer “normally”, it is reasonable to tune this formula.

Verification request

SELECT
  relpages
, ((
    SELECT
      count(*)
    FROM
      pg_index
    WHERE
      indrelid = cl.oid
  ) + 1) << 13 size_norm -- тут правильнее делать * current_setting('block_size')::bigint, но кто меняет размер блока?..
, pg_total_relation_size(oid) size
, coalesce(extract('epoch' from (now() - greatest(
    pg_stat_get_last_vacuum_time(oid)
  , pg_stat_get_last_autovacuum_time(oid)
  ))), 1 << 30) vaclag
FROM
  pg_class cl
WHERE
  oid = $1::regclass -- tbl
LIMIT 1;

relpages | size_norm | size    | vaclag
-------------------------------------------
       0 |     24576 | 1105920 | 3392.484835

#1: Still VACUUM

We cannot know in advance whether a parallel query bothers us a lot - exactly how many records are "outdated" from the moment it was started. Therefore, when we nevertheless decide to process the table somehow, in any case, we should first execute on it VACUUM - it, unlike VACUUM FULL, does not interfere with parallel processes to work with read-write data.

At the same time, he can immediately clean up most of what we would like to remove. Yes, and subsequent requests for this table will go to us by hot cache, which will reduce their duration - and, therefore, the total time of blocking others by our servicing transaction.

#2: Is anyone home?

Let's check if there is anything at all in the table:

TABLE tbl LIMIT 1;

If there is not a single record left, then we can save a lot on processing - just by doing TRUNCATE:

It acts just like an unconditional per-table DELETE command, but is much faster because it doesn't actually scan the tables. Moreover, it frees disk space immediately, so there is no need to perform a VACUUM operation after it.

Do you need to reset the table sequence counter (RESTART IDENTITY) - decide for yourself.

#3: Everyone in turn!

Since we are working in a highly competitive environment, while we are checking the absence of records in the table here, someone could have already written something there. We should not lose this information, so what? That's right, it must be done so that no one can write it down for sure.

To do this, we need to enable SERIALIZABLE- isolation for our transaction (yes, here we start the transaction) and lock the table “tightly”:

BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
LOCK TABLE tbl IN ACCESS EXCLUSIVE MODE;

It is this level of lock that is determined by the operations that we want to perform on it.

#4: Conflict of Interest

We come here and want to “lock” the sign - and if at that moment someone was active on it, for example, read from it? We will “hang” in anticipation of the release of this block, and others who want to read will already run into us ...

To prevent this from happening, we “sacrifice ourselves” - if we still failed to get a lock for a certain (admissibly small) time, then we will receive an exception from the base, but at least we won’t interfere much with the rest.

To do this, set the session variable lock_timeout (for versions 9.3+) or/and statement_timeout. The main thing to remember is that the statement_timeout value only applies from the next statement. That is, like this in gluing - won't work:

SET statement_timeout = ...;LOCK TABLE ...;

In order not to deal with restoring the “old” value of the variable later, we use the form SET LOCALA that limits the scope of the setting to the current transaction.

Remember that statement_timeout applies to all subsequent requests so that the transaction cannot stretch to unacceptable values if there is still a lot of data in the table.

#5: Copy Data

If the table turned out to be not completely empty, the data will have to be resaved through an auxiliary temporary table:

CREATE TEMPORARY TABLE _tmp_swap ON COMMIT DROP AS TABLE tbl;

Signature ON COMMIT DROP means that at the end of the transaction, the temporary table will cease to exist, and there is no need to manually delete it in the context of the connection.

Since we assume that there is not a lot of “live” data, this operation should be fairly fast.

Well, that's all! Do not forget after the completion of the transaction run ANALYZE to normalize table statistics if necessary.

Putting together the final script

We use this "pseudo-python":

# собираем статистику с таблицы
stat <-
  SELECT
    relpages
  , ((
      SELECT
        count(*)
      FROM
        pg_index
      WHERE
        indrelid = cl.oid
    ) + 1) << 13 size_norm
  , pg_total_relation_size(oid) size
  , coalesce(extract('epoch' from (now() - greatest(
      pg_stat_get_last_vacuum_time(oid)
    , pg_stat_get_last_autovacuum_time(oid)
    ))), 1 << 30) vaclag
  FROM
    pg_class cl
  WHERE
    oid = $1::regclass -- table_name
  LIMIT 1;

# таблица больше целевого размера и VACUUM был давно
if stat.size > 2 * stat.size_norm and stat.vaclag is None or stat.vaclag > 60:
  -> VACUUM %table;
  try:
    -> BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
    # пытаемся захватить монопольную блокировку с предельным временем ожидания 1s
    -> SET LOCAL statement_timeout = '1s'; SET LOCAL lock_timeout = '1s';
    -> LOCK TABLE %table IN ACCESS EXCLUSIVE MODE;
    # надо убедиться в пустоте таблицы внутри транзакции с блокировкой
    row <- TABLE %table LIMIT 1;
    # если в таблице нет ни одной "живой" записи - очищаем ее полностью, в противном случае - "перевставляем" все записи через временную таблицу
    if row is None:
      -> TRUNCATE TABLE %table RESTART IDENTITY;
    else:
      # создаем временную таблицу с данными таблицы-оригинала
      -> CREATE TEMPORARY TABLE _tmp_swap ON COMMIT DROP AS TABLE %table;
      # очищаем оригинал без сброса последовательности
      -> TRUNCATE TABLE %table;
      # вставляем все сохраненные во временной таблице данные обратно
      -> INSERT INTO %table TABLE _tmp_swap;
    -> COMMIT;
  except Exception as e:
    # если мы получили ошибку, но соединение все еще "живо" - словили таймаут
    if not isinstance(e, InterfaceError):
      -> ROLLBACK;

Is it possible not to copy the data a second time?In principle, it is possible if no other activities are tied to the oid of the table itself from the side of the BL or FK from the side of the database:

CREATE TABLE _swap_%table(LIKE %table INCLUDING ALL);
INSERT INTO _swap_%table TABLE %table;
DROP TABLE %table;
ALTER TABLE _swap_%table RENAME TO %table;

Let's run the script on the source table and check the metrics:

VACUUM tbl;
BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
  SET LOCAL statement_timeout = '1s'; SET LOCAL lock_timeout = '1s';
  LOCK TABLE tbl IN ACCESS EXCLUSIVE MODE;
  CREATE TEMPORARY TABLE _tmp_swap ON COMMIT DROP AS TABLE tbl;
  TRUNCATE TABLE tbl;
  INSERT INTO tbl TABLE _tmp_swap;
COMMIT;

relpages | size_norm | size   | vaclag
-------------------------------------------
       0 |     24576 |  49152 | 32.705771

Everything worked out! The table has shrunk by a factor of 50 and all UPDATEs are running fast again.

Source: habr.com

When VACUUM passes, we clean the table manually