PostgreSQL Antipatterns: fighting hordes of "dead"

The peculiarities of PostgreSQL's internal mechanisms allow it to be very fast in some situations and "not very" in others. Today we will focus on a classic example of a conflict between how a DBMS works and what a developer does with it - UPDATE vs MVCC principles.

Brief story from great article:

When a row is modified with an UPDATE command, two operations are actually performed: DELETE and INSERT. IN current string version xmax is set to the number of the transaction that performed the UPDATE. Then created new version the same line; its xmin value is the same as the xmax value of the previous version.

Some time after the completion of this transaction, the old or new version, depending on COMMIT/ROOLBACK, will be recognized "dead" (dead tuples) when passing VACUUM according to the table and cleared.

PostgreSQL Antipatterns: fighting hordes of "dead"

But this will not happen immediately, but problems with the "dead" can be acquired very quickly - with repeated or bulk update records in a large table, and a little later to face the situation that VACUUM can't help.

#1: I Like To Move It

Let's say your method on the business logic works for itself, and suddenly realizes that it would be necessary to update the X field in some record:

UPDATE tbl SET X = <newX> WHERE pk = $1;

Then, in the course of execution, it finds out that the Y field should be updated too:

UPDATE tbl SET Y = <newY> WHERE pk = $1;

... and then also Z - why bother with trifles?

UPDATE tbl SET Z = <newZ> WHERE pk = $1;

How many versions of this entry do we have in the database now? Yep, 4 pieces! Of these, one is relevant, and 3 will have to clean up after you [auto]VACUUM.

Do not do it this way! Use updating all fields in one request - almost always, the logic of the method can be changed like this:

UPDATE tbl SET X = <newX>, Y = <newY>, Z = <newZ> WHERE pk = $1;

#2: Use IS DISTINCT FROM Luke!

So, you still want update many, many records in a table (during the application of a script or converter, for example). And something like this flies into the script:

UPDATE tbl SET X = <newX> WHERE pk BETWEEN $1 AND $2;

Approximately in this form, a request is encountered quite often and almost always not to fill in an empty new field, but to correct some errors in the data. At the same time, she the correctness of already existing data is not taken into account at all - but in vain! That is, the record is being rewritten, even if exactly what was wanted was there - but why? Let's fix it:

UPDATE tbl SET X = <newX> WHERE pk BETWEEN $1 AND $2 AND X IS DISTINCT FROM <newX>;

Many are not aware of the existence of such a wonderful operator, so here is a cheat sheet on IS DISTINCT FROM and other logical operators to help:
PostgreSQL Antipatterns: fighting hordes of "dead"
... and a little about operations on complex ROW()-expressions:
PostgreSQL Antipatterns: fighting hordes of "dead"

#3: And I recognize the cute one by… blocking

are being launched two identical parallel processes, each of which tries to mark on the record that it is "in progress":

UPDATE tbl SET processing = TRUE WHERE pk = $1;

Even if these processes do things that are independent of each other, but within the same ID, the second client will “lock” on this request until the first transaction ends.

Decision No.1: the task is reduced to the previous one

Just add again IS DISTINCT FROM:

UPDATE tbl SET processing = TRUE WHERE pk = $1 AND processing IS DISTINCT FROM TRUE;

In this form, the second request simply will not change anything in the database, there is already “everything is as it should” - therefore, blocking will not occur. Further, the fact of “not finding” the record is already processed in the applied algorithm.

Decision No.2: advisory locks

A big topic for a separate article in which you can read about ways to use and "rake" advisory locks.

Decision No.3: no[d]smart calls

But exactly, exactly, you should have concurrent work with the same record? Or did you mess up with the algorithms for calling business logic from the client side, for example? And if you think about it...

Source: habr.com

Add a comment