PostgreSQL Antipatterns: "There must be only one!"

In SQL, you describe "what" you want to get, not "how" it should be done. Therefore, the problem of developing SQL queries in the style of "as it is heard is how it is written" takes its place of honor, along with peculiarities of condition evaluation in SQL.

Today, using extremely simple examples, let's see what this can lead to in the context of using GROUP/DISTINCT и LIMIT with them.

That's if you wrote in the request “First connect these tablets, and then throw out all the duplicates, there should be only one instance for each key" - this is exactly how it will work, even if the connection was not needed at all.

And sometimes you are lucky and it “just works”, sometimes it has an unpleasant effect on performance, and sometimes it gives effects that are absolutely unexpected from the developer’s point of view.

PostgreSQL Antipatterns: "There must be only one!"
Well, maybe not as spectacular, but…

"Sweet couple": JOIN + DISTINCT

SELECT DISTINCT
  X.*
FROM
  X
JOIN
  Y
    ON Y.fk = X.pk
WHERE
  Y.bool_condition;

How would it be clear what they wanted select such records X, for which in Y there are associated with the fulfilled condition. Submitted a request via JOIN - received some values ​​of pk several times (exactly how many suitable records turned out to be in Y). How to remove? Certainly DISTINCT!

It is especially “pleasant” when for each X-record there are several hundred related Y-records, and then duplicates are heroically removed ...

PostgreSQL Antipatterns: "There must be only one!"

How to fix? To begin with, realize that the task can be modified to "select those records X for which there is AT LEAST ONE in Y associated with the condition being fulfilled" - after all, we don’t need anything from the Y-record itself.

Nested EXISTS

SELECT
  *
FROM
  X
WHERE
  EXISTS(
    SELECT
      NULL
    FROM
      Y
    WHERE
      fk = X.pk AND
      bool_condition
    LIMIT 1
  );

Some versions of PostgreSQL understand that in EXISTS it is enough to find the first record that comes across, older ones do not. Therefore, I prefer to always indicate LIMIT 1 inside EXISTS.

LATERAL JOIN

SELECT
  X.*
FROM
  X
, LATERAL (
    SELECT
      Y.*
    FROM
      Y
    WHERE
      fk = X.pk AND
      bool_condition
    LIMIT 1
  ) Y
WHERE
  Y IS DISTINCT FROM NULL;

The same option allows, if necessary, to immediately return some data from the found associated Y-record at the same time. A similar option is discussed in the article "PostgreSQL Antipatterns: rare record will reach the middle of JOIN".

"Why pay more": DISTINCT [ON] + LIMIT 1

An additional advantage of such query transformations is the ability to easily limit the enumeration of records if only one / few of them are needed, as in the following case:

SELECT DISTINCT ON(X.pk)
  *
FROM
  X
JOIN
  Y
    ON Y.fk = X.pk
LIMIT 1;

Now we read the request and try to understand what the DBMS is supposed to do:

  • we connect the plates
  • unique by X.pk
  • choose one of the remaining records

So what did you get? "Some one record" from the unique ones - and if you take this one of the non-unique ones, will the result somehow change? .. “And if there is no difference, why pay more?”

SELECT
  *
FROM
  (
    SELECT
      *
    FROM
      X
    -- сюда можно подсунуть подходящих условий
    LIMIT 1 -- +1 Limit
  ) X
JOIN
  Y
    ON Y.fk = X.pk
LIMIT 1;

And exactly the same theme with GROUP BY + LIMIT 1.

"I just have to ask": implicit GROUP + LIMIT

Similar things occur in different nonemptiness checks labels or CTEs as the request progresses:

...
CASE
  WHEN (
    SELECT
      count(*)
    FROM
      X
    LIMIT 1
  ) = 0 THEN ...

Aggregate functions (count/min/max/sum/...) are successfully executed on the entire set, even without explicitly specifying GROUP BY. Only here with LIMIT they are not very friendly.

The developer can think “Now, if there are records there, then I need no more than LIMIT”. But you don't have to! Because for the base it is:

  • count what they want on all records
  • give as many lines as they ask

Depending on the target conditions, it is appropriate to make one of the following substitutions:

  • (count + LIMIT 1) = 0 on NOT EXISTS(LIMIT 1)
  • (count + LIMIT 1) > 0 on EXISTS(LIMIT 1)
  • count >= N on (SELECT count(*) FROM (... LIMIT N))

"How much to hang in grams": DISTINCT + LIMIT

SELECT DISTINCT
  pk
FROM
  X
LIMIT $1

A naive developer may sincerely believe that the execution of a request will stop, as soon as we find the first $1 different values ​​that come across.

Sometime in the future, this may and will work thanks to a new node Index Skip Scan, the implementation of which is currently being worked out, but not yet.

So far first all records will be retrieved, are unique, and only as many of them as requested will be returned. It is especially sad if we wanted something like $1 = 4, and there are hundreds of thousands of records in the table ...

In order not to be sad in vain, we will use a recursive query "DISTINCT for the Poor" from PostgreSQL Wiki:

PostgreSQL Antipatterns: "There must be only one!"

Source: habr.com

Add a comment