PostgreSQL Antipatterns: how deep is the rabbit hole? let's go through the hierarchy

In complex ERP systems many entities are hierarchical in naturewhen homogeneous objects line up in parent-child relationship tree - this is the organizational structure of the enterprise (all these branches, departments and working groups), and the catalog of goods, and areas of work, and the geography of points of sale, ...

PostgreSQL Antipatterns: how deep is the rabbit hole? let's go through the hierarchy

In fact, there is none business automation, where at least some hierarchy would not have appeared as a result. But even if you don't work for a business, you can still easily run into hierarchical relationships. Trite, even your family tree or floor plan in a shopping center is the same structure.

There are many ways to store such a tree in a DBMS, but today we will focus on only one option:

CREATE TABLE hier(
  id
    integer
      PRIMARY KEY
, pid
    integer
      REFERENCES hier
, data
    json
);

CREATE INDEX ON hier(pid); -- не забываем, что FK не подразумевает автосоздание индекса, в отличие от PK

And while you peer into the depths of the hierarchy, it patiently waits to see how [in]effective your “naive” ways of working with such a structure will turn out to be.

PostgreSQL Antipatterns: how deep is the rabbit hole? let's go through the hierarchy
Let's take a look at the typical tasks that arise, their implementation in SQL, and try to improve their performance.

#1. How deep is the rabbit hole?

Let's, for definiteness, accept that this structure will reflect the subordination of departments in the structure of the organization: departments, divisions, sectors, branches, working groups, ... - whatever you call them.
PostgreSQL Antipatterns: how deep is the rabbit hole? let's go through the hierarchy

First we generate our 'tree' of 10K elements

INSERT INTO hier
WITH RECURSIVE T AS (
  SELECT
    1::integer id
  , '{1}'::integer[] pids
UNION ALL
  SELECT
    id + 1
  , pids[1:(random() * array_length(pids, 1))::integer] || (id + 1)
  FROM
    T
  WHERE
    id < 10000
)
SELECT
  pids[array_length(pids, 1)] id
, pids[array_length(pids, 1) - 1] pid
FROM
  T;

Let's start with the simplest task - to find all employees who work within a specific sector, or in terms of hierarchy - find all descendants of a node. And also it would be nice to get the “depth” of the child ... All this may be necessary, for example, to build some complex selection according to the list of IDs of these employees.

Everything would be fine if there are only a couple of levels of these descendants and quantitatively within a dozen, but if there are more than 5 levels, and there are already dozens of descendants, there may be problems. Let's take a look at how traditional "down the tree" search options are written (and work). But first, let's determine which of the nodes will be the most interesting for our research.

Most "deep" subtrees:

WITH RECURSIVE T AS (
  SELECT
    id
  , pid
  , ARRAY[id] path
  FROM
    hier
  WHERE
    pid IS NULL
UNION ALL
  SELECT
    hier.id
  , hier.pid
  , T.path || hier.id
  FROM
    T
  JOIN
    hier
      ON hier.pid = T.id
)
TABLE T ORDER BY array_length(path, 1) DESC;

 id  | pid  | path
---------------------------------------------
7624 | 7623 | {7615,7620,7621,7622,7623,7624}
4995 | 4994 | {4983,4985,4988,4993,4994,4995}
4991 | 4990 | {4983,4985,4988,4989,4990,4991}
...

Most "wide" subtrees:

...
SELECT
  path[1] id
, count(*)
FROM
  T
GROUP BY
  1
ORDER BY
  2 DESC;

id   | count
------------
5300 |   30
 450 |   28
1239 |   27
1573 |   25

For these queries, we used a typical recursive JOIN:
PostgreSQL Antipatterns: how deep is the rabbit hole? let's go through the hierarchy

Obviously, with such a query model the number of iterations will be the same as the total number of children (and there are several dozen of them), and it can take quite significant resources, and, as a result, time.

Let's check on the "widest" subtree:

WITH RECURSIVE T AS (
  SELECT
    id
  FROM
    hier
  WHERE
    id = 5300
UNION ALL
  SELECT
    hier.id
  FROM
    T
  JOIN
    hier
      ON hier.pid = T.id
)
TABLE T;

PostgreSQL Antipatterns: how deep is the rabbit hole? let's go through the hierarchy
[look at explain.tensor.ru]

As expected, we found all 30 entries. But we spent 60% of the time on this - because we did 30 index searches at the same time. Is it possible to do less?

Bulk subtraction by index

Do we need to make a separate query to the index for each node? It turns out not - we can read from the index for several keys at once in one call through = ANY(array).

And in each such group of identifiers, we can take all the IDs found in the previous step by “nodes”. That is, at each next step we will search for all descendants of a certain level at once.

Only, here's the problem in a recursive selection, you cannot refer to itself in a subquery, but we need to somehow select only what was found at the previous level ... It turns out that it is impossible to make a nested query to the entire selection, but to its specific field - it is possible. And this field can be an array - which is what we need to use ANY.

It sounds a little wild, but on the diagram - everything is simple.

PostgreSQL Antipatterns: how deep is the rabbit hole? let's go through the hierarchy

WITH RECURSIVE T AS (
  SELECT
    ARRAY[id] id$
  FROM
    hier
  WHERE
    id = 5300
UNION ALL
  SELECT
    ARRAY(
      SELECT
        id
      FROM
        hier
      WHERE
        pid = ANY(T.id$)
    ) id$
  FROM
    T
  WHERE
    coalesce(id$, '{}') <> '{}' -- условие выхода из цикла - пустой массив
)
SELECT
  unnest(id$) id
FROM
  T;

PostgreSQL Antipatterns: how deep is the rabbit hole? let's go through the hierarchy
[look at explain.tensor.ru]

And here the most important thing is not even win 1.5 times over time, and that we subtracted fewer buffers, since we have only 5 accesses to the index instead of 30!

An additional bonus is the fact that after the final unnest, the identifiers will remain ordered by "levels".

Node sign

The following consideration which will help improve performance − "leaves" cannot have children, that is, they do not need to look for “down” at all. In the formulation of our task, this means that if we walked along the chain of departments and reached the employee, then there is no need to search further along this branch.

Let's put in our table additional boolean-field, which will immediately tell us whether this particular entry in our tree is a “node” - that is, whether it can have descendants at all.

ALTER TABLE hier
  ADD COLUMN branch boolean;

UPDATE
  hier T
SET
  branch = TRUE
WHERE
  EXISTS(
    SELECT
      NULL
    FROM
      hier
    WHERE
      pid = T.id
    LIMIT 1
);
-- Запрос успешно выполнен: 3033 строк изменено за 42 мс.

Great! It turns out that we have only a little more than 30% of all elements of the tree have descendants.

Now let's apply a slightly different mechanics - connections to the recursive part through LATERAL, which will allow us to immediately access the fields of the recursive "table", and use the aggregate function with the filtering condition based on the node to reduce the set of keys:

PostgreSQL Antipatterns: how deep is the rabbit hole? let's go through the hierarchy

WITH RECURSIVE T AS (
  SELECT
    array_agg(id) id$
  , array_agg(id) FILTER(WHERE branch) ns$
  FROM
    hier
  WHERE
    id = 5300
UNION ALL
  SELECT
    X.*
  FROM
    T
  JOIN LATERAL (
    SELECT
      array_agg(id) id$
    , array_agg(id) FILTER(WHERE branch) ns$
    FROM
      hier
    WHERE
      pid = ANY(T.ns$)
  ) X
    ON coalesce(T.ns$, '{}') <> '{}'
)
SELECT
  unnest(id$) id
FROM
  T;

PostgreSQL Antipatterns: how deep is the rabbit hole? let's go through the hierarchy
[look at explain.tensor.ru]

We were able to shorten one more call to the index and won more than 2 times in terms of volume readable.

#2. Back to the roots

This algorithm will be useful if you need to collect records for all elements "up the tree", while retaining information about which source sheet (and with what indicators) caused it to fall into the selection - for example, to generate a summary report with aggregation into nodes.

PostgreSQL Antipatterns: how deep is the rabbit hole? let's go through the hierarchy
What follows should be taken solely as a proof-of-concept, since the request turns out to be very cumbersome. But if it dominates your database, you should think about using such techniques.

Let's start with a couple of simple statements:

  • The same record from the database best read only once.
  • Records from the database it is more efficient to read in bulkthan singly.

Now let's try to construct the query we need.

Step 1

Obviously, at the initialization of the recursion (where without it!) we will have to subtract the records of the leaves themselves according to the set of initial identifiers:

WITH RECURSIVE tree AS (
  SELECT
    rec -- это цельная запись таблицы
  , id::text chld -- это "набор" приведших сюда исходных листьев
  FROM
    hier rec
  WHERE
    id = ANY('{1,2,4,8,16,32,64,128,256,512,1024,2048,4096,8192}'::integer[])
UNION ALL
  ...

If it seemed strange to someone that the "set" is stored as a string and not an array, then there is a simple explanation for this. For strings, there is a built-in aggregating "gluing" function string_agg, but not for arrays. Although her easy to implement yourself.

Step 2

Now we would like to get a set of section IDs that will need to be subtracted further. Almost always they will be duplicated in different records of the original set - so we would group them, while retaining information about the source leaves.

But here we have three troubles:

  1. The "sub-recursive" part of the query cannot contain aggregate functions with GROUP BY.
  2. A call to a recursive "table" cannot be in a nested subquery.
  3. The query in the recursive part cannot contain a CTE.

Fortunately, all these problems are fairly easy to bypass. Let's start from the end.

CTE in recursive part

Like this not works:

WITH RECURSIVE tree AS (
  ...
UNION ALL
  WITH T (...)
  SELECT ...
)

And so - it works, brackets decide!

WITH RECURSIVE tree AS (
  ...
UNION ALL
  (
    WITH T (...)
    SELECT ...
  )
)

Nested query to recursive "table"

Hmm… A recursive CTE call cannot be in a subquery. But it can be inside a CTE! And a nested query can already refer to this CTE!

GROUP BY inside recursion

Unpleasant, but ... We also have an easy way to emulate GROUP BY using DISTINCT ON and window functions!

SELECT
  (rec).pid id
, string_agg(chld::text, ',') chld
FROM
  tree
WHERE
  (rec).pid IS NOT NULL
GROUP BY 1 -- не работает!

And this is how it works!

SELECT DISTINCT ON((rec).pid)
  (rec).pid id
, string_agg(chld::text, ',') OVER(PARTITION BY (rec).pid) chld
FROM
  tree
WHERE
  (rec).pid IS NOT NULL

Now we see why the numeric ID turned into text - so that they can be glued together separated by commas!

Step 3

For the final, we have nothing left:

  • subtract "section" records by a set of grouped IDs
  • match subtracted sections with "sets" of original sheets
  • "expand" string-set with unnest(string_to_array(chld, ',')::integer[])

WITH RECURSIVE tree AS (
  SELECT
    rec
  , id::text chld
  FROM
    hier rec
  WHERE
    id = ANY('{1,2,4,8,16,32,64,128,256,512,1024,2048,4096,8192}'::integer[])
UNION ALL
  (
    WITH prnt AS (
      SELECT DISTINCT ON((rec).pid)
        (rec).pid id
      , string_agg(chld::text, ',') OVER(PARTITION BY (rec).pid) chld
      FROM
        tree
      WHERE
        (rec).pid IS NOT NULL
    )
    , nodes AS (
      SELECT
        rec
      FROM
        hier rec
      WHERE
        id = ANY(ARRAY(
          SELECT
            id
          FROM
            prnt
        ))
    )
    SELECT
      nodes.rec
    , prnt.chld
    FROM
      prnt
    JOIN
      nodes
        ON (nodes.rec).id = prnt.id
  )
)
SELECT
  unnest(string_to_array(chld, ',')::integer[]) leaf
, (rec).*
FROM
  tree;

PostgreSQL Antipatterns: how deep is the rabbit hole? let's go through the hierarchy
[look at explain.tensor.ru]

Source: habr.com

Add a comment