In complex ERP systems many entities are hierarchical in naturewhen homogeneous objects line up in parent-child relationship tree - this is the organizational structure of the enterprise (all these branches, departments and working groups), and the catalog of goods, and areas of work, and the geography of points of sale, ...
In fact, there is none
There are many ways to store such a tree in a DBMS, but today we will focus on only one option:
CREATE TABLE hier(
id
integer
PRIMARY KEY
, pid
integer
REFERENCES hier
, data
json
);
CREATE INDEX ON hier(pid); -- не забываем, что FK не подразумевает автосоздание индекса, в отличие от PK
And while you peer into the depths of the hierarchy, it patiently waits to see how [in]effective your “naive” ways of working with such a structure will turn out to be.
Let's take a look at the typical tasks that arise, their implementation in SQL, and try to improve their performance.
#1. How deep is the rabbit hole?
Let's, for definiteness, accept that this structure will reflect the subordination of departments in the structure of the organization: departments, divisions, sectors, branches, working groups, ... - whatever you call them.
First we generate our 'tree' of 10K elements
INSERT INTO hier
WITH RECURSIVE T AS (
SELECT
1::integer id
, '{1}'::integer[] pids
UNION ALL
SELECT
id + 1
, pids[1:(random() * array_length(pids, 1))::integer] || (id + 1)
FROM
T
WHERE
id < 10000
)
SELECT
pids[array_length(pids, 1)] id
, pids[array_length(pids, 1) - 1] pid
FROM
T;
Let's start with the simplest task - to find all employees who work within a specific sector, or in terms of hierarchy - find all descendants of a node. And also it would be nice to get the “depth” of the child ... All this may be necessary, for example, to build some
Everything would be fine if there are only a couple of levels of these descendants and quantitatively within a dozen, but if there are more than 5 levels, and there are already dozens of descendants, there may be problems. Let's take a look at how traditional "down the tree" search options are written (and work). But first, let's determine which of the nodes will be the most interesting for our research.
Most "deep" subtrees:
WITH RECURSIVE T AS (
SELECT
id
, pid
, ARRAY[id] path
FROM
hier
WHERE
pid IS NULL
UNION ALL
SELECT
hier.id
, hier.pid
, T.path || hier.id
FROM
T
JOIN
hier
ON hier.pid = T.id
)
TABLE T ORDER BY array_length(path, 1) DESC;
id | pid | path
---------------------------------------------
7624 | 7623 | {7615,7620,7621,7622,7623,7624}
4995 | 4994 | {4983,4985,4988,4993,4994,4995}
4991 | 4990 | {4983,4985,4988,4989,4990,4991}
...
Most "wide" subtrees:
...
SELECT
path[1] id
, count(*)
FROM
T
GROUP BY
1
ORDER BY
2 DESC;
id | count
------------
5300 | 30
450 | 28
1239 | 27
1573 | 25
For these queries, we used a typical recursive JOIN:
Obviously, with such a query model the number of iterations will be the same as the total number of children (and there are several dozen of them), and it can take quite significant resources, and, as a result, time.
Let's check on the "widest" subtree:
WITH RECURSIVE T AS (
SELECT
id
FROM
hier
WHERE
id = 5300
UNION ALL
SELECT
hier.id
FROM
T
JOIN
hier
ON hier.pid = T.id
)
TABLE T;
As expected, we found all 30 entries. But we spent 60% of the time on this - because we did 30 index searches at the same time. Is it possible to do less?
Bulk subtraction by index
Do we need to make a separate query to the index for each node? It turns out not - we can read from the index for several keys at once in one call through = ANY(array)
.
And in each such group of identifiers, we can take all the IDs found in the previous step by “nodes”. That is, at each next step we will search for all descendants of a certain level at once.
Only, here's the problem in a recursive selection, you cannot refer to itself in a subquery, but we need to somehow select only what was found at the previous level ... It turns out that it is impossible to make a nested query to the entire selection, but to its specific field - it is possible. And this field can be an array - which is what we need to use ANY
.
It sounds a little wild, but on the diagram - everything is simple.
WITH RECURSIVE T AS (
SELECT
ARRAY[id] id$
FROM
hier
WHERE
id = 5300
UNION ALL
SELECT
ARRAY(
SELECT
id
FROM
hier
WHERE
pid = ANY(T.id$)
) id$
FROM
T
WHERE
coalesce(id$, '{}') <> '{}' -- условие выхода из цикла - пустой массив
)
SELECT
unnest(id$) id
FROM
T;
And here the most important thing is not even win 1.5 times over time, and that we subtracted fewer buffers, since we have only 5 accesses to the index instead of 30!
An additional bonus is the fact that after the final unnest, the identifiers will remain ordered by "levels".
Node sign
The following consideration which will help improve performance − "leaves" cannot have children, that is, they do not need to look for “down” at all. In the formulation of our task, this means that if we walked along the chain of departments and reached the employee, then there is no need to search further along this branch.
Let's put in our table additional boolean
-field, which will immediately tell us whether this particular entry in our tree is a “node” - that is, whether it can have descendants at all.
ALTER TABLE hier
ADD COLUMN branch boolean;
UPDATE
hier T
SET
branch = TRUE
WHERE
EXISTS(
SELECT
NULL
FROM
hier
WHERE
pid = T.id
LIMIT 1
);
-- Запрос успешно выполнен: 3033 строк изменено за 42 мс.
Great! It turns out that we have only a little more than 30% of all elements of the tree have descendants.
Now let's apply a slightly different mechanics - connections to the recursive part through LATERAL
, which will allow us to immediately access the fields of the recursive "table", and use the aggregate function with the filtering condition based on the node to reduce the set of keys:
WITH RECURSIVE T AS (
SELECT
array_agg(id) id$
, array_agg(id) FILTER(WHERE branch) ns$
FROM
hier
WHERE
id = 5300
UNION ALL
SELECT
X.*
FROM
T
JOIN LATERAL (
SELECT
array_agg(id) id$
, array_agg(id) FILTER(WHERE branch) ns$
FROM
hier
WHERE
pid = ANY(T.ns$)
) X
ON coalesce(T.ns$, '{}') <> '{}'
)
SELECT
unnest(id$) id
FROM
T;
We were able to shorten one more call to the index and won more than 2 times in terms of volume readable.
#2. Back to the roots
This algorithm will be useful if you need to collect records for all elements "up the tree", while retaining information about which source sheet (and with what indicators) caused it to fall into the selection - for example, to generate a summary report with aggregation into nodes.
What follows should be taken solely as a proof-of-concept, since the request turns out to be very cumbersome. But if it dominates your database, you should think about using such techniques.
Let's start with a couple of simple statements:
- The same record from the database best read only once.
- Records from the database it is more efficient to read in bulkthan singly.
Now let's try to construct the query we need.
Step 1
Obviously, at the initialization of the recursion (where without it!) we will have to subtract the records of the leaves themselves according to the set of initial identifiers:
WITH RECURSIVE tree AS (
SELECT
rec -- это цельная запись таблицы
, id::text chld -- это "набор" приведших сюда исходных листьев
FROM
hier rec
WHERE
id = ANY('{1,2,4,8,16,32,64,128,256,512,1024,2048,4096,8192}'::integer[])
UNION ALL
...
If it seemed strange to someone that the "set" is stored as a string and not an array, then there is a simple explanation for this. For strings, there is a built-in aggregating "gluing" function string_agg
, but not for arrays. Although her
Step 2
Now we would like to get a set of section IDs that will need to be subtracted further. Almost always they will be duplicated in different records of the original set - so we would group them, while retaining information about the source leaves.
But here we have three troubles:
- The "sub-recursive" part of the query cannot contain aggregate functions with
GROUP BY
. - A call to a recursive "table" cannot be in a nested subquery.
- The query in the recursive part cannot contain a CTE.
Fortunately, all these problems are fairly easy to bypass. Let's start from the end.
CTE in recursive part
Like this not works:
WITH RECURSIVE tree AS (
...
UNION ALL
WITH T (...)
SELECT ...
)
And so - it works, brackets decide!
WITH RECURSIVE tree AS (
...
UNION ALL
(
WITH T (...)
SELECT ...
)
)
Nested query to recursive "table"
Hmm… A recursive CTE call cannot be in a subquery. But it can be inside a CTE! And a nested query can already refer to this CTE!
GROUP BY inside recursion
Unpleasant, but ... We also have an easy way to emulate GROUP BY using DISTINCT ON
and window functions!
SELECT
(rec).pid id
, string_agg(chld::text, ',') chld
FROM
tree
WHERE
(rec).pid IS NOT NULL
GROUP BY 1 -- не работает!
And this is how it works!
SELECT DISTINCT ON((rec).pid)
(rec).pid id
, string_agg(chld::text, ',') OVER(PARTITION BY (rec).pid) chld
FROM
tree
WHERE
(rec).pid IS NOT NULL
Now we see why the numeric ID turned into text - so that they can be glued together separated by commas!
Step 3
For the final, we have nothing left:
- subtract "section" records by a set of grouped IDs
- match subtracted sections with "sets" of original sheets
- "expand" string-set with
unnest(string_to_array(chld, ',')::integer[])
WITH RECURSIVE tree AS (
SELECT
rec
, id::text chld
FROM
hier rec
WHERE
id = ANY('{1,2,4,8,16,32,64,128,256,512,1024,2048,4096,8192}'::integer[])
UNION ALL
(
WITH prnt AS (
SELECT DISTINCT ON((rec).pid)
(rec).pid id
, string_agg(chld::text, ',') OVER(PARTITION BY (rec).pid) chld
FROM
tree
WHERE
(rec).pid IS NOT NULL
)
, nodes AS (
SELECT
rec
FROM
hier rec
WHERE
id = ANY(ARRAY(
SELECT
id
FROM
prnt
))
)
SELECT
nodes.rec
, prnt.chld
FROM
prnt
JOIN
nodes
ON (nodes.rec).id = prnt.id
)
)
SELECT
unnest(string_to_array(chld, ',')::integer[]) leaf
, (rec).*
FROM
tree;
Source: habr.com