PostgreSQL Antipatterns: A Tale of Iterative Refinement of Search by Name, or "Optimizing Back and forth"

Thousands of managers from sales offices across the country fix in our CRM system tens of thousands of contacts daily — facts of communication with potential or already working with us clients. And for this client, you must first find, and preferably very quickly. And this happens most often by name.

Therefore, it is not surprising that, once again analyzing "heavy" queries on one of the most loaded databases - our own VLIS corporate account, I found "in the top" query for "quick" search by name for business cards.

Moreover, further investigation revealed an interesting example optimization first, then performance degradation request with its consistent completion by several teams, each of which acted solely from the best of intentions.

0: what did the user want

PostgreSQL Antipatterns: A Tale of Iterative Refinement of Search by Name, or "Optimizing Back and forth"[KDPV hence]

What does the user usually mean when he talks about a “quick” search by name? It almost never turns out to be a "fair" substring search like ... LIKE '%роза%' - after all, then the result is not only 'Розалия' и 'Магазин Роза'but роза' and even 'Дом Деда Мороза'.

The user means at the household level that you will provide him search at the beginning of a word in the title and show more relevant what starts on entered. And do it almost instantly - with subscript input.

1: limit the task

And even more so, a person will not specifically introduce 'роз магаз'so that you have to search prefix for each word. No, it is much easier for a user to respond to a quick hint for the last word than to deliberately “under-enter” the previous ones - see how any search engine works out this.

In general, correctly to formulate the requirements for the problem is more than half the solution. Sometimes careful use case analysis can significantly affect the outcome..

What does an abstract developer do?

1.0: external search engine

Oh, search is difficult, you don’t want to do something at all - let's give it to devops! Let them deploy a search engine external to the database for us: Sphinx, ElasticSearch, ...

A working, albeit time-consuming option in terms of synchronization and efficiency of changes. But not in our case, since the search is carried out for each client only within the framework of his account data. And the data has a fairly high variability - and if now the manager has entered a card 'Магазин Роза', then after 5-10 seconds he can already remember that he forgot to specify the email there and want to find and fix it.

Therefore - let's search "directly in the database". Fortunately, PostgreSQL allows us to do this, and more than one option - we will consider them.

1.1: "honest" substring

We cling to the word "substring". But exactly for index search by substring (and even by regular expressions!) There is an excellent pg_trgm module! Only then it will be necessary to sort correctly.

Let's try to take for the sake of simplicity of the model such a plate:

CREATE TABLE firms(
  id
    serial
      PRIMARY KEY
, name
    text
);

We upload 7.8 million records of real organizations there and index them:

CREATE EXTENSION pg_trgm;
CREATE INDEX ON firms USING gin(lower(name) gin_trgm_ops);

Let's look for the first 10 records for substring search:

SELECT
  *
FROM
  firms
WHERE
  lower(name) ~ ('(^|s)' || 'роза')
ORDER BY
  lower(name) ~ ('^' || 'роза') DESC -- сначала "начинающиеся на"
, lower(name) -- остальное по алфавиту
LIMIT 10;

PostgreSQL Antipatterns: A Tale of Iterative Refinement of Search by Name, or "Optimizing Back and forth"
[look at explain.tensor.ru]

Well, that ... 26ms, 31MB read data and more than 1.7K filtered records - for 10 searched. The overhead is too high, is it possible to do something more efficient?

1.2: text search? it's FTS!

Indeed, PostgreSQL provides a very powerful full text search engine (Full Text Search), including with the possibility of prefix search. Great option, you don't even need to install extensions! Let's try:

CREATE INDEX ON firms USING gin(to_tsvector('simple'::regconfig, lower(name)));

SELECT
  *
FROM
  firms
WHERE
  to_tsvector('simple'::regconfig, lower(name)) @@ to_tsquery('simple', 'роза:*')
ORDER BY
  lower(name) ~ ('^' || 'роза') DESC
, lower(name)
LIMIT 10;

PostgreSQL Antipatterns: A Tale of Iterative Refinement of Search by Name, or "Optimizing Back and forth"
[look at explain.tensor.ru]

Parallelization of the query execution helped us a little here, cutting the time by half to 11ms. Yes, and we had to read 1.5 times less - in total 20MB. And here the less - the better, because the larger the volume we subtract, the higher the chances of getting a cache miss, and each extra page of data read from the disk is a potential "brake" for the request.

1.3: still LIKE?

The previous request is good for everyone, but only if you pull it a hundred thousand times a day, then it will run 2TB read data. At best - from memory, but if you're not lucky, then from the disk. So let's try to make it smaller.

Remember what the user wants to see first "that start with...". So it's in its purest form. prefix search through text_pattern_ops! And only if we “do not have enough” up to 10 required records, then we will have to read them using the FTS search:

CREATE INDEX ON firms(lower(name) text_pattern_ops);

SELECT
  *
FROM
  firms
WHERE
  lower(name) LIKE ('роза' || '%')
LIMIT 10;

PostgreSQL Antipatterns: A Tale of Iterative Refinement of Search by Name, or "Optimizing Back and forth"
[look at explain.tensor.ru]

Excellent performance - total 0.05ms and just over 100KB read! We just forgot sort by nameso that the user does not get lost in the results:

SELECT
  *
FROM
  firms
WHERE
  lower(name) LIKE ('роза' || '%')
ORDER BY
  lower(name)
LIMIT 10;

PostgreSQL Antipatterns: A Tale of Iterative Refinement of Search by Name, or "Optimizing Back and forth"
[look at explain.tensor.ru]

Oh, something is not so beautiful anymore - it seems that there is an index, but sorting flies past it ... Of course, it is already many times more efficient than the previous version, but ...

1.4: "finish with a file"

But there is an index that allows you to search by range, and it is normal to use sorting - regular btree!

CREATE INDEX ON firms(lower(name));

Only the request for it will have to be “assembled manually”:

SELECT
  *
FROM
  firms
WHERE
  lower(name) >= 'роза' AND
  lower(name) <= ('роза' || chr(65535)) -- для UTF8, для однобайтовых - chr(255)
ORDER BY
   lower(name)
LIMIT 10;

PostgreSQL Antipatterns: A Tale of Iterative Refinement of Search by Name, or "Optimizing Back and forth"
[look at explain.tensor.ru]

Excellent - and sorting works, and resource consumption remains "microscopic", thousands of times more effective than "pure" FTS! It remains to collect in a single request:

(
  SELECT
    *
  FROM
    firms
  WHERE
    lower(name) >= 'роза' AND
    lower(name) <= ('роза' || chr(65535)) -- для UTF8, для однобайтовых кодировок - chr(255)
  ORDER BY
     lower(name)
  LIMIT 10
)
UNION ALL
(
  SELECT
    *
  FROM
    firms
  WHERE
    to_tsvector('simple'::regconfig, lower(name)) @@ to_tsquery('simple', 'роза:*') AND
    lower(name) NOT LIKE ('роза' || '%') -- "начинающиеся на" мы уже нашли выше
  ORDER BY
    lower(name) ~ ('^' || 'роза') DESC -- используем ту же сортировку, чтобы НЕ пойти по btree-индексу
  , lower(name)
  LIMIT 10
)
LIMIT 10;

Note that the second subquery is executed only if the first one returned less than expected last LIMIT number of lines. About this way of optimizing queries, I already wrote before.

So yes, we now have btree and gin on the table at the same time, but statistically it turned out that less than 10% of requests reach the execution of the second block. That is, with such typical limitations for the task known in advance, we were able to reduce the total consumption of server resources by almost a thousand times!

1.5*: do without a file

Above LIKE we were prevented from using the wrong sorting. But it can be "set on the right path" by specifying the USING operator:

The default is ASC. In addition, you can specify the name of a specific sort operator in the clause USING. The sort operator must be a "less than" or "greater than" member of some family of B-tree operators. ASC usually equivalent USING < и DESC usually equivalent USING >.

In our case, "less" is ~<~:

SELECT
  *
FROM
  firms
WHERE
  lower(name) LIKE ('роза' || '%')
ORDER BY
  lower(name) USING ~<~
LIMIT 10;

PostgreSQL Antipatterns: A Tale of Iterative Refinement of Search by Name, or "Optimizing Back and forth"
[look at explain.tensor.ru]

2: how requests turn sour

Now we leave our request to “brew” for six months or a year, and with surprise we again find it “in the top” with indicators of the total daily “pumping” of memory (buffers shared hit) In 5.5TB - that is, even more than it was originally.

No, of course, and our business has grown, and the workload has increased, but not by the same amount! So, something is not clean here - let's figure it out.

2.1: the birth of paging

At some point, another development team wanted to make it possible to “jump” into the registry from a quick subscript search with the same, but expanded results. And what registry without pagination? Let's screw it on!

( ... LIMIT <N> + 10)
UNION ALL
( ... LIMIT <N> + 10)
LIMIT 10 OFFSET <N>;

Now it was possible for the developer to show the register of search results with a “page-type” loading without straining.

Of course, in fact, more and more is read for each next page of data (everything from the previous time, which we discard, plus the desired “tail”) - that is, this is an unambiguous anti-pattern. And it would be more correct to start the search at the next iteration from the key stored in the interface, but more about that another time.

2.2: want exotic

At some point, the developer wanted diversify the resulting sample with data from another table, for which the entire previous query was sent to the CTE:

WITH q AS (
  ...
  LIMIT <N> + 10
)
SELECT
  *
, (SELECT ...) sub_query -- какой-то запрос к связанной таблице
FROM
  q
LIMIT 10 OFFSET <N>;

And even so, not bad, since the subquery is evaluated only for 10 returned records, if not ...

2.3: DISTINCT senseless and merciless

Somewhere in the process of such an evolution from the 2nd subquery lost NOT LIKE condition. It is clear that after this UNION ALL started to return some entries twice - first found at the beginning of the line, and then again - at the beginning of the first word of this line. In the limit, all records of the 2nd subquery could match the records of the first.

What does a developer do instead of looking for a reason?.. Not a question!

  • double the size initial samples
  • impose DISTINCTto get only single instances of each row

WITH q AS (
  ( ... LIMIT <2 * N> + 10)
  UNION ALL
  ( ... LIMIT <2 * N> + 10)
  LIMIT <2 * N> + 10
)
SELECT DISTINCT
  *
, (SELECT ...) sub_query
FROM
  q
LIMIT 10 OFFSET <N>;

That is, it is clear that the result, in the end, is exactly the same, but the chance to “fly” into the 2nd CTE subquery has become much higher, and even without it, clearly read more.

But this is not the saddest thing. Since the developer asked to select DISTINCT not for specific, but for all fields at once records, then the sub_query field is automatically included - the result of the subquery. Now, to execute DISTINCT, the database had to execute already not 10 subqueries, but all <2 * N> + 10!

2.4: cooperation above all!

So, the developers lived - they didn’t grieve, because in the registry “screw up” to significant N values ​​\uXNUMXb\uXNUMXbwith a chronic slowdown in getting each next “page”, the user clearly did not have the patience.

Until developers from another department came to them, and did not want to use such a convenient method for iterative search - that is, we take a piece from some sample, filter by additional conditions, draw the result, then the next piece (which in our case is achieved by increasing N), and so on until we fill the screen.

In general, in a caught specimen N has reached almost 17K, and in just a day, at least 4K such requests were executed “along the chain”. The last of them boldly scanned already by 1GB of memory per iteration...

Total

PostgreSQL Antipatterns: A Tale of Iterative Refinement of Search by Name, or "Optimizing Back and forth"

Source: habr.com

Add a comment