Search results output and performance issues

One of the typical scenarios in all the applications we are used to is searching for data by certain criteria and displaying them in a form that is easy to read. There may also be additional options for sorting, grouping, pagination. The task, in theory, is trivial, but when solving it, many developers make a number of mistakes, because of which performance then suffers. Let's try to consider various options for solving this problem and formulate recommendations for choosing the most efficient implementation.

Search results output and performance issues

Paging option #1

The simplest option that comes to mind is the pagination of search results in its most classic form.

Search results output and performance issues
Let's say your application uses a relational database. In this case, to display information in this form, you will need to execute two SQL queries:

  • Get the rows for the current page.
  • Count the total number of rows that match the search criteria - this is necessary to display pages.

Consider the first query on the example of a test MS SQL database AdventureWorks for 2016 server. For this purpose, we will use the Sales.SalesOrderHeader table:

SELECT * FROM Sales.SalesOrderHeader
ORDER BY OrderDate DESC
OFFSET 0 ROWS
FETCH NEXT 50 ROWS ONLY

The above query will return the first 50 orders from a list sorted by date added in descending order, in other words, the last 50 orders.

It runs quickly on a test base, but let's look at the execution plan and I / O statistics:

Search results output and performance issues

Table 'SalesOrderHeader'. Scan count 1, logical reads 698, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

You can get I/O statistics for each query by issuing the SET STATISTICS IO ON command in the query runtime.

As you can see from the execution plan, sorting all rows of the source table by the date added is the most resource intensive. And the problem is that the more rows appear in the table, the “heavier” sorting will be. In practice, such situations should be avoided, so let's add an index on the date of addition and see if the resource consumption has changed:

Search results output and performance issues

Table 'SalesOrderHeader'. Scan count 1, logical reads 165, physical reads 0, read-ahead reads 5, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Obviously it's gotten a lot better. But have all the problems been solved? Let's change the query to search for orders where the total cost of goods exceeds $100:

SELECT * FROM Sales.SalesOrderHeader
WHERE SubTotal > 100
ORDER BY OrderDate DESC
OFFSET 0 ROWS
FETCH NEXT 50 ROWS ONLY

Search results output and performance issues

Table 'SalesOrderHeader'. Scan count 1, logical reads 1081, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

We have a funny situation: the query plan is not much worse than the previous one, but the actual number of logical reads is almost twice as much as with a full table scan. There is a way out - if we make a composite index from an existing index and add the total price of goods as the second field, then again we get 165 logical reads:

CREATE INDEX IX_SalesOrderHeader_OrderDate_SubTotal on Sales.SalesOrderHeader(OrderDate, SubTotal);

This series of examples could go on and on, but the two main points I want to make here are:

  • Adding any new criteria or sort order to a search query can significantly affect the speed of its execution.
  • But if we need to subtract only a part of the data, and not all the results that match the search conditions, there are many ways to optimize such a query.

Now let's move on to the second query mentioned at the very beginning - the one that counts the number of records that satisfy the search criteria. Let's take the same example - searching for orders that are more expensive than $100:

SELECT COUNT(1) FROM Sales.SalesOrderHeader
WHERE SubTotal > 100

Given the composite index above, we get:

Search results output and performance issues

Table 'SalesOrderHeader'. Scan count 1, logical reads 698, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

The fact that the query goes through the entire index is not surprising, since the SubTotal field is not in the first position, so the query cannot use it. The problem is solved by adding one more index to the SubTotal field, and as a result it gives only 48 logical reads.

You can give a few more examples of requests for counting the number, but the essence remains the same: getting a portion of data and counting the total amount are two fundamentally different requests, and each requires its own measures for optimization. In general, it will not be possible to find a combination of indexes that works equally well for both queries.

Accordingly, one of the important requirements that should be clarified when developing such a search solution is whether it is really important for a business to see the total number of objects found. It often happens that it doesn't. And navigation by specific page numbers, in my opinion, is a solution with a very narrow scope, since most paging scenarios look like “go to the next page”.

Paging option #2

Suppose users don't care about knowing the total number of objects found. Let's try to simplify the search page:

Search results output and performance issues
In fact, only the fact that there is no way to jump to specific page numbers has changed, and now this table does not need to know how many there can be in order to display. But the question arises - how does the table know if there is data for the next page (in order to properly display the "Next" link)?

The answer is very simple: you can subtract from the base one record more than you need to display, and the presence of this "additional" record will show whether there is a next portion. Thus, to get one page of data, you will need to perform only one request, which significantly improves performance and makes it easier to support this functionality. In practice, I had a case where the refusal to count the total number of records accelerated the issuance of results by 4-5 times.

There are several user interface options for this approach: back and forward commands as in the example above, a "load more" button that simply adds a new portion to the displayed results, "infinite scroll" which works like a "load more" but the signal for the next portion is the user scrolling all the output results to the end. Whatever the visual solution, the principle of data sampling remains the same.

The nuances of the implementation of paging

All the query examples above use the "offset + count" approach, when the query itself specifies from which order of the result row and how many rows to return. First, let's look at how best to organize the passing of parameters in this case. In practice, I met several ways:

  • Sequence number of the requested page (pageIndex), page size (pageSize).
  • The sequence number of the first record to be returned (startIndex), the maximum number of records in the result (count).
  • The sequence number of the first entry to be returned (startIndex), the sequence number of the last entry to be returned (endIndex).

At first glance, it may seem that this is so elementary that there is no difference. But this is not so - the second one (startIndex, count) is the most convenient and universal option. There are several reasons for this:

  • For the +1 entry subtraction approach above, the first option with pageIndex and pageSize is extremely inconvenient. For example, we want to display 50 posts per page. According to the above algorithm, you need to read one record more than necessary. If this "+1" is not embedded on the server, it turns out that for the first page we must request records from 1 to 51, for the second - from 51 to 101, and so on. If you specify a page size of 51 and increase pageIndex, then the second page will return from 52 to 102, and so on. Accordingly, in the first variant, the only way to normally implement the button for moving to the next page is to set up the reading of the “extra” line on the server, which will be a very implicit nuance.
  • The third option doesn't make sense at all, because most database queries will still need to pass the count, not the index of the last record. Let the subtraction of startIndex from endIndex and an elementary arithmetic operation, but it is superfluous here.

Now we should describe the shortcomings of the implementation of paging through "offset + quantity":

  • Getting each next page will be more expensive and slower than the previous one, because the database will still need to go through all the records "from the beginning" according to the search and sort criteria, and then stop at the desired fragment.
  • Not all DBMS can support this approach.

There are alternatives, but they are not ideal either. The first of these approaches is called “keyset paging” or “seek method” and is as follows: after receiving a chunk, you can remember the field values ​​​​in the last entry on the page, and then use them to get the next chunk. For example, we ran the following query:

SELECT * FROM Sales.SalesOrderHeader
ORDER BY OrderDate DESC
OFFSET 0 ROWS
FETCH NEXT 50 ROWS ONLY

And in the last record we got the order date value '2014-06-29'. Then, to get the next page, you can try to do this:

SELECT * FROM Sales.SalesOrderHeader
WHERE OrderDate < '2014-06-29'
ORDER BY OrderDate DESC
OFFSET 0 ROWS
FETCH NEXT 50 ROWS ONLY

The problem is that OrderDate is a non-unique field and the condition above is very likely to miss a lot of needed rows. To add uniqueness to this query, you need to add a unique field to the condition (assume that 75074 is the last value of the primary key from the first portion):

SELECT * FROM Sales.SalesOrderHeader
WHERE (OrderDate = '2014-06-29' AND SalesOrderID < 75074)
   OR (OrderDate < '2014-06-29')
ORDER BY OrderDate DESC, SalesOrderID DESC
OFFSET 0 ROWS
FETCH NEXT 50 ROWS ONLY

This option will work correctly, but in general it will be difficult to optimize, since the condition contains an OR operator. If the value of the primary key grows with the growth of OrderDate, then the condition can be simplified, leaving only the filter by SalesOrderID. But if there is no strict correlation between the values ​​of the primary key and the field by which the result is sorted, in most DBMSs this OR cannot be avoided. The exception I know of is PostgreSQL, where tuple comparison is fully supported, and the above condition can be written as "WHERE (OrderDate, SalesOrderID) < ('2014-06-29', 75074)". Given a composite key with these two fields, such a query should be fairly easy.

A second alternative approach can be found, for example, in Elasticsearch scroll API or Cosmos DB - when a request, in addition to data, returns a special identifier with which you can get the next piece of data. If this identifier has an indefinite lifetime (as in Comsos DB), then this is a great way to implement page sequencing paging (option #2 mentioned above). Its possible disadvantages are: not supported by all DBMS; the received ID of the next portion may have a limited lifetime, which is generally not suitable for implementing user interaction (like, for example, the ElasticSearch scroll API).

Complex filtering

We complicate the task further. Suppose there is a requirement to implement the so-called faceted search, which is well known to everyone from online stores. The above examples based on the orders table are not very representative in this case, so let's switch to the Product table from the AdventureWorks database:

Search results output and performance issues
What is the idea of ​​faceted search? In that for each filter element, the number of records that match this criterion is shown based on filters selected in all other categories.

For example, if we select the Bikes category and the Black color in this example, the table will display only black bikes, but:

  • For each criterion in the "Categories" group, the number of products from that category will be shown in black.
  • For each criterion in the "Colors" group, the number of bikes in that color will be shown.

Here is an example output for such conditions:

Search results output and performance issues
If, in addition, the “Clothing” category is checked, the table will also show the black clothes that are available. The number of black products in the "Color" section will also be recalculated according to the new conditions, only in the "Categories" section nothing will change... I hope these examples are enough to understand the usual faceted search algorithm.

Now let's imagine how this can be implemented on a relational basis. Each group of criteria, such as Category and Color, will require a separate query:

SELECT pc.ProductCategoryID, pc.Name, COUNT(1) FROM Production.Product p
  INNER JOIN Production.ProductSubcategory ps ON p.ProductSubcategoryID = ps.ProductSubcategoryID
  INNER JOIN Production.ProductCategory pc ON ps.ProductCategoryID = pc.ProductCategoryID
WHERE p.Color = 'Black'
GROUP BY pc.ProductCategoryID, pc.Name
ORDER BY COUNT(1) DESC

Search results output and performance issues

SELECT Color, COUNT(1) FROM Production.Product p
  INNER JOIN Production.ProductSubcategory ps ON p.ProductSubcategoryID = ps.ProductSubcategoryID
WHERE ps.ProductCategoryID = 1 --Bikes
GROUP BY Color
ORDER BY COUNT(1) DESC

Search results output and performance issues
What is wrong with this solution? Very simple - it does not scale well. Each section of the filter requires a separate query to count the quantities, and these queries are not the easiest. In online stores, in some categories there may be several dozen filter sections, which can be a serious problem for performance.

Usually, after these statements, I am offered some solutions, namely:

  • Combine all quantity counts in one query. Technically, this is possible using the UNION keyword, but this will not help performance much - the database will still have to execute each of the fragments from scratch.
  • Cache quantities. This is offered to me almost every time I describe a problem. The nuance is that this is generally impossible. Suppose we have 10 "facets", each with 5 values. This is a very "modest" situation against the background of what can be seen in the same online stores. The choice of one facet element affects the quantities in 9 others, in other words, for each combination of criteria, the quantities can be different. In total, in our example, there are 50 criteria that the user can choose, respectively, there will be 250 possible combinations. There will not be enough memory or time to fill in such an array of data. Here you can object and say that not all combinations are real and the user rarely chooses more than 5-10 criteria. Yes, it is possible to lazy load and cache the quantity only for what has ever been selected, but the more choices there are, the less efficient the cache will be and the more noticeable the response time problems will be (especially if the dataset changes regularly).

Fortunately, such a problem has long been sufficiently effective solutions that work predictably on large amounts of data. For any of these options, it makes sense to split the facet recalculation and getting the results page into two parallel server calls and organize the user interface in such a way that loading data by facets "does not interfere" with the display of search results.

  • Call a full facet recalculation as infrequently as possible. For example, do not recalculate everything on each change in search criteria, but instead find the total number of results that match the current conditions, and prompt the user to show them - "1425 records found, show?" The user can either continue to change the search terms or click the "show" button. Only in the second case, all requests for obtaining results and recalculating quantities on all "facets" will be executed. In this case, as it is easy to see, you will have to deal with a request for obtaining the total number of results and its optimization. This method can be found in many small online stores. Obviously, this is not a panacea for this problem, but in simple cases it can be a good compromise.
  • Use a search engine to search results and count facets, such as Solr, ElasticSearch, Sphinx, and others. All of them are designed to build "facets" and do it quite efficiently due to the inverted index. How search engines are arranged, why they are more effective than general-purpose databases in such cases, what practices and pitfalls are - this is a topic for a separate article. Here I want to point out that the search engine cannot be a replacement for the main data store, it is used as an addition: any changes in the main database that are relevant for the search are synchronized into the search index; the search engine usually interacts only with the search engine and does not refer to the main database. One of the most important points here is how to organize this synchronization reliably. It all depends on the requirements for "reaction time". If the time between a change in the main database and its “appearance” in the search is not critical, you can make a service that looks for recently changed records every few minutes and indexes them. If the minimum possible response time is required, you can implement something like transactional outbox to send updates to the search service.

Conclusions

  1. Implementing server-side paging is a major complication and only makes sense for fast-growing or simply large datasets. How to evaluate "big" or "fast growing" - there is no absolutely exact recipe, but I would take this approach:
    • If obtaining a complete collection of data, taking into account server time and transferring over the network, normally fits into the performance requirements, there is no point in implementing paging on the server side.
    • There may be such a situation that no performance problems are expected in the near future, since there is little data, but the data collection is constantly growing. If some data set may cease to satisfy the previous point in the future, it is better to start paging right away.
  2. If there is no strict requirement on the part of the business to show the total number of results or to display page numbers, and at the same time there is no search engine in your system, it is better not to implement these points and consider option #2.
  3. If there is a clear requirement for faceted search, you have two options without sacrificing performance:
    • Do not recalculate all quantities on each change in search criteria.
    • Use search engines such as Solr, ElasticSearch, Sphinx and others. But it should be understood that it cannot be a replacement for the main database, and should be used as an addition to the main storage for solving search problems.
  4. Also in the case of faceted search, it makes sense to split getting the search results page and counting the numbers into two parallel requests. Counting quantities may take longer than getting results, while the results are more important to the user.
  5. If you are using a SQL database for searching, any code change related to this part should be well tested for performance on an appropriate amount of data (greater than the amount in the "live" database). It is also desirable to use query execution time monitoring on all instances of the database, and especially on the “live” one. Even if everything was fine with query plans at the development stage, the situation can noticeably change with the growth of data volume.

Source: habr.com

Add a comment