Don't use OFFSET and LIMIT in paginated queries

Gone are the days when you didn't have to worry about optimizing database performance. Time does not stand still. Every new tech entrepreneur wants to build the next Facebook while trying to collect all the data they can get their hands on. Businesses need this data for better training of models that help to earn. In such conditions, programmers need to create such APIs that allow them to quickly and reliably work with huge amounts of information.

Don't use OFFSET and LIMIT in paginated queries

If you've been designing back-end applications or databases for some time, you've probably written code to perform paginated queries. For example - like this:

SELECT * FROM table_name LIMIT 10 OFFSET 40

The way it is?

But if that's how you did your pagination, I'm sorry to say that you didn't do it in the most efficient way.

Do you want to object to me? Can not spend time. Slack, Shopify ΠΈ Mixmax already apply the techniques that I want to talk about today.

Name at least one backend developer who has never used OFFSET ΠΈ LIMIT to execute queries with pagination. In MVP (Minimum Viable Product, minimum viable product) and in projects where small amounts of data are used, this approach is quite applicable. He, so to speak, "just works."

But if you need to create reliable and efficient systems from scratch, you should take care in advance about the efficiency of executing database queries used in such systems.

Today we'll talk about the problems that come with widely used (sorry that this is the case) implementations of paginated query engines, and how to achieve high performance when executing such queries.

What's wrong with OFFSET and LIMIT?

As has been said, OFFSET ΠΈ LIMIT They show themselves well in projects that do not need to work with large amounts of data.

The problem arises if the database grows to such a size that it no longer fits in the server's memory. But at the same time, in the course of working with this database, you need to use queries with pagination.

For this problem to manifest itself, there must be a situation in which the DBMS resorts to an inefficient Full Table Scan operation for each paginated query (at the same time, data insertion and deletion operations can occur , and we do not need outdated data!).

What is a "full table scan" (or "sequential table scan", Sequential Scan)? This is an operation during which the DBMS sequentially reads each row of the table, that is, the data contained in it, and checks them for compliance with the specified condition. This type of table scan is known to be the slowest. The fact is that when it is executed, many I / O operations are performed that involve the disk subsystem of the server. The situation is aggravated by the delays associated with working with data stored on disks, and the fact that transferring data from disk to memory is a resource-intensive operation.

For example, you have records of 100000000 users and you execute a query with the construct OFFSET 50000000. This means that the DBMS will have to load all these records (and we don’t even need them!), put them in memory, and after that take, say, 20 results, which are reported in LIMIT.

Let's say it might look like this: "select rows from 50000 to 50020 out of 100000". That is, the system will need to first load 50000 rows to execute the query. See how much unnecessary work she has to do?

If you don't believe me, take a look at the example I created using the features db-fiddle.com

Don't use OFFSET and LIMIT in paginated queries
Example at db-fiddle.com

There, on the left, in the field Schema SQL, there is code that inserts 100000 rows into the database, and on the right, in the field Query SQL, showing two queries. The first, slow one, looks like this:

SELECT *
FROM `docs`
LIMIT 10 OFFSET 85000;

And the second, which is an efficient solution to the same problem, like this:

SELECT *
FROM `docs`
WHERE id > 85000
LIMIT 10;

In order to fulfill these requests, just click on the button Run at the top of the page. Having done this, let's compare the information about the query execution time. It turns out that the execution of an inefficient query takes at least 30 times longer than the execution of the second one (this time varies from run to run, for example, the system may report that the first query took 37 ms, and execution of the second - 1 ms).

And if there is more data, then everything will look even worse (in order to be sure of this, take a look at my example with 10 million rows).

What we've just discussed should give you some insight into how database queries are actually handled.

Please note that the higher the value OFFSET β€” the longer the request will be executed.

What should be used instead of a combination of OFFSET and LIMIT?

Instead of a combination OFFSET ΠΈ LIMIT it is worth using a structure built according to the following scheme:

SELECT * FROM table_name WHERE id > 10 LIMIT 20

This is Cursor based pagination query execution.

Instead of storing the current OFFSET ΠΈ LIMIT and pass them with each request, you need to store the last received primary key (usually this is ID) and LIMIT, and as a result, requests resembling the above will be obtained.

Why? The fact is that by explicitly specifying the identifier of the last read line, you tell your DBMS where it needs to start searching for the necessary data. Moreover, the search, thanks to the use of the key, will be carried out efficiently, the system will not have to be distracted by lines that are outside the specified range.

Let's take a look at the following performance comparison of various queries. Here is an inefficient query.

Don't use OFFSET and LIMIT in paginated queries
Slow query

And here is an optimized version of this query.

Don't use OFFSET and LIMIT in paginated queries
Quick request

Both queries return exactly the same amount of data. But the first one takes 12,80 seconds to complete, and the second one takes 0,01 seconds. Feel the difference?

Possible problems

For the proposed method of querying to work effectively, the table must have a column (or columns) containing unique, sequential indexes, such as an integer identifier. In some specific cases, this may determine the success of such queries in order to increase the speed of working with the database.

Naturally, when constructing queries, it is necessary to take into account the peculiarities of the table architecture, and choose those mechanisms that will best show themselves on existing tables. For example, if you need to work in queries with large amounts of related data, you might be interested in this article.

If we are faced with the problem of not having a primary key, for example, if we have a table with a many-to-many relationship, then the traditional approach of using OFFSET ΠΈ LIMIT, we are guaranteed to fit. But its use can lead to potentially slow queries. In cases like this, I would recommend using an auto-incrementing primary key, even if it's only needed to handle paginated queries.

If you are interested in this topic - here, here ΠΈ here - some useful stuff.

Results

The main conclusion that we can draw is that always, no matter what the size of the databases, you need to analyze the speed of query execution. In our time, the scalability of solutions is extremely important, and if everything is designed correctly from the very beginning of work on a certain system, this, in the future, can save the developer from many problems.

How do you analyze and optimize database queries?

Don't use OFFSET and LIMIT in paginated queries

Source: habr.com

Add a comment