Optimization of database queries on the example of a B2B service for builders

How to grow by 10 times the number of queries to the database without moving to a more productive server and keep the system working? I will tell you how we dealt with the drop in performance of our database, how we optimized SQL queries to serve as many users as possible and not increase the cost of computing resources.

I am making a service for managing business processes in construction companies. About 3 thousand companies work with us. More than 10 thousand people every day work with our system for 4-10 hours. It solves various tasks of planning, notification, warning, validation… We use PostgreSQL 9.6. We have about 300 tables in the database, and every day it receives up to 200 million requests (10 thousand different ones). On average, we have 3-4 thousand requests per second, at the most active moments more than 10 thousand requests per second. Most queries are OLAP. There are much fewer additions, modifications and deletions, that is, the OLTP load is relatively small. I gave all these figures so that you can evaluate the scale of our project and understand how our experience can be useful to you.

Picture one. lyrical

When we started development, we didn’t really think about what load would fall on the database and what we would do if the server stopped pulling. When designing the database, we followed general guidelines and tried not to shoot ourselves in the foot, but beyond general advice like “don’t use the pattern Entity Attribute Values we didn't go. Designed based on the principles of normalization, avoiding data redundancy and did not care about speeding up certain queries. As soon as the first users came, we faced a performance problem. As usual, we were completely unprepared for this. The first problems were simple. As a rule, everything was solved by adding a new index. But there came a time when simple patches stopped working. Realizing that there is not enough experience and it is becoming more and more difficult for us to understand what is the cause of the problems, we hired specialists who helped us set up the server correctly, connect monitoring, showed us where to look to get statistics.

Picture two. Statistical

So we have about 10 thousand different queries that are executed on our database per day. Of these 10 thousand, there are monsters that are executed 2-3 million times with an average execution time of 0.1-0.3 ms and there are requests with an average execution time of 30 seconds that are called 100 times a day.

It was not possible to optimize all 10 thousand queries, so we decided to figure out where to direct our efforts in order to improve the performance of the database correctly. After several iterations, we began to divide requests into types.

TOP queries

These are the heaviest queries that take the most time (total time). These are queries that are either called very often or queries that take a very long time to complete (long and frequent queries were optimized back in the first iterations of the struggle for speed). As a result, the server spends the most time on their execution. Moreover, it is important to separate top requests by total execution time and separately by IO time. Ways to optimize such queries are slightly different.

The usual practice of all companies is to work with TOP queries. There are not many of them, optimization of even one query can free up 5-10% of resources. However, as the project matures, optimizing TOP queries becomes an increasingly non-trivial task. All simple methods have already been worked out, and the most “heavy” request takes “only” 3-5% of resources. If TOP queries in total take less than 30-40% of the time, then most likely you have already made efforts to make them work quickly and it's time to move on to optimizing queries from the next group.
It remains to answer the question of how many top requests to include in this group. I usually take at least 10, but no more than 20. I try to make the time of the first and last in the TOP group differ by no more than 10 times. That is, if the query execution time drops sharply from 1st place to 10th, then I take TOP-10, if the drop is smoother, then I increase the group size to 15 or 20.
Optimization of database queries on the example of a B2B service for builders

Middle peasants (medium)

These are all requests that come immediately after the TOP, with the exception of the last 5-10%. Usually, it is in the optimization of these queries that there is an opportunity to greatly increase the performance of the server. These requests can "weigh" up to 80%. But even if their share has exceeded 50%, then it's time to take a closer look at them.

tail

As mentioned, these requests come at the end and take 5-10% of the time. You can forget about them only if you do not use automatic query analysis tools, then their optimization can also be cheap.

How to evaluate each group?

I'm using a SQL query that helps to make such an assessment for PostgreSQL (I'm sure that for many other DBMS you can write a similar query)

SQL query to estimate the size of TOP-MEDIUM-TAIL groups

SELECT sum(time_top) AS sum_top, sum(time_medium) AS sum_medium, sum(time_tail) AS sum_tail
FROM
(
  SELECT CASE WHEN rn <= 20              THEN tt_percent ELSE 0 END AS time_top,
         CASE WHEN rn > 20 AND rn <= 800 THEN tt_percent ELSE 0 END AS time_medium,
         CASE WHEN rn > 800              THEN tt_percent ELSE 0 END AS time_tail
  FROM (
    SELECT total_time / (SELECT sum(total_time) FROM pg_stat_statements) * 100 AS tt_percent, query,
    ROW_NUMBER () OVER (ORDER BY total_time DESC) AS rn
    FROM pg_stat_statements
    ORDER BY total_time DESC
  ) AS t
)
AS ts

The result of the query is three columns, each of which contains the percentage of time it takes to process requests from this group. Inside the request, there are two numbers (in my case, 20 and 800) that separate requests from one group from another.

This is how the share of requests roughly correlates at the time of the start of optimization work and now.

Optimization of database queries on the example of a B2B service for builders

It can be seen from the diagram that the share of TOP requests has sharply decreased, but the “middle peasants” have grown.
At first, outright blunders got into TOP requests. Over time, childhood illnesses disappeared, the share of TOP requests decreased, more and more efforts had to be made to speed up heavy requests.

To get the text of requests, use the following request

SELECT * FROM (
  SELECT ROW_NUMBER () OVER (ORDER BY total_time DESC) AS rn, total_time / (SELECT sum(total_time) FROM pg_stat_statements) * 100 AS tt_percent, query
  FROM pg_stat_statements
  ORDER BY total_time DESC
) AS T
WHERE
rn <= 20 -- TOP
-- rn > 20 AND rn <= 800 -- MEDIUM
-- rn > 800  -- TAIL

Here is a list of the most commonly used tricks that helped us speed up TOP queries:

  • Redesign of the system, for example, reworking the notification logic on the message broker instead of periodic queries to the database
  • Add or change indexes
  • Rewriting ORM queries to pure SQL
  • Rewriting lazy data loading logic
  • Caching through data denormalization. For example, we have a table connection Delivery -> Invoice -> Request -> Application. That is, each delivery is associated with the application through other tables. In order not to link all tables in each request, we duplicated the link to the order in the Delivery table.
  • Caching of static tables with directories and rarely changing tables in the program memory.

Sometimes the changes were drawn to an impressive redesign, but they gave 5-10% offload of the system and were justified. Over time, the exhaust became smaller, and the redesign was required more and more serious.

Then we paid attention to the second group of requests - the group of middle peasants. It has a lot more requests and it seemed that it would take a very long time to analyze the entire group. However, most queries turned out to be very easy to optimize, and many problems were repeated dozens of times in various variations. Here are examples of some typical optimizations that we applied to dozens of similar queries, and each group of optimized queries unloaded the database by 3-5%.

  • Instead of checking for records using COUNT and full table scans, EXISTS is now used
  • We got rid of DISTINCT (there is no general recipe, but sometimes you can easily get rid of it by speeding up the query by 10-100 times).

    For example, instead of a query to select all drivers from a large delivery table (DELIVERY)

    SELECT DISTINCT P.ID, P.FIRST_NAME, P.LAST_NAME
    FROM DELIVERY D JOIN PERSON P ON D.DRIVER_ID = P.ID
    

    made a query on a relatively small table PERSON

    SELECT P.ID, P.FIRST_NAME, P.LAST_NAME
    FROM PERSON
    WHERE EXISTS(SELECT D.ID FROM DELIVERY WHERE D.DRIVER_ID = P.ID)
    

    It would seem that we used a correlated subquery, but it gives a speedup of more than 10 times.

  • In many cases, COUNT was abandoned altogether and
    replaced by calculation of approximate value
  • instead
    UPPER(s) LIKE JOHN%’ 
    

    use

    s ILIKE “John%”
    

Each specific request could be accelerated sometimes by 3-1000 times. Despite the impressive performance, at first it seemed to us that there was no point in optimizing a query that takes 10 ms, is in the 3rd hundred of the heaviest queries, and takes hundredths of a percent in the total load time on the database. But applying the same recipe to a group of requests of the same type, we won back a few percent. In order not to waste time on manually reviewing all hundreds of requests, we wrote several simple scripts that, using regular expressions, found the same type of requests. As a result, the automatic search of groups of requests allowed us to further improve our performance, with a modest effort.

As a result, we have been working on the same hardware for three years now. The average daily load is about 30%, in peaks it reaches 70%. The number of requests, as well as the number of users, has grown by about 10 times. And all this thanks to the constant monitoring of these very groups of requests TOP-MEDIUM. As soon as some new request appears in the TOP group, we immediately analyze it and try to speed it up. We review the MEDIUM group once a week using query analysis scripts. If new requests come across that we already know how to optimize, then we quickly change them. Sometimes we find new optimization methods that can be applied to several queries at once.

According to our forecasts, the current server will withstand an increase in the number of users by another 3-5 times. True, we have one more trump card up our sleeve - we still have not transferred SELECT queries to the mirror, as it is recommended to do. But we do not do this consciously, because we want to first exhaust the possibilities of "smart" optimization before turning on the "heavy artillery".
A critical look at the work done may suggest using vertical scaling. Buy a more powerful server, instead of wasting the time of specialists. The server may not cost so much, especially since we have not yet exhausted the vertical scaling limits. However, only the number of requests increased by 10 times. For several years, the functionality of the system has increased and now there are more types of requests. The functionality that was, due to caching, is performed by fewer requests, moreover, more efficient requests. So you can safely multiply by another 5 to get the real acceleration factor. So, according to the most conservative estimates, we can say that the acceleration was 50 or more times. It would cost more to swing the server vertically 50 times. Especially considering that once the optimization is carried out, it works all the time, and the bill for the rented server comes every month.

Source: habr.com

Add a comment