Mass optimization of PostgreSQL queries. Kirill Borovikov (Tensor)

The report presents some approaches that allow monitor the performance of SQL queries when there are millions of them per day, and there are hundreds of controlled PostgreSQL servers.

What technical solutions allow us to efficiently process such a volume of information, and how it makes life easier for an ordinary developer.


Who is interested analysis of specific problems and various optimization techniques SQL queries and solving typical DBA tasks in PostgreSQL - you can also see series of articles about this theme.

Mass optimization of PostgreSQL queries. Kirill Borovikov (Tensor)
My name is Kirill Borovikov, I represent "Tensor" company. Specifically, I specialize in working with databases in our company.

Today I will tell you how we are engaged in query optimization, when you do not need to “pick up” the performance of a single query, but solve the problem en masse. When there are millions of requests, and you need to find some approaches to solving this big problem.

In general, "Tensor" for a million of our customers is VLSI - our application: corporate social network, solutions for video communication, for internal and external document management, accounting systems for accounting and warehouse, ... That is, such a "megacombine" for integrated business management, in which there are more than 100 different internal projects.

In order for all of them to work and develop normally, we have 10 development centers throughout the country, they have more 1000 developers.

We have been working with PostgreSQL since 2008 and have accumulated a large amount of what we process - these are client data, statistical, analytical, data from external information systems - over 400TB. Only "in production" there are about 250 servers, and in total there are about 1000 database servers that we monitor.

Mass optimization of PostgreSQL queries. Kirill Borovikov (Tensor)

SQL is a declarative language. You describe not "how" something should work, but "what" you want to get. The DBMS knows better how to do a JOIN - how to connect your tables, what conditions to impose, what will go through the index, what will not ...

Some DBMS accept hints: “No, join these two tables in such and such a queue”, but PostgreSQL does not know how. This is the conscious position of the leading developers: “It’s better to finish the query optimizer than to allow developers to use some kind of hints.”

But, despite the fact that PostgreSQL does not allow "outside" to manage itself, it perfectly allows see what's going on insidewhen you run your query and where it has problems.

Mass optimization of PostgreSQL queries. Kirill Borovikov (Tensor)

In general, with what classic problems does a developer [to DBA] usually come? “Here we have fulfilled the request, and we are slow, everything is hanging, something is happening ... Some kind of trouble!

The reasons are almost always the same:

  • inefficient query algorithm
    The developer: “Now I have 10 tables in SQL for him through JOIN ...” - and expects that his conditions will miraculously effectively “untie”, and he will get everything quickly. But miracles do not happen, and any system with such variability (10 tables in one FROM) always gives some kind of error. [article]
  • outdated statistics
    The moment is very relevant specifically for PostgreSQL, when you “poured” a large dataset onto the server, make a request - and it “sex scans” you on the plate. Because yesterday it contained 10 records, and today there are 10 million, but PostgreSQL is not yet aware of this, and it needs to be prompted about it. [article]
  • "plugging" in resources
    You put a large and heavy loaded database on a weak server that does not have enough disk, memory, or the performance of the processor itself. And that's all ... Somewhere there is a performance ceiling, above which you can no longer jump.
  • blocking
    A difficult moment, but they are most relevant for various modifying queries (INSERT, UPDATE, DELETE) - this is a separate big topic.

Get a plan

… And for everything else, we need a plan! We need to see what's going on inside the server.

Mass optimization of PostgreSQL queries. Kirill Borovikov (Tensor)

A query execution plan for PostgreSQL is a query execution algorithm tree in textual representation. Exactly the algorithm that was recognized as the most effective as a result of the analysis by the planner.

Each tree node is an operation: extracting data from a table or index, building a bitmap, joining two tables, joining, intersecting, or excluding selections. Executing a query is a walk through the nodes of this tree.

To get the query plan, the easiest way is to execute the statement EXPLAIN. To get with all real attributes, i.e. actually execute a query on the base − EXPLAIN (ANALYZE, BUFFERS) SELECT ....

Bad point: when you execute it, it happens "here and now", so it is only suitable for local debugging. If you take some highly loaded server that is under a strong flow of data changes, and you see: “Hey! Here we slowly performedXia request." Half an hour, an hour ago - while you were running and getting this request from the logs, carrying it back to the server, your entire dataset and statistics have changed. You run it to debug - and it runs fast! And you can't understand "why", why было slow.

Mass optimization of PostgreSQL queries. Kirill Borovikov (Tensor)

In order to understand what was exactly at the moment when the request is executed on the server, smart people wrote auto_explain module. It is present in almost all the most common PostgreSQL distributions, and it can be simply activated in the config file.

If he understands that some request is running longer than the limit that you told him, he does "snapshot" the plan of this request and writes them together to the log.

Mass optimization of PostgreSQL queries. Kirill Borovikov (Tensor)

Everything seems to be fine now, we go to the log and see there ... [footcloth of text]. But we can’t say anything about it, except for the fact that this is an excellent plan, because it took 11ms to complete.

Everything seems to be fine - but nothing is clear what actually happened. In addition to the general time, we don’t see anything special. Because looking at such a "brat" plain text is generally adorable.

But even if it’s not obvious, even if it’s inconvenient, there are more fundamental problems:

  • The node indicates sum over the resources of the entire subtree under him. That is, it is simply impossible to find out how much time was spent here specifically on this Index Scan - it is impossible if there is some nested condition under it. We must dynamically see if there are “children” and conditional variables inside, CTE - and subtract all this “in the mind”.
  • The second point: the time that is indicated on the node is node execution time. If this node was executed several times as a result of, for example, a cycle through table records, then the plan increases the number of loops - cycles of this node. But the atomic execution time itself remains the same in the plan. That is, in order to understand how much this node was performed in total, one must multiply one by the other - again “in the mind”.

In such scenarios, understand “Who is the weakest link?” practically unrealistic. Therefore, even the developers themselves in the "manual" write that “Understanding the plan is an art to be learned, an experience…”.

But we have 1000 developers, and this experience cannot be passed on to each of them. I, you, he - they know, but someone over there - no longer. Maybe he will learn, or maybe not, but he needs to work now - and where would he get this experience from.

Plan visualization

Therefore, we realized that in order to deal with these problems, we need good visualization of the plan. [article]

Mass optimization of PostgreSQL queries. Kirill Borovikov (Tensor)

We first went “on the market” - let's look on the Internet for what generally exists.

But it turned out that there are very few relatively “live” solutions that are more or less developed - literally, one thing: explain.depesz.com by Hubert Lubaczewski. At the entrance to the field “feed” the text representation of the plan, it shows you a plate with parsed data:

  • own node processing time
  • total time over the entire subtree
  • the number of records that were retrieved and that were statistically expected
  • node body itself

Also, this service has the ability to share an archive of links. You threw your plan there and said: "Hey, Vasya, here's a link for you, something is wrong there."

Mass optimization of PostgreSQL queries. Kirill Borovikov (Tensor)

But there are also small problems.

First, a huge amount of "copy-paste". You take a piece of the log, put it in there, and again, and again.

Second, the no analysis of the amount of data read - the same buffers that outputs EXPLAIN (ANALYZE, BUFFERS), we don't see it here. He simply does not know how to disassemble, understand and work with them. When you're reading a lot of data and realizing that you might not be decomposing correctly across disk and in-memory cache, this information is very important.

The third negative point is the very poor development of this project. The commits are very small, it's good if once every six months, and the code is in Perl.

Mass optimization of PostgreSQL queries. Kirill Borovikov (Tensor)

But these are all “lyrics”, one could somehow live with this, but there is one thing that turned us away from this service. These are Common Table Expression (CTE) parsing errors and various dynamic nodes like InitPlan/SubPlan.

If you believe this picture, then we have the total execution time of each individual node is greater than the total execution time of the entire request. Everything is simple - the time of generation of this CTE was not subtracted from the CTE Scan node. Therefore, we no longer know the correct answer, how long did the CTE scan itself take.

Mass optimization of PostgreSQL queries. Kirill Borovikov (Tensor)

Then we realized that it was time to write our own - hooray! Each developer says: “Now we will write our own, it will be super easy!”

We took a stack typical for web services: the core on Node.js + Express, pulled on Bootstrap and D3.js for beautiful diagrams. And our expectations were fully justified - we received the first prototype in 2 weeks:

  • custom plan parser
    That is, now we can parse any plan in general from those generated by PostgreSQL.
  • correct analysis of dynamic nodes - CTE Scan, InitPlan, SubPlan
  • buffers distribution analysis - where data pages are read from memory, where from the local cache, where from disk
  • got visibility
    In order not to “dig” everything in the log, but to see the “weakest link” immediately in the picture.

Mass optimization of PostgreSQL queries. Kirill Borovikov (Tensor)

We got something like this picture - immediately with syntax highlighting. But usually our developers are no longer working with a full representation of the plan, but with something that is shorter. After all, we have already parsed all the numbers and threw them to the left and right, and left only the first line in the middle, what kind of node is it: CTE Scan, CTE generation or Seq Scan according to some plate.

This is the abbreviated representation we call plan template.

Mass optimization of PostgreSQL queries. Kirill Borovikov (Tensor)

What else would be convenient? It would be convenient to see which share of the total time is distributed to which node - and we simply “glued” it to the side pie charts.

We point at the node and see - it turns out that Seq Scan took less than a quarter of the total time, and the remaining 3/4 was taken by CTE Scan. Horror! This is a small note about the "rate of fire" of CTE Scan, if you actively use them in your requests. They are not very fast - they lose even to a regular table scan. [article] [article]

But usually such diagrams are more interesting, more complicated, when we immediately point to a segment, and we see, for example, that some Seq Scan "ate" more than half of the time. Moreover, there was some kind of Filter inside, a lot of records were discarded over it ... You can directly throw this picture to the developer and say: “Vasya, everything is bad here! Figure it out, look - something is wrong!

Mass optimization of PostgreSQL queries. Kirill Borovikov (Tensor)

Naturally, without the "rake" could not do.

The first thing they “stepped on” was the problem of rounding. The node time of each individual node in the plan is indicated with an accuracy of 1 µs. And when the number of node cycles exceeds, for example, 1000 - after executing, PostgreSQL divided “up to”, then when calculating backwards, we get the total time “somewhere between 0.95ms and 1.05ms”. When the count goes to microseconds, it’s still nothing, but when it’s already to [milli] seconds, you have to take this information into account when “untwisting” resources according to the nodes of the plan “who consumed how much from whom”.

Mass optimization of PostgreSQL queries. Kirill Borovikov (Tensor)

The second point, more complicated, is the distribution of resources (those buffers) among dynamic nodes. It cost us another 2 weeks for the first 4 weeks for the prototype.

It’s quite easy to get such a problem - we do CTE and supposedly read something in it. In fact, PostgreSQL is “smart” and will not read anything directly there. Then we take the first record from it, and the one hundred and first from the same CTE to it.

Mass optimization of PostgreSQL queries. Kirill Borovikov (Tensor)

We look at the plan and understand - strangely, we had 3 buffers (data pages) "consumed" in Seq Scan, 1 more in CTE Scan, and 2 more in the second CTE Scan. That is, if we simply sum everything up, we get 6, but from the tablet we read only 3! CTE Scan does not read anything from anywhere, but works directly with the process memory. So something is clearly wrong here!

In fact, it turns out that here all those 3 pages of data that were requested from Seq Scan, first 1 asked for the 1st CTE Scan, and then the 2nd, and 2 more were read to it. That is, a total of 3 pages were read data, not 6.

Mass optimization of PostgreSQL queries. Kirill Borovikov (Tensor)

And this picture led us to understand that the execution of the plan is no longer a tree, but just some kind of acyclic graph. And we got something like this diagram so that we understand "something from where it came at all." That is, here we created a CTE from pg_class, and asked it twice, and almost all the time it took us along the branch when we asked for it the 2nd time. It is clear that reading the 101st entry is much more expensive than just the 1st from the table.

Mass optimization of PostgreSQL queries. Kirill Borovikov (Tensor)

We took a breath for a while. Said, "Now, Neo, you know kung fu! Now our experience is right on your screen. Now you can use it." [article]

Log Consolidation

Our 1000 developers breathed a sigh of relief. But we understood that we only have hundreds of "combat" servers, and all this "copy-paste" on the part of the developers is not at all convenient. We realized that we need to collect it ourselves.

Mass optimization of PostgreSQL queries. Kirill Borovikov (Tensor)

In general, there is a regular module that can collect statistics, however, it also needs to be activated in the config - this pg_stat_statements module. But he did not suit us.

Firstly, he assigns to the same queries under different schemes within the same database different QueryIds. That is, if you first do SET search_path = '01'; SELECT * FROM user LIMIT 1;, and then SET search_path = '02'; and the same query, then there will be different records in the statistics of this module, and I will not be able to collect general statistics specifically in the context of this query profile, without taking into account schemes.

The second point that prevented us from using it - lack of plans. That is, there is no plan, there is only the request itself. We see what slowed down, but do not understand why. And here we return to the problem of a rapidly changing dataset.

And the last moment - lack of "facts". That is, you cannot address a specific instance of query execution - it does not exist, there is only aggregated statistics. Although it is possible to work with this, it is just very difficult.

Mass optimization of PostgreSQL queries. Kirill Borovikov (Tensor)

Therefore, we decided to fight against "copy-paste" and began to write collector.

The collector connects via SSH, “pulls” a secure connection to the server with the database using a certificate, and tail -F "clings" to it on the log file. So in this session we get a complete "mirror" of the entire log file, which is generated by the server. At the same time, the load on the server itself is minimal, because we do not parse anything there, we just mirror the traffic.

Since we have already started writing the interface in Node.js, we continued to write the collector on it. And this technology justified itself, because it is very convenient to use JavaScript to work with lightly formatted text data, which is the log. And the Node.js infrastructure itself as a backend platform makes it easy and convenient to work with network connections, and indeed with some kind of data streams.

Accordingly, we “stretch” two connections: the first, in order to “listen” to the log itself and take it to ourselves, and the second, in order to periodically ask the base. “But in the log it came up that the tablet with oid 123 was blocked,” but this does not mean anything to the developer, and it would be nice to ask the base “What is OID = 123 anyway?” And so we periodically ask the base what we do not know yet.

Mass optimization of PostgreSQL queries. Kirill Borovikov (Tensor)

“Only one thing you left out, there is a species of elephant-like bees! ..” We started developing this system when we wanted to monitor 10 servers. The most critical in our understanding, on which some problems arose that were difficult to deal with. But during the first quarter, we received a hundred for monitoring - because the system “came in”, everyone wanted it, it’s convenient for everyone.

All this must be added up, the data flow is large, active. Actually, what we monitor, what we know how to deal with, we use it. We also use PostgreSQL as a data store. And nothing is faster to "pour" data into it than the operator COPY Not yet.

But simply “pouring” data is not exactly our technology. Because if you have about 50k requests per second on a hundred servers, then this will generate 100-150GB of logs per day for you. Therefore, we had to carefully "cut" the base.

First, we did partitioning by day, because, by and large, no one is interested in the correlation between days. What difference does it make what you had yesterday, if tonight you rolled out a new version of the application - and already some new statistics.

Secondly, we learned (were forced) very very fast to write with COPY. That is, not just COPYbecause it is faster than INSERT, and even faster.

Mass optimization of PostgreSQL queries. Kirill Borovikov (Tensor)

The third moment - I had to refuse triggers, respectively, and from Foreign Keys. That is, we do not have referential integrity at all. Because if you have a table that has a pair of FKs on it, and you say in the database structure that "here is a log entry referenced by FK, for example, to a group of entries", then when you insert it, PostgreSQL is left with nothing but how to take and honestly execute SELECT 1 FROM master_fk1_table WHERE ... with the identifier that you are trying to insert - just to check that this entry is present there, that you do not “break off” this Foreign Key with your insertion.

We get instead of one record in the target table and its indexes, plus reading from all tables to which it refers. And we don’t need it at all - our task is to record as much as possible and as quickly as possible with the least load. So FK - down with it!

The next point is aggregation and hashing. Initially, we implemented them in the database - after all, it’s convenient right away, when the record arrives, to make it in some kind of tablet "plus one" right in the trigger. Good, convenient, but bad in the same way - you insert one record, but you are forced to read and write something else from another table. Moreover, not only to read and write - also to do it every time.

Now imagine that you have a table in which you simply count the number of requests that have passed through a particular host: +1, +1, +1, ..., +1. And you, in principle, do not need it - it's all possible sum in memory on the collector and send to the base in one go +10.

Yes, in the event of some kind of malfunction, your logical integrity may “fall apart”, but this is an almost unrealistic case - because you have a normal server, it has a battery in the controller, you have a transaction log, a log on the file system ... In general, not it's worth it. It's not worth the performance loss you get from running triggers/FKs for the cost you incur.

The same goes for hashing. A certain request flies to you, you calculate a certain identifier from it in the database, write it to the database and then tell it to everyone. Everything is fine until at the time of recording a second person comes to you who wants to record it himself - and you have a blockage, and this is already bad. Therefore, if you can transfer the generation of some IDs to the client (relative to the base), it is better to do this.

It was just perfect for us to use MD5 from a text - a request, a plan, a template, ... We calculate it on the collector side, and “pour” a ready-made ID into the database. The MD5 length and daily partitioning let us not worry about possible collisions.

Mass optimization of PostgreSQL queries. Kirill Borovikov (Tensor)

But in order to record all this quickly, we needed to modify the recording procedure itself.

How is data usually written? We have some kind of dataset, we decompose it into several tables, and then COPY - first into the first, then into the second, into the third ... It's inconvenient, because we seem to be writing one data stream in three steps sequentially. Unpleasant. Can it be done faster? Can!

To do this, it is enough just to decompose these flows in parallel with each other. It turns out that we have errors, requests, templates, locks flying in separate threads, ... - and we write it all in parallel. For this it is enough keep a permanently open COPY channel to each individual target table.

Mass optimization of PostgreSQL queries. Kirill Borovikov (Tensor)

That is, the collector there is always a stream, in which I can write the data I need. But in order for the database to see this data, and someone not to hang in locks, waiting for this data to be written, COPY must be interrupted at certain intervals. For us, the period of the order of 100ms turned out to be the most effective - we close it and immediately open it again to the same table. And if we don’t have enough of one stream at some peaks, then we do pooling up to a certain limit.

Additionally, we found out that for such a load profile, any aggregation, when records are collected in packets, is evil. Classic evil is INSERT ... VALUES and further 1000 records. Because at that point you have a write peak on the media, and everyone else trying to write something to disk will be waiting.

To get rid of such anomalies, just don't aggregate anything, don't buffer at all. And if buffering to disk does occur (thankfully, the Stream API in Node.js allows you to find out) - postpone this connection. That's when an event comes to you that it is free again - write to it from the accumulated queue. In the meantime, it is busy - take the next free one from the pool and write to it.

Before implementing this approach to writing data, we had about 4K write ops, and in this way we reduced the load by 4 times. Now they have grown another 6 times due to new monitored databases - up to 100MB / s. And now we store logs for the last 3 months in the amount of about 10-15TB, hoping that any developer can solve any problem in three months.

We understand the problems

But simply collecting all this data is good, useful, appropriate, but not enough - you need to understand it. Because these are millions of different plans per day.

Mass optimization of PostgreSQL queries. Kirill Borovikov (Tensor)

But millions are uncontrollable, you must first make "smaller". And, first of all, you need to decide how you will organize this “smaller” one.

We have identified three key points for ourselves:

  • who sent this request
    That is, from which application it “arrived”: web interface, backend, payment system, or something else.
  • where it happened
    On what specific server. Because if you have several servers under one application, and suddenly one “blunted” (because the “disk is rotten”, “memory leaked”, some other trouble), then you need to specifically address the server.
  • How the problem manifested itself in one way or another

To understand "who" sent us a request, we use a regular tool - setting a session variable: SET application_name = '{bl-host}:{bl-method}'; - we send the name of the business logic host from which the request comes, and the name of the method or application that initiated it.

After we have passed the "master" of the request, it must be displayed in the log - for this we configure the variable log_line_prefix = ' %m [%p:%v] [%d] %r %a'. Who cares maybe look in the manualwhat it all means. It turns out that we see in the log:

  • time
  • process and transaction IDs
  • base name
  • The IP of the person who sent this request
  • and method name

Mass optimization of PostgreSQL queries. Kirill Borovikov (Tensor)

Further, we realized that it is not very interesting to watch the correlation for one request between different servers. Infrequently, you get a situation where you have one application equally “crap” here and there. But even if the same - look at any of these servers.

So, cut "one server - one day" we found enough for any analysis.

The first analytical section is the same "sample" - an abbreviated form of the plan presentation, cleared of all numerical indicators. The second cut is the application or method, and the third is the specific plan node that caused us problems.

When we moved from concrete instances to templates, we got two advantages at once:

  • multiple reduction in the number of objects for analysis
    We have to analyze the problem no longer by thousands of requests or plans, but by dozens of patterns.
  • Timeline
    That is, by summarizing the “facts” within a certain section, one can display their appearance during the day. And here you can understand that if you have some pattern that happens, for example, once an hour, but it should - once a day, you should think about what went wrong - by whom and why it was called, maybe it should be here shouldn't. This is another non-numerical, purely visual way of analysis.

Mass optimization of PostgreSQL queries. Kirill Borovikov (Tensor)

The remaining methods are based on the indicators that we extract from the plan: how many times such a pattern occurred, the total and average time, how much data was read from disk, and how much from memory ...

Because, for example, you come to the host analytics page, look - something is too much to read on the disk. The disk on the server can't cope - and who reads from it?

And you can sort by any column and decide what you will deal with right now - with the load on the processor or on the disk, or with the total number of requests ... Sorted, looked at the "top" ones, fixed it - rolled out a new version of the application.
[video lecture]

And immediately you can see different applications that go with the same template from a request like SELECT * FROM users WHERE login = 'Vasya'. Frontend, backend, processing... And you wonder why processing should read the user if he does not interact with him.

The reverse way is from the application to immediately see what it is doing. For example, the frontend is this, this, this, and this once an hour (just the timeline helps). And the question immediately arises - it seems that it is not the business of the frontend to do something once an hour ...

Mass optimization of PostgreSQL queries. Kirill Borovikov (Tensor)

After some time, we realized that we lacked an aggregated statistics by plan nodes. We have isolated from the plans only those nodes that do something with the data of the tables themselves (read / write them by index or not). In fact, with respect to the previous picture, only one aspect is added - how many records this node brought us, and how much was discarded (Rows Removed by Filter).

You do not have a suitable index on the plate, you make a request to it, it flies past the index, falls into the Seq Scan ... you filtered all the records except for one. And why do you need 100M filtered records per day, isn't it better to roll up the index?

Mass optimization of PostgreSQL queries. Kirill Borovikov (Tensor)

Having analyzed all the plans by nodes, we realized that there are some typical structures in the plans that most likely look suspicious. And it would be nice to tell the developer: “Friend, here you first read by index, then sort, and then cut off” - as a rule, there is one entry.

Everyone who wrote requests has probably come across such a pattern: “Give me the last order for Vasya, its date.” And if you don’t have an index by date, or there is no date in the index you used, then step on exactly such a “rake” .

But we know that this is a “rake” - so why not immediately tell the developer what he should do. Accordingly, opening the plan now, our developer immediately sees a beautiful picture with hints, where he is immediately told: “You have problems here and there, but they are solved this way and that.”

As a result, the amount of experience that was needed to solve problems at the beginning and now has fallen significantly. Here we have such a tool.

Mass optimization of PostgreSQL queries. Kirill Borovikov (Tensor)

Source: habr.com

Add a comment