PostgreSQL and JDBC squeeze out all the juice. Vladimir Sitnikov

I suggest that you familiarize yourself with the transcript of the report of the beginning of 2016 by Vladimir Sitnikov "PostgreSQL and JDBC are squeezing all the juice"

PostgreSQL and JDBC squeeze out all the juice. Vladimir Sitnikov

PostgreSQL and JDBC squeeze out all the juice. Vladimir Sitnikov

Good afternoon My name is Vladimir Sitnikov. I have been working for NetCracker for 10 years. And I'm mainly into performance. Everything related to Java, everything related to SQL is what I love.

And today I will talk about what we faced in the company when we started using PostgreSQL as a database server. And we mostly work with Java. But what I will tell you today is not only related to Java. As practice has shown, this also occurs in other languages.

PostgreSQL and JDBC squeeze out all the juice. Vladimir Sitnikov

We will talk:

  • about data sampling.
  • About saving data.
  • And also about performance.
  • And about the underwater rake that is buried there.

PostgreSQL and JDBC squeeze out all the juice. Vladimir Sitnikov

Let's start with a simple question. We select one row from the table by primary key.

PostgreSQL and JDBC squeeze out all the juice. Vladimir Sitnikov

The base is on the same host. And all this economy takes 20 milliseconds.

PostgreSQL and JDBC squeeze out all the juice. Vladimir Sitnikov

These 20 milliseconds are a lot. If you have such 100 requests, then you are wasting time per second scrolling through these requests, i.e., we are wasting time.

We do not like to do this and look at what the base offers us for this. The database offers us two options for executing queries.

PostgreSQL and JDBC squeeze out all the juice. Vladimir Sitnikov

The first option is a simple request. Why is he good? The fact that we take it and send it, and nothing more.

PostgreSQL and JDBC squeeze out all the juice. Vladimir Sitnikov

https://github.com/pgjdbc/pgjdbc/pull/478

The database also has an extended query, which is more tricky, but more functional. You can separately send a request for parsing, execution, variable binding, etc.

Super extended query is something we won't cover in the current report. We may want something from the database and there is such a list of wishlists that has been formed in some form, that is, this is what we want, but it is impossible now and in the next year. Therefore, we just recorded and we will walk around, shake the main people.

PostgreSQL and JDBC squeeze out all the juice. Vladimir Sitnikov

And what we can do is simple query and extended query.

What is special about each approach?

A simple query is good for one-time execution. Once done and forgotten. And the problem is that it does not support the binary data format, that is, it is not suitable for some high-performance systems.

PostgreSQL and JDBC squeeze out all the juice. Vladimir Sitnikov

Extended query - allows you to save time on parsing. This is what we have done and started using. It helped us tremendously. There is not only savings on parsing. There are savings on data transfer. Transferring data in binary format is much more efficient.

PostgreSQL and JDBC squeeze out all the juice. Vladimir Sitnikov

Let's move on to practice. This is what a typical application looks like. It could be Java, etc.

We have created statement. Executed the command. Created close. Where is the error here? What is the problem? No problem. That's how it is in all the books. That's how you should write. If you want maximum performance, write like this.

PostgreSQL and JDBC squeeze out all the juice. Vladimir Sitnikov

But practice has shown that this does not work. Why? Because we have a "close" method. And when we do that, from the point of view of the database, it turns out - it's like a smoker working with a database. We said "PARSE EXECUTE DEALLOCATE".

Why these extra creations and unloading of statements? Nobody needs them. But usually in PreparedStatement it turns out that when we close them, they close everything in the database. This is not what we want.

PostgreSQL and JDBC squeeze out all the juice. Vladimir Sitnikov

We want, like healthy people, to work with the base. Once we took and prepared our statement, then we execute it many times. In fact, many times - this is once in a lifetime applications have been parsed. And on different REST we use the same statement id. Here is our goal.

PostgreSQL and JDBC squeeze out all the juice. Vladimir Sitnikov

How can we achieve this?

PostgreSQL and JDBC squeeze out all the juice. Vladimir Sitnikov

Very simple - no need to close statements. We write like this: "prepare" "execute".

PostgreSQL and JDBC squeeze out all the juice. Vladimir Sitnikov

PostgreSQL and JDBC squeeze out all the juice. Vladimir Sitnikov

If we launch this, it is clear that something will overflow somewhere. If it is not clear, then you can measure. Let's take and write a benchmark in which such a simple method. We create statement. We run it on some version of the driver and we get that it crashes pretty quickly with the loss of all the memory that has not been sold.

It is clear that such errors are easily corrected. I won't talk about them. But I will say that the new version is much faster. The method is stupid, but nonetheless.

PostgreSQL and JDBC squeeze out all the juice. Vladimir Sitnikov

How to work correctly? What do we need to do for this?

In reality, applications always close statements. In all books they write to close, otherwise the memory will leak.

And PostgreSQL can't cache queries. It is necessary that each session creates this cache for itself.

And we don’t want to waste time on parsing either.

PostgreSQL and JDBC squeeze out all the juice. Vladimir Sitnikov

And as usual we have two options.

The first option - we take and say that let's wrap everything in PgSQL. There is a cache. It caches everything. It will turn out great. We looked at this. We have 100500 requests. Does not work. We do not agree - to turn requests into procedures with handles. No no.

We have a second option - to take it and drink it ourselves. Open the source, start sawing. We drink, we drink. It turned out that this is not so difficult to do.

PostgreSQL and JDBC squeeze out all the juice. Vladimir Sitnikov

https://github.com/pgjdbc/pgjdbc/pull/319

It appeared in August 2015. Now there is a more modern version. And everything is great. Works so well that we don't change anything in the app. And we even stopped thinking in the direction of PgSQL, that is, this was quite enough for us to reduce all overhead costs to almost zero.

Accordingly, Server-prepared statements are activated on the 5th execution in order not to waste memory in the database for each one-time request.

PostgreSQL and JDBC squeeze out all the juice. Vladimir Sitnikov

You may ask - where are the numbers? What are you getting? And here I will not give the numbers, because each request has its own.

Our queries were such that we spent about 20 milliseconds on parsing on OLTP queries. There was 0,5 milliseconds to execute, 20 milliseconds to parse. Request - 10 KiB of text, 170 plan lines. This is an OLTP request. It asks for 1, 5, 10 lines, sometimes more.

But we didn't want to spend 20 milliseconds at all. We reduced to 0. Everything is great.

What can you take away from here? If you have Java, then you take the modern version of the driver and rejoice.

If you have some other language, then you think - maybe you need it too? Because from the point of view of the final language, for example, if PL 8 or you have LibPQ, then it is not obvious to you that you are wasting time not on execution, on parsing, and this is worth checking. How? Everything is free.

PostgreSQL and JDBC squeeze out all the juice. Vladimir Sitnikov

Except for the fact that there are errors, some features. And we are going to talk about them right now. Most of it will be about industrial archeology, what we have found, what we have come across.

PostgreSQL and JDBC squeeze out all the juice. Vladimir Sitnikov

If the request is generated dynamically. It happens. Someone glues the lines, it turns out a SQL query.

Why is he bad? It is bad because each time we end up with a different string.

And for this different string, you need to re-calculate the hashCode. This is really a CPU task - finding a long query text in even the available hash is not so easy. Therefore, the conclusion is simple - do not generate queries. Store them in one single variable. And rejoice.

PostgreSQL and JDBC squeeze out all the juice. Vladimir Sitnikov

Next problem. Data types are important. There are ORMs that say that no matter what NULL, let there be some. If Int, then we say setInt. And if NULL, then let VARCHAR be always. And what's the difference in the end what is NULL? The database itself will understand everything. And this image doesn't work.

In practice, the database doesn't care at all. If you said the first time that this is a number for you, and the second time you said that this is a VARCHAR, then it is impossible to reuse Server-prepared statements. And in this case, we have to re-create our statement.

PostgreSQL and JDBC squeeze out all the juice. Vladimir Sitnikov

If you are running the same query, then make sure that the data types in the column are not confused. You need to watch out for NULL. This is a common mistake we had after we started using PreparedStatements

PostgreSQL and JDBC squeeze out all the juice. Vladimir Sitnikov

Okay, turned it on. They may have taken the driver. And performance dropped. Everything got bad.

How does it happen? Is this a bug or a feature? Unfortunately, it was not possible to understand whether this is a bug or a feature. But there is a very simple scenario for reproducing this problem. She quite unexpectedly ambushed us. Also consists in sampling literally from one table. Of course, we had more such requests. They, as a rule, included two or three tables, but there is such a playback scenario. Take on your base of any version and reproduce.

PostgreSQL and JDBC squeeze out all the juice. Vladimir Sitnikov

https://gist.github.com/vlsi/df08cbef370b2e86a5c1

The point is that we have two columns, each of which is indexed. There are a million rows in one column by NULL value. And in the second column there are only 20 lines. When we execute without bind variables, everything works well.

If we start to execute with bound variables, i.e. we execute the sign "?" or "$1" for our request, what do we get in the end?

PostgreSQL and JDBC squeeze out all the juice. Vladimir Sitnikov

https://gist.github.com/vlsi/df08cbef370b2e86a5c1

First run - as expected. The second is a little faster. Something has been cached. Third, fourth, fifth. Then clap - and somehow like this. And the worst thing is that it happens on the sixth execution. Who knew that you need to do exactly six executions in order to understand what the actual execution plan is?

PostgreSQL and JDBC squeeze out all the juice. Vladimir Sitnikov

Who is guilty? What happened? The database contains an optimization. And it is, as it were, optimized for the generic case. And, accordingly, starting from some time, she switches to a generic plan, which, unfortunately, may turn out to be different. It may be the same, or it may be different. And there is some threshold value that leads to this behavior.

What can be done about it? Here, of course, it is more difficult to assume something. There is a simple solution that we use. This is +0, OFFSET 0. You probably know such solutions. We just take it and add β€œ+0” to the request and everything is fine. I'll show you later.

And there is another option - to look at the plans more carefully. The developer must not only write the request, but also say β€œexplain analyze” 6 times. If it's 5, then it won't work.

And there is a third option - this is to write a letter to pgsql-hackers. I wrote, however, it is not yet clear whether this is a bug or a feature.

PostgreSQL and JDBC squeeze out all the juice. Vladimir Sitnikov

https://gist.github.com/vlsi/df08cbef370b2e86a5c1

While we are thinking whether this is a bug or a feature, let's fix it. Let's take our request and add "+0". Everything is fine. Two characters and you don’t even have to think about how it is and what is there. Very simple. We simply forbade the database to use an index on this column. We do not have an index on the "+0" column and that's it, the database does not use the index, everything is fine.

PostgreSQL and JDBC squeeze out all the juice. Vladimir Sitnikov

Here is the rule of 6 explains. Now in current versions you need to do 6 times if you have bound variables. If you don't have bound variables, then we do this. And in the end, it is this request that falls in our country. It's not tricky.

It would seem, how much is possible? Bug here, bug there. Really a bug everywhere.

PostgreSQL and JDBC squeeze out all the juice. Vladimir Sitnikov

Let's see again. For example, we have two schemas. Scheme A with table Y and scheme B with table Y. Query - select data from a table. What will we have with this? We will have an error. We will have all of the above. The rule is - a bug is everywhere, we will have all of the above.

PostgreSQL and JDBC squeeze out all the juice. Vladimir Sitnikov

Now the question is: "Why?". It would seem that there is documentation that if we have a schema, then there is a "search_path" variable that tells where to look for the table. It would seem that there is a variable.

What is the problem? The problem is that server-prepared statements do not suspect that someone can change the search_path. This value remains, as it were, constant for the database. And some parts may not pick up new meanings.

PostgreSQL and JDBC squeeze out all the juice. Vladimir Sitnikov

Of course, this depends on the version you are testing on. Depends on how seriously your tables differ. And version 9.1 will just fulfill the old queries. Newer versions may find a catch and say that you have a bug.

PostgreSQL and JDBC squeeze out all the juice. Vladimir Sitnikov

Set search_path + server-prepared statements =
cached plan must not change result type

How to treat it? There is a simple recipe - do not do it. There is no need to change the search_path while the application is running. If you change, it's better to create a new connection.

You can discuss, i.e., open, discuss, add. Maybe we will convince the database developers that in the case when someone changes the value, the database should tell the client about it: β€œLook, your value has been updated here. Maybe you need to reset the statements, recreate them? Now the database behaves covertly and does not report in any way that somewhere inside the statements have changed.

And I will emphasize again - this is something that is not typical for Java. We will see the same thing in PL/ pgSQL one by one. But it will reproduce.

PostgreSQL and JDBC squeeze out all the juice. Vladimir Sitnikov

Let's try to select some more data. Choose-choose. We have a table with a million rows. Each line is a kilobyte. About a gigabyte of data. And we have 128 megabytes of working memory in the Java machine.

We, as recommended in all books, use stream processing. That is, we open resultSet and read data from there little by little. Will it work? Will it fall from memory? Will you read a little? Let's believe in the database, let's believe in Postgres. We don't believe. Shall we fall OutOFMemory? Who crashed OutOfMemory? And who managed to fix it after that? Someone managed to fix it.

If you have a million rows, then you can't just choose. It is necessary OFFSET/LIMIT. Who is for this option? And who is for the option that autoCommit should be played?

Here, as usual, the most unexpected option turns out to be correct. And if you suddenly turn off autoCommit, then it will help. Why is that? Science does not know about it.

PostgreSQL and JDBC squeeze out all the juice. Vladimir Sitnikov

But by default, all clients connecting to a Postgres database fetch the entire data. PgJDBC is no exception in this regard, it selects all rows.

There is a variation on the FetchSize theme, that is, you can say at the level of a separate statement that here, please, select data by 10, 50. But this does not work until you turn off autoCommit. Turned off autoCommit - it starts working.

But walking through the code and setting setFetchSize everywhere is inconvenient. Therefore, we made such a setting that for the entire connection will say the default value.

PostgreSQL and JDBC squeeze out all the juice. Vladimir Sitnikov

Here we said it. Set up the setting. And what did we get? If we select small, if, for example, we select 10 rows, then we have a very large overhead. Therefore, this value should be set to about a hundred.

PostgreSQL and JDBC squeeze out all the juice. Vladimir Sitnikov

Ideally, of course, you still need to learn how to limit bytes, but the recipe is this: set the defaultRowFetchSize to more than a hundred and rejoice.

PostgreSQL and JDBC squeeze out all the juice. Vladimir Sitnikov

Let's move on to inserting data. Insertion is simpler, there are different options. For example, INSERT, VALUES. This is a good option. You can say "INSERT SELECT". In practice, they are one and the same. There is no difference in performance.

Books say that you need to execute a Batch statement, books say that you can execute more complex commands with several brackets. And Postgres has a wonderful feature - you can do COPY, that is, do it faster.

PostgreSQL and JDBC squeeze out all the juice. Vladimir Sitnikov

If you measure it, you can again make several interesting discoveries. How do we want it to work? We want not to parse and not to execute unnecessary commands.

PostgreSQL and JDBC squeeze out all the juice. Vladimir Sitnikov

In practice, TCP does not allow us to do this. If the client is busy sending a request, then the database does not read requests in an attempt to send us answers. As a result, the client waits for the database while it reads the request, and the database waits for the client until it reads the response.

PostgreSQL and JDBC squeeze out all the juice. Vladimir Sitnikov

And so the client is forced to send a synchronization packet periodically. Extra network interactions, extra waste of time.

PostgreSQL and JDBC squeeze out all the juice. Vladimir SitnikovAnd the more we add them, the worse it gets. The driver is very pessimistic and adds them quite often, about once every 200 lines, depending on the size of the lines, etc.

PostgreSQL and JDBC squeeze out all the juice. Vladimir Sitnikov

https://github.com/pgjdbc/pgjdbc/pull/380

It happens that you correct one line and everything will speed up 10 times. It happens. Why? As usual, the constant has already been used somewhere. And the value "128" meant not to use batching.

PostgreSQL and JDBC squeeze out all the juice. Vladimir Sitnikov

Java microbenchmark harness

Good thing it didn't make it into the official version. Discovered before they started releasing. All the values ​​I give are based on modern versions.

PostgreSQL and JDBC squeeze out all the juice. Vladimir Sitnikov

Let's measure. We measure InsertBatch simple. We measure InsertBatch multiple, i.e. the same thing, but there are many values. Tricky move. Not everyone knows how, but this is such a simple move, much easier than COPY.

PostgreSQL and JDBC squeeze out all the juice. Vladimir Sitnikov

You can do COPY.

PostgreSQL and JDBC squeeze out all the juice. Vladimir Sitnikov

And you can do it on structures. Declare User default type, pass array and INSERT directly to table.

If you open the link: pgjdbc/ubenchmsrk/InsertBatch.java, then this code is on GitHub. You can see specifically what requests are generated there. It doesn't matter.

PostgreSQL and JDBC squeeze out all the juice. Vladimir Sitnikov

We launched. And the first thing we realized is that it’s simply impossible not to use batch. All batching options are zero, i.e., the execution time is almost zero compared to a single execution.

PostgreSQL and JDBC squeeze out all the juice. Vladimir Sitnikov

We are inserting data. It's a very simple table. Three columns. And what do we see here? We see that all three of these options are roughly comparable. And COPY is, of course, better.

PostgreSQL and JDBC squeeze out all the juice. Vladimir Sitnikov

This is when we insert pieces. When we said that one VALUES value, two VALUES values, three VALUES values, or we indicated 10 of them separated by commas. It's right now horizontally. 1, 2, 4, 128. It can be seen that the Batch Insert, which is drawn in blue, makes it much easier for him. That is, when you insert one at a time or even when you insert four at a time, it becomes twice as good, simply because we stuffed a little more into VALUES. Fewer EXECUTE operations.

Using COPY on small volumes is extremely unpromising. I didn't even draw on the first two. They go to heaven, i.e. these green numbers for COPY.

COPY should be used when you have at least a hundred rows of data. The overhead of opening this connection is large. And, to be honest, I did not dig in this direction. Batch I optimized, COPY - no.

What do we do next? They tried it on. We understand that we need to use either structures or an ingenious bacth that combines several values.

PostgreSQL and JDBC squeeze out all the juice. Vladimir Sitnikov

What should be taken away from today's report?

  • PreparedStatement is our everything. This gives a lot for performance. It gives a big barrel of tar.
  • And you need to do EXPLAIN ANALYZE 6 times.
  • And we need to dilute OFFSET 0, and tricks like +0 in order to correct the percentage of our problem requests remaining there.

Source: habr.com

Add a comment