We write in PostgreSQL on sublight: 1 host, 1 day, 1TB

Recently, I told how, using typical recipes increase the performance of "read" SQL queries from PostgreSQL database. Today we will talk about how can be made more efficient in the database without using any "twists" in the config - just by organizing the data flows correctly.

We write in PostgreSQL on sublight: 1 host, 1 day, 1TB

#1. Sectioning

An article about how and why it is worth organizing applied partitioning "in theory" has already been, but here we will talk about the practice of applying some approaches within the framework of our monitoring service for hundreds of PostgreSQL servers.

"Deeds of bygone days..."

Initially, like any MVP, our project started under a fairly low load - monitoring was carried out only for a dozen of the most critical servers, all tables were relatively compact ... But time passed, there were more and more monitored hosts, and once again trying to do something with one of 1.5TB tables, we realized that although it is possible to live like this, it is very inconvenient.

The times were almost epic, different versions of PostgreSQL 9.x were relevant, so all partitioning had to be done “manually” - through table inheritance and triggers routing with dynamic EXECUTE.

We write in PostgreSQL on sublight: 1 host, 1 day, 1TB
The resulting solution turned out to be universal enough to be able to translate it to all tables:

  • An empty "header" parent table was declared, which described all desired indexes and triggers.
  • Recording from the point of view of the client was made in the "root" table, and inside using routing trigger BEFORE INSERT the record was "physically" inserted into the desired section. If there wasn’t one yet, we caught an exception and ...
  • … by using CREATE TABLE ... (LIKE ... INCLUDING ...) was created according to the template of the parent table section with a restriction on the desired dateso that when data is retrieved, only it is read.

PG10: first try

But partitioning through inheritance has historically not been well adapted to work with an active write stream or a large number of descendant sections. For example, we can recall that the algorithm for choosing the desired section had quadratic complexitythat it works with 100+ sections, you yourself understand how ...

In PG10, this situation was greatly optimized by implementing support for native partitioning. Therefore, we immediately tried to apply it immediately after the storage migration, but ...

As it turned out after digging through the manual, the natively partitioned table in this version:

  • does not support index descriptions
  • does not support triggers on it
  • cannot be anyone's "descendant"
  • do not support INSERT ... ON CONFLICT
  • does not know how to generate a section automatically

Having painfully received a rake on the forehead, we realized that we would not be able to do without modifying the application, and postponed further research for six months.

PG10: second chance

So, we began to solve the problems that arose in turn:

  1. Because triggers and ON CONFLICT we still needed it in some places, for their development they made an intermediate proxy table.
  2. Get rid of "routing" in triggers - that is, from EXECUTE.
  3. Rendered separately template table with all indexesso that they are not even present on the proxy table.

We write in PostgreSQL on sublight: 1 host, 1 day, 1TB
Finally, after all this, the main table was already partitioned natively. The creation of a new section has so far remained on the conscience of the application.

"Saw" dictionaries

As in any analytical system, we also had "facts" and "cuts" (dictionaries). In our case, in this capacity were, for example, template body similar slow queries or the text of the query itself.

Our “facts” were sectioned by days for a long time, so we calmly deleted obsolete sections, and they did not interfere with us (logs!). But there is a problem with dictionaries...

Not to say that there were a lot of them, but about 100TB of “facts” turned out to be a 2.5TB dictionary. You can’t conveniently delete anything from such a table, you can’t compress it in adequate time, and writing to it gradually became slower.

It’s like a dictionary… in it, each entry must be presented exactly once… and that’s right, but!.. No one bothers us to have a separate dictionary for each day! Yes, this brings a certain redundancy, but it allows:

  • write/read faster due to the smaller section size
  • consume less memory by working with more compact indexes
  • store less data due to the ability to quickly remove obsolete

As a result of the whole complex of measures CPU load reduced by ~30%, disk load by ~50%:

We write in PostgreSQL on sublight: 1 host, 1 day, 1TB
At the same time, we continued to write exactly the same thing to the database, just with a lower load.

#2. Evolution and refactoring of the database

So, we settled on what we have there is a section for every day with data. Actually, CHECK (dt = '2018-10-12'::date) - and there is a partitioning key and a condition for hitting a record in a specific section.

Since all reports in our service are built in the context of a specific date, then the indexes from the “unpartitioned times” for them were of all types (Server, date, Plan Template), (Server, date, Plan node), (date, Error class, Server),…

But now every section live their copies each such index ... And within each section date is a constant… It turns out that now we are in each such index tritely enter a constant as one of the fields, which makes both its volume and search time larger, but does not bring any result. They left the rake to themselves, oops ...

We write in PostgreSQL on sublight: 1 host, 1 day, 1TB
The direction of optimization is obvious - just remove date field from all indexes on partitioned tables. With our volumes, the gain is about 1TB/week!

And now let's note that this terabyte still had to be written down somehow. That is, we are also disk should now load less! This picture clearly shows the effect of the cleaning, to which we devoted a week:

We write in PostgreSQL on sublight: 1 host, 1 day, 1TB

#3. "Smearing" the peak load

One of the big troubles of loaded systems is excessive synchronization some operations that do not require. Sometimes “because they didn’t notice”, sometimes “it was easier that way”, but sooner or later you have to get rid of it.

We zoom in on the previous picture - and we see that we have a disk "pumps" on the load with a double amplitude between adjacent readings, which obviously “statistically” should not be with such a number of operations:

We write in PostgreSQL on sublight: 1 host, 1 day, 1TB

To achieve this is quite simple. We have already started monitoring almost 1000 servers, each is processed by a separate logical thread, and each thread resets the accumulated information to be sent to the database at a certain frequency, something like this:

setInterval(sendToDB, interval)

The problem here lies precisely in the fact that all threads start at about the same time, so the moments of sending them almost always coincide “to the point”. Oops #2…

Luckily, this is easy enough to fix. adding a "random" run on time:

setInterval(sendToDB, interval * (1 + 0.1 * (Math.random() - 0.5)))

#4. We cache what you need

The third traditional highload problem is no cache where he is could be.

For example, we made it possible to analyze in terms of plan nodes (all these Seq Scan on users), but they immediately forgot to think that they are, for the most part, the same.

No, of course, nothing is written to the database again, this cuts off the trigger from INSERT ... ON CONFLICT DO NOTHING. But this data reaches the base anyway, and even more read to check for conflict have to do. Oops #3…

The difference in the number of records sent to the database before/after enabling caching is obvious:

We write in PostgreSQL on sublight: 1 host, 1 day, 1TB

And this is the concomitant drop in storage load:

We write in PostgreSQL on sublight: 1 host, 1 day, 1TB

Total

"Terabyte-per-day" only sounds scary. If you do everything right, then this is just 2^40 bytes / 86400 seconds = ~12.5MB/sthat even desktop IDE screws held. 🙂

But seriously, even with a tenfold “skew” of the load during the day, you can easily meet the capabilities of modern SSDs.

We write in PostgreSQL on sublight: 1 host, 1 day, 1TB

Source: habr.com

Add a comment