Dear DELETE. Nikolay Samokhvalov (Postgres.ai)

Dear DELETE. Nikolay Samokhvalov (Postgres.ai)

Sometime in the distant future, the automatic removal of unnecessary data will be one of the important tasks of the DBMS [1]. In the meantime, we ourselves need to take care of deleting or moving unnecessary data to less expensive storage systems. Let's say you decide to delete a few million rows. A fairly simple task, especially if the condition is known and there is a suitable index. "DELETE FROM table1 WHERE col1 = :value" - what could be simpler, right?

Video:

Dear DELETE. Nikolay Samokhvalov (Postgres.ai)

  • I have been on the Highload program committee since the first year, i.e. since 2007.

  • And I've been with Postgres since 2005. Used it in many projects.

  • Group with RuPostges also since 2007.

  • We have grown to 2100+ participants at Meetup. It is second in the world after New York, overtaken by San Francisco for a long time.

  • I have lived in California for several years. I deal more with American companies, including large ones. They are active users of Postgres. And there are all sorts of interesting things.

Dear DELETE. Nikolay Samokhvalov (Postgres.ai)

https://postgres.ai/ is my company. We are in the business of automating tasks that eliminate development slowdowns.

If you are doing something, then sometimes there are some kind of plugs around Postgres. Let's say you need to wait for the admin to set up a test stand for you, or you need to wait for the DBA to respond to you. And we find such bottlenecks in the development, testing and administration process and try to eliminate them with the help of automation and new approaches.

Dear DELETE. Nikolay Samokhvalov (Postgres.ai)

https://www.seagate.com/files/www-content/our-story/trends/files/idc-seagate-dataage-whitepaper.pdf

I was recently at VLDB in Los Angeles. This is the largest conference on databases. And there was a report that in the future DBMS will not only store, but also automatically delete data. This is a new topic.

There is more and more data in the world of zettabytes - that's 1 petabytes. And now it is already estimated that we have more than 000 zettabytes of data stored in the world. And there are more and more of them.

Dear DELETE. Nikolay Samokhvalov (Postgres.ai)

https://vldb2019.github.io/files/VLDB19-keynote-2-slides.pdf

And what to do with it? Obviously it needs to be removed. Here is a link to this interesting report. But so far this has not been implemented in the DBMS.

Those who can count money want two things. They want us to delete, so technically we should be able to do it.

Dear DELETE. Nikolay Samokhvalov (Postgres.ai)

What I will tell next is some abstract situation that includes a bunch of real situations, i.e. a kind of compilation of what actually happened to me and the surrounding databases many times, many years. Rakes are everywhere and everyone steps on them all the time.

Dear DELETE. Nikolay Samokhvalov (Postgres.ai)

Let's say we have a base or several bases that are growing. And some records are obviously rubbish. For example, the user started doing something there, but did not finish it. And after some time we know that this unfinished can no longer be stored. That is, we would like to clean some garbage things in order to save space, improve performance, etc.

Dear DELETE. Nikolay Samokhvalov (Postgres.ai)

In general, the task is to automate the removal of specific things, specific lines in some table.

Dear DELETE. Nikolay Samokhvalov (Postgres.ai)

And we have such a request, which we will talk about today, that is, about garbage removal.

Dear DELETE. Nikolay Samokhvalov (Postgres.ai)

We asked an experienced developer to do it. He took this request, checked it for himself - everything works. Tested on staging - everything is fine. Rolled out - everything works. Once a day we run it - everything is fine.

Dear DELETE. Nikolay Samokhvalov (Postgres.ai)

The database grows and grows. Daily DELETE starts to work a little more slowly.

Dear DELETE. Nikolay Samokhvalov (Postgres.ai)

Then we understand that we now have a marketing company and the traffic will be several times larger, so we decide to temporarily pause unnecessary things. And forget to return.

Dear DELETE. Nikolay Samokhvalov (Postgres.ai)

A few months later they remembered. And that developer quit or is busy with something else, instructed another to return it.

He checked on dev, on staging - everything is OK. Naturally, you still need to clean up what has accumulated. He checked everything works.

Dear DELETE. Nikolay Samokhvalov (Postgres.ai)

What happens next? Then everything falls apart for us. It drops so that at some point everything falls down. Everyone is in shock, no one understands what is happening. And then it turns out that the matter was in this DELETE.

Dear DELETE. Nikolay Samokhvalov (Postgres.ai)

Something went wrong? Here is a list of what could have gone wrong. Which of these is the most important?

  • For example, there was no review, i.e. the DBA expert did not look at it. He would immediately find the problem with an experienced eye, and besides, he has access to prod, where several million lines have accumulated.

  • Maybe they checked something wrong.

  • Maybe the hardware is outdated and you need to upgrade this base.

  • Or something is wrong with the database itself, and we need to move from Postgres to MySQL.

  • Or maybe there is something wrong with the operation.

  • Maybe there are some mistakes in the organization of work and you need to fire someone and hire the best people?

Dear DELETE. Nikolay Samokhvalov (Postgres.ai)

There was no DBA check. If there was a DBA, he would see these several million lines and even without any experiments would say: "They don't do that." Suppose if this code were in GitLab, GitHub and there would be a code review process and it was not such that without the approval of the DBA this operation would take place on prod, then obviously the DBA would say: β€œThis cannot be done.”

Dear DELETE. Nikolay Samokhvalov (Postgres.ai)

And he would say that you will have problems with disk IO and all processes will go crazy, there may be locks, and also you will block autovacuum for a bunch of minutes, so this is not good.

Dear DELETE. Nikolay Samokhvalov (Postgres.ai)

http://bit.ly/nancy-hl2018-2

The second mistake - they checked in the wrong place. We saw after the fact that a lot of junk data accumulated on prod, but the developer did not have accumulated data in this database, and no one created this junk during staging. Accordingly, there were 1 lines that quickly worked out.

We understand that our tests are weak, i.e. the process that is built does not catch problems. An adequate DB experiment was not performed.

An ideal experiment is preferably carried out on the same equipment. It is not always possible to do this on the same equipment, but it is very important that it be a full-sized copy of the database. This is what I have been preaching for several years now. And a year ago I talked about this, you can watch it all on YouTube.

Dear DELETE. Nikolay Samokhvalov (Postgres.ai)

Maybe our equipment is bad? If you look, then the latency jumped. We have seen that utilization is 100%. Of course, if these were modern NVMe drives, then it would probably be much easier for us. And maybe we wouldn't lay down from it.

If you have clouds, then the upgrade is easily done there. Raised new replicas on the new hardware. switchover. And all is well. Pretty easy.

Dear DELETE. Nikolay Samokhvalov (Postgres.ai)

Is it possible to somehow touch the smaller disks? And here, just with the help of DBA, we dive into a certain topic called checkpoint tuning. It turns out that we did not have checkpoint tuning.

What is checkpoint? It is in any DBMS. When you have data in memory that changes, it is not immediately written to disk. The information that the data has changed is first written to the write-ahead log. And at some point, the DBMS decides that it's time to dump real pages to disk, so that if we have a failure, we can do less REDO. It's like a toy. If we are killed, we will start the game from the last checkpoint. And all DBMS implement it.

Dear DELETE. Nikolay Samokhvalov (Postgres.ai)

The settings in Postgres are lagging behind. They are designed for 10-15 year old volumes of data and transactions. And checkpoint is no exception.

Here is the information from our Postgres check-up report, i.e. automatic health check. And here is some database of several terabytes. And it can be seen well that forced checkpoints in almost 90% of cases.

What does it mean? There are two settings there. Checkpoint can come by timeout, for example, in 10 minutes. Or it may come when quite a lot of data has been filled.

And by default max_wal_saze is set to 1 gigabyte. In fact, this really happens in Postgres after 300-400 megabytes. You have changed so much data and your checkpoint happens.

And if no one tuned it, and the service grew, and the company earns a lot of money, it has a lot of transactions, then the checkpoint comes once a minute, sometimes every 30 seconds, and sometimes even overlap. This is quite bad.

And we need to make sure that it comes less often. That is, we can raise max_wal_size. And it will come less frequently.

But we have developed a whole methodology for how to do it more correctly, that is, how to make a decision about choosing settings, clearly based on specific data.

Dear DELETE. Nikolay Samokhvalov (Postgres.ai)

Accordingly, we are doing two series of experiments on databases.

The first series - we change max_wal_size. And we're doing a massive operation. First, we do it on the default setting of 1 gigabyte. And we do a massive DELETE of many millions of lines.

You can see how hard it is for us. We see that disk IO is very bad. We look at how many WALs we have generated, because this is very important. Let's see how many times the checkpoint happened. And we see that it is not good.

Next we increase max_wal_size. We repeat. We increase, we repeat. And so many times. In principle, 10 points is good, where 1, 2, 4, 8 gigabytes. And we look at the behavior of a particular system. It is clear that here the equipment should be like on prod. You must have the same disks, the same amount of memory, and the same Postgres settings.

And in this way we will exchange our system, and we know how the DBMS will behave in case of a bad mass DELETE, how it will checkpoint.

Checkpoint in Russian are checkpoints.

Example: DELETE several million rows by index, rows are "scattered" across pages.

Dear DELETE. Nikolay Samokhvalov (Postgres.ai)

Here is an example. This is some base. And with the default setting of 1 gigabyte for max_wal_size, it is very clear that our disks go to the shelf for recording. This picture is a typical symptom of a very sick patient, that is, he really felt bad. And there was one single operation, there was just a DELETE of several million lines.

If such an operation is allowed in prod, then we will just lie down, because it is clear that one DELETE kills us in the shelf.

Dear DELETE. Nikolay Samokhvalov (Postgres.ai)

Further, where 16 gigabytes, it is clear that the teeth have already gone. Teeth are already better, that is, we are knocking on the ceiling, but not so bad. There was some freedom there. On the right is the record. And the number of operations - the second graph. And it is clear that we are already breathing a little easier when 16 gigabytes.

Dear DELETE. Nikolay Samokhvalov (Postgres.ai)

And where 64 gigabytes can be seen that it has become completely better. Already the teeth are pronounced, there are more opportunities to survive other operations and do something with the disk.

Why is that?

Dear DELETE. Nikolay Samokhvalov (Postgres.ai)

I will dive into the details a little, but this topic, how to conduct checkpoint tuning, can result in a whole report, so I will not load much, but I will outline a little what difficulties there are.

If the checkpoint happens too often, and we update our lines not sequentially, but find by index, which is good, because we do not delete the entire table, then it may happen that at first we touched the first page, then the thousandth, and then returned to the first . And if between these visits to the first page, checkpoint has already saved it to disk, then it will save it again, because we got it dirty a second time.

And we will force checkpoint to save it many times. How would there be redundant operations for him.

Dear DELETE. Nikolay Samokhvalov (Postgres.ai)

But that is not all. Pages are 8 kilobytes in Postgres and 4 kilobytes in Linux. And there is a full_page_writes setting. It is enabled by default. And this is correct, because if we turn it off, then there is a danger that only half of the page will be saved if it crashes.

The behavior of writing to the WAL of the forward log is such that when we have a checkpoint and we change the page for the first time, the entire page, i.e., all 8 kilobytes, gets into the forward log, although we only changed the line, which weighs 100 bytes . And we have to write down the entire page.

In subsequent changes, there will be only a specific tuple, but for the first time we write down everything.

And, accordingly, if the checkpoint happened again, then we have to start everything from scratch again and push the whole page. With frequent checkpoints, when we walk through the same pages, full_page_writes = on will be more than it could be, i.e. we generate more WAL. More is sent to replicas, to the archive, to disk.

And, accordingly, we have two redundancies.

Dear DELETE. Nikolay Samokhvalov (Postgres.ai)

If we increase max_wal_size, it turns out that we make it easier for both checkpoint and wal writer. And that's great.

Let's put in a terabyte and live with it. What's bad about it? This is bad, because in case of a failure, we will climb for hours, because the checkpoint was a long time ago and a lot has already changed. And we need to do all this REDO. And so we do the second series of experiments.

We do an operation and see when the checkpoint is about to complete, we do kill -9 Postgres on purpose.

And after that we start it again, and see how long it will rise on this equipment, i.e. how much it will REDO in this bad situation.

Twice I will note that the situation is bad. First, we crashed right before the checkpoint was over, so we have a lot to lose. And secondly, we had a massive operation. And if checkpoints were on timeout, then, most likely, less WAL would be generated since the last checkpoint. That is, it is a double loser.

We measure such a situation for different max_wal_size sizes and understand that if max_wal_size is 64 gigabytes, then in a double worst case we will climb for 10 minutes. And we think whether it suits us or not. This is a business question. We need to show this picture to those responsible for business decisions and ask, β€œHow long can we lie down at most in case of a problem? Can we lie down in the worst situation for 3-5 minutes? And you make a decision.

And here is an interesting point. We have a couple of reports about Patroni at the conference. And maybe you are using it. This is an autofailover for Postgres. GitLab and Data Egret talked about this.

And if you have an autofailover that comes in 30 seconds, then maybe we can lie down for 10 minutes? Because we will switch to the replica by this point, and everything will be fine. This is a moot point. I don't know a clear answer. I just feel that this topic is not only around crash recovery.

If we have a long recovery after a failure, then we will be uncomfortable in many other situations. For example, in the same experiments, when we do something and sometimes have to wait for 10 minutes.

I still wouldn't go too far, even if we have an autofailover. As a rule, values ​​such as 64, 100 gigabytes are good values. Sometimes it's even worth choosing less. In general, this is a subtle science.

Dear DELETE. Nikolay Samokhvalov (Postgres.ai)

To do iterations, for example, max_wal_size =1, 8, you need to repeat the mass operation many times. You made it. And on the same base you want to do it again, but you have already deleted everything. What to do?

I'll talk later about our solution, what we do in order to iterate in such situations. And this is the most correct approach.

But in this case, we were lucky. If, as it says here "BEGIN, DELETE, ROLLBACK", then we can repeat DELETE. That is, if we canceled it ourselves, then we can repeat it. And physically at you the data will lie in the same place. You don't even get any bloat. You can iterate over such DELETEs.

This DELETE with ROLLBACK is ideal for checkpoint tuning, even if you don't have a properly deployed database labs.

Dear DELETE. Nikolay Samokhvalov (Postgres.ai)

We made a plate with one column "i". Postgres has utility columns. They are invisible unless specifically asked for. These are: ctid, xmid, xmax.

Ctid is a physical address. Zero page, the first tuple in the page.

It can be seen that after ROOLBACK the tuple remained in the same place. That is, we can try again, it will behave the same way. This is the main thing.

Dear DELETE. Nikolay Samokhvalov (Postgres.ai)

Xmax is the time of death of the tuple. It was stamped, but Postgres knows that the transaction was rolled back, so it doesn't matter if it's 0 or it's a rolled back transaction. This suggests that it is possible to iterate over DELETE and check the bulk operations of the system behavior. You can make database labs for the poor.

Dear DELETE. Nikolay Samokhvalov (Postgres.ai)

This is about programmers. About DBA, too, they always scold programmers for this: β€œWhy are you doing such long and difficult operations?”. This is a completely different perpendicular topic. There used to be administration, and now there will be development.

Obviously, we have not broken into pieces. It's clear. It is impossible not to break such DELETE for a heap of millions of lines into parts. It will be done for 20 minutes, and everything will lie down. But, unfortunately, even experienced developers make mistakes, even in very large companies.

Why is it important to break?

  • If we see that the disk is hard, then let's slow it down. And if we are broken, then we can add pauses, we can slow down throttling.

  • And we will not block others for a long time. In some cases it doesn't matter, if you're deleting real garbage that no one is working on, then most likely you won't block anyone except the autovacuum work, because it will wait for the transaction to complete. But if you remove something that someone else can request, then they will be blocked, there will be some kind of chain reaction. Long transactions should be avoided on websites and mobile applications.

Dear DELETE. Nikolay Samokhvalov (Postgres.ai)

https://postgres.ai/products/joe/

This is interesting. I often see that developers ask: "What pack size should I choose?".

It is clear that the larger the bundle size, the smaller the transaction overhead, i.e., the additional overhead from transactions. But at the same time, the time increases for this transaction.

I have a very simple rule: take as much as you can, but don't go over executables per second.

Why a second? The explanation is very simple and understandable to everyone, even non-technical people. We see a reaction. Let's take 50 milliseconds. If something has changed, then our eye will react. If less, then more difficult. If something responds after 100 milliseconds, for example, you clicked the mouse, and it answered you after 100 milliseconds, you already feel this slight delay. A second is already perceived as brakes.

Accordingly, if we break our mass operations into 10-second bursts, then we have a risk that we will block someone. And it will work for a few seconds, and people will already notice it. Therefore, I prefer not to do more than a second. But at the same time, do not break it up very finely, because the transaction overhead will be noticeable. The base will be harder, and other different problems may arise.

We choose the size of the pack. In each case, we can do it differently. Can be automated. And we are convinced of the efficiency of the processing of one pack. That is, we do DELETE of one pack or UPDATE.

By the way, everything I'm talking about is not only about DELETE. As you guessed, these are any bulk operations on data.

And we see that the plan is excellent. You can see the index scan, index only scan is even better. And we have a small amount of data involved. And less than a second fulfills. Super.

And we still need to make sure that there is no degradation. It happens that the first packs quickly work out, and then it gets worse, worse and worse. The process is such that you need to test a lot. This is exactly what database labs is for.

And we still have to prepare something so that it will allow us to follow this correctly in production. For example, we can write the time in the log, we can write where we are now and who we have now deleted. And this will allow us to understand what is happening later. And in case something goes wrong, quickly find the problem.

If we need to check the efficiency of requests and we need to iterate many times, then there is such a thing as a fellow bot. He's already ready. It is used by dozens of developers daily. And he knows how to give a huge terabyte database on request in 30 seconds, your own copy. And you can delete something there and say RESET, and delete it again. You can experiment with it this way. I see a future for this thing. And we are already doing it.

Dear DELETE. Nikolay Samokhvalov (Postgres.ai)

https://docs.gitlab.com/ee/development/background_migrations.html

What are partitioning strategies? I see 3 different partitioning strategies that the developers on the pack are using.

The first one is very simple. We have a numeric ID. And let's break it down into different intervals and work with that. The downside is clear. In the first segment, we may have 100 lines of real garbage, in the second 5 lines or not at all, or all 1 lines will turn out to be garbage. Very uneven work, but it is easy to break. They took the maximum ID and smashed it. This is a naive approach.

The second strategy is a balanced approach. It is used in Gitlab. They took and scanned the table. We found the boundaries of the ID packs so that each pack had exactly 10 records. And put them in a queue. And then we process. You can do this in multiple threads.

In the first strategy, too, by the way, you can do this in several threads. It is not difficult.

Dear DELETE. Nikolay Samokhvalov (Postgres.ai)

https://medium.com/@samokhvalov/how-partial-indexes-affect-update-performance-in-postgres-d05e0052abc

But there is a cooler and better approach. This is the third strategy. And when possible, it is better to choose it. We do this on the basis of a special index. In this case, it will most likely be an index according to our garbage condition and ID. We will include the ID so that it is an index only scan so that we do not go to the heap.

Generally, index only scan is faster than index scan.

Dear DELETE. Nikolay Samokhvalov (Postgres.ai)

And we quickly find our IDs that we want to delete. BATCH_SIZE we select in advance. And we not only get them, we get them in a special way and immediately hack them. But we are locking so that if they are already locked, we do not lock them, but move on and take the next ones. This is for update skip locked. This super feature of Postgres allows us to work in several threads if we want. It is possible in one stream. And here there is a CTE - this is one request. And we have a real deletion going on in the second floor of this CTE - returning *. You can return id, but it's better *if you don't have much data on each line.

Dear DELETE. Nikolay Samokhvalov (Postgres.ai)

Why do we need it? This is what we need to report back. We have now deleted so many lines in fact. And we have borders by ID or by created_at like this. You can do min, max. Something else can be done. You can stuff a lot here. And it is very convenient for monitoring.

There is one more note about the index. If we decide that we need a special index for this task, then we need to make sure that it does not spoil heap only tuples updates. That is, Postgres has such statistics. This can be seen in pg_stat_user_tables for your table. You can see if hot updates are being used or not.

There are situations when your new index can simply cut them off. And you have all the other updates that are already working, slow down. Not just because the index appeared (each index slows down updates a little, but a little), but here it still ruins it. And it is impossible to make special optimization for this table. This happens sometimes. This is such a subtlety that few people remember. And this rake is easy to step on. Sometimes it happens that you need to find an approach from the other side and still do without this new index, or make another index, or in some other way, for example, you can use the second method.

But this is the most optimal strategy, how to split into batches and shoot at batches with one request, delete a little bit, etc.

Dear DELETE. Nikolay Samokhvalov (Postgres.ai)

Long transactions https://gitlab.com/snippets/1890447

Blocked autovacuum - https://gitlab.com/snippets/1889668

blocking issue - https://gitlab.com/snippets/1890428

Mistake #5 is a big one. Nikolai from Okmeter talked about Postgres monitoring. Ideal Postgres monitoring, unfortunately, does not exist. Some are closer, some are farther. Okmeter is close enough to being perfect, but a lot is missing and needs to be added. You need to be ready for this.

For example, dead tuples are best monitored. If you have a lot of dead things in the table, then something is wrong. It is better to react now, otherwise there may be degradation, and we can lie down. It happens.

If there is a large IO, then it is clear that this is not good.

Long transactions too. Long transactions should not be allowed on OLTP. And here is a link to a snippet that allows you to take this snippet and already do some tracking of long transactions.

Why are long transactions bad? Because all the locks will be released only at the end. And we screw everyone. Plus, we block autovacuum for all tables. It's not good at all. Even if you have hot standby enabled on the replica, it's still bad. In general, nowhere is it better to avoid long transactions.

If we have many tables that are not vacuumed, then we need to have an alert. Here such a situation is possible. We can indirectly affect the operation of autovacuum. This is a snippet from Avito, which I slightly improved. And it turned out to be an interesting tool to see what we have with autovacuum. For example, some tables are waiting there and will not wait for their turn. You also need to put it in monitoring and have an alert.

And issues blocks. Forest of block trees. I like to take something from someone and improve it. Here, Data Egret took a cool recursive CTE that shows a forest of lock trees. This is a good diagnostic tool. And on its basis, you can also build monitoring. But this must be done carefully. You need to make a small statement_timeout for yourself. And lock_timeout is desirable.

Dear DELETE. Nikolay Samokhvalov (Postgres.ai)

Sometimes all these errors occur in sum.

In my opinion, the main mistake here is organizational. It is organizational, because the technique does not pull. This is number 2 - they checked in the wrong place.

We checked in the wrong place, because we did not have a production clone, which is easy to check on. A developer may not have access to production at all.

And we checked not there. If we had checked there, we would have seen it ourselves. The developer saw it all even without a DBA if he checked it in a good environment, where there is the same amount of data and an identical location. He would have seen all this degradation and he would be ashamed.

More about autovacuum. After we have done a massive sweep of several million lines, we still need to do REPACK. This is especially important for indexes. They will feel bad after we cleaned everything there.

And if you want to bring back the daily cleaning work, then I would suggest doing it more often, but smaller. It can be once a minute or even more often a little bit. And you need to monitor two things: that this thing has no errors and that it does not lag behind. The trick that I showed will just solve this.

Dear DELETE. Nikolay Samokhvalov (Postgres.ai)

What we do is open source. It's posted on GitLab. And we make it so that people can check even without a DBA. We are doing a database lab, that is, we call the base component on which Joe is currently working. And you can grab a copy of production. Now there is an implementation of Joe for slack, you can say there: β€œexplain such and such a request” and immediately get the result for your copy of the database. You can even DELETE there, and no one will notice it.

Dear DELETE. Nikolay Samokhvalov (Postgres.ai)

Let's say you have 10 terabytes, we make database lab also 10 terabytes. And with simultaneous 10 terabyte databases, 10 developers can work simultaneously. Everyone can do what they want. Can delete, drop, etc. That's such a fantasy. We will talk about this tomorrow.

Dear DELETE. Nikolay Samokhvalov (Postgres.ai)

This is called thin provisioning. This is subtle provisioning. This is some kind of fantasy that greatly removes delays in development, in testing and makes the world a better place in this regard. That is, it just allows you to avoid problems with bulk operations.

Example: 5 terabyte database, getting a copy in less than 30 seconds. And it does not even depend on the size, that is, it does not matter how many terabytes.

Today you can go to postgres.ai and dig into our tools. You can register to see what is there. You can install this bot. It's free. Write.

Questions

Very often in real situations it turns out that the data that should remain in the table is much less than what needs to be deleted. That is, in such a situation, it is often easier to implement such an approach, when it is easier to create a new object, copy only the necessary data there, and trunk the old table. It is clear that a programmatic approach is needed for this moment, while you will be switching. How is this approach?

This is a very good approach and a very good task. It's very similar to what pg_repack does, it's very similar to what you have to do when you make IDs 4 bytes. Many frameworks did this a few years ago, and just the plates have grown up, and they need to be converted to 8 bytes.

This task is quite difficult. We did it. And you have to be very careful. There are locks, etc. But it is being done. That is, the standard approach is to go with pg_repack. You declare such a label. And before you start uploading snapshot data into it, you also declare one plate that tracks all changes. There is a trick that you may not even track some changes. There are subtleties. And then you switch by rolling changes. There will be a short pause when we shut everyone down, but in general this is being done.

If you look at pg_repack on GitHub, then there, when there was a task to convert an ID from int 4 to int 8, then there was an idea to use pg_repack itself. This is also possible, but it's a bit of a hack, but it will work for this too. You can intervene in the trigger that pg_repack uses and say there: "We don't need this data", i.e. we only transfer what we need. And then he just switches and that's it.

With this approach, we still get a second copy of the table, in which the data is already indexed and stacked very evenly with beautiful indexes.

Bloat is not present, it is a good approach. But I know that there are attempts to develop an automation for this, i.e. to make a universal solution. I can put you in touch with this automation. It's written in Python, which is a good thing.

I'm just a little bit from the world of MySQL, so I came to listen. And we use this approach.

But it is only if we have 90%. If we have 5%, then it is not very good to use it.

Thanks for the report! If there are no resources to make a complete copy of prod, is there any algorithm or formula to calculate the load or size?

Good question. So far, we are able to find multi-terabyte databases. Even if the hardware there is not the same, for example, less memory, less processor and disks are not exactly the same, but still we do it. If there is absolutely nowhere, then you need to think. Let me think until tomorrow, you came, we will talk, this is a good question.

Thanks for the report! You first started about the fact that there is a cool Postgres, which has such and such limitations, but it is developing. And this is all a crutch by and large. Isn't this all in conflict with the development of Postgres itself, in which some DELETE deferent will appear or something else that should keep at a low level what we are trying to smear with some of our strange means here?

If we said in SQL to delete or update many records in one transaction, then how can Postgres distribute it there? We are physically limited in operations. We will still do it for a long time. And we will lock at this time, etc.

Done with indexes.

I can assume that the same checkpoint tuning could be automated. Someday it might be. But then I don't really understand the question.

The question is, is there such a vector of development that goes here and there, and here yours goes parallel? Those. Haven't they thought about it yet?

I talked about the principles that can be used now. There is another bot Nancy, with this you can do automated checkpoint tuning. Will it someday be in Postgres? I don't know, it hasn't even been discussed yet. We are still far from that. But there are scientists who make new systems. And they shove us into automatic indexes. There are developments. For example, you can look at auto tuning. It selects parameters automatically. But he will not do checkpoint tuning for you yet. That is, it will pick up for performance, shell buffer, etc.

And for checkpoint tuning, you can do this: if you have a thousand clusters and different hardware, different virtual machines in the cloud, you can use our bot Nancy do automation. And max_wal_size will be selected according to your target settings automatically. But so far this is not even close in the core, unfortunately.

Good afternoon You talked about the dangers of long transactions. You said that autovacuum is blocked in case of deletions. How else does it harm us? Because we are talking more about freeing up space and being able to use it. What else are we missing?

Autovacuum is maybe not the biggest problem here. And the fact that a long transaction can lock other transactions, this possibility is more dangerous. She may or may not meet. If she met, then it can be very bad. And with autovacuum - this is also a problem. There are two problems with long transactions in OLTP: locks and autovacuum. And if you have hot standby feedback enabled on the replica, then you will still receive an autovacuum lock on the master, it will arrive from the replica. But at least there will be no locks. And there will be loks. We are talking about data changes, so locks are an important point here. And if this is all for a long, long time, then more and more transactions are locked. They can steal others. And lok trees appear. I provided a link to the snippet. And this problem becomes more noticeable faster than the problem with autovacuum, which can only accumulate.

Thanks for the report! You started your report by saying that you tested incorrectly. We continued our idea that we need to take the same equipment, with the base in the same way. Let's say we gave the developer a base. And he complied with the request. And he seems to be fine. But he does not check for live, but for live, for example, we have a load of 60-70%. And even if we use this tuning, it doesn't work very well.

Having an expert on the team and using DBA experts who can predict what will happen with a real background load is important. When we just drove our clean changes, we see the picture. But a more advanced approach, when we did the same thing again, but with a load simulated with production. It's quite cool. Until then, you have to grow up. It's like an adult. We just looked at what we have and also looked at whether we have enough resources. That's a good question.

When we are already doing a garbage select and we have, for example, a deleted flag

This is what autovacuum does automatically in Postgres.

Oh, does he do it?

Autovacuum is the garbage collector.

Thank you!

Thanks for the report! Is there an option to immediately design a database with partitioning in such a way that all garbage gets dirty from the main table somewhere to the side?

Of course, there is.

Is it possible then to protect ourselves if we have locked a table that should not be used?

Of course have. But it's like a chicken and egg question. If we all know what will happen in the future, then, of course, we will do everything cool. But the business is changing, there are new columns, new requests. And then – oops, we want to remove it. But this ideal situation, in life it occurs, but not always. But overall it's a good idea. Just truncate and that's it.

Source: habr.com

Add a comment