Backups with WAL-G. What's there in 2019? Andrey Borodin

I suggest that you familiarize yourself with the transcript of the report of the beginning of 2019 by Andrey Borodin "Backup copies with WAL-G. What is there in 2019?"

Backups with WAL-G. What's there in 2019? Andrey Borodin

Hi all! My name is Andrey Borodin. I am a developer at Yandex. I've been interested in PostgreSQL since 2016, after I talked to the developers, and they said that everything is simple - you take the source code and build it, and everything will work out. And since then I can not stop - I write all sorts of different things.

Backups with WAL-G. What's there in 2019? Andrey BorodinOne of the things I do is a backup system. WAL-G. In general, at Yandex we have been dealing with backup systems in PostgreSQL for a very long time. And you can find on the Internet a series of six reports on how we make backup systems. And every year they evolve a little, develop a little, become more reliable.

But today's report is not only about what we have done, but also about how simple everything is and about what is. How many of you have already watched my talks about WAL-G? It's good that quite a few people haven't watched because I'll start with the simplest thing.

Backups with WAL-G. What's there in 2019? Andrey Borodin

If suddenly you have a PostgreSQL cluster, and I think everyone has a couple of them with them, and suddenly there is no backup system yet, then you need to get any S3 storage or Google Cloud compatible storage.

Backups with WAL-G. What's there in 2019? Andrey Borodin

For example, you can come to our booth and take a promotional code for Yandex Object Storage, which is S3 compatible.

Backups with WAL-G. What's there in 2019? Andrey Borodin

Then create Bucket. It's just a container for information.

Backups with WAL-G. What's there in 2019? Andrey Borodin

Create a service user.

Backups with WAL-G. What's there in 2019? Andrey Borodin

Create a service user access key aws-s3-key.

Backups with WAL-G. What's there in 2019? Andrey Borodin

Download the latest stable release of WAL-G.

How do our pre-releases differ from releases? I'm often asked to release early. And if there is no bug in the version for a sufficient time, for example, a month, then I release the release. Here is the November release. And this means that every month we found some kind of bug, usually in non-critical functionality, but so far we have not released a release. The previous version is only November. There are no bugs known to us in it, i.e. bugs were added during the development of the project.

Backups with WAL-G. What's there in 2019? Andrey Borodin

Once you have downloaded WAL-G, you can run a simple "backup list" command by passing in environment variables. And it will connect to Object Storage and tell you what backups you have. At first, of course, you should not have backups. The purpose of this slide is to show that everything is quite simple. This is a console command that accepts environment variables and executes subcommands.

Backups with WAL-G. What's there in 2019? Andrey Borodin

After that, you can make your first backup. Say "backup-push" in WAL-G and point WAL-G to the location of your cluster's pgdata. And most likely PostgreSQL will tell you if you don't already have a backup system that you need to enable "archive-mode".

Backups with WAL-G. What's there in 2019? Andrey Borodin

This means going into settings and turning on "archive_mode=on" and adding "archive_command" which is exactly the same as a subcommand in WAL-G. But in this topic, people often for some reason use bar scripts and make a binding around WAL-G. Please don't do this. Use the functionality that is in WAL-G. If you are missing something, then write to GitHub. WAL-G assumes that it is the only program that runs on archive_command.

Backups with WAL-G. What's there in 2019? Andrey Borodin

We use WAL-G mainly to create a High Availability cluster in Yandex Database management.

Backups with WAL-G. What's there in 2019? Andrey Borodin

And it is usually used in a topology of one Master and several replications. At the same time, it makes a backup copy in Yandex Object Storage.

Backups with WAL-G. What's there in 2019? Andrey Borodin

The most common scenarios are creating cluster copies using Point in time recovery. But in this case, the performance of the backup system is not so important for us. We just need to pour a new cluster from the backup.

Backups with WAL-G. What's there in 2019? Andrey Borodin

Usually, we need the performance of the backup system when adding a new node. Why is it important? Usually people add a new node to a cluster because the existing cluster can't handle the read load. They need to add a new replica. If we add a load from pg_basebackup to the Master, then the Master can add up. Therefore, it was very important for us that we could quickly pour a new node from the archive, creating a minimum load on the Master.

Backups with WAL-G. What's there in 2019? Andrey Borodin

And another similar situation. This is the need to reload the old Master after switching the cluster Master from the Data Center with which connectivity was lost.

Backups with WAL-G. What's there in 2019? Andrey Borodin

  • As a result, when formulating the requirements for the backup system, we realized that pg_basebackup is not suitable for us when operating in the cloud.
  • We wanted to be able to compress our data. But almost any backup system will provide data compression, except what is in the box.
  • We wanted to parallelize everything, because the user in the cloud buys a large number of processor cores. But if we do not have parallelism in some operation, then a large number of cores becomes useless.
  • We need encryption because often this is not our data and cannot be stored in the clear. By the way, our contribution to WAL-G began with encryption. We completed the encryption in WAL-G, after which we were asked: β€œMaybe one of us will develop the project?”. And since then I've been working with WAL-G for over a year.
  • We also needed resource throttling, because over the time of cloud operation, we found out that sometimes people have an important grocery load at night and this load should not be interfered with. Therefore, we added resource throttling.
  • As well as listing and management.
  • And verification.

Backups with WAL-G. What's there in 2019? Andrey Borodin

We have covered many different tools. Fortunately, we have a huge selection in PostgreSQL. And everywhere we were missing something, some one small feature, some one small feature.

Backups with WAL-G. What's there in 2019? Andrey Borodin

And having considered the existing systems, we came to the fact that we will develop WAL-G. It was then a new project. It was quite easy to influence the development towards the cloud infrastructure of the backup system.

Backups with WAL-G. What's there in 2019? Andrey Borodin

The main ideology that we adhere to is that WAL-G should be as simple as a balalaika.

Backups with WAL-G. What's there in 2019? Andrey Borodin

There are 4 commands in WAL-G. This:

WAL-PUSH - archive the shaft.

WAL-FETCH - get a shaft.

BACKUP-PUSH - make a backup.

BACKUP-FETCH - get a backup from the backup system.

Backups with WAL-G. What's there in 2019? Andrey Borodin

In fact, WAL-G also has a management of these backups, i.e. listing and deleting shafts and backups in the history, which are no longer needed at the moment.

Backups with WAL-G. What's there in 2019? Andrey Borodin

One of the important functions for us is the function of creating delta copies.

Delta copies mean that we do not create a full backup of the entire cluster, but only modified pages of modified files in the cluster are made. It would seem that functionally this is very similar to being able to recover using WAL. But WAL-single-threaded, delta-backup, we can roll in parallel. Accordingly, when we have a basic backup made on Saturday, delta backups are daily, and on Thursday we fail, then we need to roll 4 delta backups and 10 hours of WAL. It will take about the same time, because the delta backups roll in parallel.

Backups with WAL-G. What's there in 2019? Andrey Borodin

LSN-based deltas - this means that when creating a backup, we will need to combine each page and check its LSN with the LSN of the previous backup in order to understand that it has changed. Any page that could potentially contain modified data should be present in the delta backup.

Backups with WAL-G. What's there in 2019? Andrey Borodin

As I said, quite a lot of attention has been paid to parallelism.

Backups with WAL-G. What's there in 2019? Andrey Borodin

But the archive API in PostgreSQL is consistent. PostgreSQL archives one WAL file and requests one WAL file when restoring. But when the database has requested one WAL file using the WAL-FETCH command, we call the WAL-PREFETCH command, which prepares 8 more shafts in order to fetch data from the object store in parallel.

Backups with WAL-G. What's there in 2019? Andrey BorodinAnd when the database asks us to archive one shaft, we look into archive_status and see if there are any other WAL files. And we are trying to upload WAL in parallel too. This gives a significant performance gain, significantly reducing the distance in the number of unarchived WALs. Many backup system developers feel that this is such a risky system because we rely on our knowledge of the internals of code that is not a PostgreSQL API. PostgreSQL does not guarantee the presence of the archive_status folder and does not guarantee the semantics, the presence of ready signals for WAL files there. Nevertheless, we study the source code, see that this is so and try to exploit it. And we control in which direction PostgreSQL is developing, if suddenly this mechanism is broken, then we will stop using it.

Backups with WAL-G. What's there in 2019? Andrey Borodin

In its purest form, the LSN-based WAL delta requires reading any cluster file whose mode-time in the file system has changed since the previous backup. We lived with it for a long time, almost a year. And in the end, we came to the conclusion that we have WAL deltas.

Backups with WAL-G. What's there in 2019? Andrey BorodinThis means that every time we archive WAL on the Master, we not only compress it, encrypt it and send it to the network, but we also read it at the same time. We analyze, read records in it. We understand which blocks have changed and collect delta files.

The delta file describes a range of WAL files, describes information about which blocks have been changed in this WAL range. And then these delta files are also archived.

Backups with WAL-G. What's there in 2019? Andrey Borodin

Here we are faced with the fact that we parallelized everything quite quickly, but we cannot read a sequential history in parallel, because in a certain segment, we may meet the end of the previous WAL record, which so far we have nothing to match, because parallel reading has led to that we first analyze the future, which does not yet have a past.

Backups with WAL-G. What's there in 2019? Andrey Borodin

As a result, we had to put incomprehensible pieces into _delta_partial files. As a result, when we return to the past, we will glue the pieces of the WAL record into one, after that we will parse it and understand what has changed in it.

If at least one point is formed in the history of our shaft parsing where we do not understand what happened, then, accordingly, at the next backup we will have to read the entire cluster again, just as we did with the usual LSN-based delta.

Backups with WAL-G. What's there in 2019? Andrey Borodin

As a result, all our suffering led to the fact that we over-opensourced the WAL-G parsing library. As far as I know, no one is using it yet, but if anyone wants to, write and use it, it is in the public domain. (Updated link https://github.com/wal-g/wal-g/tree/master/internal/walparser)

Backups with WAL-G. What's there in 2019? Andrey Borodin

As a result, all information flows look rather complicated. Our Master archives the shaft and archives the delta files. And a replica that makes a backup must receive delta files for the time that has passed between backups. At the same time, parts of the story will need to be received in bulk and parsed, because not the whole story fits into large segments. And only after that, the replica can archive a full-fledged delta backup.

Backups with WAL-G. What's there in 2019? Andrey Borodin

On the charts, everything looks much simpler. This is a boot from one of our real clusters. We have LSN-based, made in one day. And we see that the LSN-based delta backup went from three in the morning to five in the morning. This is the load in the number of processor cores. WAL-delta took us about 20 minutes here. That is, it became much faster, but at the same time there was a more intensive exchange over the network.

Backups with WAL-G. What's there in 2019? Andrey Borodin

Since we have information about which blocks changed and at what time in the history of the database, we went ahead and decided to integrate the functionality - a PostgreSQL extension called "pg_prefaulter"

Backups with WAL-G. What's there in 2019? Andrey Borodin

This means that when a Stand-by base issues a restore command, it tells WAL-G to fetch the next WAL file. We understand approximately which data blocks the WAL recovery process will access in the near future and initiate a read operation on these blocks. This is done in order to improve the performance of SSD controllers. Because the WAL rollover will reach the page that needs to be changed. This page is on disk and is not in the page cache. And he will synchronously wait for this page to arrive. But next to it is WAL-G, which knows that in the next few hundred megabytes of WAL we will need certain pages and in parallel begins to warm them up. Initiates many disk accesses so that they are performed in parallel. This works well on SSD drives, but, unfortunately, this is absolutely not applicable to a hard drive, because we only interfere with it with our hints.

This is what is in the code now.

Backups with WAL-G. What's there in 2019? Andrey Borodin

There are features that we would like to add.

Backups with WAL-G. What's there in 2019? Andrey Borodin

In this picture, you can see that WAL-delta takes a relatively short time. And this is a reading of the changes that occurred in the database during the day. We could do WAL-delta not only at night, because it is no longer a significant source of load. We can read WAL-delta every minute because it's cheap. In one minute, we can scan all the changes that have happened to the cluster. And this could be called "instant WAL-delta".

Backups with WAL-G. What's there in 2019? Andrey Borodin

The bottom line is that when we restore the cluster, reduce the number of stories that we have to roll sequentially. That is, the amount of WAL that PostgreSQL rolls should be reduced, because it takes a significant amount of time.

But that's not all. If we know that some block will be changed before the backup consistency point, we can not change it in the past. That is, now we have file-by-file WAL-delta roll forward optimization. This means that if, for example, on Tuesday, some table was completely deleted or some files were completely deleted from the table, then when delta is rolled on Monday and pg_basebackup is restored on Saturday, we will not even create this data.

We want to extend this technology to the page level. That is, if some part of the file changes on Monday, but will be overwritten on Wednesday, then when restoring to a point on Thursday, we do not need to write the first few versions of the pages to disk.

But this is still an idea that is being actively discussed within us, but the code has not yet reached.

Backups with WAL-G. What's there in 2019? Andrey Borodin

We want to make one more feature in WAL-G. We want to make it extensible because we need to support different databases and would like to be able to approach backup management in the same way. But the problem is that the MySQL APIs are radically different. MySQL's PITR is not based on a physical WAL log, but on binlog. And we do not have an archiving system in MySQL that would tell some external system that this binlog is finished and needs to be archived. We need to stand somewhere in cron with the base and check if there is something ready?

And in the same way, during MySQL restore, there is no restore command that could tell the system that I need files such and such. Before you start restoring a cluster, you need to know what files you need. You yourself need to guess what files you need. But these problems, perhaps, can be somehow circumvented. (Clarification: MySQL is already supported)

Backups with WAL-G. What's there in 2019? Andrey Borodin

In the report, I also wanted to talk about those cases when WAL-G is not suitable for you.

Backups with WAL-G. What's there in 2019? Andrey Borodin

If you do not have a synchronous replica, WAL-G does not guarantee that the last hop will be saved. And if archiving lags behind the last few segments of history, which is a risk. In the absence of a synchronous replica, I would not recommend using WAL-G. Still, it is mainly designed for a cloud installation, which implies a High Availability solution with a synchronous replica, which is responsible for the safety of the committed last bytes.

Backups with WAL-G. What's there in 2019? Andrey Borodin

I often see people trying to exploit both WAL-G and WAL-E at the same time. We support backward compatibility in the sense that WAL-G can restore a shaft from WAL-E and can restore a backup made in WAL-E. But since both of these systems use parallel wal-push, they start stealing files from each other. If we fix it in WAL-G, then it will still remain in WAL-E. In WAL-E, it looks at archive-status, sees the finished files, and archives them, while other systems simply do not know that this WAL file existed, because PostgreSQL will not try to archive it a second time.

What are we fixing here on the WAL-G side? We will not inform PostgreSQL that this file was taken away in parallel, and when PostgreSQL asks us to archive it, we will already know that such a file with such mode-time and with such md5 has already been archived and just say PostgreSQL - OK, everything is ready, in fact, without doing anything.

But on the WAL-E side, this problem is unlikely to be fixed, so it is impossible to make an archive command that will archive the file in both WAL-G and WAL-E.

In addition, there are cases when WAL-G is not suitable for you now, but we will definitely fix it.

Backups with WAL-G. What's there in 2019? Andrey BorodinFirstly, we currently do not have a built-in backup verification. We do not have verification either during backup or during recovery. Of course, this is implemented in the cloud. But this is implemented simply by pre-checking, simply by restoring the cluster. It would be desirable to give such functionality to users. But under verification, I'm assuming that WAL-G will be able to restore the cluster and start it up and run smoke tests: pg_dumpall to /dev/null and amcheck to verify the indexes.

Backups with WAL-G. What's there in 2019? Andrey Borodin

There is currently no way in WAL-G to defer one backup from WAL. That is, we support some window. For example, keeping the last seven days, keeping the last ten backups, keeping the last three full backups. Quite often, people come and say: "We need a backup of what happened on New Year's Day and we want to keep it forever." WAL-G does not yet know how to do this. (Note - This has already been fixed. More details - The backup-mark option in https://github.com/wal-g/wal-g/blob/master/PostgreSQL.md)

Backups with WAL-G. What's there in 2019? Andrey Borodin

And we do not have page checksum checks and integrity checks for all shaft segments during PITR validation.

Backups with WAL-G. What's there in 2019? Andrey Borodin

From all this, I made a project for Google Summer of Code. If you know smart students who would like to write something in Go and get several thousand dollars from one company with the letter "G", then recommend our project to them. I will mentor this project, they can do it. If there are no students, then I will take it and do it myself in the summer.

Backups with WAL-G. What's there in 2019? Andrey Borodin

And we have many other small problems that we are gradually working on. And some pretty strange things happen.

For example, if you give an empty backup in WAL-G, it will simply fall. For example, if you tell him that you need to back up an empty folder. There will not be a pg_control file. And he will think that he does not understand something. In theory, in this case, you need to write a normal message to the user to explain to him how to use the tool. But this is not even a feature of programming, but a feature of a good intelligible language.

We don't know how to do offline backup. If the base lies, we cannot backup it. But here everything is very simple. We name backups by LSN when it started. The LSN of the underlying base must be read from the control file. And this is such an unrealized feature. Many backup systems are able to backup a lying database. And it's convenient.

We are not currently processing the lack of space for backup normally. Because we usually work with large backups at home. And the hands did not reach this point. But if someone wants to program in Go right now, add an out of bucket error handling. I will definitely check out the pull request.

And the most important thing that worries us is that we want as many docker integration tests as possible that test different scenarios. Right now we are testing only basic scenarios. On every commit, but we want to check all the functionality we support per commit. In particular, for example, we will have enough support for PostgreSQL 9.4-9.5. We support them because the community supports PostgreSQL, but we don't check per commit to make sure it's still not broken. And I think it's a pretty serious risk.

Backups with WAL-G. What's there in 2019? Andrey Borodin

WAL-G works for us on more than a thousand clusters in Yandex Database management. And every day it backs up several hundred terabytes of data.

We have a lot of TODO in our code. If you want to program, come, we are waiting for a pull request, we are waiting for questions.

Backups with WAL-G. What's there in 2019? Andrey Borodin

Questions

Good evening! Thank you! My guess is that if you're using WAL-delta, then you're probably relying heavily on full-page writes. And if so, have you tested? You showed a beautiful graph. How much more beautiful does it become if FPW is turned off?

We have Full-page writes enabled, we haven't tried turning it off. That is, I, as a developer, did not try to turn it off. System administrators who have researched have probably researched this issue. But we need FPW. Almost no one turns it off, because otherwise it is impossible to take a backup from the replica.

Thanks for the report! I have two questions. The first question is what will happen to tablespaces?

We are waiting for a pull request. Our databases live on SSD and NMVE disks and we don't really need this feature. Right now I'm not ready to spend serious time doing it well. I am all for supporting this. There are people who supported it, but supported it in the way it suits them. They forked, but don't make a pull request. (Added in version 0.2.13)

And the second question. You said at the very beginning that WAL-G assumes that it works alone and wrappers are not needed. I use wrappers myself. Why shouldn't they be used?

We want it to be as simple as a balalaika. This means that you don’t need anything at all, except for a balalaika. We want the system to be simple. If you have a functionality that you need to make in a script, then come and tell us - we will do it in Go.

Good evening! Thanks for the report! We were unable to get WAL-G to work with GPG decryption. Encrypts normally, does not want to decrypt. Did something not work out for us? The situation is depressing.

Create an issue on GitHub, let's figure it out.

That is, have you encountered this?

There's a bug reporting feature that when WAL-G doesn't understand what a file is, it asks, "Maybe it's encrypted?". Perhaps the problem is not in encryption at all. I want to fix the logging on this topic. He must decipher. We are currently working on this topic in the sense that we do not really like how the system for obtaining public and private keys is organized. Because we call an external GPG so that it gives us its keys. And then we take these keys and pass them to the internal GPG, which is open PGP, which is compiled to us inside WAL-G, and we call encryption there. In this regard, we want to improve the system and want to support Libsodium encryption (Added in version 0.2.15). Of course, decoding should work, let's figure it out - more symptoms are needed than a couple of words. You can somehow gather in the speaker's room and look at the system. (PGP encryption without external GPG - v0.2.9)

Hello! Thanks for the report! I have two questions. I have a strange desire to do pg_basebackup and WAL log in two providers, i.e. I want to one cloud and another. Is there some way to do this?

It doesn't exist now, but it's an interesting idea.

I just don’t trust one provider, I want to have another just in case too.

The idea is interesting. Technically, this is not difficult to implement. So that the idea is not lost, can I ask to make an issue on GitHub?

Yes, of course.

And then, when students come to the Google Summer of Code, we will add them to the project so that there is more work, in order to get more from them.

And the second question. There is an issue on GitHub. I think it's already closed. There is a panic at restore. And to defeat it, you made a separate assembly. It lies right in issues. Also there is a variant that a variable environment in one flow to do. And so it works very slowly. And we've had this problem, and so far it hasn't been fixed.

The problem is that for some reason the storage (CEPH) resets the connection when we come to it with a lot of parallelism. What can be done about it? The retry logic looks like this. We are trying to upload the file again. In one pass, some files did not load for us, we will make a second one for all those who did not enter. And as long as at least one file per iteration is loaded, we repeat and repeat and repeat. We finalized the logic of retry - exponential backoff. But it is not entirely clear what to do with the fact that the connection simply breaks from the side of the storage system. That is, when we upload to one stream, it does not break these connections. What can we improve here? We have network throttling, we can limit each connection by the number of bytes it sends. As for the rest, I don’t know how to deal with the fact that the object storage does not allow us to download or download from it in parallel.

There is no SLA? It is not written in them how they allow themselves to be tortured?

The bottom line is that people who come up with this question usually have their own repository. That is, no one comes from Amazon or Google Cloud or Yandex Object Storage.

Maybe the question is not for you?

The question here in this case is unimportant to whom. If there are any ideas how to deal with this, let's do it in WAL-G. But so far I have no good ideas on how to deal with it. There are some Object Storages that support listing backups differently. You ask them to list the objects, and they add another folder there. WAL-G gets scared at the same time - there is some thing here that is not a file, I cannot restore it, which means that the backup was not restored. That is, in fact, you have a completely restored cluster, but it returns an erroneous status to you, because Object Storage returned some strange information that it did not fully understand.

It is in the Mail cloud that such a thing arises.

If it is possible to build a reproduce...

It is consistently reproducible...

If there is reproduce, then I think we will experiment with retry strategies and figure out how to retry and understand what the cloud requires of us. Maybe it will be stable for us on three connections and will not break the connection, then we will carefully reach three. Because now we are dropping the connection very quickly, i.e. if we started the recovery in 16 threads, then after the first retry there will be 8 threads, 4 threads, 2 threads and one. And then it will pull the file in one stream. If there are some magic values ​​like 7,5 streams are best for pumping, then we will linger on them and try to do another 7,5 streams. There is such an idea.

Thanks for the report! What does a complete WAL-G workflow look like? For example, in a stupid case when there is no delta on pages. And we take and remove the initial backup, then we archive the shaft until we turn blue. Here, as I understand it, there is a breakdown. At some point, you need to make a delta backup of the pages, i.e. is some external process driving it or how does it happen?

The delta backup API is quite simple. There is a number there - max delta steps, it seems to be called that. It defaults to zero. This means that every time you do a backup-push, it will push a full backup. If you change it to any positive number, for example, to 3, then the next time you do a backup-push, it looks at the history of previous backups. He sees that you do not exceed the chain of 3 deltas and makes a delta.

So every time we run WAL-G, does it try to make a full backup?

No, we're running WAL-G and it's trying to delta if your policies allow it.

Roughly speaking, if you run it from zero each time, will it behave like pg_basebackup?

No, it will still run faster because it uses compression and parallelism. Pg_basebackup will put a shaft next to you. WAL-G relies on the fact that you have archiving configured. And will issue a warning if it is not configured.

pg_basebackup can be run without shafts.

Yes, then they will behave almost the same. pg_basebackup copies to the file system. By the way, we have a new feature that I forgot to mention. We can now back up from pg_basebackup to the file system. I don't know why it's necessary, but it's there.

For example, on CephFS. Not everyone wants to configure Object Storage.

Yes, that's probably why they asked a question about this feature so that we could do it. And we made it.

Thanks for the report! There is just a question about copying to the file system. Out of the box, do you now support copying to remote storage, for example, if there is some kind of shelf in the data center or something else?

In this formulation, it is a difficult question. Yes, we support it, but this functionality is not included in any release yet. That is, all pre-releases support this, but release versions do not. This functionality was added in version 0.2. It will definitely be in the release soon, as soon as we fix all known bugs. But right now, this can only be done in pre-release. There are two bugs in the pre-release. Problem with WAL-E recovery, we didn't fix that. And in the last pre-release, a bug about delta-backup was added. Therefore, we recommend everyone to use release versions. As soon as there are no more bugs in the pre-release, we can say that we support Google Cloud, S3-compatible things and file storage.

Hello, thanks for the report. As I understand it, WAL-G is not some kind of centralized system like barmen? Are you planning to move in this direction?

The problem is that we have moved away from this direction. WAL-G lives on the base host, on the cluster host, and on all cluster hosts. When we moved into several thousand clusters, we had many bartender installations. And every time something falls apart in them, it's a big problem. Because they need to be repaired, you need to understand which clusters do not have backups now. In the direction of physical hardware for backup systems, I do not plan to develop WAL-G. If the community wants some functionality here, I don't mind at all.

We have teams that are responsible for storage. And we feel so good that it's not us, that there are special people who put our files where the files are safe. They do all sorts of tricky coding there in order to withstand the loss of a certain number of files. They are responsible for network bandwidth. When you have a bartender, you may suddenly find out that small databases with high traffic have gathered in the same server. You seem to have a lot of space on it, but for some reason everything does not fit through the network. It may turn out the other way around. There are a lot of networks there, there are processor cores, but the disks have run out here. And we got tired of this need to juggle with something, and we moved on to the fact that data storage is a separate service, for which separate special people are responsible.

PS A new version has been released 0.2.15, which can use the .walg.json configuration file found in the postgres home directory by default. You can opt out of bash scripts. An example of .walg.json is in this issue https://github.com/wal-g/wal-g/issues/545

Video:



Source: habr.com

Add a comment