WAL-G: new features and community expansion. Georgy Rylov

I suggest that you familiarize yourself with the transcript of the report of the beginning of 2020 by Georgy Rylov "WAL-G: new opportunities and community expansion"

Open-source maintainers run into a lot of problems as they grow. How to write more and more required features, fix more issues and manage to watch more and more pull requests? Using WAL-G (backup-tool for PostgreSQL) as an example, I will tell you about how we solved these problems by launching a course on Open-source development at the university, what we have achieved and where we will move on.

WAL-G: new features and community expansion. Georgy Rylov

Hi everyone again! I am a developer at Yandex from Yekaterinburg. And today I will talk about WAL-G.

The title of the report did not say that it was something about backups. Does anyone know what WAL-G is? Or does everyone know? Raise your hand who doesn't know. Go nuts, you came to the report and don't know what it's about.

Let me tell you what's going on today. It so happened that our team has been doing backup for a long time. And this is another report in a series where we talk about how we store data safely, securely, conveniently and efficiently.

WAL-G: new features and community expansion. Georgy Rylov

In the previous series there were many reports by Andrey Borodin, Vladimir Leskov. There were many of us. And we've all been talking about WAL-G for years.

clck.ru/F8ioz - https://www.highload.ru/moscow/2018/abstracts/3964

clck.ru/Ln8Qw - https://www.highload.ru/moscow/2019/abstracts/5981

This report will be slightly different from the rest in that it was more about the technical part, but here I will talk about how we faced the problems associated with the growth of the community. And how we came up with a little idea that helps us deal with it.

WAL-G: new features and community expansion. Georgy Rylov

A few years ago, WAL-G was a fairly small project that we got from Citus Data. And we just took it. And it was developed by one person.

And only WAL-G did not have:

  • Backup from a replica.
  • There were no incremental backups.
  • There were no WAL-Delta backups.
  • And there wasn't a whole lot more.

WAL-G has grown a lot in these few years.

WAL-G: new features and community expansion. Georgy Rylov

And by 2020, all of the above has already appeared. And to this we added what we now have:

  • Over 1 stars on GitHub.
  • 150 forks.
  • About 15 open PRs.
  • And many more contributors.
  • And open issues constantly. And this despite the fact that we go there literally every day, we do something with it.

WAL-G: new features and community expansion. Georgy Rylov

And we came to the conclusion that this project requires more of our attention, even when we ourselves do not need to implement something for our Managed Databases service in Yandex.

And somewhere in the fall of 2018, an idea came to our mind. Usually the team has several ways how to cut some features, fix bugs if you don't have enough hands. For example, you can hire another developer and pay him money. Or you can take an intern for a while and also pay him some kind of salary. But there is still a fairly large layer of people, some of whom already really know how to write code. It's just that you don't always know what quality this code is.

We thought about it and decided to try to attract students. But students will not participate in everything with us. They will only do some of the work. And they will, for example, write tests, fix bugs, implement features that do not affect the core functionality. The main functionality is creating backups and restoring backups. If we make a bug in creating a backup, then we will get data loss. And no one wants that, of course. Everyone wants everything to be very reliable. Therefore, the code that we trust less than our own, of course, we do not want to let there. That is, any non-critical code is what we would like to receive from our additional working hands.

Under what conditions is a student's PR accepted?

  • They are required to cover their code with tests. Everything should take place in CI.
  • And we also go through 2 reviews. One by Andrey Borodin and one by me.
  • And additionally, to check that this does not break anything in our service, I separately upload the assembly with this commit. And we check in end-to-end tests that nothing crashes with us.

Special course on Open Source

WAL-G: new features and community expansion. Georgy Rylov

A little about why this is needed and why this, it seems to me, is a cool idea.

For us, the profit is obvious:

  • We get extra hands.
  • And we are looking for candidates for the team among smart students who write smart code.

What is the benefit for students?

They may be less obvious, because students, at a minimum, do not receive money for the code they write, but only receive grades in the test.

I asked them about it. And in their words:

  • Contributor experience in Open Source.
  • Get a line in CV.
  • Show yourself and pass an interview at Yandex.
  • Become a member of GSoC.
  • +1 special course for those who want to write code.

I will not talk about how the course was arranged. I will only say that WAL-G was the main project. We also included projects such as Odyssey, PostgreSQL and ClickHouse in this course.

And they gave problems not only on this course, but also issued diplomas and term papers.

What about user benefits?

Now let's move on to the part that interests you, rather. What's the point of this to you? The point is that the students fixed a lot of bugs. And we made the request features that you asked us to do.

And let me tell you about the things that you have long wanted and that have been implemented.

WAL-G: new features and community expansion. Georgy Rylov

Support for tablespaces. Tablespaces in WAL-G have been expected since the release of WAL-G, because WAL-G is the successor of another WAL-E backup tool where database backups with tablespaces were supported.

Let me briefly remind you what it is and why it is all needed. Typically, you have all your Postgres data in one directory on the filesystem, called the base. And this directory already contains all the files and subdirectories required by Postgres.

Tablespaces are directories where Postgres data resides, but they do not lie outside the base directory. The slide shows that the tablespacs are outside the base directory.

WAL-G: new features and community expansion. Georgy Rylov

What does this look like for Postgres itself? The base directory has a separate pg_tblspc subdirectory. And it contains symlinks to directories that actually contain Postgres data outside the base directory.

WAL-G: new features and community expansion. Georgy Rylov

When you use all this, then for you these commands may look something like this. That is, you create a table in some specified tablespace and see where you currently have it. Here are the last two lines, the last two commands called. And there it is clear that there is a way. But in fact, this is not the real way. This is the prefixed path from the base directory to the tablespace. And from there it is matched with a symlink that leads to your real data.

We do not use all this in our team, but it was used by many other WAL-E users who wrote to us that they want to move to WAL-G, but this bothered them. Now it is supported.

WAL-G: new features and community expansion. Georgy Rylov

Another feature that our special course brought to us is catchup. Catchup is known to people who have probably worked more with Oracle than with Postgres.

Briefly about what it is. Somehow, the cluster topology in our service can usually look like this. We have a master. There is a replica that streams write-ahead log from it. And the replica tells the master which LSN it is currently on. And somewhere in parallel with this, the log can be archived. And in addition to archiving the log, backups are also sent to the cloud. And delta backups are sent.

What could be the problem? When you have a fairly large base, you may end up with a replica starting to lag far behind the master. And she is so far behind that she can never catch up with him. This problem usually needs to be solved somehow.

And the easiest way is to remove the replica and reload it again, because it will never catch up, and the problem needs to be dealt with. But this is quite a long time, because restoring a whole 10 TB database backup is a very, very long time. And we want to do it all as quickly as possible if such problems arise. And that's what catchup is for.

Catchup allows you to use delta backups that are stored in the cloud in this way. You tell which LSN the lagging replica is currently on and specify it in the catchup command in order to create a delta backup between the LSN and the LSN that your cluster is currently on. And after that, you restore this backup to a replica that was lagging behind.

Other bases

Students also brought us a lot of features at once. Since we cook not only Postgres at Yandex, we also have MySQL, MongoDB, Redis, ClickHouse, at some point we needed to be able to make backups with point-in-time recovery for MySQL, and to it was possible to upload them to the cloud.

And we wanted to do it in some similar way that WAL-G does. And we decided to experiment and see how it all looks.

And at first, without sharing this logic in any way, they wrote the code in the fork. We saw that we have some kind of working model and it can fly. Then we thought that our main community is postgres'ists, they use WAL-G. And so you need to somehow separate these parts. That is, when the code for Postgres is corrected, we do not break MySQL; when we correct MySQL, we do not break Postgres.

WAL-G: new features and community expansion. Georgy Rylov

The first idea on how to separate this was the idea to use the same approach as used in PostgreSQL extensions. And, in fact, to make a MySQL backup, you had to install some kind of dynamic library.

But here the asymmetry of this approach is immediately visible. When you backup Postgres, you put a normal Postgres backup on it and everything is fine. And for MySQL, it turns out that you install a backup for Postgres and also install a dynamic library for MySQL for it. Sounds kind of weird. We also thought so and decided that this was not the solution we needed.

Different builds for Postgres, MySQL, MongoDB, Redis

But this allowed us, as it seems to us, to come to the right decision - to allocate different assemblies for different bases. This made it possible to isolate the logic tied to backups of various databases that would access the common API that WAL-G implements.

WAL-G: new features and community expansion. Georgy Rylov

This is the part that we wrote ourselves - before giving the students problems. I mean, this is exactly the part where they could do something wrong, so we figured we'd better do something like this and everything will be fine.

WAL-G: new features and community expansion. Georgy Rylov

After that, we issued tasks. They were taken apart immediately. Students were required to support three bases.

This is MySQL, which we've been backing up with WAL-G this way for over a year now.

And now MongoDB is approaching production, where it is finished with a file. In fact, we wrote the framework for all this. Then the students wrote some working things. And then we bring them to a state that we can accept in our production.

These problems didn't look like the students had to write complete backup tools for each of these databases. We didn't have that problem. Our problem was that we wanted point-in-time recovery and we wanted to backup to the cloud. And they asked the students to write some code that would solve this. Students took advantage of existing backup tools that somehow take backups, and then glued it all with WAL-G, which sent it all to the cloud. And also point-in-time recovery was added to this.

WAL-G: new features and community expansion. Georgy Rylov

What else did the students bring? They brought Libsodium encryption support to WAL-G.

We also have backup storage policies. Now backups can be marked as permanent. And somehow it is more convenient for your service to automate the process of storing them.

WAL-G: new features and community expansion. Georgy Rylov

What was the result of this experiment?

More than 100 people signed up for the course. At first I did not say that the university in Yekaterinburg is the Ural Federal University. We announced everything there. 100 people signed up. In reality, much less began to do something, about 30 people.

Even fewer people closed the course, because there it was necessary to write tests for those codes that already exist. And also fix some bug or make some kind of feature. And some of the students still closed the course.

At the moment, during this course, students have fixed about 14 issues, made 10 features of different sizes. And, it seems to me, this is a complete replacement for one or two developers.

Among other things, we issued diplomas and term papers. And 12 took diplomas. 6 of them have already defended on "5". The rest still had no protection, but I think that they will also be fine.

Plans for the future

What plans do we have for the future?

At least those feature requests that we have already heard from users and want to make them. This:

  • Tracking the correctness of tracking the timeline in the backup archive of the HA cluster. You can do this with WAL-G. And, I think, we will find students who will take up this matter.
  • We already have a responsible person for transferring backups and WAL between clouds.
  • And we recently published the idea that we can speed up WAL-G even more by decompressing incremental backups without rewriting pages and optimizing the archives we send there.

You can share them here

What was this report for? In addition to the fact that now, in addition to us 4 people who support this project, we have additional hands, which are quite a lot. Especially if you write to them in a personal. And if you back up your data and do it using WAL-G or would like to move to WAL-G, then we can take into account your wishes quite easily.

WAL-G: new features and community expansion. Georgy Rylov

This is a qr code and a link. You can follow them and write all your Wishlist. For example, we do not fix a bug. Or you really want some feature, but for some reason it is not yet in any backup, including ours. Be sure to write about it.

WAL-G: new features and community expansion. Georgy Rylov

Questions

Hello! Thanks for the report! The question is about WAL-G, but not about Postgres. WAL-G backups MySQL and invokes an extra backup. If we take modern installations on CentOS and if you do yum install MySQL, then MariDB will be installed. Since version 10.3 extra backup is not supported, MariDB backup is supported. How are you doing with this?

At the moment we have not attempted to back up MariDB. We've had support requests for FoundationDB, but in general, if there is such a request, then we can find people who will do it. It's not as long and not as difficult as I think.

Good afternoon Thanks for the report! Question about potential new features. Are you ready to make WAL-G work with tapes so you can backup to tapes?

Backup on tape storage apparently means?

Yes.

Andrey Borodin is there, who can answer this question better than me.

(Andrey) Yes, thanks for the question! We had a request to transfer the backup to tape from the cloud storage. And for this sawn transfer between clouds. Because the transfer between the clouds is some generalized version of the transfer to the tape. In addition, we have an extensible architecture in terms of Storages. By the way, many Storoges were written by students. And if you write a Storage for the tape, then of course it will be supported. We are ready to consider a pull request. There it is necessary to write a file, to read a file. If you do these things in Go, you usually end up with 50 lines of code. And then tape will be supported in WAL-G.

Thanks for the report! Interesting development process. Backup is a serious piece of functionality that should be well covered by tests. When you implemented functionality for new databases, were the students also writing the tests, or did you write the tests yourself, and then give the implementation to the students?

Tests were also written by students. But students wrote more for features like new databases. They wrote integration tests. And they wrote unit tests. If the integration passes, that is, at the moment, this is a script that you execute manually or cron does it for you, for example. That is, there the scenario is very understandable.

Students don't have much experience. How long does it take to review?

Yes, reviews take a long time. That is, usually, when several committers come at once and say that I did this, I did that, then you need to think and set aside about half a day to figure out what they wrote there. Because the code must be read carefully. They didn't get interviewed. We don't know them very well, so it takes a significant amount of time.

Thanks for the report! Earlier Andrey Borodin stated that archive_command in WAL-G should be called directly. But in the case of some patron of the cluster, we need additional logic to determine the node from which to send shafts. How do you solve this problem yourself?

What is your problem here? let's say you have a synchronous replica that you are taking a backup from? Or what?

(Andrew) The point is that WAL-G is really supposed to be used without wrapping it with shell scripts. If something is missing, then let's add the logic that should be inside WAL-G. Regarding where the backup should be from, we believe that the backup should be from the current master in the cluster. Archiving from a replica is a bad idea. There are various possible scenarios with problems. In particular, problems with archiving timelines and any additional information. Thanks for the question!

(Clarification: We got rid of wrapping with shell scripts in this issue)

Good evening! Thanks for the report! I was interested in the catchup feature you mentioned. Faced with the situation of lagging replica, which could not catch up. And in WAL-G I did not find a description in the documents of this feature.

Catchup appeared literally on the 20th of January 2020. Perhaps the documentation needs to be improved. We write it ourselves and we write not that super-excellently. And perhaps students should start demanding that they write it.

Is it on release already?

The pull request is already merged, i.e. I checked it. I have tried this on a test cluster. So far, we have not had a situation where we could test this on a combat example.

When to expect?

I don't know. Wait a month, we'll check it out for sure.

Source: habr.com

Add a comment