Postgres Tuesday #5: PostgreSQL and Kubernetes. CI/CD. Test Automation»

Postgres Tuesday #5: PostgreSQL and Kubernetes. CI/CD. Test Automation»

At the end of last year, another live broadcast of the Russian PostgreSQL community took place #RuPostgres, during which its co-founder Nikolai Samokhvalov spoke with Flant technical director Dmitry Stolyarov about this DBMS in the context of Kubernetes.

We publish the transcript of the main part of this discussion, and on Community YouTube channel Full video posted:

Databases and Kubernetes

NA: We won't talk about VACUUM and CHECKPOINTs today. We want to talk about Kubernetes. I know you have many years of experience. I watched your videos and even re-watched some of them... Let's get straight to the point: why Postgres or MySQL in K8s at all?

DS: There is no unequivocal answer to this question and cannot be. But in general, this is simplicity and convenience ... potential. After all, everyone wants managed services.

NA: To how RDS, only at home?

DS: Yes: like RDS, just anywhere.

NA: “Anywhere” is a good point. In large companies, everything is located in different places. Why then, if it is a large company, not take a ready-made solution? For example, Nutanix has its own developments, other companies (VMware ...) have the same “RDS, only at home”.

DS: But we are talking about a separate implementation that will only work under certain conditions. And if we are talking about Kubernetes, then there is a huge variety of infrastructure (which can be in K8s). Essentially this is a standard for APIs to the cloud...

NA: It’s also free!

DS: It is not so important. Freeness is important for not a very large segment of the market. Something else is important... You probably remember the report “Databases and Kubernetes"?

NA: Yes.

DS: I realized that it was received very ambiguously. Some people thought that I was saying: “Guys, let’s get all the databases into Kubernetes!”, while others decided that these were all terrible bicycles. But I wanted to say something completely different: “Look what is happening, what problems there are and how they can be solved. Should we use Kubernetes databases now? Production? Well, only if you like...doing certain things. But for a dev, I can say that I recommend it. For a dev, the dynamism of creating/deleting environments is very important.”

NS: By dev, do you mean all environments that are not prod? Staging, QA…

DS: If we’re talking about perf stands, then probably not, because the requirements there are specific. If we are talking about special cases where a very large database is needed for staging, then probably not... If this is a static, long-lived environment, then what is the benefit of having the database located in K8s?

NA: None. But where do we see static environments? A static environment will become obsolete tomorrow.

DS: Staging can be static. We have clients...

NA: Yes, I have one too. It's a big problem if you have a 10 TB database and 200 GB staging...

DS: I have a very cool case! On staging there is a product database to which changes are made. And there is a button: “roll out to production”. These changes - deltas - are added (it seems they are simply synchronized via the API) in production. This is a very exotic option.

NA: I have seen startups in the Valley that are sitting in RDS or even in Heroku - these are stories from 2-3 years ago - and they download the dump to their laptop. Because the database is still only 80 GB, and there is space on the laptop. Then they buy additional disks for everyone so that they have 3 databases to carry out different developments. This is how it happens too. I also saw that they are not afraid to copy prod into staging - it very much depends on the company. But I also saw that they are very afraid, and that they often do not have enough time and hands. But before we move on to this topic, I want to hear about Kubernetes. Do I understand correctly that no one is in prod yet?

DS: We have small databases in prod. We are talking about volumes of tens of gigabytes and non-critical services for which we were too lazy to make replicas (and there is no such need). And provided that there is normal storage under Kubernetes. This database worked in a virtual machine - conditionally in VMware, on top of the storage system. We placed it in PV and now we can transfer it from car to car.

NA: Databases of this size, up to 100 GB, can be rolled out in a few minutes on good disks and a good network, right? A speed of 1 GB per second is no longer exotic.

DS: Yes, for linear operation this is not a problem.

NA: Okay, we just have to think about prod. And if we are considering Kubernetes for non-prod environments, what should we do? I see that in Zalando do operator, in Crunchy sawing, there are some other options. And there is OnGres - this is our good friend Alvaro from Spain: what they do is essentially not just operator, and the whole distribution (Stack Gres), into which, in addition to Postgres itself, they also decided to stuff a backup, the Envoy proxy...

DS: Envoy for what? Balancing Postgres traffic specifically?

NA: Yes. That is, they see it as: if you take a Linux distribution and kernel, then regular PostgreSQL is the kernel, and they want to make a distribution that will be cloud-friendly and run on Kubernetes. They put together components (backups, etc.) and debug them so that they work well.

DS: Very cool! Essentially this is software to create your own managed Postgres.

NA: Linux distributions have eternal problems: how to make drivers so that all hardware is supported. And they have the idea that they will work in Kubernetes. I know that in the Zalando operator we recently saw a connection to AWS and this is no longer very good. There shouldn’t be a tie to a specific infrastructure - what’s the point then?

DS: I don’t know exactly what situation Zalando got into, but in Kubernetes storage is now made in such a way that it is impossible to take a disk backup using a generic method. Recently in standard - in latest version CSI specifications — we made snapshots possible, but where is it implemented? Honestly, everything is still so raw... We are trying CSI on top of AWS, GCE, Azure, vSphere, but as soon as you start using it, you can see that it is not ready yet.

NA: That’s why we sometimes have to rely on infrastructure. I think this is still an early stage - growing pains. Question: What advice would you give to newbies who want to try PgSQL in K8s? What operator maybe?

DS: The problem is that Postgres is 3% for us. We also have a very large list of different software in Kubernetes, I won’t even list everything. For example, Elasticsearch. There are a lot of operators: some are actively developing, others are not. We have drawn up requirements for ourselves about what an operator should have in order for us to take it seriously. In an operator specifically for Kubernetes - not in an “operator to do something in Amazon’s conditions”... In fact, we quite widely (= almost all clients) use a single operator - for Redis (we will publish an article about him soon).

NA: Isn't it for MySQL too? I know that Percona... since they are now working on MySQL, MongoDB, and Postgres, they will have to create some kind of universal solution: for all databases, for all cloud providers.

DS: We didn’t have time to look at the operators for MySQL. This is not our main focus right now. MySQL works fine in standalone. Why use an operator if you can just launch a database... You can launch a Docker container with Postrges, or you can launch it in a simple way.

NA: There was a question about this too. No operator at all?

DS: Yes, 100% of us have PostgreSQL running without an operator. So far so. We actively use the operator for Prometheus and Redis. We have plans to find an operator for Elasticsearch - it is the one that is most “on fire”, because we want to install it in Kubernetes in 100% of cases. Just as we want to ensure that MongoDB is also always installed in Kubernetes. Here certain wishes appear - there is a feeling that in these cases something can be done. And we didn’t even look at Postgres. Of course, we know that there are different options, but in fact we have a standalone.

DB for testing in Kubernetes

NA: Let's move on to the topic of testing. How to roll out changes to the database - from a DevOps perspective. There are microservices, many databases, something is changing somewhere all the time. How to ensure normal CI/CD so that everything is in order from the DBMS perspective. What's your approach?

DS: There can't be one answer. There are several options. The first is the size of the base that we want to roll out. You yourself mentioned that companies have different attitudes towards having a copy of the prod database on dev and stage.

NA: And under the conditions of the GDPR, I think they are being more and more careful... I can say that in Europe they have already begun to impose fines.

DS: But often you can write software that takes a dump from production and obfuscates it. Prod data is obtained (snapshot, dump, binary copy...), but it is anonymized. Instead, there can be generation scripts: these can be fixtures or just a script that generates a large database. The problem is: how long does it take to create a base image? And how long does it take to deploy it in the desired environment?

We came to a scheme: if the client has a fixed data set (minimal version of the database), then we use them by default. If we are talking about review environments, when we created a branch, we deployed an instance of the application - we roll out a small database there. But it turned out well option, when we take a dump from production once a day (at night) and build a Docker container with PostgreSQL and MySQL with this loaded data based on it. If you need to expand the database 50 times from this image, this is done quite simply and quickly.

NA: By simple copying?

DS: Data is stored directly in the Docker image. Those. We have a ready-made image, albeit 100 GB. Thanks to layers in Docker, we can quickly deploy this image as many times as we need. The method is stupid, but it works well.

NA: Further, when you test, it changes right inside Docker, right? Copy-on-write inside Docker - throw it away and go again, everything is fine. Class! And you already use it with might and main?

DS: For a long time.

NA: We do very similar things. Only we do not use Docker's copy-on-write, but some other one.

DS: It's not generic. And Docker works everywhere.

NA: In theory, yes. But we also have modules there, you can make different modules and work with different file systems. What a moment here. From the Postgres side, we look at all this differently. Now I looked from the Docker side and saw that everything works for you. But if the database is huge, for example, 1 TB, then all this takes a long time: operations at night, and stuffing everything into Docker... And if 5 TB are stuffed into Docker... Or is everything fine?

DS: What's the difference: these are blobs, just bits and bytes.

NA: The difference is this: do you do it through dump and restore?

DS: Not at all necessary. There are different ways to generate this image.

NA: For some clients, we have made it so that instead of regularly generating a base image, we constantly keep it up to date. It is essentially a replica, but it receives data not from the master directly, but through an archive. A binary archive where WALs are downloaded every day, where backups are taken... These WALs then reach the base image with a slight delay (literally 1-2 seconds). We clone from it in any way - now we have ZFS by default.

DS: But with ZFS you are limited to one node.

NA: Yes. But ZFS also has a magical send: you can send a snapshot with it, and even (I haven’t really tested it yet, but ...) you can send a delta between two PGDATA. In fact, we have another tool that we haven’t really considered for such tasks. PostgreSQL has pg_rewind, which works like a “smart” rsync, skipping a lot of what you don’t have to watch, because nothing has changed there. We can do a quick synchronization between the two servers and rewind in the same way.

So, from this, more DBA side, we are trying to create a tool that allows us to do the same thing you said: we have one database, but we want to test something 50 times, almost simultaneously.

DS: 50 times means you need to order 50 Spot instances.

NA: No, we do everything on one machine.

DS: But how will you expand 50 times if this one database is, say, terabyte. Most likely she needs a conditional 256 GB of RAM?

NA: Yes, sometimes you need a lot of memory - that's normal. But this is an example from life. The production machine has 96 cores and 600 GB. At the same time, 32 cores (even 16 cores now sometimes) and 100-120 GB of memory are used for the database.

DS: And 50 copies fit in there?

NA: So there is only one copy, then copy-on-write (ZFS) works... I’ll tell you in more detail.

For example, we have a 10 TB database. They made a disk for it, ZFS also compressed its size by 30-40 percent. Since we don’t do load testing, the exact response time is not important to us: let it be up to 2 times slower - that’s okay.

We give the opportunity to programmers, QA, DBA, etc. perform testing in 1-2 threads. For example, they might run some kind of migration. It does not require 10 cores at once - it needs 1 Postgres backend, 1 core. Migration will start - maybe autovacuum will still start, then the second core will be used. We have 16-32 cores allocated, so 10 people can work at the same time, no problem.

Because physically PGDATA the same, it turns out that we are actually deceiving Postgres. The trick is this: for example, 10 Postgres are launched simultaneously. What is the problem usually? They put shared_buffers, let's say 25%. Accordingly, this is 200 GB. You won’t be able to launch more than three of these, because the memory will run out.

But at some point we realized that this was not necessary: ​​we set shared_buffers to 2 GB. PostgreSQL has effective_cache_size, and in reality it is the only one that influences plans. We set it to 0,5 TB. And it doesn’t even matter that they don’t actually exist: he makes plans as if they exist.

Accordingly, when we test some kind of migration, we can collect all the plans - we will see how it will happen in production. The seconds there will be different (slower), but the data that we actually read, and the plans themselves (what JOINs are there, etc.) turn out exactly the same as in production. And you can run many such checks in parallel on one machine.

DS: Don't you think there are a few problems here? The first is a solution that only works on PostgreSQL. This approach is very private, it is not generic. The second is that Kubernetes (and everything that cloud technologies are going to now) involves many nodes, and these nodes are ephemeral. And in your case it is a stateful, persistent node. These things make me conflicted.

NA: First - I agree, this is a purely Postgres story. I think if we have some kind of direct IO and a buffer pool for almost all the memory, this approach will not work - the plans will be different. But for now, we only work with Postgres, we don’t think about others.

About Kubernetes. You yourself tell us everywhere that we have a persistent database. If the instance fails, the main thing is to save the disk. Here we also have the entire platform in Kubernetes, and the component with Postgres is separate (although it will be there one day). Therefore, everything is like this: the instance fell, but we saved its PV and simply connected it to another (new) instance, as if nothing had happened.

DS: From my point of view, we create pods in Kubernetes. K8s - elastic: knots are ordered as needed. The task is to simply create a pod and say that it needs X amount of resources, and then K8s will figure it out on its own. But storage support in Kubernetes is still unstable: 1.16In 1.17 (this release was released of the week ago) these features become only beta.

Six months to a year will pass - it will become more or less stable, or at least it will be declared as such. Then the possibility of snapshots and resize already solves your problem completely. Because you have a base. Yes, it may not be very fast, but the speed depends on what is “under the hood”, because some implementations can copy and copy-on-write at the disk subsystem level.

NA: It is also necessary for all engines (Amazon, Google...) to start supporting this version - this also takes some time.

DS: We don't use them yet. We use ours.

Local development for Kubernetes

NA: Have you come across such a wish when you need to install all the pods on one machine and do such a small test. To quickly get a proof of concept, see that the application runs in Kubernetes, without dedicating a bunch of machines for it. There's Minikube, right?

DS: It seems to me that this case - deployed on one node - is exclusively about local development. Or some manifestations of such a pattern. Eat minicubeThere k3s, CHILD. We are moving towards using Kubernetes IN Docker. Now we started working with it for tests.

NA: I used to think that this was an attempt to wrap all pods in one Docker image. But it turned out that this is about something completely different. Anyway, there are separate containers, separate pods - just in Docker.

DS: Yes. And there is a rather funny imitation made, but the meaning is this... We have a utility for deployment - yard. We want to make it a conditional mode werf up: “Get me local Kubernetes.” And then run the conditional there werf follow. Then the developer will be able to edit the IDE, and a process will be launched in the system that sees the changes and rebuilds the images, redeploying them to local K8s. This is how we want to try to solve the problem of local development.

Snapshots and database cloning in K8s reality

NA: If we return to copy-on-write. I noticed that the clouds also have snapshots. They work differently. For example, in GCP: you have a multi-terabyte instance on the east coast of the United States. You take snapshots periodically. You pick up a copy of the disk on the west coast from a snapshot - in a few minutes everything is ready, it works very quickly, only the cache needs to be filled in memory. But these clones (snapshots) are in order to 'provision' a new volume. This is cool when you need to create a lot of instances.

But for tests, it seems to me that snapshots, which you talk about in Docker or I talk about in ZFS, btrfs and even LVM... - they allow you not to create really new data on one machine. In the cloud, you will still pay for them every time and wait not seconds, but minutes (and in the case lazy load, possibly a watch).

Instead, you can get this data in a second or two, run the test and throw it away. These snapshots solve different problems. In the first case - to scale up and get new replicas, and in the second - for tests.

DS: I do not agree. Doing cloning of volumes normally is the task of the cloud. I have not looked at their implementation, but I know how we do it on hardware. We have Ceph, it can contain any physical volume (RBD) say clones and get a second volume with the same characteristics in tens of milliseconds, IOPS'ami, etc. You need to understand that there is a tricky copy-on-write inside. Why shouldn't the cloud do the same? I'm sure they are trying to do this one way or another.

NA: But it will still take them seconds, tens of seconds to raise the instance, bring Docker there, etc.

DS: Why is it necessary to raise an entire instance? We have an instance with 32 cores, 16... and it can fit into it - for example, four. When we order the fifth one, the instance will already be raised, and then it will be deleted.

NA: Yes, interesting, Kubernetes turns out to be a different story. Our database is not in K8s, and we have one instance. But cloning a multi-terabyte database takes no more than two seconds.

DS: This is great. But my initial point is that this is not a generic solution. Yes, it’s cool, but it’s only suitable for Postgres and only on one node.

NA: It is suitable not only for Postgres: these plans, as I described, will only work in it. But if we don’t bother about plans, and we just need all the data for functional testing, then this is suitable for any DBMS.

DS: Many years ago we did something similar on LVM snapshots. This is a classic. This approach was very actively used. Stateful nodes are just a pain. Because you should not drop them, you should always remember them...

NA: Do you see any possibility of a hybrid here? Let's say stateful is some kind of pod, it works for several people (many testers). We have one volume, but thanks to the file system, the clones are local. If the pod falls, but the disk remains, the pod will rise, count information about all the clones, pick everything up again and say: “Here are your clones running on these ports, continue working with them.”

DS: Technically this means that within Kubernetes it is one pod within which we run many Postgres.

NA: Yes. He has a limit: let’s say no more than 10 people work with him at the same time. If you need 20, we’ll launch a second such pod. We will fully clone it, having received the second full volume, it will have the same 10 “thin” clones. Don't you see this opportunity?

DS: We need to add security issues here. This type of organization implies that this pod has high privileges (capabilities), because it can perform non-standard operations on the file system... But I repeat: I believe that in the medium term they will fix storage in Kubernetes, and in the clouds they will fix the whole story with volumes - everything will “just work”. There will be resize, cloning... There is a volume - we say: “Create a new one based on that,” and after a second and a half we get what we need.

NA: I don’t believe in one and a half seconds for many terabytes. On Ceph you do it yourself, but you talk about the clouds. Go to the cloud, on EC2 make a clone of an EBS volume of many terabytes and see what the performance will be. It won't take a few seconds. I'm very interested in when they will reach this level. I understand what you are saying, but I beg to differ.

DS: Ok, but I said in the medium term, not short term. For several years.

About the operator for PostgreSQL from Zalando

In the middle of this meeting, Alexey Klyukin, a former developer from Zalando, also joined in and spoke about the history of the PostgreSQL operator:

It’s great that this topic is touched upon in general: both Postgres and Kubernetes. When we started doing it at Zalando in 2017, it was a topic that everyone wanted to do, but no one did. Everyone already had Kubernetes, but when they asked what to do with databases, even people like Kelsey Hightower, who preached K8s, said something like this:

“Go to managed services and use them, don’t run the database in Kubernetes. Otherwise, your K8s will decide, for example, to make an upgrade, turn off all the nodes, and your data will fly far, far away.”

We decided to make an operator that, contrary to this advice, will run a Postgres database in Kubernetes. And we had a good reason - owners. This is an automatic failover for PostgreSQL, done correctly, i.e. using etcd, consul or ZooKeeper as a storage of information about the cluster. Such a repository that will give everyone who asks, for example, what the current leader is, the same information - despite the fact that we have everything distributed - so that there is no split brain. Plus we had Docker image for him.

In general, the company’s need for auto failover appeared after migrating from an internal hardware data center to the cloud. The cloud was based on a proprietary PaaS (Platform-as-a-Service) solution. It's Open Source, but it took a lot of work to get it up and running. It was called STUPS.

Initially, there was no Kubernetes. More precisely, when our own solution was deployed, K8s already existed, but it was so raw that it was not suitable for production. It was, in my opinion, 2015 or 2016. By 2017, Kubernetes had become more or less mature—there was a need to migrate there.

And we already had a Docker container. There was a PaaS that used Docker. Why not try K8s? Why not write your own operator? Murat Kabilov, who came to us from Avito, started this as a project on his own initiative - “to play” - and the project “took off.”

But in general, I wanted to talk about AWS. Why was there historical AWS related code...

When you run something in Kubernetes, you need to understand that K8s is such a work in progress. It is constantly developing, improving and even breaking down from time to time. You need to keep a close eye on all the changes in Kubernetes, you need to be ready to dive into it if something happens and learn how it works in detail - perhaps more than you would like. This is, in principle, for any platform on which you run your databases ...

So, when we did the statement, we had Postgres running on an external volume (EBS in this case, since we were working on AWS). The database grew, at some point it was necessary to resize it: for example, the initial size of EBS was 100 TB, the database grew to it, now we want to make EBS 200 TB. How? Let's say you can do a dump / restore to a new instance, but it's long and downtime.

Therefore, I wanted a resize that would enlarge the EBS partition and then tell the file system to use the new space. And we did it, but at that time Kubernetes did not have any API for the resize operation. Since we worked on AWS, we wrote code for its API.

No one is stopping you from doing the same for other platforms. There is no hint in the statement that it can only be run on AWS, and it will not work on everything else. In general, this is an Open Source project: if anyone wants to speed up the emergence of the use of the new API, you are welcome. Eat GitHub, pull requests - the Zalando team tries to respond to them quite quickly and promote the operator. As far as I know, the project participated at Google Summer of Code and some other similar initiatives. Zalando is working very actively on it.

PS Bonus!

If you are interested in the topic of PostgreSQL and Kubernetes, then please also note that the next Postgres Tuesday took place last week, where I spoke with Nikolai Alexander Kukushkin from Zalando. Video from it is available here.

P.P.S

Read also on our blog:

Source: habr.com

Add a comment