Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

The main goal of Patroni is to provide High Availability for PostgreSQL. But Patroni is just a template, not a ready-made tool (which, in general, is said in the documentation). At first glance, having set up Patroni in the test lab, you can see what a great tool it is and how easily it handles our attempts to break the cluster. However, in practice, in a production environment, everything does not always happen as beautifully and elegantly as in a test lab.

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

I'll tell you a little about myself. I started out as a system administrator. Worked in web development. I have been working at Data Egret since 2014. The company is engaged in consulting in the field of Postgres. And we serve exactly Postgres, and we work with Postgres every day, so we have different expertise related to the operation.

And at the end of 2018, we began to slowly use Patroni. And some experience has been accumulated. We somehow diagnosed it, tuned it, came to our best practices. And in this report I will talk about them.

Apart from Postgres, I love Linux. I like to poke around in it and explore, I like to collect cores. I love virtualization, containers, docker, Kubernetes. All this interests me, because the old admin habits are affecting. I like to deal with monitoring. And I love postgres things related to administration, i.e. replication, backup. And in my spare time I write in Go. I am not a software engineer, I just write for myself in Go. And it gives me pleasure.

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

  • I think many of you know that Postgres does not have HA (High Availability) out of the box. To get HA, you need to install something, configure it, make an effort and get it.
  • There are several tools and Patroni is one of them that solves HA pretty cool and very well. But by putting it all in a test lab and running it, we can see that it all works, we can reproduce some problems, see how Patroni serves them. And we will see that it all works great.
  • But in practice, we faced different problems. And I will talk about these problems.
  • I'll tell you how we diagnosed it, what we tweaked - whether it helped us or not.

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

  • I will not tell you how to install Patroni, because you can google on the Internet, you can look at the configuration files to understand how it all starts, how it is configured. You can understand the schemes, architectures, finding information about it on the Internet.
  • I will not talk about someone else's experience. I will only talk about the problems that we faced.
  • And I will not talk about problems that are outside of Patroni and PostgreSQL. If, for example, there are problems associated with balancing, when our cluster has collapsed, I will not talk about it.

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

And a small disclaimer before we start our report.

All these problems that we encountered, we had them in the first 6-7-8 months of operation. Over time, we came to our internal best practices. And our problems disappeared. Therefore, the report was announced about six months ago, when it was all fresh in my head and I remembered it all perfectly.

In the course of preparing the report, I already raised old postmortems, looked at the logs. And some of the details could be forgotten, or some of some details could not be further investigated during the analysis of the problems, so at some points it may seem that the problems have not been fully considered, or there is some kind of lack of information. And so I ask you to excuse me for this moment.

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

What is Patroni?

  • This is a template for building HA. That's what it says in the documentation. And from my point of view, this is a very correct clarification. Patroni is not a silver bullet that will solve all your problems, that is, you need to make an effort to make it work and bring benefits.
  • This is an agent service that is installed on every database service and is a kind of init system for your Postgres. It starts Postgres, stops, restarts, reconfigures, and changes the topology of your cluster.
  • Accordingly, in order to store the state of the cluster, its current representation, as it looks, some kind of storage is needed. And from this point of view, Patroni took the path of storing state in an external system. It is a distributed configuration storage system. It can be Etcd, Consul, ZooKeeper, or kubernetes Etcd, i.e. one of these options.
  • And one of the features of Patroni is that you get the autofiler out of the box, only by setting it up. If we take Repmgr for comparison, then the filer is included there. With Repmgr, we get a switchover, but if we want an autofiler, then we need to configure it additionally. Patroni already has an autofiler out of the box.
  • And there are many other things. For example, maintenance of configurations, pouring new replicas, backup, etc. But this is beyond the scope of the report, I will not talk about it.

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

And a small result is that the main task of Patroni is to do an autofile well and reliably so that our cluster remains operational and the application does not notice changes in the cluster topology.

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

But when we start using Patroni, our system gets a little more complicated. If earlier we had Postgres, then when using Patroni we get Patroni itself, we get DCS where state is stored. And it all has to work somehow. So what can go wrong?

May break:

  • Postgres might break. It can be a master or a replica, one of them may fail.
  • The Patroni itself may break.
  • The DCS where state is stored may break.
  • And the network can break.

All these points I will consider in the report.

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

I will consider cases as they become more complex, not from the point of view that the case involves many components. And from the point of view of subjective feelings, that this case was difficult for me, it was difficult to disassemble it ... and vice versa, some case was light and it was easy to disassemble it.

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

And the first case is the easiest. This is the case when we took a database cluster and deployed our DCS storage on the same cluster. This is the most common mistake. This is a mistake in building architectures, i.e., combining different components in one place.

So, there was a filer, let's go to deal with what happened.

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

And here we are interested in when the filer happened. That is, we are interested in this moment in time when the cluster state changed.

But the filer is not always instantaneous, i.e. it does not take any unit of time, it can be delayed. It can be long lasting.

Therefore, it has a start time and an end time, i.e. it is a continuous event. And we divide all events into three intervals: we have time before the filer, during the filer and after the filer. That is, we consider all events in this timeline.

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

And the first thing, when a filer happened, we look for the cause of what happened, what was the cause of what led to the filer.

If we look at the logs, they will be classic Patroni logs. He tells us in them that the server has become the master, and the role of the master has passed to this node. Here it is highlighted.

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

Next, we need to understand why the filer happened, i.e. what events occurred that caused the master role to move from one node to another. And in this case, everything is simple. We have an error in interacting with the storage system. The master realized that he could not work with DCS, that is, there was some kind of problem with the interaction. And he says that he can no longer be a master and resigns. This line β€œdemoted self” says exactly that.

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

If we look at the events that preceded the filer, we can see there the very reasons that were the problem for the continuation of the wizard.

If we look at the Patroni logs, we will see that we have a lot of errors, timeouts, i.e. the Patroni agent cannot work with DCS. In this case, this is Consul agent, which is communicating on port 8500.

And the problem here is that Patroni and the database are running on the same host. And the Consul servers were launched on the same node. By creating a load on the server, we created problems for the Consul servers as well. They couldn't communicate properly.

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

After some time, when the load subsided, our Patroni was able to communicate with agents again. Normal work resumed. And the same Pgdb-2 server became the master again. That is, there was a small flip, due to which the node resigned the powers of the master, and then took them over again, that is, everything returned as it was.

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

And this can be regarded as a false alarm, or it can be regarded that Patroni did everything right. That is, he realized that he could not maintain the state of the cluster and removed his authority.

And here the problem arose due to the fact that the Consul servers are on the same hardware as the bases. Accordingly, any load: whether it is the load on disks or processors, it also affects the interaction with the Consul cluster.

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

And we decided that it should not live together, we allocated a separate cluster for Consul. And Patroni was already working with a separate Consul, that is, there was a separate Postgres cluster, a separate Consul cluster. This is a basic instruction on how to carry and keep all these things so that it does not live together.

As an option, you can twist the parameters ttl, loop_wait, retry_timeout, i.e. try to survive these short-term load peaks by increasing these parameters. But this is not the most suitable option, because this load can be long in time. And we will simply go beyond these limits of these parameters. And that might not really help.

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

The first problem, as you understand, is simple. We took and put the DCS together with the base, we got a problem.

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

The second problem is similar to the first. It is similar in that we again have interoperability problems with the DCS system.

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

If we look at the logs, we will see that we again have a communication error. And Patroni says I can't interact with DCS so the current master goes into replica mode.

The old master becomes a replica, here Patroni works out, as it should be. It runs pg_rewind to rewind the transaction log and then connect to the new master to catch up with the new master. Here Patroni works out, as he should.

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

Here we must find the place that preceded the filer, i.e. those errors that caused us to have a filer. And in this regard, Patroni logs are quite convenient to work with. He writes the same messages at a certain interval. And if we start scrolling through these logs quickly, then we will see from the logs that the logs have changed, which means that some problems have begun. We quickly return to this place, see what happens.

And in a normal situation, the logs look something like this. The owner of the lock is checked. And if the owner, for example, has changed, then some events may occur that Patroni must respond to. But in this case, we are fine. We are looking for the place where the errors started.

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

And having scrolled to the point where the errors began to appear, we see that we have had an auto-fileover. And since our errors were related to interaction with DCS and in our case we used Consul, we also look at the Consul logs, what happened there.

Roughly comparing the time of the filer and the time in the Consul logs, we see that our neighbors in the Consul cluster began to doubt the existence of other members of the Consul cluster.

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

And if you look at the logs of other Consul agents, you can also see that some kind of network collapse is happening there. And all members of the Consul cluster doubt each other's existence. And this was the impetus for the filer.

If you look at what happened before these errors, you can see that there are all sorts of errors, for example, deadline, RPC falled, that is, there is clearly some kind of problem in the interaction of the Consul cluster members with each other.

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

The simplest answer is to repair the network. But for me, standing on the podium, it is easy to say this. But the circumstances are such that not always the customer can afford to repair the network. He may live in a DC and may not be able to repair the network, affect the equipment. And so some other options are needed.

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

There are options:

  • The simplest option, which is written, in my opinion, even in the documentation, is to disable Consul checks, that is, simply pass an empty array. And we tell the Consul agent not to use any checks. With these checks, we can ignore these network storms and not initiate a filer.
  • Another option is to double check raft_multiplier. This is a parameter of the Consul server itself. By default, it is set to 5. This value is recommended by the documentation for staging environments. In fact, this affects the frequency of messaging between members of the Consul network. In fact, this parameter affects the speed of service communication between members of the Consul cluster. And for production, it is already recommended to reduce it so that the nodes exchange messages more often.
  • Another option we've come up with is to increase the priority of Consul processes among other processes for the operating system's process scheduler. There is such a β€œnice” parameter, it just determines the priority of processes that is taken into account by the OS scheduler when scheduling. We have also reduced the nice value for Consul agents, i.e. increased the priority so that the operating system gives Consul processes more time to work and execute their code. In our case, this solved our problem.
  • Another option is not to use Consul. I have a friend who is a big supporter of Etcd. And we regularly argue with him which is better Etcd or Consul. But in terms of which is better, we usually agree with him that Consul has an agent that should be running on each node with a database. That is, the interaction of Patroni with the Consul cluster goes through this agent. And this agent becomes a bottleneck. If something happens to the agent, then Patroni can no longer work with the Consul cluster. And this is the problem. There is no agent in the Etcd plan. Patroni can work directly with a list of Etcd servers and already communicate with them. In this regard, if you use Etcd in your company, then Etcd will probably be a better choice than Consul. But we at our customers are always limited by what the client has chosen and uses. And we have Consul for the most part for all clients.
  • And the last point is to revise the parameter values. We can raise these parameters up in the hope that our short-term network problems will be short and not fall outside the range of these parameters. This way we can reduce the aggressiveness of Patroni to autofile if some network problems occur.

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

I think many who use Patroni are familiar with this command.

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

This command shows the current state of the cluster. And at first glance, this picture may seem normal. We have a master, we have a replica, there is no replication lag. But this picture is normal exactly until we know that this cluster should have three nodes, not two.

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

Accordingly, there was an autofile. And after this autofile, our replica disappeared. We need to find out why she disappeared and bring her back, restore her. And we again go to the logs and see why we had an auto-fileover.

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

In this case, the second replica became the master. It's all right here.

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

And we need to look at the replica that fell off and which is not in the cluster. We open the Patroni logs and see that we had a problem during the process of connecting to the cluster at the pg_rewind stage. To connect to the cluster, you need to rewind the transaction log, request the required transaction log from the master, and use it to catch up with the master.

In this case, we do not have a transaction log and the replica cannot start. Accordingly, we stop Postgres with an error. And therefore it is not in the cluster.

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

We need to understand why it is not in the cluster and why there were no logs. We go to the new master and look at what he has in the logs. It turns out that when pg_rewind was done, a checkpoint occurred. And some of the old transaction logs were simply renamed. When the old master tried to connect to the new master and query these logs, they were already renamed, they just didn't exist.

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

I compared timestamps when these events happened. And there the difference is literally 150 milliseconds, that is, the checkpoint completed in 369 milliseconds, the WAL segments were renamed. And literally in 517, after 150 milliseconds, rewind started on the old replica. That is, literally 150 milliseconds was enough for us so that the replica could not connect and earn.

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

What are my options?

We initially used replication slots. We thought it was good. Although at the first stage of operation we turned off the slots. It seemed to us that if the slots accumulate a lot of WAL segments, we can drop the master. He will fall. We suffered for some time without slots. And we realized that we need slots, we returned the slots.

But there is a problem here, that when the master goes to the replica, it deletes the slots and deletes the WAL segments along with the slots. And to eliminate this problem, we decided to raise the wal_keep_segments parameter. It defaults to 8 segments. We raised it to 1 and looked at how much free space we had. And we donated 000 gigabytes for wal_keep_segments. That is, when switching, we always have a reserve of 16 gigabytes of transaction logs on all nodes.

And plus - it is still relevant for long-term maintenance tasks. Let's say we need to update one of the replicas. And we want to turn it off. We need to update the software, maybe the operating system, something else. And when we turn off a replica, the slot for that replica is also removed. And if we use a small wal_keep_segments, then with a long absence of a replica, the transaction logs will be lost. We will raise a replica, it will request those transaction logs where it stopped, but they may not be on the master. And the replica will not be able to connect either. Therefore, we keep a large stock of magazines.

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

We have a production base. There are already projects in progress.

There was a filer. We went in and looked - everything is in order, the replicas are in place, there is no replication lag. There are no errors in the logs either, everything is in order.

The product team says that there should be some data, but we see it from one source, but we don’t see it in the database. And we need to understand what happened to them.

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

It is clear that pg_rewind missed them. We immediately understood this, but went to see what was happening.

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

In the logs, we can always find when the filer happened, who became the master, and we can determine who was the old master and when he wanted to become a replica, i.e. we need these logs to find out the amount of transaction logs that was lost.

Our old master has rebooted. And Patroni was registered in the autorun. Launched Patroni. He then started Postgres. More precisely, before starting Postgres and before making it a replica, Patroni launched the pg_rewind process. Accordingly, he erased part of the transaction logs, downloaded new ones and connected. Here Patroni worked smartly, that is, as expected. The cluster has been restored. We had 3 nodes, after the filer 3 nodes - everything is cool.

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

We have lost some data. And we need to understand how much we have lost. We are looking for just the moment when we had a rewind. We can find it in such journal entries. Rewind started, did something there and ended.

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

We need to find the position in the transaction log where the old master left off. In this case, this is the mark. And we need a second mark, that is, the distance by which the old master differs from the new one.

We take the usual pg_wal_lsn_diff and compare these two marks. And in this case, we get 17 megabytes. A lot or a little, everyone decides for himself. Because for someone 17 megabytes is not much, for someone it is a lot and unacceptable. Here, each individual determines for himself in accordance with the needs of the business.

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

But what have we found out for ourselves?

First, we must decide for ourselves - do we always need Patroni to autostart after a system reboot? It often happens that we have to go to the old master, see how far he has gone. Perhaps inspect segments of the transaction log, see what's there. And to understand whether we can lose this data or whether we need to run the old master in standalone mode in order to pull this data out.

And only after that we must decide whether we can discard this data or we can restore it, connect this node as a replica to our cluster.

In addition, there is a "maximum_lag_on_failover" parameter. By default, if my memory serves me, this parameter has a value of 1 megabyte.

How does he work? If our replica is behind by 1 megabyte of data in the replication lag, then this replica does not take part in the elections. And if suddenly there is a fileover, Patroni looks at which replicas are lagging behind. If they are behind by a large number of transaction logs, they cannot become a master. This is a very good security feature that prevents you from losing a lot of data.

But there is a problem in that the replication lag in the Patroni cluster and DCS is updated at a certain interval. I think 30 seconds is the default ttl value.

Accordingly, there may be a situation where there is one replication lag for replicas in DCS, but in fact there may be a completely different lag or there may be no lag at all, i.e. this thing is not realtime. And it doesn't always reflect the real picture. And it's not worth doing fancy logic on it.

And the risk of loss always remains. And in the worst case, one formula, and in the average case, another formula. That is, when we plan the implementation of Patroni and evaluate how much data we can lose, we must rely on these formulas and roughly imagine how much data we can lose.

And there is good news. When the old master has gone ahead, he can go ahead due to some background processes. That is, there was some kind of autovacuum, he wrote the data, saved them to the transaction log. And we can easily ignore and lose this data. There is no problem in this.

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

And this is how the logs look like if maximum_lag_on_failover is set and a filer has occurred, and you need to select a new master. The replica assesses itself as incapable of taking part in the elections. And she refuses to participate in the race for the leader. And she waits for a new master to be selected, so that she can then connect to it. This is an additional measure against data loss.

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

Here we have a product team who wrote that their product is having problems with Postgres. At the same time, the master itself cannot be accessed, because it is not available via SSH. And the autofile does not happen either.

This host was forced to reboot. Because of the reboot, an auto-file happened, although it was possible to do a manual auto-file, as I now understand. And after the reboot, we are already going to see what we had with the current master.

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

At the same time, we knew in advance that we had problems with disks, that is, we already knew from monitoring where to dig and what to look for.

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

We got into the postgres log, started to see what was happening there. We saw commits that last there for one, two, three seconds, which is not at all normal. We saw that our autovacuum starts up very slowly and strangely. And we saw temporary files on the disk. That is, these are all indicators of problems with disks.

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

We looked into the system dmesg (kernel log). And we saw that we have problems with one of the disks. The disk subsystem was software Raid. We looked at /proc/mdstat and saw that we were missing one drive. That is, there is a Raid of 8 disks, we are missing one. If you carefully look at the slide, then in the output you can see that we do not have sde there. At us, conditionally speaking, the disk has dropped out. This triggered disk problems, and applications also experienced problems when working with the Postgres cluster.

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

And in this case, Patroni would not help us in any way, because Patroni does not have the task of monitoring the state of the server, the state of the disk. And we must monitor such situations by external monitoring. We quickly added disk monitoring to external monitoring.

And there was such a thought - could fencing or watchdog software help us? We thought that he would hardly have helped us in this case, because during the problems Patroni continued to interact with the DCS cluster and did not see any problem. That is, from the point of view of DCS and Patroni, everything was fine with the cluster, although in fact there were problems with the disk, there were problems with the availability of the database.

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

In my opinion, this is one of the strangest problems that I have researched for a very long time, I have read a lot of logs, re-picked and called it a cluster simulator.

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

The problem was that the old master could not become a normal replica, i.e. Patroni started it, Patroni showed that this node was present as a replica, but at the same time it was not a normal replica. Now you will see why. This is what I have kept from the analysis of that problem.

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

And how did it all start? It started, as in the previous problem, with disc brakes. We had commits for a second, two.

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

There were breaks in connections, i.e., clients were torn.

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

There were blockages of varying severity.

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

And, accordingly, the disk subsystem is not very responsive.

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

And the most mysterious thing for me is the immediate shutdown request that arrived. Postgres has three shutdown modes:

  • It's graceful when we wait for all clients to disconnect on their own.
  • There is fast when we force clients to disconnect because we are going to shutdown.
  • And immediate. In this case, immediate doesn't even tell clients to shut down, it just shuts down without warning. And to all clients, the operating system already sends an RST message (a TCP message that the connection is interrupted and the client has nothing more to catch).

Who sent this signal? Postgres background processes do not send such signals to each other, i.e. this is kill-9. They don’t send such things to each other, they only react to such things, i.e. this is an emergency restart of Postgres. Who sent it, I don't know.

I looked at the "last" command and I saw one person who also logged in this server with us, but I was too shy to ask a question. Perhaps it was kill -9. I would see kill -9 in the logs, because Postgres says it took kill -9, but I didn't see it in the logs.

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

Looking further, I saw that Patroni did not write to the log for quite a long time - 54 seconds. And if we compare two timestamps, there were no messages for about 54 seconds.

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

And during this time there was an autofile. Patroni did a great job here again. Our old master was unavailable, something happened to him. And the election of a new master began. Everything worked out well here. Our pgsql01 has become the new leader.

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

We have a replica that has become a master. And there is a second response. And there were problems with the second replica. She tried to reconfigure. As I understand it, she tried to change recovery.conf, restart Postgres and connect to the new master. She writes messages every 10 seconds that she is trying, but she is not succeeding.

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

And during these attempts, an immediate-shutdown signal arrives at the old master. The master is restarted. And also recovery stops because the old master goes into reboot. That is, the replica cannot connect to it, because it is in shutdown mode.

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

At some point, it worked, but replication did not start.

My only guess is that there was an old master address in recovery.conf. And when a new master appeared, the second replica still tried to connect to the old master.

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

When Patroni started up on the second replica, the node started up but could not replicate. And a replication lag was formed, which looked something like this. That is, all three nodes were in place, but the second node lagged behind.

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

At the same time, if you look at the logs that were written, you could see that replication could not start because the transaction logs were different. And those transaction logs that the master offers, which are specified in recovery.conf, simply do not fit our current node.

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

And here I made a mistake. I had to come and see what was in recovery.conf to test my hypothesis that we were connecting to the wrong master. But then I was just dealing with this and it didn’t occur to me, or I saw that the replica was lagging behind and would have to be refilled, that is, I somehow worked carelessly. This was my joint.

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

After 30 minutes, the admin already came, i.e. I restarted Patroni on the replica. I already put an end to it, I thought that it would have to be refilled. And I thought - I'll restart Patroni, maybe something good will turn out. Recovery started. And the base even opened, it was ready to accept connections.

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

Replication has started. But a minute later, she fell off with an error that transaction logs are not suitable for her.

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

I thought I'd restart again. I restarted Patroni again, and I did not restart Postgres, but restarted Patroni in the hope that it would magically start the database.

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

The replication started again, but the marks in the transaction log were different, they were not the same as the previous start attempt. Replication stopped again. And the message was already slightly different. And it was not very informative for me.

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

And then it occurs to me - what if I restart Postgres, at this time I make a checkpoint on the current master to move the point in the transaction log a little forward so that recovery starts from another moment? Plus, we still had stocks of WAL.

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

I restarted Patroni, did a couple of checkpoints on the master, a couple of restart points on the replica when it opened. And it helped. I thought for a long time why it helped and how it worked. And the replica started. And replication was no longer torn.

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

Such a problem for me is one of the more mysterious ones, over which I still puzzle over what really happened there.

What are the implications here? Patroni can work as intended and without any errors. But at the same time, this is not a 100% guarantee that everything is fine with us. The replica may start, but it may be in a semi-working state, and the application cannot work with such a replica, because there will be old data.

And after the filer, you always need to check that everything is in order with the cluster, that is, there is the required number of replicas, there is no replication lag.

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

And as we go through these issues, I will make recommendations. I tried to combine them into two slides. Probably, all the stories could be combined into two slides and only told.

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

When you use Patroni, you must have monitoring. You should always know when an autofileover occurred, because if you don't know you had an autofileover, you have no control over the cluster. And that's bad.

After each filer, we always have to manually check the cluster. We need to make sure that we always have an up-to-date number of replicas, there is no replication lag, there are no errors in the logs related to streaming replication, with Patroni, with the DCS system.

Automation can work successfully, Patroni is a very good tool. It can work, but this will not bring the cluster to the desired state. And if we don't find out about it, we'll be in trouble.

And Patroni is not a silver bullet. We still need to understand how Postgres works, how replication works and how Patroni works with Postgres, and how communication between nodes is provided. This is necessary in order to be able to fix problems with your hands.

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

How do I approach the issue of diagnosis? It so happened that we work with different clients and no one has an ELK stack, and we have to sort out the logs by opening 6 consoles and 2 tabs. In one tab, these are the Patroni logs for each node, in the other tab, these are the Consul logs, or Postgres if necessary. It is very difficult to diagnose this.

What approaches have I developed? First, I always look when the filer has arrived. And for me this is a watershed. I look at what happened before the filer, during the filer and after the filer. The fileover has two marks: this is the start and end time.

Next, I look in the logs for events before the filer, which preceded the filer, i.e. I look for the reasons why the filer happened.

And this gives a picture of understanding what happened and what can be done in the future so that such circumstances do not occur (and as a result, there is no filer).

And where do we usually look? I look:

  • First, to the Patroni logs.
  • Next, I look at the Postgres logs, or the DCS logs, depending on what was found in the Patroni logs.
  • And the system logs also sometimes give an understanding of what caused the filer.

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

How do I feel about Patroni? I have a very good relationship with Patroni. In my opinion, this is the best there is today. I know many other products. These are Stolon, Repmgr, Pg_auto_failover, PAF. 4 tools. I tried them all. Patroni is my favorite.

If they ask me: "Do I recommend Patroni?". I will say yes, because I like Patroni. And I think I learned how to cook it.

If you're interested in seeing what other problems there are with Patroni besides the problems I've mentioned, you can always check out the page issues on GitHub. There are many different stories and many interesting issues are discussed there. And as a result, some bugs were introduced and resolved, that is, this is an interesting read.

There are some interesting stories about people shooting themselves in the foot. Very informative. You read and understand that it is not necessary to do so. I ticked myself.

And I would like to say a big thank you to Zalando for developing this project, namely to Alexander Kukushkin and Alexey Klyukin. Aleksey Klyukin is one of the co-authors, he no longer works at Zalando, but these are two people who started working with this product.

And I think that Patroni is a very cool thing. I am happy that she exists, it is interesting with her. And a big thank you to all the contributors who make patches to Patroni. I hope that Patroni will become more mature, cool and efficient with age. It's already functional, but I hope it will get even better. Therefore, if you plan to use Patroni, then do not be afraid. This is a good solution, it can be implemented and used.

That's all. If you have questions, ask.

Patroni Failure Stories or How to crash your PostgreSQL cluster. Alexey Lesovsky

Questions

Thanks for the report! If after a filer you still need to look there very carefully, then why do we need an automatic filer?

Because it's new stuff. We've only been with her for a year. Better to be safe. We want to come in and see that everything really worked out the way it should. This is the level of adult distrust - it is better to double-check and see.

For example, we went in the morning and looked, right?

Not in the morning, we usually learn about the autofile almost immediately. We receive notifications, we see that an autofile has occurred. We almost immediately go and look. But all these checks should be brought to the monitoring level. If you access Patroni via the REST API, there is a history. By history you can see the timestamps when the filer happened. Based on this, monitoring can be done. You can see the history, how many events were there. If we have more events, then an autofile has occurred. You can go and see. Or our monitoring automation checked that we have all the replicas in place, there is no lag and everything is fine.

Thank you!

Thanks a lot for the great story! If we moved the DCS cluster somewhere far from the Postgres cluster, then this cluster also needs to be serviced periodically? What are the best practices that some pieces of the DCS cluster need to be turned off, something to do with them, etc.? How does this whole structure survive? And how do you do these things?

For one company, it was necessary to make a matrix of problems, what happens if one of the components or several components fail. According to this matrix, we sequentially go through all the components and build scenarios in case of failure of these components. Accordingly, for each failure scenario, you can have an action plan for recovery. And in the case of DCS, it comes as part of the standard infrastructure. And the admin administers it, and we already rely on the admins who administer it and their ability to fix it in case of accidents. If there is no DCS at all, then we deploy it, but at the same time we don’t particularly monitor it, because we are not responsible for the infrastructure, but we give recommendations on how and what to monitor.

That is, did I understand correctly that I need to disable Patroni, disable the filer, disable everything before doing anything with the hosts?

It depends on how many nodes we have in the DCS cluster. If there are many nodes and if we disable only one of the nodes (the replica), then the cluster maintains a quorum. And Patroni remains operational. And nothing is triggered. If we have some complex operations that affect more nodes, the absence of which can ruin the quorum, then - yes, it might make sense to put Patroni on pause. It has a corresponding command - patronictl pause, patronictl resume. We just pause and the autofiler doesn't work at that time. We do maintenance on the DCS cluster, then we take off the pause and continue to live.

Thank you very much!

Thank you very much for your report! How does the product team feel about data being lost?

Product teams don't care, and team leads are worried.

What guarantees are there?

Guarantees are very difficult. Alexander Kukushkin has a report β€œHow to calculate RPO and RTO”, i.e. recovery time and how much data we can lose. I think we need to find these slides and study them. As far as I remember, there are specific steps on how to calculate these things. How many transactions we can lose, how much data we can lose. As an option, we can use synchronous replication at the Patroni level, but this is a double-edged sword: we either have data reliability, or we lose speed. There is synchronous replication, but it also does not guarantee 100% protection against data loss.

Alexey, thanks for the great report! Any experience with using Patroni for zero level protection? That is, in conjunction with synchronous standby? This is the first question. And the second question. You have used different solutions. We used Repmgr, but without autofiler, and now we are planning to include autofiler. And we consider Patroni as an alternative solution. What can you say as advantages compared to Repmgr?

The first question was about synchronous replicas. No one uses synchronous replication here, because everyone is scared (Several clients are already using it, in principle, they didn’t notice performance problems - Speaker's note). But we have developed a rule for ourselves that there should be at least three nodes in a synchronous replication cluster, because if we have two nodes and if the master or replica fails, then Patroni switches this node to Standalone mode so that the application continues to work. In this case, there is a risk of data loss.

Regarding the second question, we have used Repmgr and still do with some clients for historical reasons. What can be said? Patroni comes with an autofiler out of the box, Repmgr comes with autofiler as an additional feature that needs to be enabled. We need to run the Repmgr daemon on each node and then we can configure the autofiler.

Repmgr checks if Postgres nodes are alive. Repmgr processes check for the existence of each other, this is not a very efficient approach. there can be complex cases of network isolation in which a large Repmgr cluster can fall apart into several smaller ones and continue working. I have not been following Repmgr for a long time, maybe it was fixed ... or maybe not. But the removal of information about the state of the cluster in DCS, as Stolon, Patroni does, is the most viable option.

Alexey, I have a question, maybe a lamer one. In one of the first examples, you moved DCS from the local machine to a remote host. We understand that the network is a thing that has its own characteristics, it lives on its own. And what happens if for some reason the DCS cluster becomes unavailable? I will not say the reasons, there can be a lot of them: from the crooked hands of networkers to real problems.

I didn't say it out loud, but the DCS cluster must also be failover, i.e. it's an odd number of nodes, in order for a quorum to be met. What happens if the DCS cluster becomes unavailable, or a quorum cannot be met, i.e. some kind of network split or node failure? In this case, the Patroni cluster goes into read only mode. The Patroni cluster cannot determine the state of the cluster and what to do. It cannot contact the DCS and store the new cluster state there, so the entire cluster goes into read only. And waits either for manual intervention from the operator or for DCS to recover.

Roughly speaking, DCS becomes a service for us as important as the base itself?

Yes Yes. In so many modern companies, Service Discovery is an integral part of the infrastructure. It is being implemented even before there was even a database in the infrastructure. Relatively speaking, the infrastructure was launched, deployed in the DC, and we immediately have Service Discovery. If it is Consul, then DNS can be built on it. If this is Etcd, then there may be a part from the Kubernetes cluster, in which everything else will be deployed. It seems to me that Service Discovery is already an integral part of modern infrastructures. And they think about it much earlier than about databases.

Thank you!

Source: habr.com

Add a comment