Modeling failover clusters based on PostgreSQL and Pacemaker

Introduction

Some time ago, I was given the task of developing a failover cluster for PostgreSQL, operating in several data centers connected by fiber within the same city, and able to withstand the failure (for example, blackout) of one data center. As a software that is responsible for fault tolerance, I chose Pacemaker, because this is the official solution from RedHat for creating failover clusters. It is good because RedHat provides support for it, and because this solution is universal (modular). With its help, it will be possible to provide fault tolerance not only for PostgreSQL, but also for other services, either using standard modules or creating them for specific needs.

To this decision, a reasonable question arose: how fault-tolerant will a failover cluster be? To investigate this, I developed a test bench that simulates various failures on the nodes of the cluster, waits for recovery, restores the failed node and continues testing in a loop. Initially, this project was called hapgsql, but over time I got bored with the name, which has only one vowel. Therefore, I began to name fault-tolerant databases (and float IPs pointing to them) Krogan (a character from a computer game, in which all important organs are duplicated), and nodes, clusters and the project itself are tuchanka (the planet where the krogans live).

Management has now approved open a project for the open source community under the MIT license. The README will soon be translated into English (because the Pacemaker and PostgreSQL developers are expected to be the main consumers), and I decided to issue the old Russian version of the README (partially) in the form of this article.

Modeling failover clusters based on PostgreSQL and Pacemaker

Clusters are deployed on virtual machines VirtualBox. In total, 12 virtual machines will be deployed (36GiB in total), which form 4 failover clusters (different options). The first two clusters consist of two PostgreSQL servers located in different data centers and a common server Witness c quorum device (hosted on a cheap virtual machine in a third data center) that resolves uncertainty 50%/50%by casting one's vote. The third cluster in three data centers: one master, two slaves, no quorum device. The fourth cluster consists of four PostgreSQL servers, two per data center: one master, the rest are replicas, and also uses Witness c quorum device. The fourth survives the failure of two servers or one data center. This solution can be scaled up to more replicas if necessary.

Time Service ntpd also reconfigured for fault tolerance, but it uses the method of ntpd (orphan mode). Shared server Witness acts as a central NTP server, distributing its time to all clusters, thereby synchronizing all servers with each other. If Witness fails or turns out to be isolated, then one of the cluster servers (within the cluster) will start distributing its time. Auxiliary caching HTTP proxy also raised to Witness, with its help, other virtual machines have access to Yum repositories. In reality, services such as accurate time and proxy will most likely be hosted on dedicated servers, and in the booth they are hosted on Witness only to save the number of virtual machines and space.

Versions

v0. Works with CentOS 7 and PostgreSQL 11 on VirtualBox 6.1.

Cluster structure

All clusters are designed to be located in several data centers, united in one flat network and must withstand failure or network isolation of one data center. That's why impossible use to protect against split-brain standard Pacemaker technology called STONITH (Shoot The Other Node In The Head) or fence. Its essence: if the nodes in the cluster begin to suspect that something is wrong with some node, it does not respond or behaves incorrectly, then they forcibly turn it off through β€œexternal” devices, for example, an IPMI or UPS control card. But this will only work in cases where, with a single failure of the IPMI server or UPS, they continue to work. It also plans to protect against a much more catastrophic failure, when the entire data center fails (for example, is de-energized). And with such a refusal, everything stonith-devices (IPMI, UPS, etc.) won't work either.

Instead, the system is based on the idea of ​​a quorum. All nodes have a voice, and only those that see more than half of all nodes can work. This number "half + 1" is called quorum. If a quorum is not reached, then the node decides that it is in network isolation and must turn off its resources, i.e. it's like that split-brain protection. If the software that is responsible for this behavior does not work, then a watchdog, for example, based on IPMI, should work.

If the number of nodes is even (a cluster in two data centers), then the so-called uncertainty may arise 50%/50% (fifty-fifty) when network isolation divides the cluster exactly in half. Therefore, for an even number of nodes, it is added quorum device - an undemanding daemon that can be run on the cheapest virtual machine in the third data center. He gives his vote to one of the segments (which he sees), and thereby resolves the 50%/50% uncertainty. The server on which the quorum device will run, I called Witness (terminology from repmgr, I liked it).

Resources can move from place to place, for example, from faulty servers to serviceable ones, or at the command of system administrators. In order for clients to know where the resources they need are located (where to connect?), floating IP (float IP). These are the IPs that Pacemaker can move around the nodes (everything is in a flat network). Each of them symbolizes a resource (service) and will be located where you need to connect to get access to this service (in our case, the database).

Tuchanka1 (compressed scheme)

  Structure

Modeling failover clusters based on PostgreSQL and Pacemaker

The idea was that we have many small databases with low load, for which it is unprofitable to maintain a dedicated slave server in hot standby mode for read only transactions (there is no need for such a waste of resources).

Each data center has one server. Each server has two PostgreSQL instances (in PostgreSQL terminology, they are called clusters, but to avoid confusion, I will call them instances (by analogy with other databases), and I will only call Pacemaker clusters clusters). One instance works in master mode, and only it provides services (only float IP leads to it). The second instance works as a slave for the second data center, and will only provide services if its master fails. Since most of the time only one of the two instances (the master) will provide services (perform requests), all server resources are optimized for the master (memory is allocated for the shared_buffers cache, etc.), but so that the second instance also has enough resources ( albeit for non-optimal work through the file system cache) in case one of the data centers fails. The slave does not provide services (does not perform read only requests) during normal cluster operation, so that there is no war for resources with the master on the same machine.

In the case of two nodes, fault tolerance is possible only with asynchronous replication, since with synchronous replication, the failure of the slave will lead to the stop of the master.

failure to witness

Modeling failover clusters based on PostgreSQL and Pacemaker

failure to witness (quorum device) I will consider only for the Tuchanka1 cluster, the same story will be with all the others. If witness fails, nothing will change in the cluster structure, everything will continue to work the same way as it worked. But the quorum will become 2 out of 3, and therefore any next failure will be fatal for the cluster. It still needs to be done urgently.

Rejection Tuchanka1

Modeling failover clusters based on PostgreSQL and Pacemaker

Failure of one of the data centers for Tuchanka1. In this case Witness casts its vote to the second node in the second data center. There, the former slave turns into a master, as a result, both masters work on the same server and both of their float IPs point to them.

Tuchanka2 (classic)

  Structure

Modeling failover clusters based on PostgreSQL and Pacemaker

The classic scheme of two nodes. The master works on one, the slave works on the second. Both can execute requests (the slave is only read only), so both are pointed to by float IP: krogan2 is the master, krogan2s1 is the slave. Both the master and the slave will have fault tolerance.

In the case of two nodes, fault tolerance is only possible with asynchronous replication, because with synchronous replication, the failure of the slave will lead to the stop of the master.

Rejection Tuchanka2

Modeling failover clusters based on PostgreSQL and Pacemaker

If one of the data centers fails Witness vote for the second. On the only working data center, the master will be raised, and both float IPs will point to it: master and slave. Of course, the instance must be configured in such a way that it has enough resources (connection limits, etc.) to simultaneously accept all connections and requests from the master and slave float IP. That is, during normal operation, it should have a sufficient margin for limits.

Tuchanka4 (many slaves)

  Structure

Modeling failover clusters based on PostgreSQL and Pacemaker

Already another extreme. There are databases that have a lot of read-only requests (a typical case of a highly loaded site). Tuchanka4 is a situation where there may be three or more slaves to handle such requests, but still not too many. With a very large number of slaves, it will be necessary to invent a hierarchical replication system. In the minimum case (in the picture), each of the two data centers has two servers, each of which has a PostgreSQL instance.

Another feature of this scheme is that it is already possible to organize one synchronous replication here. It is configured to replicate, if possible, to another data center, and not to a replica in the same data center as the master. The master and each slave are indicated by a float IP. For good, between the slaves it will be necessary to do some kind of balancing of requests sql proxy, for example, on the client side. Different types of clients may require different types of sql proxy, and only the client developers know who needs which one. This functionality can be implemented either by an external daemon or by a client library (connection pool), etc. All this is beyond the scope of the database failover cluster (failover SQL proxy can be implemented independently, along with client failover).

Rejection Tuchanka4

Modeling failover clusters based on PostgreSQL and Pacemaker

If one data center (i.e. two servers) fails, witness votes for the second one. As a result, two servers work in the second data center: the master works on one, and the master float IP points to it (to receive read-write requests); and a slave with synchronous replication is running on the second server, and one of the slave float IPs points to it (for read only requests).

The first thing to note: not all slave float IPs will work, but only one. And in order to work correctly with it, it will be necessary that sql proxy redirected all requests to the only remaining float IP; and if sql proxy no, you can list all float IP slaves separated by commas in the connection URL. In that case, with libpq the connection will be to the first working IP, as done in the automatic testing system. Perhaps in other libraries, for example, JDBC, this will not work and is necessary sql proxy. This is done because float IP for slaves is prohibited from simultaneously rising on one server, so that they are evenly distributed among slave servers if there are several of them.

Second: even in the event of a data center failure, synchronous replication will be maintained. And even if a secondary failure occurs, that is, one of the two servers fails in the remaining data center, the cluster, although it stops providing services, will still retain information about all the committed transactions for which it confirmed the commit (there will be no loss information on secondary failure).

Tuchanka3 (3 data centers)

  Structure

Modeling failover clusters based on PostgreSQL and Pacemaker

This is a cluster for a situation where there are three fully functional data centers, each of which has a fully functional database server. In this case quorum device not needed. A master works in one data center, and slaves work in the other two. Replication is synchronous, type ANY (slave1, slave2), that is, the client will receive a commit confirmation when any of the slaves is the first to respond that he has accepted the commit. Resources are pointed to by one float IP for master and two for slaves. Unlike Tuchanka4, all three float IPs are fault tolerant. To balance read-only SQL queries, you can use sql proxy (with separate fault tolerance), or assign one slave IP float to half of the clients, and the second to the other half.

Rejection Tuchanka3

Modeling failover clusters based on PostgreSQL and Pacemaker

If one of the data centers fails, two remain. In one, the master and float IP from the master are raised, in the second, the slave and both slave float IPs (there must be a double resource reserve on the instance to accept all connections from both slave float IPs). Synchronous replication between masters and slaves. Also, the cluster will save information about committed and confirmed transactions (there will be no loss of information) in case of destruction of two data centers (if they are not destroyed at the same time).

I decided not to include a detailed description of the file structure and deployment. If you want to play around, you can read all this in the README. I give only a description of automatic testing.

Automatic testing system

To check the fault tolerance of clusters with imitation of various faults, an automatic testing system was made. Launched by a script test/failure. The script can take as parameters the numbers of clusters that you want to test. For example, this command:

test/failure 2 3

will only test the second and third cluster. If parameters are not specified, then all clusters will be tested. All clusters are tested in parallel and the result is displayed in the tmux panel. Tmux uses a dedicated tmux server, so the script can be run from under the default tmux, resulting in a nested tmux. I recommend using the terminal in a large window and with a small font. Before starting testing, all virtual machines are rolled back to a snapshot at the time the script ends setup.

Modeling failover clusters based on PostgreSQL and Pacemaker

The terminal is divided into columns according to the number of tested clusters, by default (in the screenshot) there are four of them. I will describe the contents of the columns using Tuchanka2 as an example. The panels in the screenshot are numbered:

  1. Test statistics are displayed here. Speakers:
    • failure β€” the name of the test (function in the script) that emulates the failure.
    • reaction β€” the arithmetic mean time in seconds for which the cluster has restored its performance. It is measured from the beginning of the script that emulates the failure, and until the moment when the cluster restores its health and is able to continue providing services. If the time is very short, for example, six seconds (this happens in clusters with several slaves (Tuchanka3 and Tuchanka4)), this means that the malfunction ended up on an asynchronous slave and did not affect performance in any way, there were no cluster state switches.
    • deviation - shows the spread (accuracy) of the value reaction by the standard deviation method.
    • count How many times this test has been performed.
  2. A short log allows you to evaluate what the cluster is currently doing. The iteration (test) number, timestamp, and operation name are displayed. Too long execution (> 5 minutes) indicates some kind of problem.
  3. heart (heart) is the current time. For visual performance assessment wizard the current time is constantly written to its table using the master's float IP. If successful, the result is displayed in this panel.
  4. beat (pulse) - "current time", which was previously recorded by the script heart to master, now read from slave through its float IP. Allows you to visually assess the performance of a slave and replication. There are no slaves with float IP in Tuchanka1 (there are no slaves providing services), but there are two instances (DB), so it will not be shown here beat, heart second instance.
  5. Monitoring the state of the cluster using the utility pcs mon. Shows the structure, distribution of resources by nodes and other useful information.
  6. It displays system monitoring from each cluster virtual machine. There may be more such panels - how many virtual machines the cluster has. Two graphs CPU load (two processors in virtual machines), virtual machine name, System Load (named Load Average because it averaged over 5, 10, and 15 minutes), process data, and memory allocation.
  7. Tracing the script that performs the tests. In the event of a malfunction - a sudden interruption of work or an endless waiting loop - here you can see the reason for this behavior.

Testing is carried out in two stages. First, the script goes through all types of tests, randomly choosing a virtual machine to which this test should be applied. Then an endless testing cycle is performed, virtual machines and a malfunction are randomly selected each time. Sudden termination of the test script (bottom panel) or an endless waiting loop for something (> 5 minutes time to complete one operation, this can be seen in the trace) indicates that some of the tests on this cluster have failed.

Each test consists of the following operations:

  1. Starting a function that emulates a fault.
  2. Ready? - waiting for the restoration of the cluster's health (when all services are rendered).
  3. Cluster recovery timeout is shown (reaction).
  4. Fix - the cluster is "repaired". After which it should return to a fully operational state and ready for the next malfunction.

Here is a list of tests with a description of what they do:

  • Forkbomb: Creates "Out of memory" with a fork bomb.
  • OutOfSpace: the hard drive is full. But the test is rather symbolic, with the insignificant load that is created during testing, when the hard drive overflows, PostgreSQL usually does not fail.
  • Postgres-KILL: kills PostgreSQL with the command killall -KILL postgres.
  • postgres-STOP: hangs PostgreSQL with the command killall -STOP postgres.
  • power off: "de-energizes" the virtual machine with the command VBoxManage controlvm "Π²ΠΈΡ€Ρ‚ΡƒΠ°Π»ΠΊΠ°" poweroff.
  • Reset: reloads the virtual machine with the command VBoxManage controlvm "Π²ΠΈΡ€Ρ‚ΡƒΠ°Π»ΠΊΠ°" reset.
  • SBD STOP: suspends the SBD daemon with the command killall -STOP sbd.
  • Shut Down: via SSH sends a command to the virtual machine systemctl poweroff, the system shuts down gracefully.
  • UnLink: network isolation, command VBoxManage controlvm "Π²ΠΈΡ€Ρ‚ΡƒΠ°Π»ΠΊΠ°" setlinkstate1 off.

End testing either with the standard tmux command "kill-window" ctrl-b&, or by command "detach-client" ctrl-bd: at the same time, testing is completed, tmux is closed, virtual machines are turned off.

Problems Identified During Testing

  • At the moment watchdog daemon sbd handles stopping observed daemons, but not freezing them. And, as a result, malfunctions are incorrectly worked out, leading to a freeze only Corosync ΠΈ Pacemaker, but not hanging SBD.... For check Corosync already have PR#83 (on GitHub at SBD.), accepted into the branch master. They promised (in PR#83) that there would be something similar for Pacemaker, I hope that by RedHat 8 will do. But such β€œmalfunctions” are speculative, easily imitated artificially using, for example, killall -STOP corosyncbut never meet in real life.

  • Π£ Pacemaker in version for 7 CentOS incorrectly set sync_timeout Ρƒ quorum device, as a result if one node failed, the second node rebooted with some probability, to which the master was supposed to move. Cured by magnification sync_timeout Ρƒ quorum device during deployment (in script setup/setup1). This amendment was not accepted by the developers Pacemaker, instead they promised to rework the infrastructure in such a way (in some indefinite future) that this timeout is calculated automatically.

  • If you specified during database configuration that LC_MESSAGES (text messages) Unicode can be used, for example, ru_RU.UTF-8, then at startup postgres in an environment where the locale is not UTF-8, say, in an empty environment (here pacemakers+pgsqlms(paf) starts postgres) then in the log instead of UTF-8 letters there will be question marks. The PostgreSQL developers have not agreed on what to do in this case. It costs, you need to put LC_MESSAGES=en_US.UTF-8 when configuring (creating) a DB instance.

  • If wal_receiver_timeout is set (by default it is 60s), then when testing PostgreSQL-STOP on the master in tuchanka3 and tuchanka4 clusters Replication does not reconnect to a new master. Replication there is synchronous, so not only the slave stops, but also the new master. Gets by setting wal_receiver_timeout=0 when configuring PostgreSQL.

  • I occasionally observed PostgreSQL replication hanging in the ForkBomb test (memory overflow). After ForkBomb, sometimes slaves may not reconnect to the new master. I have seen this only in tuchanka3 and tuchanka4 clusters, where due to the fact that replication is synchronous, the master hung. The problem went away by itself, after some long time (about two hours). More research is needed to fix this. The symptoms are similar to the previous bug, which is caused by a different cause, but with the same consequences.

Krogan picture taken from Deviant Art with the permission of the author:

Modeling failover clusters based on PostgreSQL and Pacemaker

Source: habr.com

Add a comment