Cluster of two nodes - the devil is in the details

Hey Habr! I present to your attention the translation of the article "Two Nodes - The Devil is in the Details" by Andrew Beekhof.

Many people prefer two-node clusters because they seem to be conceptually simpler and also 33% cheaper than their three-node counterparts. While it is possible to put together a good two-node cluster, in most cases, due to unaccounted for scenarios, this configuration will create many non-obvious problems.

The first step in creating any high availability system is to find and try to eliminate individual points of failure, often referred to as SPoF (single point of failure).

It is worth bearing in mind that in any system it is impossible to eliminate all possible risks of downtime. This stems at least from the fact that a typical defense against risk is the introduction of some redundancy, which leads to an increase in the complexity of the system and the appearance of new points of failure. Therefore, we initially compromise and focus on events associated with individual points of failure, and not on chains of related and, therefore, increasingly less likely events.

Considering trade-offs, we are not only looking for SPoF, but also balancing risks and consequences, resulting in the conclusion of what is critical and what is not may differ for each deployment.

Not everyone needs alternative electricity providers with independent transmission lines. Although the paranoia paid off for at least one customer when their monitoring found a faulty transformer. The customer was on the phone trying to alert the power company until the faulty transformer exploded.

The natural starting point is to have more than one node in the system. However, before the system can move services to a surviving node, it must generally be ensured that the services being moved are not active elsewhere.

There is no downside to a two-node cluster if both nodes are serving the same static website as a result of a failure. However, things change if both parties end up independently managing a shared job queue or granting uncoordinated write access to a replicated database or shared file system.

Therefore, to prevent data corruption resulting from a single node failure - we rely on something called "delimitation" (fencing).

The principle of distancing

The principle of fencing is based on the question: can a competing node cause data corruption? In case data corruption is a likely scenario, isolating the node from both incoming requests and persistent storage is a good solution. The most common approach to fencing is to shut down failed nodes.

There are two categories of distancing methods which I will name direct и indirect, but they can equally be called active и passive. Direct methods include actions on the part of surviving peers, such as interacting with an IPMI (Intelligent Platform Management Interface - an interface for remote monitoring and managing the physical state of a server) or iLO (a mechanism for managing servers in the absence of physical access to them), in while indirect methods rely on the failed node to somehow recognize that it is in an unhealthy state (or at least prevent the rest of the members from recovering) and signal hardware watchdog about the need to shut down the failed node.

Quorum helps in case of using both direct and indirect methods.

Direct distancing

In the case of direct fencing, we can use quorum to prevent fencing races in the event of a network failure.

With the concept of a quorum in place, there is enough information in the system (even without being connected to their peers) for nodes to automatically know whether they should initiate disassociation and/or recovery.

Without a quorum, both sides of the network division rightly assume that the other side is dead and will seek to disassociate the other. In the worst case, both sides manage to shut down the entire cluster. An alternative scenario is deathmatch, an endless loop of nodes showing up, not seeing their peers, rebooting them, and initiating recovery only to reboot when their peer goes through the same logic.

The problem with fencing is that the most commonly used devices become unavailable due to the same failure events that we want to target for recovery. Most IPMI and iLO cards are installed on the hosts they control and, by default, use the same network, which causes the target hosts to assume that the rest of the hosts are offline.

Unfortunately, the features of IPMI and iLo devices are rarely considered at the time of purchase of equipment.

Indirect distancing

Quorum is also important for managing indirect fencing, if done correctly, quorum can allow survivors to assume that lost nodes will transition to a safe state after a certain period of time.

With this setup, the hardware watchdog timer resets every N seconds unless quorum is lost. If the timer (usually a few multiples of N) expires, then the device performs an ungraceful power down (not a shutdown).

This approach is very efficient, but without a quorum, there is not enough information within the cluster to manage it. It is not easy to tell the difference between a network outage and a node failure. The reason this matters is that without the ability to distinguish between the two cases, you are forced to choose the same behavior in both cases.

The problem with choosing one mode is that there is no course of action that maximizes availability and prevents data loss.

  • If you choose to assume that a partner node is up, but actually failed, the cluster will unnecessarily stop services that should be running to compensate for the loss of services from the downed partner node.
  • If you choose to assume the node is down, but it was just a network outage and in fact the remote node is up, then at best you are subscribing to some future manual reconciliation of the result sets.

No matter what heuristic you use, it's trivial to create a failure that will either make both sides work or force the cluster to shut down the surviving nodes. Not using quorum really deprives the cluster of one of the most powerful tools in its arsenal.

If there is no other alternative, the best approach is to sacrifice accessibility (here the author refers to the CAP theorem). High availability of corrupted data doesn't help anyone, and manually reconciling different datasets isn't fun either.

Quorum

Quorum sounds great, right?

The only drawback is that in order to have it in a cluster with N members, you need to have a connection between N / 2 + 1 of your nodes. Which is not possible in a two node cluster after one node fails.

Which ultimately brings us to the fundamental two-node problem:
Quorum is meaningless in two node clusters, and without it, it is not possible to reliably determine a course of action that maximizes availability and prevents data loss
Even in a system of two nodes connected by a crossover cable, it is not possible to definitively distinguish between a network outage and another node failure. The shutdown of one end (whose probability is certainly proportional to the distance between the nodes) will be enough to disprove any assumption that the health of the link is equal to the health of the partner node.

We make a cluster of two nodes work

Sometimes a client cannot or does not want to buy a third node, and we are forced to look for an alternative.

Option 1 - Duplicate demarcation method

The node's iLO or IPMI device represents a point of failure because, if it fails, survivors cannot use it to bring the node to a safe state. In a cluster of 3 or more nodes, we can mitigate this by calculating the quorum and using a hardware watchdog (an indirect fencing mechanism, as discussed earlier). In the case of two nodes, we should use network power switches (power distribution units or PDUs) instead.

After a failure, the survivor first attempts to communicate with the primary fencing device (embedded iLO or IPMI). If successful, recovery continues as usual. Only if the iLO / IPMI device fails is the PDU accessed, if the access is successful the recovery can continue.

Be sure to place the PDU on a network that is separate from cluster traffic, otherwise a single network failure will block access to both the fencing devices and will block service recovery.

At this point, you might be asking - isn't the PDU a single point of failure? To which the answer is, of course.

If this risk is significant to you, you are not alone: ​​connect both nodes to two PDUs and tell the cluster software to use both when powering up and powering down the nodes. The cluster now remains active if one PDU dies and a second failure of either the other PDU or the IPMI device would be required to block recovery.

Option 2 - Adding an arbiter

In some scenarios, while the technique of redundant distancing is technically possible, it is politically difficult. Many companies like to have a certain separation between administrators and application owners, and security-conscious network administrators are not always enthusiastic about handing over PDU access parameters to anyone.

In this case, the recommended alternative is to create a neutral third party that can supplement the quorum calculation.

In the event of a failure, the node must be able to see the air of its peer or arbiter in order to restore services. The arbiter also includes a feature to break the link if both nodes can see the arbiter but cannot see each other.

This option should be used in conjunction with an indirect fencing method such as a hardware watchdog timer that is configured to shut down a machine if it loses connection to its peer node and arbiter. Thus, the survivor can assume with reasonable certainty that his peer node will be in a secure state after the hardware watchdog timer expires.

The practical difference between an arbiter and a third node is that an arbiter requires far fewer resources to run and potentially can serve more than one cluster.

Option 3 - Human factor

The last approach is to have the survivors continue to run any services they were already running, but not start any new ones, until either the problem resolves itself (network recovery, node reboot) or a person takes responsibility for manually confirming that the other side is dead.

Bonus option

Did I mention that you can add a third node?

two racks

For the sake of argument, let's pretend I convinced you of the merits of the third node, now we have to consider the physical layout of the nodes. If they are placed (and powered) in the same rack, this is also SPoF, and one that cannot be solved by adding a second rack.

If this is surprising, consider what would happen if a two-node rack went down, and how the surviving node would differentiate between that case and the network failure.

Short answer: it's impossible, and again we're dealing with all the problems in the case of two nodes. Or survivor:

  • ignores quorum and incorrectly attempts to initiate recovery during network outages (the ability to complete the fencing is a different story and depends on whether the PDU is involved and whether they share power with any of the racks), or
  • respects quorum and prematurely shuts itself down when its peer node fails

In any case, two racks are no better than one, and the nodes must either receive independent power supplies or be spread across three (or more, depending on how many nodes you have) racks.

Two data centers

By this point, readers who are no longer risk-averse may consider disaster recovery. What happens when an asteroid hits the same data center with our three nodes spread across three different racks? Obviously Bad Things, but depending on your needs, adding a second data center might not be enough.

If done correctly, the second data center provides you (and this is reasonable) with an up-to-date and consistent copy of your services and their data. However, as in the XNUMX-node, XNUMX-rack scenarios, there is not enough information in the system to ensure maximum availability and prevent corruption (or dataset divergence). Even with three nodes (or racks), spreading them across only two data centers leaves the system unable to reliably make the right decision in the event of a (now much more likely) event that both parties cannot communicate.

This does not mean that a two-data center solution is never a good idea. Companies often want a person to be in the know before taking the exceptional step of moving to a backup data center. Just be aware that if you want to automate the failure, you will either need a third data center to make the quorum make sense (directly or through an arbiter), or you will find a way to reliably shut down the entire data center.

Source: habr.com

Add a comment