RabbitMQ vs. Kafka: Fault Tolerance and High Availability in Clusters

RabbitMQ vs. Kafka: Fault Tolerance and High Availability in Clusters

Fault tolerance and high availability are big topics, so we will devote separate articles to RabbitMQ and Kafka. This article is about RabbitMQ and the next one is about Kafka versus RabbitMQ. The article is long, so make yourself comfortable.

Let's take a look at fault tolerance, consistency, and high availability (HA) strategies and the trade-offs that must be made in each strategy. RabbitMQ can run on a cluster of nodes - and then it is classified as a distributed system. When it comes to distributed systems, we often talk about consistency and availability.

These concepts describe how the system behaves when it fails. Network connection failure, server failure, hard disk failure, server temporary unavailability due to garbage collection, packet loss or network connection slowdown. All this can lead to data loss or conflicts. It turns out that it is almost impossible to raise a system that is both completely consistent (no data loss, no data discrepancy) and available (will accept reads and writes) for all failure scenarios.

We will see that consistency and availability are on different ends of the spectrum, and you have to choose which way to optimize. The good news is that with RabbitMQ this choice is possible. You have some kind of "nerdish" levers to shift the balance towards more consistency or more accessibility.

We will pay special attention to which configurations lead to data loss due to acknowledged writes. There is a chain of responsibility between publishers, brokers and consumers. After the message is passed to the broker, it is his job not to lose the message. When a broker acknowledges a message to a publisher, we don't expect it to be lost. But we will see that this can actually happen depending on the configuration of your broker and publisher.

Single Node Resiliency Primitives

Durable Queues/Routing

There are two types of queue in RabbitMQ: durable/stable (durable) and unstable (non-durable). All queues are stored in the Mnesia database. Durable queues are redeclared at node startup and thus survive a restart, system crash, or server crash (as long as the data persists). This means that as long as you declare routing (exchange) and the queue persistent, the queuing/routing infrastructure will come back online.

Volatile queues and routing are removed when the node is restarted.

Sustainable messages

Just because a queue is durable does not mean that all of its messages will survive a node restart. Only messages set by the publisher as sustainable (persistent). Persistent messages do put additional burden on the broker, but if message loss is unacceptable, then there is no other option.

RabbitMQ vs. Kafka: Fault Tolerance and High Availability in Clusters
Rice. 1. Stability matrix

Clustering with queue mirroring

To survive the loss of a broker, we need redundancy. We can cluster multiple RabbitMQ nodes and then add additional redundancy by replicating queues across multiple nodes. Thus, if one node goes down, we do not lose data and remain available.

Queue mirroring:

  • one main queue (master) that receives all write and read commands
  • one or more mirrors that receive all messages and metadata from the main queue. These mirrors do not exist for scaling, but purely for redundancy.

RabbitMQ vs. Kafka: Fault Tolerance and High Availability in Clusters
Rice. 2. Mirroring the queue

Mirroring is set by the appropriate policy. In it, you can choose the replication factor and even the nodes on which the queue should be placed. Examples:

  • ha-mode: all
  • ha-mode: exactly, ha-params: 2 (one master and one mirror)
  • ha-mode: nodes, ha-params: rabbit@node1, rabbit@node2

Publisher confirmation

Publisher Confirms are required to achieve consistent writing. Without them, there is a chance of losing messages. An acknowledgment is sent to the publisher after the message is written to disk. RabbitMQ writes messages to disk not on receipt, but on a periodic basis, in the region of a few hundred milliseconds. When a queue is mirrored, an acknowledgment is sent only after all mirrors have also written their copy of the message to disk. This means that the use of confirmations adds latency, but if data security is important, then they are necessary.

Failover Queue

When a broker shuts down or crashes, all queue leaders (masters) on that node fall off with it. The cluster then chooses the oldest mirror of each master and promotes it as the new master.

RabbitMQ vs. Kafka: Fault Tolerance and High Availability in Clusters
Rice. 3. Multiple mirrored queues and their policies

Broker 3 crashes. Note that the mirror of Queue C on Broker 2 is promoted to master. Also note that a new mirror has been created for Queue C on Broker 1. RabbitMQ always tries to maintain the replication factor specified in your policies.

RabbitMQ vs. Kafka: Fault Tolerance and High Availability in Clusters
Rice. 4. Broker 3 falls off, causing queue C to fail

The next Broker 1 is down! We only have one broker left. The mirror of Queue B is promoted to the master.

RabbitMQ vs. Kafka: Fault Tolerance and High Availability in Clusters
Fig. 5

We have returned Broker 1. No matter how well the data survived the loss and recovery of the broker, all mirrored queue messages are discarded on restart. This is important to note as there will be consequences. We will review these implications shortly. So Broker 1 is now a member of the cluster again, and the cluster tries to comply with the policies and therefore creates mirrors on Broker 1.

In this case, the loss of Broker 1 was complete, as was the loss of data, so the non-mirrored Queue B is completely lost.

RabbitMQ vs. Kafka: Fault Tolerance and High Availability in Clusters
Rice. 6. Broker 1 returns to service

Broker 3 is back online, so queues A and B get their mirrors back to satisfy their HA policies. But now all master queues are on the same node! It's not perfect, it's better to have a uniform distribution between the nodes. Unfortunately, there are no specific options for rebalancing masters. We'll come back to this problem later, as we first need to consider queue synchronization.

RabbitMQ vs. Kafka: Fault Tolerance and High Availability in Clusters
Rice. 7. Broker 3 returns to service. All master queues on one node!

So by now you should have an idea of ​​how mirrors provide redundancy and fault tolerance. This ensures availability in the event of a single node failure and protects against data loss. But we are not finished yet, because in fact everything is much more complicated.

Synchronization

When a new mirror is created, all new messages will always be replicated to that mirror and any others. As for the existing data in the main queue, we can replicate it to a new mirror, which becomes a complete copy of the master. We can also choose not to replicate existing messages and allow the main queue and the new mirror to converge on the time when new messages enter the tail and existing messages leave the head of the main queue.

This synchronization is automatic or manual and is controlled by a queuing policy. Consider an example.

We have two mirrored queues. Queue A is synchronized automatically, while Queue B is synchronized manually. Both queues contain ten messages.

RabbitMQ vs. Kafka: Fault Tolerance and High Availability in Clusters
Rice. 8. Two queues with different timing modes

Now we are losing Broker 3.

RabbitMQ vs. Kafka: Fault Tolerance and High Availability in Clusters
Rice. 9. Broker 3 went down

Broker 3 is back in service. The cluster creates a mirror for each queue on the new node and automatically synchronizes the new Queue A with the master. However, the mirror of the new Queue B remains empty. Thus, we have full redundancy of Queue A and only one mirror for the existing messages of Queue B.

RabbitMQ vs. Kafka: Fault Tolerance and High Availability in Clusters
Rice. 10. The new mirror of Queue A receives all existing messages, but the new mirror of Queue B does not

Both queues receive ten more messages. Broker 2 then goes down, and Queue A falls back to the oldest mirror, which is on Broker 1. There is no data loss when it fails. Queue B has twenty messages in the master and only ten in the mirror because that queue never replicated the original ten messages.

RabbitMQ vs. Kafka: Fault Tolerance and High Availability in Clusters
Rice. 11. Queue A is rolled back by Broker 1 without message loss

Both queues receive ten more messages each. Broker 1 now crashes. Queue A fails over to the mirror without any message loss. However, Queue B is having problems. At this point, we can optimize either availability or consistency.

If we want to optimize availability, then the policy ha-promote-on-failure should be installed in always. This is the default value, so you can simply not specify the policy at all. In such a case, in effect, we allow failures in out-of-sync mirrors. This will cause messages to be lost, but the queue remains readable and writable.

RabbitMQ vs. Kafka: Fault Tolerance and High Availability in Clusters
Rice. 12. Queue A is rolled back to Broker 3 without message loss. Queue B falls back to Broker 3 with ten messages lost

We can also install ha-promote-on-failure in value when-synced. In this case, instead of rolling back to the mirror, the queue will wait until Broker 1 returns online with its data. After it returns, the main queue is back on Broker 1 without data loss. Availability is sacrificed for data security. But this is a risky mode, which can even lead to complete data loss, which we will consider shortly.

RabbitMQ vs. Kafka: Fault Tolerance and High Availability in Clusters
Rice. 13. Queue B remains unavailable after losing Broker 1

You can ask the question: "Maybe it's better to never use automatic synchronization?". The answer is that synchronization is a blocking operation. During synchronization, the main queue cannot perform any read or write operations!

Consider an example. Now we have very long queues. How can they grow to such a size? For several reasons:

  • Queues are not actively used
  • These are high-speed queues, and right now consumers are running slowly.
  • These are high-speed queues, there has been a failure and consumers are catching up

RabbitMQ vs. Kafka: Fault Tolerance and High Availability in Clusters
Rice. 14. Two large queues with different timing modes

Broker 3 now crashes.

RabbitMQ vs. Kafka: Fault Tolerance and High Availability in Clusters
Rice. 15. Broker 3 crashes leaving one master and mirror in each queue

Broker 3 comes back online and new mirrors are created. Master Queue A starts replicating existing messages to the new mirror, and during this time the Queue is unavailable. It takes two hours to replicate data, resulting in two hours of downtime for this Queue!

However, Queue B remains available throughout the period. She sacrificed some redundancy for accessibility.

RabbitMQ vs. Kafka: Fault Tolerance and High Availability in Clusters
Rice. 16. Queue remains unavailable during synchronization

After two hours, Queue A also becomes available and can start accepting reads and writes again.

Updates

This blocking behavior during synchronization makes it difficult to update clusters with very large queues. At some point, the master node needs to be restarted, which means either going to the mirror or turning off the queue during the server upgrade. If we choose to transition, we will lose messages if the mirrors are out of sync. By default, during a broker outage, failover to an out-of-sync mirror is not performed. This means that as soon as the broker comes back, we don't lose any messages, the only damage was just a simple queue. The rules of behavior when the broker is disconnected are set by the policy ha-promote-on-shutdown. You can set one of two values:

  • always= Failover to unsynchronized mirrors enabled
  • when-synced= Move only to a synchronized mirror, otherwise the queue becomes unreadable and writable. The queue returns to service as soon as the broker returns

One way or another, with large queues, you have to choose between data loss and unavailability.

When Availability Improves Data Security

There is one more complication to consider before making a decision. While automatic sync is better for redundancy, how does it affect data security? Of course, with better redundancy, RabbitMQ is less likely to lose existing messages, but what about new messages from publishers?

Here you need to consider the following:

  • Can the publisher just return an error and have the upstream service or user try again later?
  • Can the publisher save the message locally or in the database to try again later?

If the publisher is only able to drop the message, then in fact, improving accessibility also improves data security.

Thus, a balance must be sought, and the decision depends on the specific situation.

Problems with ha-promote-on-failure=when-synced

Idea ha-promote-on-failure= when-synced is that we prevent switching to an unsynchronized mirror and thus avoid data loss. The queue remains unreadable or writable. Instead, we are trying to bring back the fallen broker with intact data so that it can resume as a master without data loss.

But (and this is a big but) if the broker lost his data, then we have a big problem: the queue is lost! All data is gone! Even if you have mirrors that mostly catch up with the main queue, those mirrors are also discarded.

To re-add a node with the same name, we tell the cluster to forget the orphaned node (with rabbitmqctl forget_cluster_node) and start a new broker with the same hostname. While the cluster remembers the lost node, it remembers the old queue and out-of-sync mirrors. When the cluster is told to forget a lost node, this queue is also forgotten. Now we need to re-declare it. We lost all data even though we had mirrors with a partial set of data. It would be better to switch to an unsynchronized mirror!

Therefore, manual synchronization (and failure to synchronize) in combination with ha-promote-on-failure=when-syncedquite risky in my opinion. The docs say this option exists for data security, but it's a double-edged knife.

Master rebalancing

As promised, we return to the problem of the accumulation of all masters on one or more nodes. This can even happen as a result of a rolling update of the cluster. In a three-node cluster, all master queues will accumulate on one or two nodes.

Rebalancing masters can be problematic for two reasons:

  • There are no good tools to perform rebalancing
  • Queue synchronization

There is a third party for rebalancing Plugin, which is not officially supported. Regarding third party plugins in the RabbitMQ manual it is said: "The plugin provides some additional configuration and reporting tools, but is not supported or tested by the RabbitMQ team. Use at your own risk."

There is another trick to move the main queue through HA policies. The manual mentions script for this. It works like this:

  • Removes all mirrors using a temporary policy with a higher priority than the existing HA policy.
  • Changes the HA temporary policy to use "nodes" mode, specifying the node to which you want to move the main queue.
  • Synchronizes the queue for forced migration.
  • After the migration is complete, it deletes the temporary policy. The original HA policy takes effect and the required number of mirrors are created.

The downside is that this approach may not work if you have large queues or strict redundancy requirements.

Now let's see how RabbitMQ clusters work with network partitions.

Connectivity Disruption

The nodes of a distributed system are connected by network links, and network links can and will be disconnected. The frequency of outages depends on the local infrastructure or the reliability of the selected cloud. Either way, distributed systems should be able to handle them. Again we have a choice between availability and consistency, and again the good news is that RabbitMQ provides both options (just not at the same time).

With RabbitMQ we have two main options:

  • Allow logical separation (split-brain). This ensures availability, but may cause data loss.
  • Disable logical separation. May result in a short-term loss of availability depending on how clients connect to the cluster. It can also cause complete unavailability in a two-node cluster.

But what is logical separation? This is when the cluster splits in two due to the loss of network links. On each side, the mirrors are promoted to the master, so that in the end there are several masters per queue.

RabbitMQ vs. Kafka: Fault Tolerance and High Availability in Clusters
Rice. 17. Main queue and two mirrors, each on a separate node. Then there is a network failure and one mirror is separated. The detached node sees that the other two have fallen off and advances its mirrors to the master. We now have two main queues, both of which are readable and writable.

If publishers send data to both masters, we will have two diverging copies of the queue.

Different modes of RabbitMQ provide either availability or consistency.

Ignore mode (default)

This mode provides accessibility. After the loss of connectivity, a logical separation occurs. After connectivity is restored, the administrator must decide which partition to prioritize. The losing side will be restarted and all accumulated data from that side is lost.

RabbitMQ vs. Kafka: Fault Tolerance and High Availability in Clusters
Rice. 18. Three publishers are affiliated with three brokers. Internally, the cluster routes all requests to the main queue on Broker 2.

Now we are losing Broker 3. He sees that the other brokers have fallen off and advances his mirror to the master. This is how the logical separation occurs.

RabbitMQ vs. Kafka: Fault Tolerance and High Availability in Clusters
Rice. 19. Logical division (split-brain). The records go to the two main queues, and the two copies go out.

Connectivity is restored, but the logical separation remains. The administrator must manually select the losing party. In the following case, the administrator restarts Broker 3. All messages that it did not have time to transmit are lost.

RabbitMQ vs. Kafka: Fault Tolerance and High Availability in Clusters
Rice. 20. Administrator disables Broker 3.

RabbitMQ vs. Kafka: Fault Tolerance and High Availability in Clusters
Rice. 21. Administrator starts Broker 3 and it joins the cluster, losing all messages that were left there.

During the loss of connectivity and after its restoration, the cluster and this queue were available for reading and writing.

Autoheal mode

Works similar to Ignore mode, except that the cluster itself automatically chooses the losing side after splitting and reconnecting. The losing side returns to the empty cluster, and the queue loses all messages that were sent only to that side.

Pause Minor Mode

If we don't want to allow logical partitioning, then our only option is to refuse to read and write on the smaller side after the cluster partition. When the broker sees that it is on the smaller side, it suspends work, that is, closes all existing connections and refuses any new ones. Once per second, it checks for reconnection. As soon as connectivity is restored, it resumes work and joins the cluster.

RabbitMQ vs. Kafka: Fault Tolerance and High Availability in Clusters
Rice. 22. Three publishers are affiliated with three brokers. Internally, the cluster routes all requests to the main queue on Broker 2.

Brokers 1 and 2 then separate from Broker 3. Instead of promoting its mirror to master, Broker 3 pauses and becomes unavailable.

RabbitMQ vs. Kafka: Fault Tolerance and High Availability in Clusters
Rice. 23. Broker 3 suspends, disconnects all clients, and rejects connection requests.

As soon as connectivity is restored, it returns to the cluster.

Let's look at another example where the main queue is on Broker 3.

RabbitMQ vs. Kafka: Fault Tolerance and High Availability in Clusters
Rice. 24. Main queue at Broker 3.

Then the same loss of connectivity occurs. Broker 3 pauses because it is on the smaller side. On the other side, the nodes see that Broker 3 has fallen off, so the older mirror from Brokers 1 and 2 is promoted to master.

RabbitMQ vs. Kafka: Fault Tolerance and High Availability in Clusters
Rice. 25. Switching to Broker 2 when Broker 3 is not available.

When connectivity is restored, Broker 3 will join the cluster.

RabbitMQ vs. Kafka: Fault Tolerance and High Availability in Clusters
Rice. 26. The cluster has returned to normal operation.

It is important to understand here that we get consistency, but we can also get accessibility, if we will successfully transfer clients to most of the section. For most situations, I personally would choose the Pause Minority mode, but it really depends on the individual case.

To ensure availability, it is important to ensure that clients successfully connect to the host. Let's take a look at our options.

Ensuring customer connectivity

We have several options for how to direct clients to the main part of the cluster or to working nodes after a loss of connectivity (after a failure of one node). First, let's remember that a particular queue is hosted on a particular node, but routing and policies are replicated across all nodes. Clients can connect to any node, and internal routing will direct them where they need to go. But when a node is paused, it rejects connections, so clients must connect to another node. If the node has fallen off, it can do little at all.

Our options:

  • The cluster is accessed by a load balancer that simply cycles through the nodes and the clients retries to connect until successful. If a node is down or suspended, then attempts to connect to that node will fail, but subsequent attempts will go to other servers (in a round robin fashion). This is suitable for a momentary loss of connectivity or a downed server that will be brought up quickly.
  • Accessing the cluster via a load balancer and removing suspended/fallen nodes from the list as soon as they are detected. If this is done quickly, and if clients are able to retry connection, then we will get constant availability.
  • Give each client a list of all nodes, and the client randomly chooses one of them when it connects. If it gets an error while trying to connect, it moves to the next node in the list until it connects.
  • Remove traffic from a downed/suspended host using DNS. This is done with a small TTL.

Conclusions

RabbitMQ clustering has its advantages and disadvantages. The most serious disadvantages are that:

  • when joining a cluster, nodes discard their data;
  • blocking synchronization causes the queue to become unavailable.

All difficult decisions stem from these two architectural features. If RabbitMQ could save data when rejoining the cluster, then synchronization would be faster. If it was capable of non-blocking synchronization, it would better support large queues. Fixing these two issues would greatly improve RabbitMQ's performance as a fault-tolerant and highly available messaging technology. I would be hesitant to recommend RabbitMQ with clustering in the following situations:

  • Unreliable network.
  • Unreliable storage.
  • Very long lines.

As for settings for high availability, consider these:

  • ha-promote-on-failure=always
  • ha-sync-mode=manual
  • cluster_partition_handling=ignore (or autoheal)
  • sustainable messages
  • make sure clients connect to the active node when some node goes down

For consistency (data security), consider the following settings:

  • Publisher Confirms and Manual Acknowledgments on the consumer side
  • ha-promote-on-failure=when-syncedif publishers can try again later and if you have very strong storage! Otherwise put =always.
  • ha-sync-mode=automatic (but large inactive queues may require manual mode; also, consider whether unavailability will cause messages to be lost)
  • Pause Minority mode
  • sustainable messages

We have not yet covered all the issues of fault tolerance and high availability; for example, how to safely perform administrative procedures (such as rolling updates). We also need to talk about federation and the Shovel plugin.

If I've missed anything else, please let me know.

See also mine postwhere I'm pogroming a RabbitMQ cluster with Docker and Blockade to test some of the message loss scenarios described in this article.

Previous articles in the series:
No. 1 - habr.com/ru/company/itsumma/blog/416629
No. 2 - habr.com/ru/company/itsumma/blog/418389
No. 3 - habr.com/ru/company/itsumma/blog/437446

Source: habr.com

Add a comment